2018 I/ITSEC - 9250

Machine Supported Entity Resolution in the Cyber Domain (Room S320C)

28 Nov 18
8:30 AM - 9:00 AM
The intelligence community has prioritized training of new analysts to identify nation-state actors and its curriculum hinges on new data-driven methods for seeking and tracking malicious behavior in geographically uncertain landscapes. A key cyber challenge is to construct and understand networks in which nodes of actors, events and organizations share linked relationships. To automate building such networks, we investigate large-scale, multi-label classifiers for finding key entities, both named (persons, organizations and locations) and unnamed (web addresses, malware hashes, and dates). We evaluate machine learning methods on two novel datasets describing advanced persistent threats (APT) and their common attributes (countries, groups, tactics, and targets). One dataset employs crowd-sourced entity tags from human-curated cybersecurity reports. The second one automatically mines 10 years’ worth of APT reports totaling nearly 6,000 pages. The core mission centers on natural language processing of complex narratives, often dominated by inconsistent foreign or technical terms. With 97% accuracy, we automate the identification of new threat reports responsive to APT nation-states, as distinct from general web reports on vulnerabilities, blogs, and data feeds. We bootstrap from a small subset of human-scored APT reports, generalize the rules implicitly applied to each country, and provide simple decision trees with 91% accuracy that might aid an intelligence analyst to sift through large and continuously updated repositories. If the training challenge often begins with learning the domain-specific vocabulary, we find novel APT ontologies that both supplement existing teaching resources and automate any manual steps to free up the analyst for new kinds of more complex data mining and discovery.