WO2022176209A1 - Dispositif de recherche, procédé de recherche et programme de recherche - Google Patents

Dispositif de recherche, procédé de recherche et programme de recherche Download PDF

Info

Publication number
WO2022176209A1
WO2022176209A1 PCT/JP2021/006696 JP2021006696W WO2022176209A1 WO 2022176209 A1 WO2022176209 A1 WO 2022176209A1 JP 2021006696 W JP2021006696 W JP 2021006696W WO 2022176209 A1 WO2022176209 A1 WO 2022176209A1
Authority
WO
WIPO (PCT)
Prior art keywords
attacker
behavior
search
document
unit
Prior art date
Application number
PCT/JP2021/006696
Other languages
English (en)
Japanese (ja)
Inventor
雄己 川口
麿与 山嵜
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/006696 priority Critical patent/WO2022176209A1/fr
Publication of WO2022176209A1 publication Critical patent/WO2022176209A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying

Definitions

  • the present invention relates to a search device, search method and search program.
  • Non-Patent Document 2 reference a weak supervising method that emphasizes tf-idf and word distribution has been proposed in neural information retrieval that performs retrieval in consideration of semantic content using feature values representing semantic expressions of natural sentences.
  • the present invention has been made in view of the above, and aims to make it possible to search for documents describing information about attacks at high latitudes using the behavior of attackers.
  • the search device includes a collection unit that collects documents describing information about attacks, and an extraction unit that extracts natural sentences from the collected documents. and the behavior of the attacker to be added to the document when the similarity between the extracted natural sentence and the natural sentence representing the behavior of the attacker is greater than or equal to a predetermined threshold.
  • a generation unit that generates a label indicating that the attacker's behavior is included, and a document labeled as including the attacker's behavior is used as training data to learn a model that outputs the degree of relevance regarding the attacker's behavior to the document. and a learning unit.
  • FIG. 1 is a schematic diagram illustrating a schematic configuration of a search device.
  • FIG. 2 is a diagram for explaining the processing of the search device.
  • FIG. 3 is a flow chart showing a search processing procedure.
  • FIG. 4 is a flowchart showing a search processing procedure.
  • FIG. 5 is a diagram illustrating a computer executing a search program.
  • FIG. 1 is a schematic diagram illustrating a schematic configuration of a search device. Also, FIG. 2 is a diagram for explaining the processing of the search device.
  • the search device 10 of the present embodiment is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15. Prepare.
  • the input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner.
  • the output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like.
  • the communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a management server that manages information about attacks such as security reports via the network.
  • NIC Network Interface Card
  • the storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In the present embodiment, the storage unit 14 stores, for example, a model 14a generated by a search process to be described later.
  • the control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. Thereby, the control unit 15 functions as a collecting unit 15a, an extracting unit 15b, a generating unit 15c, a learning unit 15d, and a searching unit 15e, as illustrated in FIG. Note that these functional units may be implemented in different hardware.
  • the learning unit 15d may be implemented as a learning device. There may be.
  • the control unit 15 may include other functional units.
  • the collection unit 15a collects documents describing information about attacks, such as security reports. For example, the collection unit 15a collects documents describing information about attacks via the input unit 11 or from a management server or the like that manages information about attacks such as security reports via the communication control unit 13 .
  • the collection unit 15a may cause the storage unit 14 to store the collected documents. Alternatively, the collection unit 15a may pass the collected documents to subsequent processing without storing them in the storage unit 14 .
  • a document that describes information about an attack is, for example, a document in a format that allows acquisition of natural sentences such as HTML, PDF, etc. such as security vendor's public blogs, public reports, paid reports, and private reports.
  • the collection unit 15a collects documents corresponding to security reports from security vendor's public blogs, paid reports, etc., as pre-processing for the processing of the learning unit 15d, which will be described later (see S1 in FIG. 2).
  • the collecting unit 15a also collects security reports such as published reports, paid reports, and non-disclosed reports (see S11 in FIG. 2) as preprocessing for processing by the searching unit 15e, which will be described later.
  • the extraction unit 15b extracts natural sentences from the collected documents. For example, the extraction unit 15b extracts a portion described in natural sentences from the collected HTML or other documents as preprocessing for the processing of the learning unit 15d, which will be described later (see S2 in FIG. 2).
  • the extracting unit 15b extracts natural sentences from the explanations of signatures linked to the analysis results of the IoC used in the attack in the collected security reports as preprocessing for the processing of the searching unit 15e, which will be described later (see FIG. 2). (See S12-S13).
  • the security report contains non-natural text IoC information such as the hash value of the malware and the IP address of the communication destination.
  • the signatures used for analysis are posted in the analysis results of sandbox analysis, log analysis, etc. for IoC.
  • a signature is a rule for detecting specific behavior of an attack. Therefore, signature rules can be thought of as representing attack behavior.
  • the extraction unit 15b obtains the analysis results by obtaining existing results from an online sandbox or the like, or by obtaining samples from VirusTotal or the like and performing their own sandbox analysis. Alternatively, the extraction unit 15b acquires information about malware specimens that have been linked in the past, such as VirusTotal, to the IP address as the IoC.
  • the extracting unit 15b can extract the IoC information and add the behavior of the attacker in natural sentences to the security report using the explanation of the signature that detected the IoC.
  • a signature such as "A PE32 executable file is being dropped & ⁇ UsersAppData ⁇ is included in the file path” would include "Drops an executable to the user AppData folder" and other descriptions in natural sentences are attached.
  • the extraction unit 15b can extend the IoC information extracted from the security report and add the attacker's behavior in natural sentences to the security report.
  • the generating unit 15c When the similarity between the phrase included in the extracted natural sentence and the natural sentence representing the behavior of the attacker is greater than or equal to a predetermined threshold value, the generating unit 15c generates the behavior of the attacker to be added to the document. A label indicating inclusion is generated (see S3 in FIG. 2). For example, the generation unit 15c uses the signature of the security device and the ATT&CK (Adversarial Tactics, Techniques, and Common Knowledge) regulations that document the tactics, tactics, and behaviors used by the attacker to attack, to generate the behavior of the attacker. Get the natural sentence to represent. The generation unit 15c also vectorizes phrases such as noun phrases and verb phrases that exist in natural sentences extracted from the document.
  • ATT&CK Advanced Tactics, Techniques, and Common Knowledge
  • the generation unit 15c vectorizes the behavior of the attacker composed of natural sentences that are used as search queries in the processing of the search unit 15e, which will be described later. At that time, the generation unit 15c vectorizes using, for example, a weighted sum of word2vec that can express the semantic feature amount of the natural sentence. Therefore, vectors of natural sentences with similar semantic contents are similar in different expressions.
  • the generation unit 15c calculates the degree of similarity between the vectorized phrase and the behavior of the attacker, and if the degree of similarity is equal to or greater than a predetermined threshold, the behavior of the attacker is added to the natural sentence extracted from the document. Generates a label indicating that it contains . This makes it possible to assign labels to expressions with similar meanings.
  • the generation unit 15c attaches the generated label to the document, and uses it as teacher data used in the processing of the learning unit 15d, which will be described later.
  • the generated label is a binary value indicating whether or not it is similar, that is, whether or not it includes the behavior of an attacker. Therefore, unlike the conventional technique that generates labels as continuous values, it does not actively acquire differences between documents that match labels, and it is possible to prevent the contribution of a small number of behaviors to become extremely large among multiple behaviors. can be suppressed.
  • the learning unit 15d learns the model 14a that outputs the degree of relevance of the attacker's behavior to the document, using the document labeled as including the attacker's behavior as training data (see S4 in FIG. 2). ). Specifically, the learning unit 15d learns the relationship between the natural sentence phrases extracted from the learning target document and the behavior of the attacker to generate the model 14a. The learning unit 15d stores the generated model 14a in the storage unit 14 as a learned model.
  • the learning unit 15d learns, as the model 14a, a neural search model for performing ad-hoc searches using dynamic search queries.
  • the model 14a can be implemented by CKNRM (Convolutional Kernel Neural Ranking Model), but is not particularly limited.
  • the learning unit 15d performs a pairwise analysis of the relationship between the behavior of an attacker used as a search query in the processing of the search unit 15e described later, the document including the behavior of the attacker, and the document not including the behavior of the attacker.
  • the model 14a is learned by learning.
  • the retrieval device 10 can easily generate a highly accurate model 14a by sufficiently creating and learning weak teacher data to which weak teacher labels that are not true labels are assigned. becomes. Note that the retrieval device 10 has a risk of performing erroneous learning by performing learning using such weak teacher labels. However, even a small number of datasets is useful as a small number of correct labels may correct for incorrect learning.
  • the search unit 15e uses the learned model 14a to output the degree of relevance regarding the behavior of the attacker to the documents to be searched. Search for documents containing attacker behavior from Specifically, the search unit 15e inputs the natural sentences extracted from the document to be searched and arbitrary attacker behaviors to the trained model 14a (see S11 to S14 in FIG. 2). As a result, the search unit 15e outputs, as a search result, rankings ranked in order of including the most behaviors of the attackers used in the search, to the output unit 12 or the like (see S16 in FIG. 2).
  • the search unit 15e uses a set of multiple behaviors of an attacker as a search query, and searches for documents including the behavior of the search query from the documents to be searched (see the search phase in FIG. 2). In this way, the search unit 15e can search for documents containing behaviors of multiple attackers from the documents to be searched for multiple search queries.
  • the search unit 15e is not limited to the case of inputting an attacker's behavior described in natural sentences as a search query (see S14 in FIG. 2).
  • an IoC such as a hash value of malware
  • the extraction unit 15b expands the input IoC information with the explanation of the signature that satisfies the conditions as a result of the IoC analysis, so that the model 14a is processed as a search query in natural sentences. Input (see S13 in FIG. 2).
  • the search device 10 can also search for attacks with different IoC values due to differences in attack timing, malware customization, and the like. In this way, the search device 10 can efficiently use information obtained at the security incident response site to search for information desired by the user without omission. Therefore, according to the search device 10, it is possible to easily search for related security incidents without overlooking them, and it is possible to reduce the cost of searching for similar incidents and efficiently prevent recurrence.
  • FIG. 3 shows the processing procedure of the learning phase.
  • the flowchart of FIG. 3 is started, for example, when an input instructing the start of the learning phase is received.
  • the collection unit 15a collects documents for learning (step S1). Also, the extraction unit 15b extracts natural sentences from the collected documents (step S2).
  • the generation unit 15c attaches to the document the phrase of the attacker.
  • a label indicating that the behavior is included is generated (step S3).
  • the generation unit 15c vectorizes the semantic features of the phrases included in the natural sentence and the behavior of the attacker.
  • the generation unit 15c calculates the degree of similarity between the vectorized phrase and the behavior of the attacker, and if the degree of similarity is equal to or greater than a predetermined threshold, the behavior of the attacker is added to the natural sentence extracted from the document. Generates a label indicating that it contains .
  • processing up to this point may be performed in advance prior to the processing of the learning unit 15d below.
  • the learning unit 15d learns the model 14a that assigns a label indicating that the document includes the behavior of the attacker, using the document labeled as including the behavior of the attacker as training data. (Step S4). This completes a series of learning phases.
  • FIG. 4 shows the processing procedure of the search phase.
  • the flowchart in FIG. 4 is started, for example, when an input instructing the start of the search phase is received.
  • the collection unit 15a collects documents for search (step S11). For example, the collection unit 15a collects security reports.
  • the extraction unit 15b extracts natural sentences from the collected documents (step S12). For example, the extracting unit 15b extracts the IoC information from the security report, and adds the attacker's behavior in natural sentences to the security report using the signature description linked to the IoC analysis result.
  • processing up to this point may be performed in advance prior to the processing of the search unit 15e described below.
  • the search unit 15e uses the learned model 14a to output the degree of relevance of the behavior of the attacker to the documents to be searched, so that the search A document containing the behavior of the attacker is searched from the target document (step S14). Also, the search unit 15e outputs the search result (step S16). This completes the series of search phases.
  • the collection unit 15a collects documents describing information about attacks. Also, the extraction unit 15b extracts natural sentences from the collected documents. In addition, when the similarity between the phrase included in the extracted natural sentence and the natural sentence representing the behavior of the attacker is equal to or greater than a predetermined threshold value, the generation unit 15c adds the attacker's generates a label indicating that it contains the behavior of Also, the learning unit 15d learns the model 14a that outputs the degree of relevance regarding the behavior of the attacker to the document, using the document labeled as including the behavior of the attacker as training data.
  • the search device 10 can create training data for the behavior of attackers with different semantic granularities.
  • the retrieval device 10 can also create teacher data for behaviors of attackers that have similar meanings but different expressions. Therefore, according to the search device 10, it is possible to highly accurately search for a document describing information about an attack by using the behavior of the attacker.
  • the search unit 15e uses the learned model 14a to output the degree of relevance regarding the attacker's behavior of the document to be searched. search for documents that contain attacker behavior.
  • the search device 10 can perform flexible searches, such as searches using the behaviors of attackers with arbitrary granularity or the behaviors of attackers with similar meanings but different expressions as search queries.
  • the collection unit 15a collects security reports
  • the extraction unit 15b extracts natural sentences from signature descriptions linked to analysis results of IoCs used in attacks in the collected security reports.
  • the search unit 15e sets the security report as a document to be searched.
  • the search device 10 can perform a search with high accuracy using the behavior of the attacker as a search query from the security report.
  • the search unit 15e searches for documents containing behaviors of multiple attackers from the search target documents in response to multiple search queries.
  • the search device 10 by using a plurality of search queries to be handled in parallel, it is possible to perform a search with improved comprehensiveness without being biased towards any one of the search queries.
  • the search unit 15e searches for a document to be searched for an IoC input as a search query, using natural text extracted from the description of the signature linked to the analysis result of the IoC by the extraction unit 15b.
  • the search device 10 can efficiently use information obtained at the security incident response site to search for information desired by the user.
  • the learning unit 15d learns the model 14a through pairwise learning of the behavior of the attacker, the document containing the behavior of the attacker, and the document not containing the behavior of the attacker. This enables the retrieval device 10 to prepare sufficient teacher data and easily learn the highly accurate model 14a.
  • the search device 10 can be implemented by installing a search program for executing the above search processing as package software or online software on a desired computer.
  • the information processing device can function as the search device 10 by causing the information processing device to execute the search program.
  • information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).
  • the functions of the search device 10 may be implemented in a cloud server.
  • FIG. 5 is a diagram showing an example of a computer that executes a search program.
  • Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 .
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • Hard disk drive interface 1030 is connected to hard disk drive 1031 .
  • Disk drive interface 1040 is connected to disk drive 1041 .
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example.
  • a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example.
  • a display 1061 is connected to the video adapter 1060 .
  • the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
  • the search program is stored in the hard disk drive 1031 as a program module 1093 in which commands to be executed by the computer 1000 are written, for example.
  • the hard disk drive 1031 stores a program module 1093 that describes each process executed by the search device 10 described in the above embodiment.
  • Data used for information processing by the search program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
  • program module 1093 and program data 1094 related to the search program are not limited to being stored in the hard disk drive 1031.
  • they are stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like.
  • the program module 1093 and program data 1094 related to the search program are stored in another computer connected via a network such as a LAN (Local Area Network) or WAN (Wide Area Network), and via a network interface 1070 It may be read by CPU 1020 .
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Une unité de collecte (15a) collecte un document contenant des informations relatives à une attaque. Une unité d'extraction (15b) extrait une phrase naturelle du document collecté. Si le degré de similarité entre une expression incluse dans la phrase naturelle extraite et une phrase naturelle indiquant un comportement d'attaquant est supérieur ou égal à une valeur seuil prescrite, une unité de génération (15c) génère une étiquette qui doit être ajoutée au document et qui indique que le document inclut le comportement d'attaquant. En utilisant comme données d'entraînement le document auquel est ajoutée l'étiquette qui indique que le document comprend le comportement d'attaquant, une unité d'entraînement (15d) entraîne un modèle (14a) qui délivre, à un document, une applicabilité concernant un comportement d'attaquant.
PCT/JP2021/006696 2021-02-22 2021-02-22 Dispositif de recherche, procédé de recherche et programme de recherche WO2022176209A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/006696 WO2022176209A1 (fr) 2021-02-22 2021-02-22 Dispositif de recherche, procédé de recherche et programme de recherche

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/006696 WO2022176209A1 (fr) 2021-02-22 2021-02-22 Dispositif de recherche, procédé de recherche et programme de recherche

Publications (1)

Publication Number Publication Date
WO2022176209A1 true WO2022176209A1 (fr) 2022-08-25

Family

ID=82930559

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/006696 WO2022176209A1 (fr) 2021-02-22 2021-02-22 Dispositif de recherche, procédé de recherche et programme de recherche

Country Status (1)

Country Link
WO (1) WO2022176209A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016200978A (ja) * 2015-04-10 2016-12-01 株式会社日立製作所 教師データの生成装置
WO2017146094A1 (fr) * 2016-02-24 2017-08-31 日本電信電話株式会社 Dispositif de détection de code d'attaque, procédé de détection de code d'attaque et programme de détection de code d'attaque
WO2019053844A1 (fr) * 2017-09-14 2019-03-21 三菱電機株式会社 Dispositif, procédé et programme d'inspection de courrier électronique

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016200978A (ja) * 2015-04-10 2016-12-01 株式会社日立製作所 教師データの生成装置
WO2017146094A1 (fr) * 2016-02-24 2017-08-31 日本電信電話株式会社 Dispositif de détection de code d'attaque, procédé de détection de code d'attaque et programme de détection de code d'attaque
WO2019053844A1 (fr) * 2017-09-14 2019-03-21 三菱電機株式会社 Dispositif, procédé et programme d'inspection de courrier électronique

Similar Documents

Publication Publication Date Title
Piplai et al. Creating cybersecurity knowledge graphs from malware after action reports
US10417350B1 (en) Artificial intelligence system for automated adaptation of text-based classification models for multiple languages
CN110837550B (zh) 基于知识图谱的问答方法、装置、电子设备及存储介质
US10740545B2 (en) Information extraction from open-ended schema-less tables
Long et al. Collecting indicators of compromise from unstructured text of cybersecurity articles using neural-based sequence labelling
Giasemidis et al. A semi-supervised approach to message stance classification
US10394868B2 (en) Generating important values from a variety of server log files
US20180260382A1 (en) Domain-specific method for distinguishing type-denoting domain terms from entity-denoting domain terms
Xu et al. Signature based trouble ticket classification
US11650996B1 (en) Determining query intent and complexity using machine learning
Alves et al. Leveraging BERT's Power to Classify TTP from Unstructured Text
US10614100B2 (en) Semantic merge of arguments
Pal et al. Exploring the limits of transfer learning with unified model in the cybersecurity domain
US20210263732A1 (en) Context-based word embedding for programming artifacts
Alneyadi et al. A semantics-aware classification approach for data leakage prevention
US9910889B2 (en) Rapid searching and matching of data to a dynamic set of signatures facilitating parallel processing and hardware acceleration
WO2022176209A1 (fr) Dispositif de recherche, procédé de recherche et programme de recherche
Alsmadi et al. Issues related to the detection of source code plagiarism in students assignments
Bordes et al. Label ranking under ambiguous supervision for learning semantic correspondences
Assaggaf et al. Development of Graph-Based Knowledge on Ransomware Attacks Using Twitter Data
Guo et al. Predicting missing information of key aspects in vulnerability reports
Othman et al. VULDAT: Automated Vulnerability Detection from Cyberattack Text
Ghosh et al. Social media sentiment analysis on third booster dosage for COVID-19 vaccination: a holistic machine learning approach
Oswal Identifying and categorizing offensive language in social media
Butcher Contract Information Extraction Using Machine Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21926638

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21926638

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP