WO2022064579A1 - Classification device, classification method, and classification program - Google Patents

Classification device, classification method, and classification program Download PDF

Info

Publication number
WO2022064579A1
WO2022064579A1 PCT/JP2020/035873 JP2020035873W WO2022064579A1 WO 2022064579 A1 WO2022064579 A1 WO 2022064579A1 JP 2020035873 W JP2020035873 W JP 2020035873W WO 2022064579 A1 WO2022064579 A1 WO 2022064579A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
threat
unit
classification
pseudo
Prior art date
Application number
PCT/JP2020/035873
Other languages
French (fr)
Japanese (ja)
Inventor
麿与 山嵜
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/035873 priority Critical patent/WO2022064579A1/en
Publication of WO2022064579A1 publication Critical patent/WO2022064579A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Definitions

  • the present invention relates to a classification device, a classification method, and a classification program.
  • noise reduction methods have been proposed in the field of image processing as well, and noise model networks for reducing noise by False Positive and False Negative in multi-label tasks have been proposed.
  • the conventional method of extracting the behavior of an attacker has the following problems.
  • the rule-based extraction method since a knowledge base created manually is used, the cost is high, and it may be difficult to secure a sufficient number of rules in terms of processing load and cost to the creator. be. Therefore, in the rule-based extraction method, the accuracy of behavior extraction may decrease.
  • the extraction method using supervised learning has a problem that the available teacher data is scarce. For example, it is conceivable to use a Web page hyperlinked from the ATT & CT knowledge base as teacher data. However, with this method, the amount of data obtained as teacher data is small. In addition, since the data obtained by this method is not the data constructed as teacher data, there is a possibility that a large amount of False Negative data to which the label that should be originally assigned is not attached may be included. Therefore, even with this method, the accuracy of behavior extraction may decrease.
  • the present invention has been made in view of the above, and an object thereof is to improve cyber security capability.
  • the pseudo-teacher creation department creates pseudo-teacher data from a threat document that has a description about cyber threats based on the pseudo-teacher data creation rules.
  • the learning unit learns the noise model and the classification model in parallel using the mutual relationship based on the threat document and the pseudo-teacher data created by the pseudo-teacher creation unit.
  • the document classification unit classifies the input classification target threat document using the classification model generated by the learning by the learning unit.
  • the cyber security capability can be improved.
  • FIG. 1 is a block diagram of a classification device according to the first embodiment.
  • FIG. 2 is a block diagram showing details of the pseudo-teacher extraction unit.
  • FIG. 3 is a diagram showing an example of a rule for creating pseudo teacher data.
  • FIG. 4 is a block diagram showing details of the learning unit.
  • FIG. 5 is a flowchart of the learning process.
  • FIG. 6 is a flowchart of a pseudo-teacher data generation process by the pseudo-teacher extraction unit.
  • FIG. 7 is a flowchart of a classification model generation process by learning using pseudo teacher data in consideration of noise by the learning unit.
  • FIG. 8 is a flowchart of the threat document classification process by the classification device.
  • FIG. 9 is a diagram showing an example of a computer that executes a classification program.
  • FIG. 1 is a block diagram of a classification device.
  • the classification device 1 is an information processing device such as a server.
  • the classification device 1 is a device that classifies the behavior of an attacker with respect to an information processing system.
  • the classification device 1 has a model learning unit 10 and an inference unit 20.
  • the model learning unit 10 performs machine learning to create a classification model that classifies the behavior of an attacker.
  • the model learning unit 10 has a document collection unit 101, a threat document extraction unit 102, a pseudo-teacher extraction unit 103, and a learning unit 104.
  • the document collection unit 101 collects a set of documents from WEB2 and the like. Then, the document collection unit 101 outputs the collected document set to the threat document extraction unit 102.
  • the threat document extraction unit 102 receives the input of the document set from the document collection unit 101. Next, the threat document extraction unit 102 extracts threat documents related to cyber threats from the document set. For example, the threat document extraction unit 102 classifies threat documents and other documents by using the structure, rules, statistical methods, and the like of the WEB site from which each document included in the document set is acquired. After that, the threat document extraction unit 102 outputs the extracted threat document to the pseudo-teacher extraction unit 103.
  • the pseudo-teacher extraction unit 103 receives the input of the threat document extracted from the document set from the threat document extraction unit 102. Next, the pseudo-teacher extraction unit 103 generates pseudo-teacher data from the threat document. After that, the pseudo-teacher extraction unit 103 outputs the threat document to the learning unit 104 together with the generated pseudo-teacher data.
  • FIG. 2 is a block diagram showing details of the pseudo-teacher extraction unit.
  • the pseudo-teacher extraction unit 103 includes a sentence division unit 131, a token division unit 132, a rule conforming unit 133, a synthesis rule conforming unit 134, a convergence test unit 135, an external knowledge reference unit 136, and a pseudo-teacher generation unit. It has 137.
  • the sentence division unit 131 acquires the threat document input from the threat document extraction unit 102. Then, the sentence division unit 131 divides the threat document into sentence units. After that, the sentence division unit 131 outputs each sentence obtained by dividing the threat document to the token division unit 132 and the rule conforming unit 133.
  • the token division unit 132 receives the input of each sentence obtained by dividing the threat document from the sentence division unit 131. Then, the token dividing unit 132 divides the acquired sentence into token units.
  • a token is a character string consisting of one or a plurality of characters in a character sequence in a threat document, and is a character string that is the smallest unit that makes sense. After that, the token dividing unit 132 outputs each token obtained by dividing each sentence included in the threat document to the rule conforming unit 133.
  • the rule conforming unit 133 receives the input of each sentence obtained by dividing the threat document from the sentence dividing unit 131. Further, the rule conforming unit 133 receives the input of each token obtained by dividing each sentence included in the threat document from the token dividing unit 132.
  • the threat document that is the source of the sentence and token acquired by the rule conforming unit 133 is included in the document group by using the structure, rules, statistical methods, etc. of the WEB site from which each document is acquired, as described above. Specified from. Therefore, it is possible to create rules based on threat documents. Therefore, the rule conforming unit 133 has a rule for creating pseudo teacher data created on the premise of a threat document.
  • the rule conforming unit 133 has external knowledge such as token string matching, regular expression matching, matching using a combination of syntactic structure and token string matching and regular expression matching, and external knowledge such as NVD (National Vulnerability Database) and tokens. It has column matching, regular expression matching, and matching using a combination with a syntactic structure as rules for creating teacher data.
  • the rules for creating teacher data are not limited to these, and the rule conforming unit 133 may use other rules as long as the rules are created on the premise of a threat document.
  • FIG. 3 is a diagram showing an example of a rule for creating pseudo teacher data.
  • Each rule in FIG. 3 is a rule that when the threat document matches the condition on the left side of the arrow, the description of the attacker's behavior on the right side of the arrow exists in the threat document. That is, when a certain threat document is used as input data, the behavior of the attacker described in the threat document that is the output day is specified by using a rule for creating pseudo teacher data that satisfies the condition of the threat document. Can be done. Therefore, by using the rules for creating pseudo-teacher data for threat documents, it is possible to generate pseudo-teacher data including input data and output data.
  • the rule whose type is a phrase is an example of a creation rule corresponding to a match of token columns.
  • a rule whose type is a regular expression is an example of a creation rule that matches a regular expression.
  • a rule whose type is CVE (Common Vulnerabilities and Exposures) or CPE (Common Platform Enumeration) is a creation rule that corresponds to a combination of external knowledge such as NVD and a token string match, a regular expression match, and a syntax structure. This is just one example.
  • the rule whose type is logical operation is a rule that, when the conditions on the left side of the arrow are met, all the creation rules are given to the sentence as pseudo teacher data of the behavior of the attacker on the right side. ..
  • the rule conforming unit 133 outputs an observation information acquisition request used for determining whether or not the teacher data creation rule is conforming to the external knowledge reference unit 136.
  • the observation information used for conformity judgment using the rules for creating teacher data includes, for example, clues suggesting the behavior of each attacker such as SQL (Structured Query Language) injection, and the programs and files used to execute each behavior. It includes names, API (Application Programming Interface) and function names, command names, and external information associated with CVE (Common Vulnerabilities and Exposures).
  • the rule conforming unit 133 acquires the observation information used for the conformity determination using the teacher data creation rule from the external knowledge reference unit 136.
  • the rule conforming unit 133 uses the observation information acquired from the external knowledge reference unit 136 as the teacher data creation rule to determine whether or not the teacher data creation rule conforms to the sentence level and token level of the threat document.
  • the determination of conformity of the teacher data creation rule for a certain threat document is whether or not the attacker's behavior is required by using the teacher data creation rule to be determined for the threat document. It is to judge.
  • the rule conforming unit 133 outputs the teacher data creation rule determined to be conformable to the synthesis rule conforming unit 134 together with the sentence and token included in the threat document targeted for the determination.
  • the rule conforming unit 133 determines whether or not the teacher data creation rule conforms to all the threat documents corresponding to the acquired sentences and tokens, and includes the teacher data creation rule based on the determination in the corresponding threat document. It is output to the synthesis rule conforming unit 134 together with the sentence and the token.
  • the synthesis rule conforming unit 134 receives input of the teacher data creation rule conforming to each threat document and the sentences and tokens included in the corresponding threat document from the rule conforming unit 133. Next, the synthesis rule conforming unit 134 generates a synthesis rule for each threat document by a logical operation of a teacher data creation rule that can be conformed by the rule conforming unit 133.
  • the synthesis rule conforming unit 134 outputs an acquisition request for observation information used for determining whether or not the generated synthesis rule is conforming to the external knowledge reference unit 136.
  • the observation information used for determining conformance using this synthesis rule includes clues suggesting the behavior of each attacker, programs and files used to execute each behavior, as well as the information acquired by the rule conforming unit 133. Names, API and function names, command names, and external information associated with CVE are included.
  • the synthesis rule conforming unit 134 acquires observation information used for determining conformability using the synthesis rule from the external knowledge reference unit 136.
  • the synthesis rule conforming unit 134 uses the observation information acquired from the external knowledge reference unit 136 for the synthesis rule to determine whether or not the synthesis rule is conformable at the sentence level and the token level of the threat document.
  • the rule whose type is a logical operation in FIG. 3 is an example of a synthesis rule that determines conformability by a logical operation as a result of a plurality of rule conformances.
  • the behavior starting with x does not directly correspond to the behavior of the attacker, but is information used to give the behavior of the attacker according to the synthesis rule.
  • the behavior starting from x is, for example, information that gives a match of SQL injection in a rule whose type is a phrase or a predetermined behavior in a rule whose type is CVE.
  • the synthesis rule conforming unit 134 When it is determined that the synthesis rule can be conformed, the synthesis rule conforming unit 134 outputs the synthesis rule determined to be conformable to the convergence test unit 135 together with the sentence and token included in the threat document targeted for the determination. Further, the synthesis rule conforming unit 134 outputs the creation rule of the teacher data conforming to each threat document acquired from the rule conforming unit 133 to the convergence test unit 135.
  • the composition rule conforming unit 134 determines whether or not the new composition rule is applicable to the threat document in which the new composition rule is generated. Make a judgment. After that, the synthesis rule conforming unit 134 outputs the synthesis rule determined to be conformable among the composition rules including the new composition rule to the convergence test unit 135 together with the sentence and the token included in the threat document targeted for the determination. ..
  • the external knowledge reference unit 136 refers to an external knowledge group by being connected to an external server (not shown) or the like in which knowledge is accumulated.
  • the external knowledge reference unit 136 receives a request for acquisition of observation information from the rule conforming unit 133 and the synthesis rule conforming unit 134. Then, the external knowledge reference unit 136 acquires the observation information used for the designated pseudo-teacher data creation rule and the synthesis rule from the external knowledge group. Then, the external knowledge reference unit 136 outputs the acquired observation information to the rule conforming unit 133 or the synthesis rule conforming unit 134, which is the transmission source of the acquisition request.
  • the convergence test unit 135 receives the input of the composition rule conforming to each threat document and the sentences and tokens included in the corresponding threat document from the composition rule conforming unit 134.
  • a threat document conforming to a certain synthetic rule may result in a new synthetic rule that may be conformed.
  • a synthetic rule generated by a logical operation between a synthetic rule and an original teacher data creation rule may become a new synthetic rule.
  • the convergence test unit 135 determines whether or not a new attacker's behavior is added to the behavior set after the synthesis rule is applied to the behavior set that collects the attacker's behavior included in the threat document.
  • the convergence test unit 135 determines that the threat document may be compatible with the synthesis rule to which the new attacker's behavior is added. Then, the convergence test unit 135 requests the synthesis rule conforming unit 134 to determine whether or not the synthesis rule conforms to the behavior of a new attacker.
  • the convergence test unit 135 determines that the conformity test of the synthesis rule has converged. After that, the convergence test unit 135 outputs the teacher data creation rule and the synthesis rule suitable for each threat document to the pseudo-teacher generation unit 137 together with the sentences and tokens included in the corresponding threat document.
  • the pseudo-teacher generation unit 137 receives input of teacher data creation rules and synthesis rules suitable for each threat document, as well as sentences and tokens included in the corresponding threat document from the convergence test unit 135. Then, the pseudo-teacher generation unit 137 generates pseudo-teacher data for the sentences and tokens included in the threat document by using the teacher data creation rule and the synthesis rule suitable for the threat document. After that, the pseudo-teacher generation unit 137 outputs the generated pseudo-teacher data to the learning unit 104. Further, the pseudo-teacher extraction unit 103 outputs a threat document to the learning unit 104 in addition to the generated pseudo-teacher data.
  • the learning unit 104 receives the input of the threat document and the pseudo-teacher data from the pseudo-teacher extraction unit 103.
  • pseudo teacher data since pseudo teacher data is used, noise exists in the teacher data. Therefore, the learning unit 104 generates a classification model by performing learning in consideration of noise using pseudo teacher data. After that, the learning unit 104 outputs the generated classification model to the document classification unit 203 of the inference unit 20.
  • FIG. 4 is a block diagram showing details of the learning unit.
  • the learning unit 104 has a document expression layer 141, a label classification layer 142, a token expression layer 143, an encode layer 144, a certainty layer 145, a noise distribution layer 146, and an observation label classification layer 147.
  • the document expression layer 141 acquires the threat document output from the pseudo-teacher extraction unit 103. Then, the document expression layer 141 acquires the characteristics of each threat document as a whole. The characteristics of the entire threat document are information that expresses the meaning of the threat document.
  • the document representation layer 141 generates one vector for one threat document. After that, the document expression layer 141 outputs the characteristics of each threat document as a whole to the label classification layer 142.
  • the document expression layer 141 can be realized by using, for example, BERT (Bidirectional Encoder Representations from Transformers). However, the method of realizing the document expression layer 141 is not limited to BERT, and an implementation with another configuration can be used.
  • the token expression layer 143 acquires the threat document output from the pseudo-teacher extraction unit 103. Then, the token expression layer 143 extracts the token included in each threat document. Here, the token expression layer 143 may acquire the token included in each threat document created by the pseudo-teacher extraction unit 103.
  • the token expression layer 143 acquires the characteristics of each token included in each threat document.
  • the characteristics of a token are information that expresses the meaning of the token.
  • the token expression layer 143 since the length of each token is different, the characteristics of the tokens are also different. Since the token expression layer 143 generates one vector for each token, for one threat document, the vector of the number of tokens contained therein is generated. After that, the token expression layer 143 outputs the characteristics of each token to the encoding layer 144.
  • the token expression layer 143 can also be realized by using BERT, for example, like the document expression layer 141. However, the method of realizing the token expression layer 143 is not limited to BERT, and implementation by other configurations can be used.
  • the label classification layer 142 receives input of the overall characteristics of the threat document from the document expression layer 141. Then, the label classification layer 142 performs machine learning using a neural network with the overall characteristics of the threat document as an input, and predicts a true label corresponding to the behavior of the attacker given to the threat document.
  • the label is information indicating what kind of attacker's behavior is included in each threat document. Then, when K labels are assigned, the label classification layer 142 generates a label probability which is a vector representing each probability including each of the K labels.
  • the encode layer 144 is a layer that converts a variable length token sequence into a fixed length.
  • the encode layer 144 receives an input of a variable length vector representing the characteristics of each token from the token expression layer 143. Then, the encode layer 144 converts the characteristics of each acquired token into a fixed length. Next, the encode layer 144 integrates information representing the characteristics of the fixed-length token to generate one vector. Further, the encoding layer 144 converts the vector into a 4K-dimensional vector, where K is the number of labels given to the generated vector. After that, the encode layer 144 outputs one fixed-length and 4K-dimensional vector collectively representing the characteristics of each token to the certainty layer 145.
  • the encode layer 144 can be realized by using, for example, an RNN (Recurrent Neural Network). However, the method of realizing the encode layer 144 is not limited to the RNN, and it is also possible to use an implementation with another configuration.
  • the certainty layer 145 is a convolution layer representing the certainty of conversion from the true value (0 or 1) of each label to the value (0 or 1) of the observation label which is a label containing noise.
  • the certainty is the certainty of transition from each value of the true label to each value of the observed label for k labels when the number of labels is k. There are four transition methods because the true label takes two values and the observation label also takes two values.
  • the confidence layer 145 has 4K kernels of size 1 ⁇ 1 and bias parameters.
  • the certainty layer 145 receives an input of one fixed-length and 4K-dimensional vector that collectively represents the characteristics of each token from the encode layer 144. Then, the conviction layer 145 executes a convolution operation using the kernel and the bias parameter held in the acquired vector, and generates a 2 ⁇ 2 ⁇ k tensor representing the conviction.
  • the certainty layer 145 outputs a 2 ⁇ 2 ⁇ k tensor representing the generated certainty to the noise distribution layer 146.
  • the noise distribution layer 146 receives an input of a certainty degree represented by a 2 ⁇ 2 ⁇ k tensor from the certainty degree layer 145. Then, the noise distribution layer 146 normalizes the acquired certainty with the Softmax function to acquire the noise distribution represented by the tensor of 2 ⁇ 2 ⁇ k.
  • the noise distribution is a probability distribution that represents the transition probability that each of the k true labels transitions to each of the k observation labels in a threat document. After that, the noise distribution layer 146 outputs the noise distribution represented by the 2 ⁇ 2 ⁇ k tensor to the observation label classification layer 147.
  • the observation label classification layer 147 receives the input of the noise distribution created by the noise distribution layer 146 based on the output from the token expression layer 143 representing the characteristics of each token. Further, the observation label classification layer 147 obtains a label probability, which is a vector representing each probability that each true label is included for each threat document, from the label classification layer 142. Next, the observation label classification layer 147 calculates the weighted sum of the output of the noise distribution layer 146 to the output of the label classification layer 142 to predict the observation label including noise to be given to each threat document. Then, when K labels are assigned, the observation label classification layer 147 generates an observation label probability, which is a vector representing each probability that each K observation label is included in each threat document.
  • the document expression layer 141 and the label classification layer 142 in the learning unit 104 correspond to a classification model for calculating the appearance probability of the attacker's behavior included in the threat document and classifying the threat document. Further, the token expression layer 143, the encode layer 144, the certainty layer 145, the noise distribution layer 146, and the observation label classification layer 147 in the learning unit 104 are noise models used to reduce noise when learning the classification model.
  • the actual learning by the learning unit 104 is carried out in the following flow. Since pseudo-teacher data is used, the label that can be obtained from the threat document is the observation label. Therefore, in order to properly learn the noise model in the observation label classification layer 147, the learning unit 104 adjusts the parameters of the label classification layer 142 and performs the first half learning to improve the prediction accuracy of the true label, and then the noise.
  • the classification model is completed by learning considering noise using the model. After the learning of the first half is completed, the learning unit 104 shifts to the learning of the second half.
  • the learning unit 104 uses the document expression layer 141 and the label classification layer 142 without using the observation label classification layer 147 from the token expression layer 143, and treats the observation label as a true label and pseudo. Learning is performed based on typical teacher data. In this case, the learning unit 104 learns so as to minimize the error between the output of the label classification layer 142 and the output of the observation label. The learning unit 104 determines, for example, that the learning of the first half is completed when the prediction that the observation label is regarded as a true label is completed a predetermined number of times.
  • the learning unit 104 performs learning based on pseudo teacher data using all the layers from the document expression layer 141 to the observation label classification layer 147. In this case, the learning unit 104 learns so as to minimize the error between the output of the observation label classification layer 147 and the observation label given by the pseudo teacher data. As a result, the learning unit 104 can realize a classification model in which the influence of noise due to pseudo teacher data is suppressed by the document expression layer 141 and the label classification layer 142.
  • the learning unit 104 determines, for example, that the latter half of the learning is completed when the prediction of the true label is completed a predetermined number of times. After the latter half of the learning is completed, the learning unit 104 outputs the generated classification model to the document classification unit 203 of the inference unit 20.
  • the inference unit 20 classifies the behavior of the attacker included in the threat document of the input document using the classification model generated by the model learning unit 10.
  • the inference unit 20 has an input reception unit 201, a threat document extraction unit 202, a document classification unit 203, and an output unit 204.
  • the input reception unit 201 accepts the input of a document. Then, the input reception unit 201 outputs the received document to the threat document extraction unit 202.
  • the threat document extraction unit 202 receives the input of the document from the input reception unit 201. Next, the threat document extraction unit 202 extracts threat documents related to cyber threats from the acquired documents. The threat document extraction unit 202 classifies threat documents and other documents by using the structure, rules, statistical methods, etc. of the WEB site from which each document included in the document set is acquired. After that, the threat document extraction unit 202 outputs the extracted threat document to the document classification unit 203.
  • the document classification unit 203 receives the input of the threat document extracted from the input document from the threat document extraction unit 202. Further, the document classification unit 203 acquires the classification model generated by the learning unit 104 of the model learning unit 10. Then, the document classification unit 203 classifies threat documents using the acquired classification model. That is, the document classification unit 203 obtains the probability that a certain threat document includes the behavior of an attacker. After that, the document classification unit 203 outputs the classification result to the output unit 204.
  • the output unit 204 receives the input of the input classification result of the document from the document classification unit 203. Then, the output unit 204 outputs the classification result of the input document to a monitor or the like, and notifies the user of the classification result of the input document. Specifically, the output unit 204 outputs information such as what kind of attacker's behavior is included in the threat document in the input document and to what extent.
  • FIG. 5 is a flowchart of the learning process.
  • the document collection unit 101 acquires a document set by collecting documents from WEB 2 or the like (step S1). Next, the document collection unit 101 outputs the acquired document set to the threat document extraction unit 102.
  • the threat document extraction unit 102 receives the input of the document set from the document collection unit 101. Next, the threat document extraction unit 102 extracts a threat document related to the cyber threat from the document set (step S2). After that, the threat document extraction unit 102 outputs the extracted threat document to the pseudo-teacher extraction unit 103.
  • the pseudo-teacher extraction unit 103 receives the input of the threat document extracted from the document set from the threat document extraction unit 102. Next, the pseudo-teacher extraction unit 103 generates pseudo-teacher data from the threat document (step S3). After that, the pseudo-teacher extraction unit 103 outputs the generated pseudo-teacher data to the learning unit 104.
  • the learning unit 104 receives the input of the threat document, the token included in each threat document, and the pseudo teacher data from the pseudo teacher extraction unit 103. Next, the learning unit 104 performs learning in consideration of noise using pseudo teacher data to generate a classification model (step S4). After that, the learning unit 104 outputs the generated classification model to the document classification unit 203 of the inference unit 20.
  • FIG. 6 is a flowchart of a pseudo-teacher data generation process by the pseudo-teacher extraction unit.
  • the series of processes shown in the flowchart of FIG. 6 corresponds to an example of the processes executed in step S3 of the flowchart shown in FIG.
  • the sentence division unit 131 acquires the threat document input from the threat document extraction unit 102 (step S101).
  • the sentence division unit 131 divides the threat document into sentence units (step S102). After that, the sentence division unit 131 outputs each sentence obtained by dividing the threat document to the token division unit 132 and the rule conforming unit 133.
  • the token division unit 132 receives the input of each sentence obtained by dividing the threat document from the sentence division unit 131. Then, the token dividing unit 132 divides the acquired sentence into token units (step S103). After that, the token dividing unit 132 outputs each token obtained by dividing each sentence included in the threat document to the rule conforming unit 133.
  • the rule conforming unit 133 receives the input of each sentence obtained by dividing the threat document from the sentence dividing unit 131. Further, the rule conforming unit 133 receives the input of each token obtained by dividing each sentence included in the threat document from the token dividing unit 132. Then, the rule conforming unit 133 uses the information acquired through the external knowledge reference unit 136 for the teacher data creation rule held by itself, and whether or not the teacher data creation rule at the sentence level and the token level of the threat document conforms. Judgment is made. As a result, the rule conforming unit 133 identifies a rule for creating teacher data applicable to each threat document (step S104). After that, the rule conforming unit 133 outputs the teacher data creation rule determined to be conformable to the synthesis rule conforming unit 134 together with the sentence and token included in the threat document targeted for the determination.
  • the synthesis rule conforming unit 134 receives input of the teacher data creation rule conforming to each threat document and the sentences and tokens included in the corresponding threat document from the rule conforming unit 133. Next, the synthesis rule conforming unit 134 generates a synthesis rule for each threat document by a logical operation of a teacher data creation rule that can be conformed by the rule conforming unit 133. Then, the synthesis rule conforming unit 134 determines whether or not the synthesis rule is conformable at the sentence level and the token level of the threat document by using the information acquired through the external knowledge reference unit 136 for the synthesis rule (step S105).
  • the synthesis rule conforming unit 134 outputs the synthesis rule determined to be conformable to the convergence test unit 135 together with the sentence and the token included in the threat document targeted for the determination. Further, the synthesis rule conforming unit 134 outputs the creation rule of the teacher data conforming to each threat document acquired from the rule conforming unit 133 to the pseudo teacher generation unit 137.
  • the convergence test unit 135 receives the input of the composition rule conforming to each threat document and the sentences and tokens included in the corresponding threat document from the composition rule conforming unit 134. Next, the convergence test unit 135 synthesizes depending on whether or not a new attacker's behavior is added to the behavior set after the synthesis rule is applied to the behavior set that collects the attacker's behavior contained in each threat document. It is determined whether or not the conformity of the rules has converged (step S106).
  • step S106 When the conformity of the synthesis rule has not converged (step S106: negation), the convergence test unit 135 determines whether or not the conformity of the synthesis rule may be conformed, which may lead to the behavior of a new attacker. Ask to.
  • the synthesis rule conforming unit 134 adds a new synthesis rule to the synthesis rule for the threat document in which the new synthesis rule has occurred (step S107). After that, the synthesis rule conforming unit 134 returns to step S105.
  • the convergence determination unit 135 sets the teacher data creation rule and the synthesis rule conforming to each threat document to the sentence included in the corresponding threat document and the synthesis rule. It is output to the pseudo-teacher generation unit 137 together with the token.
  • the pseudo-teacher generation unit 137 receives input of a teacher data creation rule and a synthesis rule suitable for each threat document, and sentences and tokens included in the corresponding threat document from the convergence test unit 135. Then, the pseudo-teacher generation unit 137 generates pseudo-teacher data for the sentences and tokens included in the threat document by using the teacher data creation rule and the synthesis rule suitable for the threat document (step S108). .. After that, the pseudo-teacher generation unit 137 outputs the generated pseudo-teacher data to the learning unit 104.
  • FIG. 7 is a flowchart of a classification model generation process by learning using pseudo teacher data in consideration of noise by the learning unit.
  • the series of processes shown in the flowchart of FIG. 7 corresponds to an example of the processes executed in step S4 of the flowchart shown in FIG.
  • the document expression layer 141 and the token expression layer 143 acquire the threat document output from the pseudo-teacher extraction unit 103 (step S201).
  • the document expression layer 141 acquires the characteristics of the entire threat document (step S202). After that, the document expression layer 141 outputs the characteristics of each threat document as a whole to the label classification layer 142.
  • the learning unit 104 determines whether or not the learning of the first half is completed based on whether or not the calculation of the label probability when the observed label is regarded as a true label has been performed a predetermined number of times (step S203).
  • step S203 denial
  • the learning unit 104 uses the document expression layer 141 and the label classification layer 142 without using the observation label classification layer 147 from the token expression layer 143.
  • the observation label is regarded as a true label, and learning is performed based on the pseudo teacher data (step S204).
  • the label classification layer 142 receives input of the overall characteristics of the threat document from the document expression layer 141.
  • the label classification layer 142 performs machine learning on the overall characteristics of the threat document using a neural network, and predicts the observation label corresponding to the attacker's behavior given to the threat document as a true label. To calculate the probability label. After that, the learning unit 104 returns to step S201.
  • step S203 when the learning in the first half is completed (step S203: affirmative), the learning unit 104 shifts the learning in the second half.
  • the token expression layer 143 extracts tokens contained in each threat document.
  • the token expression layer 143 acquires the characteristics of each token included in each threat document (step S205).
  • the token expression layer 143 outputs the characteristics of each acquired token to the encoding layer 144.
  • the encode layer 144 receives the input of the characteristics of each token from the token expression layer 143. Then, the encode layer 144 converts the characteristics of each acquired token into a fixed length. Next, the encode layer 144 integrates information representing the characteristics of the fixed-length token to generate one vector. Further, when the number of labels given to the generated vector is K, the encode layer 144 converts the vector into a 4K-dimensional vector (step S206). After that, the encode layer 144 outputs one fixed-length and 4K-dimensional vector collectively representing the characteristics of each token to the certainty layer 145.
  • the certainty layer 145 receives an input of one fixed-length and 4K-dimensional vector that collectively represents the characteristics of each token from the encode layer 144. Then, the conviction layer 145 executes a convolution operation using the kernel and the bias parameter held in the acquired vector, and calculates the conviction (step S207). After that, the certainty layer 145 outputs the calculated certainty to the noise distribution layer 146.
  • the noise distribution layer 146 receives input of certainty from the certainty layer 145. Then, the noise distribution layer 146 normalizes the acquired certainty with the Softmax function to acquire the noise distribution (step S208). After that, the noise distribution layer 146 outputs the noise distribution to the observation label classification layer 147.
  • the label classification layer 142 receives input of the overall characteristics of the threat document from the document expression layer 141. Then, the label classification layer 142 performs machine learning using the overall characteristics of the threat document, predicts the true label corresponding to the behavior of the attacker given to the threat document, and generates the label probability.
  • the observation label classification layer 147 receives the input of the noise distribution layer 146. Further, the observation label classification layer 147 obtains a label probability, which is a vector representing each probability that each true label is included for each threat document, from the label classification layer 142. Next, the observation label classification layer 147 calculates the weighted sum of the output of the noise distribution layer 146 to the output of the label classification layer 142 to predict the observation label including noise to be given to each threat document.
  • observation label classification layer 147 generates the observation label probability.
  • the learning unit 104 learns a classification model using a noise model based on the probability label calculated by the label classification layer 142 and the observation label model calculated by the observation label classification layer 147 (step S209).
  • the learning unit 104 determines whether or not the latter half of the learning is completed based on whether or not the number of times of learning of the classification model using the noise model has reached a predetermined number of times (step S210). If the subsequent learning is not completed (step S210: negation), the learning unit 104 returns to step S201.
  • step S210 affirmative
  • the learning unit 104 ends the classification model generation process.
  • FIG. 8 is a flowchart of the threat document classification process by the classification device.
  • the input receiving unit 201 accepts the input of the document (step S11). Then, the input reception unit 201 outputs the received document to the threat document extraction unit 202.
  • the threat document extraction unit 202 receives the input of the document from the input reception unit 201. Next, the threat document extraction unit 202 extracts a threat document related to the cyber threat from the acquired document (step S12). After that, the threat document extraction unit 202 outputs the extracted threat document to the document classification unit 203.
  • the document classification unit 203 receives the input of the threat document extracted from the input document from the threat document extraction unit 202. Further, the document classification unit 203 acquires the classification model generated by the learning unit 104 of the model learning unit 10. Then, the document classification unit 203 classifies the threat document using the acquired classification model (step S13). That is, the document classification unit 203 obtains the probability that a certain threat document includes the behavior of an attacker. After that, the document classification unit 203 outputs the classification result to the output unit 204.
  • the output unit 204 receives the input of the input classification result of the document from the document classification unit 203. Then, the output unit 204 outputs the classification result of the input document to a monitor or the like, and notifies the user of the classification result (step S14).
  • the classification device 1 generates pseudo-teacher data from the acquired threat document by using the rules for creating pseudo-teacher data that matches the acquired threat document. Then, the classification device 1 simultaneously learns the classification model and the noise model based on the acquired threat document and pseudo teacher data. After that, the classification device 1 classifies the threat documents included in the input document by using the learned classification model. That is, the classification device 1 can create pseudo teacher data for performing the behavior of an attacker from a threat document without human intervention, and can generate pseudo teacher data of a natural sentence including generated noise. It can be used to generate a classification model that takes noise into consideration.
  • the classification device 1 automatically generates pseudo teacher data based on an extraction rule based on heuristics when a threat document is assumed in order to learn a device that extracts the attacker's behavior from the threat document. ..
  • an extraction rule based on heuristics when a threat document is assumed in order to learn a device that extracts the attacker's behavior from the threat document. ..
  • a rule for creating pseudo teacher data that has a strong connection with the behavior and has little notational fluctuation. Can be constructed. Therefore, it is possible to generate pseudo teacher data having a true label with high probability.
  • the classification device 1 uses the characteristics of the entire document for the construction of the classification model and the characteristics of each token for the construction of the noise model, so that the noise model of the network for variable-length natural sentence input is used. Can be realized. As a result, the classification device 1 has a network structure for reducing the influence of False Negative and False Positive data when generating a classification model using pseudo teacher data of natural sentences including noise.
  • the classification device 1 when training a statistical learning model for classifying the behavior of an attacker, by using a creation rule for each label, low cost without manual annotation work for each document. It is possible to construct teacher data with. In addition, by using the classification device 1, noise modeling networks can be learned from teacher data including noise such as pseudo teacher data without using clean data without noise, and False Positive and False Negative data. It will be possible to realize the classification of threat documents with reduced influence of. Therefore, by extracting the attacker's behavior with high accuracy by the classification device 1, it becomes possible to analyze the complicated attacker's behavior using a large-scale threat document set that is difficult to analyze manually, and cyber. It is possible to improve the security capability.
  • each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific forms of distribution and integration of each device are not limited to those shown in the figure, and all or part of them may be functionally or physically dispersed or physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device is realized by a CPU (Central Processing Unit) and a program that is analyzed and executed by the CPU, or hardware by wired logic. Can be realized as.
  • CPU Central Processing Unit
  • the classification device 1 can be implemented by installing a classification program that executes the above information processing as package software or online software on a desired computer. For example, by causing the information processing device to execute the above classification program, the information processing device can be made to function as the classification device 1.
  • the information processing device referred to here includes a desktop type or notebook type personal computer.
  • information processing devices include smartphones, mobile phones, mobile communication terminals such as PHS (Personal Handy-phone System), and slate terminals such as PDAs (Personal Digital Assistants). Is done.
  • the classification device 1 can be implemented as a management server device in which the terminal device used by the user is a client and the service related to the above management process is provided to the client.
  • the management server device is implemented as a server device that receives a config input request as an input and provides a management service for inputting a config.
  • the management server device may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above management processing by outsourcing.
  • FIG. 9 is a diagram showing an example of a computer that executes a classification program.
  • the computer 1000 has, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012.
  • the ROM 1011 stores, for example, a boot program such as a BIOS (BASIC Input Output System).
  • BIOS BASIC Input Output System
  • the hard disk drive interface 1030 is connected to the hard disk drive 1090.
  • the disk drive interface 1040 is connected to the disk drive 1100.
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
  • the video adapter 1060 is connected to, for example, the display 1130.
  • the hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. That is, the classification program that defines each process of the classification device 1 having the same function as the classification device 1 is implemented as a program module 1093 in which a code that can be executed by a computer is described.
  • the program module 1093 is stored in, for example, the hard disk drive 1090.
  • the program module 1093 for executing the same processing as the functional configuration in the classification device 1 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 as needed, and executes the process of the above-described embodiment.
  • the program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A classification device (1) comprises a pseudo-teacher extraction unit (103), a learning unit (104), and a document classification unit (203). The pseudo-teacher extraction unit (103) creates pseudo-teacher data from threat documents that include descriptions relating to cyber threats, on the basis of a rule for creating pseudo-teacher data. The learning unit (104) learns a noise model and a classification model in parallel using the relationship between these models on the basis of the threat documents and the pseudo-teacher data created by the pseudo-teacher extraction unit (103). The document classification unit (203) classifies an input threat document to be classified, using the classification model generated from the learning by the learning unit (104).

Description

分類装置、分類方法及び分類プログラムClassification device, classification method and classification program
 本発明は、分類装置、分類方法及び分類プログラムに関する。 The present invention relates to a classification device, a classification method, and a classification program.
 近年、セキュリティー対策の手法として、攻撃者の振る舞いを基に対策を行う技術の研究が盛んになってきている。この技術において、攻撃者の振る舞いの検知を精度よく行うためには、攻撃者の振る舞いのパターンをより多く収集することが重要となる。そこで、様々な攻撃者の振る舞いのパターンを収集する方法として、サイバー脅威に関連する脅威文書から攻撃者の振る舞いを抽出する抽出手法の研究が盛んになってきている。 In recent years, as a security measure method, research on technology that takes measures based on the behavior of an attacker has become active. In this technique, in order to accurately detect the behavior of an attacker, it is important to collect more patterns of the behavior of the attacker. Therefore, as a method of collecting various attacker behavior patterns, research on an extraction method for extracting attacker behavior from threat documents related to cyber threats has been actively conducted.
 ATT&CK(登録商標)に代表される攻撃者の振る舞いを抽出する研究では、様々な抽出手法が提案されている。例えば、人出で作成したオントロジーと抽出規則とを用いたルールベースでの抽出手法や、SVM(Support Vector Machine)などの教師有り学習を用いた抽出手法がある。攻撃者の振る舞いについて機械学習を行う場合、1つのデータに複数のラベルを付与するマルチラベルのタスクとなる。 Various extraction methods have been proposed in research to extract the behavior of attackers represented by ATT & CK (registered trademark). For example, there are a rule-based extraction method using an ontology created by humans and an extraction rule, and an extraction method using supervised learning such as SVM (Support Vector Machine). When machine learning is performed on the behavior of an attacker, it is a multi-label task of assigning a plurality of labels to one data.
 なお、自然言語処理分野の関係抽出タスクでは、人手で作成された知識ベースや抽出規則を用いて、ラベルなしデータからラベル付きの教師データを生成するdistant supervision(遠距離教師あり学習)と呼ばれる手法が存在する。distant supervisionは、実態間の関係を表す知識ベースに存在する事実を用いて、ある1文内に2つの実態に対する言及が存在し、且つ、知識ベースにそれらの実態間の関係が存在する場合に、その関係ラベルを疑似的な教師データとして付与するヒューリスティクスに基づいて学習を行う手法である。 In the relationship extraction task in the field of natural language processing, a method called distinct supervision (learning with long-distance supervised learning) that generates labeled teacher data from unlabeled data using a knowledge base and extraction rules created manually. Exists. The distant supervision uses the facts that exist in the knowledge base that expresses the relationship between the actual conditions, and when there is a reference to two actual conditions in a certain sentence and the relationship between those actual conditions exists in the knowledge base. , It is a method of learning based on heuristics that assigns the relation label as pseudo teacher data.
 distant supervisionを用いることで、疑似的な教師データにはFalse Positive(疑似陽性)及びFalse Negative(疑似陽性)のデータが存在するものの、人手によるアノテーション無しに大規模な教師データの作成が可能となる。distant supervisionは、関係抽出タスクに止まらず、系列ラベリングや文書分類においても異なるヒューリスティクスによって作成した疑似的な教師データを用いた学習手法が検討されている。 By using distant supervision, although there are False Positive and False Negative data in the pseudo teacher data, it is possible to create large-scale teacher data without manual annotation. .. For distant supervision, a learning method using pseudo teacher data created by different heuristics is being studied not only in the relationship extraction task but also in series labeling and document classification.
 さらに、distant supervisionでは、疑似的な教師データに存在するノイズ削減を行う手法が提案されている。深層学習を用いた関係抽出タスクでは、2つの実態を含む全ての文に対して関係への言及を仮定する仮説を緩和させ、False Positiveに対処する手法がとられており、1つのデータに複数のラベルを付与するマルチラベルのタスクにも適用可能である。 Furthermore, in distinct supervision, a method for reducing noise existing in pseudo teacher data is proposed. In the relationship extraction task using deep learning, a method is taken to deal with False Positive by relaxing the hypothesis that assumes reference to the relationship for all sentences including two actual conditions, and multiple methods are used for one data. It is also applicable to multi-label tasks that give the label of.
 なお、画像処理の分野でもノイズ削減の手法は提案されており、マルチラベルのタスクでのFalse Positive及びFalse Negativeによるノイズを削減するためのノイズモデルネットワークが提案されている。 In addition, noise reduction methods have been proposed in the field of image processing as well, and noise model networks for reducing noise by False Positive and False Negative in multi-label tasks have been proposed.
 しかしながら、従来からの攻撃者の振る舞いの抽出手法では、以下の問題がある。例えば、ルールベースでの抽出手法では、人手で作成した知識ベースを用いるため、コストが高く、作成者への処理負荷及びコストの面から十分な数のルールを確保することが困難となる場合がある。そのため、ルールベースでの抽出手法では、振る舞いの抽出の精度が低下するおそれがある。 However, the conventional method of extracting the behavior of an attacker has the following problems. For example, in the rule-based extraction method, since a knowledge base created manually is used, the cost is high, and it may be difficult to secure a sufficient number of rules in terms of processing load and cost to the creator. be. Therefore, in the rule-based extraction method, the accuracy of behavior extraction may decrease.
 一方、教師有り学習を用いた抽出手法では、活用可能な教師データが乏しいという問題がある。例えば、ATT&CT知識ベースからハイパーリンクされたWebページを教師データとして利用することが考えられる。しかし、この方法では教師データとして得られるデータ量が少ない。また、この方法で得られるデータは、教師データとして構築されたデータではないため、本来付与すべきラベルが付与されていないFalse Negativeのデータが多量に含まれるおそれがある。したがって、この方法でも振る舞いの抽出の精度が低下するおそれがある。 On the other hand, the extraction method using supervised learning has a problem that the available teacher data is scarce. For example, it is conceivable to use a Web page hyperlinked from the ATT & CT knowledge base as teacher data. However, with this method, the amount of data obtained as teacher data is small. In addition, since the data obtained by this method is not the data constructed as teacher data, there is a possibility that a large amount of False Negative data to which the label that should be originally assigned is not attached may be included. Therefore, even with this method, the accuracy of behavior extraction may decrease.
 また、従来の攻撃者の振る舞いの抽出手法では、ノイズを含む教師データからモデルの学習を行う際に、ノイズによる影響を学習モデルで考慮していないため、誤った教師データによるモデルパラメータの更新が行われ、精度低下の原因となるおそれがある。これらの理由から、従来の攻撃者の振る舞いの抽出手法では、サイバーセキュリティ能力を向上させることは困難である。 In addition, in the conventional attacker behavior extraction method, when learning a model from teacher data including noise, the influence of noise is not considered in the learning model, so model parameters are updated with incorrect teacher data. This is done and may cause a decrease in accuracy. For these reasons, it is difficult to improve cyber security capabilities with conventional attacker behavior extraction methods.
 なお、distant supervisionによれば教師データの不足に対処することは可能であるが、従来技術では、攻撃者の振る舞いの抽出タスクにおけるdistant supervisionの利用については検討がなされていない。したがって、従来技術では、教師データの不足への対応は難しく、サイバーセキュリティ能力を向上させることは困難である。 According to the distinct supervision, it is possible to deal with the shortage of teacher data, but in the conventional technology, the use of the distinct supervision in the task of extracting the behavior of the attacker has not been examined. Therefore, with the conventional technology, it is difficult to deal with the shortage of teacher data, and it is difficult to improve the cyber security capability.
 さらに、distant supervisionでは、ノイズを低減するためにFalse Positiveに対処することは行われているが、False Positive及びFalse Negativeの双方への対処は考慮されていない。また、False Positive及びFalse Negativeに対処する手法としては、ノイズモデル層を用いた文書分類手法が提案されているが、マルチラベルのタスクを対象としていない。したがって、これらの技術を用いても、攻撃者の振る舞いの抽出におけるノイズを軽減することは困難であり、サイバーセキュリティ能力を向上させることは困難である。 Furthermore, in the distinct supervision, measures are taken to deal with False Positives in order to reduce noise, but measures to deal with both False Positives and False Negatives are not taken into consideration. Further, as a method for dealing with False Positive and False Negative, a document classification method using a noise model layer has been proposed, but it does not target multi-label tasks. Therefore, even with these techniques, it is difficult to reduce noise in extracting the behavior of an attacker, and it is difficult to improve cyber security capability.
 本発明は、上記に鑑みてなされたものであって、サイバーセキュリティ能力を向上させることを目的とする。 The present invention has been made in view of the above, and an object thereof is to improve cyber security capability.
 上述した課題を解決し、目的を達成するために、疑似教師作成部は、疑似的な教師データの作成ルールを基に、サイバー脅威に関する記述を有する脅威文書から疑似的な教師データを作成する。学習部は、前記脅威文書及び前記疑似教師作成部により作成された前記疑似的な教師データを基に、ノイズモデル及び分類モデルを相互の関係を用いて並行して学習する。文書分類部は、前記学習部による学習により生成された前記分類モデルを用いて、入力された分類対象脅威文書の分類を行う。 In order to solve the above-mentioned problems and achieve the purpose, the pseudo-teacher creation department creates pseudo-teacher data from a threat document that has a description about cyber threats based on the pseudo-teacher data creation rules. The learning unit learns the noise model and the classification model in parallel using the mutual relationship based on the threat document and the pseudo-teacher data created by the pseudo-teacher creation unit. The document classification unit classifies the input classification target threat document using the classification model generated by the learning by the learning unit.
 本発明によれば、サイバーセキュリティ能力を向上させることができる。 According to the present invention, the cyber security capability can be improved.
図1は、第1の実施形態に係る分類装置のブロック図である。FIG. 1 is a block diagram of a classification device according to the first embodiment. 図2は、疑似教師抽出部の詳細を表すブロック図である。FIG. 2 is a block diagram showing details of the pseudo-teacher extraction unit. 図3は、疑似的な教師データの作成ルールの一例を示す図である。FIG. 3 is a diagram showing an example of a rule for creating pseudo teacher data. 図4は、学習部の詳細を表すブロック図である。FIG. 4 is a block diagram showing details of the learning unit. 図5は、学習処理のフローチャートである。FIG. 5 is a flowchart of the learning process. 図6は、疑似教師抽出部による疑似的な教師データの生成処理のフローチャートである。FIG. 6 is a flowchart of a pseudo-teacher data generation process by the pseudo-teacher extraction unit. 図7は、学習部によるノイズを考慮した疑似的な教師データを用いた学習による分類モデルの生成処理のフローチャートである。FIG. 7 is a flowchart of a classification model generation process by learning using pseudo teacher data in consideration of noise by the learning unit. 図8は、分類装置による脅威文書の分類処理のフローチャートである。FIG. 8 is a flowchart of the threat document classification process by the classification device. 図9は、分類プログラムを実行するコンピュータの一例を示す図である。FIG. 9 is a diagram showing an example of a computer that executes a classification program.
 以下に、本願の開示する分類装置、分類方法及び分類プログラムの一実施形態を図面に基づいて詳細に説明する。なお、以下の実施形態により本願の開示する分類装置、分類方法及び分類プログラムが限定されるものではない。 Hereinafter, one embodiment of the classification device, the classification method, and the classification program disclosed in the present application will be described in detail based on the drawings. The following embodiments do not limit the classification device, classification method, and classification program disclosed in the present application.
[分類装置の構成]
 図1を用いて、分類装置の構成について説明する。図1は、分類装置のブロック図である。分類装置1は、サーバなどの情報処理装置である。分類装置1は、情報処理システムに対する攻撃者の振る舞いを分類する装置である。図1に示すように、分類装置1は、モデル学習部10及び推論部20を有する。
[Structure of classification device]
The configuration of the classification device will be described with reference to FIG. FIG. 1 is a block diagram of a classification device. The classification device 1 is an information processing device such as a server. The classification device 1 is a device that classifies the behavior of an attacker with respect to an information processing system. As shown in FIG. 1, the classification device 1 has a model learning unit 10 and an inference unit 20.
 モデル学習部10は、攻撃者の振る舞いを分類する分類モデルを作成するための機械学習を行う。モデル学習部10は、文書収集部101、脅威文書抽出部102、疑似教師抽出部103及び学習部104を有する。 The model learning unit 10 performs machine learning to create a classification model that classifies the behavior of an attacker. The model learning unit 10 has a document collection unit 101, a threat document extraction unit 102, a pseudo-teacher extraction unit 103, and a learning unit 104.
 文書収集部101は、WEB2等から文書集合を収集する。そして、文書収集部101は、収集した文書集合を脅威文書抽出部102へ出力する。 The document collection unit 101 collects a set of documents from WEB2 and the like. Then, the document collection unit 101 outputs the collected document set to the threat document extraction unit 102.
 脅威文書抽出部102は、文書集合の入力を文書収集部101から受ける。次に、脅威文書抽出部102は、文書集合からサイバー脅威に関連する脅威文書を抽出する。例えば、脅威文書抽出部102は、文書集合に含まれる各文書の取得元のWEBサイトの構造やルール及び統計的手法などを用いて脅威文書とそれ以外の文書とを分類する。その後、脅威文書抽出部102は、抽出した脅威文書を疑似教師抽出部103へ出力する。 The threat document extraction unit 102 receives the input of the document set from the document collection unit 101. Next, the threat document extraction unit 102 extracts threat documents related to cyber threats from the document set. For example, the threat document extraction unit 102 classifies threat documents and other documents by using the structure, rules, statistical methods, and the like of the WEB site from which each document included in the document set is acquired. After that, the threat document extraction unit 102 outputs the extracted threat document to the pseudo-teacher extraction unit 103.
 疑似教師抽出部103は、文書集合から抽出された脅威文書の入力を脅威文書抽出部102から受ける。次に、疑似教師抽出部103は、脅威文書から疑似的な教師データを生成する。その後、疑似教師抽出部103は、生成した疑似的な教師データとともに脅威文書を学習部104へ出力する。 The pseudo-teacher extraction unit 103 receives the input of the threat document extracted from the document set from the threat document extraction unit 102. Next, the pseudo-teacher extraction unit 103 generates pseudo-teacher data from the threat document. After that, the pseudo-teacher extraction unit 103 outputs the threat document to the learning unit 104 together with the generated pseudo-teacher data.
 ここで、図2を参照して、疑似教師抽出部103の機能について詳細に説明する。図2は、疑似教師抽出部の詳細を表すブロック図である。図2に示すように、疑似教師抽出部103は、文分割部131、トークン分割部132、ルール適合部133、合成ルール適合部134、収束判定部135、外部知識参照部136及び疑似教師生成部137を有する。 Here, with reference to FIG. 2, the function of the pseudo-teacher extraction unit 103 will be described in detail. FIG. 2 is a block diagram showing details of the pseudo-teacher extraction unit. As shown in FIG. 2, the pseudo-teacher extraction unit 103 includes a sentence division unit 131, a token division unit 132, a rule conforming unit 133, a synthesis rule conforming unit 134, a convergence test unit 135, an external knowledge reference unit 136, and a pseudo-teacher generation unit. It has 137.
 文分割部131は、脅威文書抽出部102から入力された脅威文書を取得する。そして、文分割部131は、脅威文書を文単位に分割する。その後、文分割部131は、脅威文書を分割して得られた各文をトークン分割部132及びルール適合部133へ出力する。 The sentence division unit 131 acquires the threat document input from the threat document extraction unit 102. Then, the sentence division unit 131 divides the threat document into sentence units. After that, the sentence division unit 131 outputs each sentence obtained by dividing the threat document to the token division unit 132 and the rule conforming unit 133.
 トークン分割部132は、脅威文書を分割して得られた各文の入力を文分割部131から受ける。そして、トークン分割部132は、取得した文をトークン単位に分割する。トークンとは、脅威文書における文字の並び中の、1個ないし複数個の文字から成る文字列であって、意味を成す最小単位となる文字列である。その後、トークン分割部132は、脅威文書に含まれる各文を分割して得られた各トークンをルール適合部133へ出力する。 The token division unit 132 receives the input of each sentence obtained by dividing the threat document from the sentence division unit 131. Then, the token dividing unit 132 divides the acquired sentence into token units. A token is a character string consisting of one or a plurality of characters in a character sequence in a threat document, and is a character string that is the smallest unit that makes sense. After that, the token dividing unit 132 outputs each token obtained by dividing each sentence included in the threat document to the rule conforming unit 133.
 ルール適合部133は、脅威文書を分割して得られた各文の入力を文分割部131から受ける。また、ルール適合部133は、脅威文書に含まれる各文を分割して得られた各トークンの入力をトークン分割部132から受ける。 The rule conforming unit 133 receives the input of each sentence obtained by dividing the threat document from the sentence dividing unit 131. Further, the rule conforming unit 133 receives the input of each token obtained by dividing each sentence included in the threat document from the token dividing unit 132.
 ここで、ルール適合部133が取得した文及びトークンの元となる脅威文書は、前述したように各文書の取得元のWEBサイトの構造やルール及び統計的手法などを用いることで文書群の中から特定される。そのため、脅威文書を前提としたルールが作成可能である。そこで、ルール適合部133は、脅威文書を前提として作成された疑似的な教師データの作成ルールを予め有する。 Here, the threat document that is the source of the sentence and token acquired by the rule conforming unit 133 is included in the document group by using the structure, rules, statistical methods, etc. of the WEB site from which each document is acquired, as described above. Specified from. Therefore, it is possible to create rules based on threat documents. Therefore, the rule conforming unit 133 has a rule for creating pseudo teacher data created on the premise of a threat document.
 例えば、ルール適合部133は、トークン列の一致、正規表現の一致、構文構造とトークン列の一致及び正規表現の一致との組み合わせを用いた一致、NVD(National Vulnerability Database)などの外部知識とトークン列の一致、正規表現の一致及び構文構造との組み合わせを用いた一致などを教師データの作成ルールとして有する。ただし、教師データの作成ルールはこれらに限らず、ルール適合部133は、脅威文書を前提としいて作成されるルールであれば他のルールを用いてもよい。 For example, the rule conforming unit 133 has external knowledge such as token string matching, regular expression matching, matching using a combination of syntactic structure and token string matching and regular expression matching, and external knowledge such as NVD (National Vulnerability Database) and tokens. It has column matching, regular expression matching, and matching using a combination with a syntactic structure as rules for creating teacher data. However, the rules for creating teacher data are not limited to these, and the rule conforming unit 133 may use other rules as long as the rules are created on the premise of a threat document.
 図3は、疑似的な教師データの作成ルールの一例を示す図である。図3の各ルールは、脅威文書が矢印の左側の条件に一致する場合に、矢印の右側の攻撃者の振る舞いの記載がその脅威文書に存在するというルールである。すなわち、ある脅威文書を入力データとした場合に、その脅威文書が条件を満たす疑似的な教師データの作成ルールを用いることで出力デーとなる脅威文書に記述された攻撃者の振る舞いを特定することができる。したがって、脅威文書に対して疑似的な教師データの作成ルールを用いることで、入力データと出力データとを含む疑似的な教師データを生成することができる。 FIG. 3 is a diagram showing an example of a rule for creating pseudo teacher data. Each rule in FIG. 3 is a rule that when the threat document matches the condition on the left side of the arrow, the description of the attacker's behavior on the right side of the arrow exists in the threat document. That is, when a certain threat document is used as input data, the behavior of the attacker described in the threat document that is the output day is specified by using a rule for creating pseudo teacher data that satisfies the condition of the threat document. Can be done. Therefore, by using the rules for creating pseudo-teacher data for threat documents, it is possible to generate pseudo-teacher data including input data and output data.
 図3において、種別がフレーズとされたルールは、トークン列の一致にあたる作成ルールの一例である。種別が正規表現とされたルールは、正規表現の一致にあたる作成ルールの一例である。種別がCVE(Common Vulnerabilities and Exposures)又はCPE(Common Platform Enumeration)とされたルールは、NVDなどの外部知識とトークン列の一致、正規表現の一致及び構文構造との組み合わせを用いた一致にあたる作成ルールの一例である。種別が論理演算とされたルールは、いずれの作成ルールも、矢印の左辺の条件に適合した場合に、当該文に対して右辺の攻撃者の振る舞いの疑似的な教師データとして付与するルールである。 In FIG. 3, the rule whose type is a phrase is an example of a creation rule corresponding to a match of token columns. A rule whose type is a regular expression is an example of a creation rule that matches a regular expression. A rule whose type is CVE (Common Vulnerabilities and Exposures) or CPE (Common Platform Enumeration) is a creation rule that corresponds to a combination of external knowledge such as NVD and a token string match, a regular expression match, and a syntax structure. This is just one example. The rule whose type is logical operation is a rule that, when the conditions on the left side of the arrow are met, all the creation rules are given to the sentence as pseudo teacher data of the behavior of the attacker on the right side. ..
 ルール適合部133は、教師データの作成ルールの適合可否の判定に用いる観測情報の取得要求を外部知識参照部136へ出力する。教師データの作成ルールを用いた適合判定に用いる観測情報には、例えば、SQL(Structured Query Language)インジェクションなどの各攻撃者の振る舞いを示唆する手掛かり語、各振る舞いの実行に使われるプログラム及びファイルの名称、API(Application Programming Interface)や関数の名称、コマンドの名称、並びに、CVE(Common Vulnerabilities and Exposures)に紐づく外部情報などが含まれる。その後、ルール適合部133は、教師データの作成ルールを用いた適合判定に用いる観測情報を外部知識参照部136から取得する。 The rule conforming unit 133 outputs an observation information acquisition request used for determining whether or not the teacher data creation rule is conforming to the external knowledge reference unit 136. The observation information used for conformity judgment using the rules for creating teacher data includes, for example, clues suggesting the behavior of each attacker such as SQL (Structured Query Language) injection, and the programs and files used to execute each behavior. It includes names, API (Application Programming Interface) and function names, command names, and external information associated with CVE (Common Vulnerabilities and Exposures). After that, the rule conforming unit 133 acquires the observation information used for the conformity determination using the teacher data creation rule from the external knowledge reference unit 136.
 ルール適合部133は、教師データの作成ルールに外部知識参照部136から取得した観測情報を用いて脅威文書の文レベル及びトークンレベルでの教師データの作成ルールの適合可否の判定を行う。ここで、ある脅威文書に対する教師データの作成ルールの適合可否の判定とは、その脅威文書に対して判定対象とする教師データの作成ルールを用いることで攻撃者の振る舞いが求められるか否かを判定することである。 The rule conforming unit 133 uses the observation information acquired from the external knowledge reference unit 136 as the teacher data creation rule to determine whether or not the teacher data creation rule conforms to the sentence level and token level of the threat document. Here, the determination of conformity of the teacher data creation rule for a certain threat document is whether or not the attacker's behavior is required by using the teacher data creation rule to be determined for the threat document. It is to judge.
 その後、ルール適合部133は、適合可能と判定した教師データの作成ルールを、その判定の対象とした脅威文書に含まれる文及びトークンとともに合成ルール適合部134へ出力する。ルール適合部133は、取得した文及びトークンに対応する全ての脅威文書に対して教師データの作成ルールの適合可否を判定して、判定に基づく教師データの作成ルールを対応する脅威文書に含まれる文及びトークンとともに合成ルール適合部134へ出力する。 After that, the rule conforming unit 133 outputs the teacher data creation rule determined to be conformable to the synthesis rule conforming unit 134 together with the sentence and token included in the threat document targeted for the determination. The rule conforming unit 133 determines whether or not the teacher data creation rule conforms to all the threat documents corresponding to the acquired sentences and tokens, and includes the teacher data creation rule based on the determination in the corresponding threat document. It is output to the synthesis rule conforming unit 134 together with the sentence and the token.
 合成ルール適合部134は、各脅威文書に適合した教師データの作成ルール及びそれに対応する脅威文書に含まれる文及びトークンの入力をルール適合部133から受ける。次に、合成ルール適合部134は、各脅威文書について、ルール適合部133により適合が可能とされた教師データの作成ルールの論理演算により合成ルールを生成する。 The synthesis rule conforming unit 134 receives input of the teacher data creation rule conforming to each threat document and the sentences and tokens included in the corresponding threat document from the rule conforming unit 133. Next, the synthesis rule conforming unit 134 generates a synthesis rule for each threat document by a logical operation of a teacher data creation rule that can be conformed by the rule conforming unit 133.
 そして、合成ルール適合部134は、生成した合成ルールの適合可否の判定に用いる観測情報の取得要求を外部知識参照部136へ出力する。この合成ルールを用いた適合可否の判定に用いる観測情報には、ルール適合部133が取得する情報と同様に、各攻撃者の振る舞いを示唆する手掛かり語、各振る舞いの実行に使われるプログラム及びファイルの名称、APIや関数の名称、コマンドの名称、並びに、CVEに紐づく外部情報などが含まれる。その後、合成ルール適合部134は、合成ルールを用いた適合可否の判定に用いる観測情報を外部知識参照部136から取得する。 Then, the synthesis rule conforming unit 134 outputs an acquisition request for observation information used for determining whether or not the generated synthesis rule is conforming to the external knowledge reference unit 136. The observation information used for determining conformance using this synthesis rule includes clues suggesting the behavior of each attacker, programs and files used to execute each behavior, as well as the information acquired by the rule conforming unit 133. Names, API and function names, command names, and external information associated with CVE are included. After that, the synthesis rule conforming unit 134 acquires observation information used for determining conformability using the synthesis rule from the external knowledge reference unit 136.
 合成ルール適合部134は、合成ルールに外部知識参照部136から取得した観測情報を用いて脅威文書の文レベル及びトークンレベルでの合成ルールの適合可否の判定を行う。例えば、図3において種別が論理演算とされたルールが、複数のルール適合の結果の論理演算によって適合可否を判断する合成ルールの一例である。xから始まる振る舞いは、直接攻撃者の振る舞いには対応しないが、合成ルールよって攻撃者の振る舞いを付与するために利用される情報である。xから始まる振る舞いは、例えば、種別がフレーズのルールにおけるSQLインジェクションの一致や、種別がCVEのルールにおいて所定の振る舞いを付与する情報である。 The synthesis rule conforming unit 134 uses the observation information acquired from the external knowledge reference unit 136 for the synthesis rule to determine whether or not the synthesis rule is conformable at the sentence level and the token level of the threat document. For example, the rule whose type is a logical operation in FIG. 3 is an example of a synthesis rule that determines conformability by a logical operation as a result of a plurality of rule conformances. The behavior starting with x does not directly correspond to the behavior of the attacker, but is information used to give the behavior of the attacker according to the synthesis rule. The behavior starting from x is, for example, information that gives a match of SQL injection in a rule whose type is a phrase or a predetermined behavior in a rule whose type is CVE.
 合成ルールの適合が可能と判定した場合、合成ルール適合部134は、適合可能と判定した合成ルールを、その判定の対象とした脅威文書に含まれる文及びトークンとともに収束判定部135へ出力する。また、合成ルール適合部134は、ルール適合部133から取得した各脅威文書が適合した教師データの作成ルールを収束判定部135へ出力する。 When it is determined that the synthesis rule can be conformed, the synthesis rule conforming unit 134 outputs the synthesis rule determined to be conformable to the convergence test unit 135 together with the sentence and token included in the threat document targeted for the determination. Further, the synthesis rule conforming unit 134 outputs the creation rule of the teacher data conforming to each threat document acquired from the rule conforming unit 133 to the convergence test unit 135.
 その後、後述する収束判定部135により新たな合成ルールの適合可否の判定依頼を受けた場合、合成ルール適合部134は、新たな合成ルールが発生した脅威文書についてその新たな合成ルールの適合可否の判定を行う。その後、合成ルール適合部134は、新たな合成ルールを含む合成ルールのうち適合可能と判定した合成ルールを、その判定の対象とした脅威文書に含まれる文及びトークンとともに収束判定部135へ出力する。 After that, when the convergence test unit 135, which will be described later, receives a request for determining whether or not the new composition rule is applicable, the composition rule conforming unit 134 determines whether or not the new composition rule is applicable to the threat document in which the new composition rule is generated. Make a judgment. After that, the synthesis rule conforming unit 134 outputs the synthesis rule determined to be conformable among the composition rules including the new composition rule to the convergence test unit 135 together with the sentence and the token included in the threat document targeted for the determination. ..
 外部知識参照部136は、知識が蓄積された外部のサーバ(不図示)などに接続されることで、外部の知識群を参照する。外部知識参照部136は、ルール適合部133及び合成ルール適合部134から観測情報の取得要求を受ける。そして、外部知識参照部136は、指定された疑似的な教師データの作成ルールや合成ルールに用いる観測情報を外部の知識群から取得する。そして、外部知識参照部136は、取得した観測情報の取得要求の送信元であるルール適合部133又は合成ルール適合部134へ出力する。 The external knowledge reference unit 136 refers to an external knowledge group by being connected to an external server (not shown) or the like in which knowledge is accumulated. The external knowledge reference unit 136 receives a request for acquisition of observation information from the rule conforming unit 133 and the synthesis rule conforming unit 134. Then, the external knowledge reference unit 136 acquires the observation information used for the designated pseudo-teacher data creation rule and the synthesis rule from the external knowledge group. Then, the external knowledge reference unit 136 outputs the acquired observation information to the rule conforming unit 133 or the synthesis rule conforming unit 134, which is the transmission source of the acquisition request.
 収束判定部135は、各脅威文書に適合した合成ルール及びそれに対応する脅威文書に含まれる文及びトークンの入力を合成ルール適合部134から受ける。ここで、脅威文書がある合成ルールに適合したことにより、適合の可能性がある新たな合成ルールが発生する場合がある。例えば、合成ルールと元の教師データの作成ルールとの論理演算で生成される合成ルールなどが、新たな合成ルールとなる可能性がある。ただし、新たな合成ルールを用いた場合に、その脅威文書に含まれる攻撃者の振る舞いが追加されなければ、その新たな合成ルールを用いて教師データを生成しても、教師データの数は増えない。そこで、収束判定部135は、脅威文書に含まれる攻撃者の振る舞いを集めた振舞集合に対する合成ルールの適合後に、新たな攻撃者の振る舞いがその振舞集合に追加されないか否かを判定する。 The convergence test unit 135 receives the input of the composition rule conforming to each threat document and the sentences and tokens included in the corresponding threat document from the composition rule conforming unit 134. Here, a threat document conforming to a certain synthetic rule may result in a new synthetic rule that may be conformed. For example, a synthetic rule generated by a logical operation between a synthetic rule and an original teacher data creation rule may become a new synthetic rule. However, if the attacker's behavior contained in the threat document is not added when the new synthesis rule is used, the number of teacher data will increase even if the teacher data is generated using the new synthesis rule. do not have. Therefore, the convergence test unit 135 determines whether or not a new attacker's behavior is added to the behavior set after the synthesis rule is applied to the behavior set that collects the attacker's behavior included in the threat document.
 新たな攻撃者の振る舞いが追加された場合、収束判定部135は、その新たな攻撃者の振る舞いを付加させる合成ルールに対しても脅威文書が適合する可能性があると判定する。そして、収束判定部135は、新たな攻撃者の振る舞いを付加させる合成ルールの適合可否の判定を合成ルール適合部134に依頼する。 When a new attacker's behavior is added, the convergence test unit 135 determines that the threat document may be compatible with the synthesis rule to which the new attacker's behavior is added. Then, the convergence test unit 135 requests the synthesis rule conforming unit 134 to determine whether or not the synthesis rule conforms to the behavior of a new attacker.
 これに対して、新たな振る舞いが追加されない場合、収束判定部135は、合成ルールの適合判定が収束したと判定する。その後、収束判定部135は、各脅威文書に適合した教師データの作成ルール及び合成ルールを、対応する脅威文書に含まれる文及びトークンとともに疑似教師生成部137へ出力する。 On the other hand, if no new behavior is added, the convergence test unit 135 determines that the conformity test of the synthesis rule has converged. After that, the convergence test unit 135 outputs the teacher data creation rule and the synthesis rule suitable for each threat document to the pseudo-teacher generation unit 137 together with the sentences and tokens included in the corresponding threat document.
 疑似教師生成部137は、各脅威文書に適合した教師データの作成ルール及び合成ルール、並びに、対応する脅威文書に含まれる文及びトークンの入力を収束判定部135から受ける。そして、疑似教師生成部137は、脅威文書に含まれる文及びトークンに対して、その脅威文書に適合した教師データの作成ルール及び合成ルールを用いて疑似的な教師データを生成する。その後、疑似教師生成部137は、生成した疑似的な教師データを学習部104へ出力する。また、疑似教師抽出部103は、生成した疑似的な教師データの他に、脅威文書を学習部104へ出力する。 The pseudo-teacher generation unit 137 receives input of teacher data creation rules and synthesis rules suitable for each threat document, as well as sentences and tokens included in the corresponding threat document from the convergence test unit 135. Then, the pseudo-teacher generation unit 137 generates pseudo-teacher data for the sentences and tokens included in the threat document by using the teacher data creation rule and the synthesis rule suitable for the threat document. After that, the pseudo-teacher generation unit 137 outputs the generated pseudo-teacher data to the learning unit 104. Further, the pseudo-teacher extraction unit 103 outputs a threat document to the learning unit 104 in addition to the generated pseudo-teacher data.
 学習部104は、脅威文書及び疑似的な教師データの入力を疑似教師抽出部103から受ける。ここでは、疑似的な教師データを用いるため、教師データにノイズが存在する。そこで、学習部104は、疑似的な教師データを用いてノイズを考慮した学習を行って分類モデルを生成する。その後、学習部104は、生成した分類モデルを推論部20の文書分類部203へ出力する。 The learning unit 104 receives the input of the threat document and the pseudo-teacher data from the pseudo-teacher extraction unit 103. Here, since pseudo teacher data is used, noise exists in the teacher data. Therefore, the learning unit 104 generates a classification model by performing learning in consideration of noise using pseudo teacher data. After that, the learning unit 104 outputs the generated classification model to the document classification unit 203 of the inference unit 20.
 ここで、図4を参照して、学習部104の機能について詳細に説明する。図4は、学習部の詳細を表すブロック図である。図4に示すように、学習部104は、文書表現層141、ラベル分類層142、トークン表現層143、エンコード層144、確信度層145、ノイズ分布層146及び観測ラベル分類層147を有する。 Here, the function of the learning unit 104 will be described in detail with reference to FIG. FIG. 4 is a block diagram showing details of the learning unit. As shown in FIG. 4, the learning unit 104 has a document expression layer 141, a label classification layer 142, a token expression layer 143, an encode layer 144, a certainty layer 145, a noise distribution layer 146, and an observation label classification layer 147.
 文書表現層141は、疑似教師抽出部103から出力された脅威文書を取得する。そして、文書表現層141は、各脅威文書全体の特徴を取得する。脅威文書全体の特徴とは、その脅威文書が含む文意を表す情報である。文書表現層141は、1つの脅威文書について1つのベクトルを生成する。その後、文書表現層141は、各脅威文書全体の特徴をラベル分類層142へ出力する。文書表現層141は、例えば、BERT(Bidirectional Encoder Representations from Transformers)を用いることで実現可能である。ただし、文書表現層141の実現方法はBERTに限定されず、他の構成による実装を用いることも可能である。 The document expression layer 141 acquires the threat document output from the pseudo-teacher extraction unit 103. Then, the document expression layer 141 acquires the characteristics of each threat document as a whole. The characteristics of the entire threat document are information that expresses the meaning of the threat document. The document representation layer 141 generates one vector for one threat document. After that, the document expression layer 141 outputs the characteristics of each threat document as a whole to the label classification layer 142. The document expression layer 141 can be realized by using, for example, BERT (Bidirectional Encoder Representations from Transformers). However, the method of realizing the document expression layer 141 is not limited to BERT, and an implementation with another configuration can be used.
 トークン表現層143は、疑似教師抽出部103から出力された脅威文書を取得する。そして、トークン表現層143は、各脅威文書に含まれるトークンを抽出する。ここで、トークン表現層143は、疑似教師抽出部103で作成された各脅威文書に含まれるトークンを取得してもよい。 The token expression layer 143 acquires the threat document output from the pseudo-teacher extraction unit 103. Then, the token expression layer 143 extracts the token included in each threat document. Here, the token expression layer 143 may acquire the token included in each threat document created by the pseudo-teacher extraction unit 103.
 次に、トークン表現層143は、各脅威文書に含まれるトークンのそれぞれの特徴を取得する。トークンの特徴とは、そのトークンが有する意味を表す情報である。ここで、各トークンの長さはそれぞれ異なるため、トークンの特徴もそれぞれ長さが異なる。トークン表現層143は、1つのトークンにつき1つのベクトルを生成するので、1つの脅威文書についてはその中に含まれるトークンの個数のベクトルを生成する。その後、トークン表現層143は、トークン個々の特徴をエンコード層144へ出力する。トークン表現層143も、例えば、文書表現層141と同様にBERTを用いることで実現可能である。ただし、トークン表現層143の実現方法もBERTに限定されず、他の構成による実装を用いることも可能である。 Next, the token expression layer 143 acquires the characteristics of each token included in each threat document. The characteristics of a token are information that expresses the meaning of the token. Here, since the length of each token is different, the characteristics of the tokens are also different. Since the token expression layer 143 generates one vector for each token, for one threat document, the vector of the number of tokens contained therein is generated. After that, the token expression layer 143 outputs the characteristics of each token to the encoding layer 144. The token expression layer 143 can also be realized by using BERT, for example, like the document expression layer 141. However, the method of realizing the token expression layer 143 is not limited to BERT, and implementation by other configurations can be used.
 このように文書表現層141及びトークン表現層143を配置することで、ラベルの分類には文章全体の意味を考慮し、ノイズのモデル化には個々のトークンの中でエラーに影響する要素を考慮することができる。 By arranging the document expression layer 141 and the token expression layer 143 in this way, the meaning of the entire sentence is taken into consideration when classifying labels, and the elements that affect errors in individual tokens are taken into consideration when modeling noise. can do.
 ラベル分類層142は、脅威文書の全体的な特徴の入力を文書表現層141から受ける。そして、ラベル分類層142は、脅威文書の全体的な特徴を入力としてニューラルネットワークを用いて機械学習を行い、脅威文書に付与する攻撃者の振る舞いに対応する真のラベルを予測する。ラベルとは、各脅威文書にどのような攻撃者の振る舞いの記載が含まれているかを表す情報である。そして、ラベル分類層142は、K個のラベルを付与する場合、K個の各ラベルが含まれるそれぞれの確率を表すベクトルであるラベル確率を生成する。 The label classification layer 142 receives input of the overall characteristics of the threat document from the document expression layer 141. Then, the label classification layer 142 performs machine learning using a neural network with the overall characteristics of the threat document as an input, and predicts a true label corresponding to the behavior of the attacker given to the threat document. The label is information indicating what kind of attacker's behavior is included in each threat document. Then, when K labels are assigned, the label classification layer 142 generates a label probability which is a vector representing each probability including each of the K labels.
 エンコード層144は、可変長であるトークン系列を固定長に変換する層である。エンコード層144は、トークン個々の特徴を表す可変長のベクトルの入力をトークン表現層143から受ける。そして、エンコード層144は、取得したトークン個々の特徴を固定長に変換する。次に、エンコード層144は、固定長にしたトークンの特徴を表す情報を統合して1つのベクトルを生成する。さらに、エンコード層144は、生成したベクトルに与えられたラベルの数をKとした場合に、そのベクトルを4K次元のベクトルに変換する。その後、エンコード層144は、各トークンの特徴をまとめて表す1つの固定長且つ4K次元のベクトルを確信度層145へ出力する。エンコード層144は、例えば、RNN(Recurrent Neural Network)を用いることで実現可能である。ただし、エンコード層144の実現方法は、RNNに限定されず、他の構成による実装を用いることも可能である。 The encode layer 144 is a layer that converts a variable length token sequence into a fixed length. The encode layer 144 receives an input of a variable length vector representing the characteristics of each token from the token expression layer 143. Then, the encode layer 144 converts the characteristics of each acquired token into a fixed length. Next, the encode layer 144 integrates information representing the characteristics of the fixed-length token to generate one vector. Further, the encoding layer 144 converts the vector into a 4K-dimensional vector, where K is the number of labels given to the generated vector. After that, the encode layer 144 outputs one fixed-length and 4K-dimensional vector collectively representing the characteristics of each token to the certainty layer 145. The encode layer 144 can be realized by using, for example, an RNN (Recurrent Neural Network). However, the method of realizing the encode layer 144 is not limited to the RNN, and it is also possible to use an implementation with another configuration.
 確信度層145は、各ラベル真の値(0または1)から、ノイズを含むラベルである観測ラベルの値(0または1)へ変換される確信度を表す畳み込み層である。確信度は、ラベルがk個である場合、k個のラベルについて真のラベルの各値から観測ラベルの各値へ遷移する確信度である。真のラベルは2つの値をとり、且つ、観測ラベルも2つの値を取るため遷移の方法は4つ存在する。確信度層145は、サイズ1×1の4K個カーネルとバイアスパラメータを有する。 The certainty layer 145 is a convolution layer representing the certainty of conversion from the true value (0 or 1) of each label to the value (0 or 1) of the observation label which is a label containing noise. The certainty is the certainty of transition from each value of the true label to each value of the observed label for k labels when the number of labels is k. There are four transition methods because the true label takes two values and the observation label also takes two values. The confidence layer 145 has 4K kernels of size 1 × 1 and bias parameters.
 確信度層145は、各トークンの特徴をまとめて表す1つの固定長且つ4K次元のベクトルの入力をエンコード層144から受ける。そして、確信度層145は、取得したベクトルに保持するカーネル及びバイアスパラメータを用いて畳み込み演算を実行し、確信度を表す2×2×kのテンソルを生成する。この分類装置1では、学習を行う際に疑似的な教師データを用いるため、ノイズを含む観測ラベルが発生する。そのため、得られた観測ラベルに対して、真のラベルがどの様な振る舞いかを決定することが必要になる。そこで、確信度が示す新のラベルが各観測ラベルに遷移する確信度を用いることで、得られた観測ラベルから真のラベルを推定することが可能となる。確信度層145は、生成した確信度を表す2×2×kのテンソルをノイズ分布層146へ出力する。 The certainty layer 145 receives an input of one fixed-length and 4K-dimensional vector that collectively represents the characteristics of each token from the encode layer 144. Then, the conviction layer 145 executes a convolution operation using the kernel and the bias parameter held in the acquired vector, and generates a 2 × 2 × k tensor representing the conviction. In this classification device 1, since pseudo teacher data is used when learning, an observation label containing noise is generated. Therefore, it is necessary to determine how the true label behaves with respect to the obtained observation label. Therefore, by using the certainty that the new label indicated by the certainty transitions to each observation label, it is possible to estimate the true label from the obtained observation label. The certainty layer 145 outputs a 2 × 2 × k tensor representing the generated certainty to the noise distribution layer 146.
 ノイズ分布層146は、2×2×kのテンソルで表される確信度の入力を確信度層145から受ける。そして、ノイズ分布層146は、取得した確信度に対してSoftmax関数による正規化を行って2×2×kのテンソルで表されるノイズ分布を取得する。ノイズ分布は、ある脅威文書においてk個の真のラベルのそれぞれがk個の観測ラベルのそれぞれに遷移する遷移確率を表す確率分布である。その後、ノイズ分布層146は、2×2×kのテンソルで表されるノイズ分布を観測ラベル分類層147へ出力する。 The noise distribution layer 146 receives an input of a certainty degree represented by a 2 × 2 × k tensor from the certainty degree layer 145. Then, the noise distribution layer 146 normalizes the acquired certainty with the Softmax function to acquire the noise distribution represented by the tensor of 2 × 2 × k. The noise distribution is a probability distribution that represents the transition probability that each of the k true labels transitions to each of the k observation labels in a threat document. After that, the noise distribution layer 146 outputs the noise distribution represented by the 2 × 2 × k tensor to the observation label classification layer 147.
 観測ラベル分類層147は、トークン個々の特徴を表すトークン表現層143からの出力を基にしてノイズ分布層146で作成されたノイズ分布の入力を受ける。さらに、観測ラベル分類層147は、各脅威文書についての各真のラベルが含まれるそれぞれの確率を表すベクトルであるラベル確率をラベル分類層142から取得する。次に、観測ラベル分類層147は、ラベル分類層142の出力にノイズ分布層146の出力の重み付け和を計算することで、各脅威文書に付与するノイズを含む観測ラベルを予測する。そして、観測ラベル分類層147は、K個のラベルを付与する場合、脅威文書毎にK個の各観測ラベルが含まれるそれぞれの確率を表すベクトルである観測ラベル確率を生成する。 The observation label classification layer 147 receives the input of the noise distribution created by the noise distribution layer 146 based on the output from the token expression layer 143 representing the characteristics of each token. Further, the observation label classification layer 147 obtains a label probability, which is a vector representing each probability that each true label is included for each threat document, from the label classification layer 142. Next, the observation label classification layer 147 calculates the weighted sum of the output of the noise distribution layer 146 to the output of the label classification layer 142 to predict the observation label including noise to be given to each threat document. Then, when K labels are assigned, the observation label classification layer 147 generates an observation label probability, which is a vector representing each probability that each K observation label is included in each threat document.
 学習部104における文書表現層141及びラベル分類層142が、脅威文書に含まれる攻撃者の振る舞いの出現確率を算出して脅威文書の分類を行うための分類モデルにあたる。また、学習部104におけるトークン表現層143、エンコード層144、確信度層145、ノイズ分布層146及び観測ラベル分類層147が、分類モデルの学習時にノイズを減少させるために用いるノイズモデルである。 The document expression layer 141 and the label classification layer 142 in the learning unit 104 correspond to a classification model for calculating the appearance probability of the attacker's behavior included in the threat document and classifying the threat document. Further, the token expression layer 143, the encode layer 144, the certainty layer 145, the noise distribution layer 146, and the observation label classification layer 147 in the learning unit 104 are noise models used to reduce noise when learning the classification model.
 ここで、学習部104による実際の学習は以下のような流れで実施する。疑似的な教師データを用いるため、脅威文書から取得できるラベルは観測ラベルである。そこで、適切に観測ラベル分類層147でノイズモデルを学習するために、学習部104は、ラベル分類層142のパラメータを調整し真のラベルの予測精度を向上させる前半の学習を行った後に、ノイズモデルを用いてノイズを考慮した学習を行って分類モデルを完成させる。学習部104は、前半の学習が完了した後に、後半の学習に移行する。 Here, the actual learning by the learning unit 104 is carried out in the following flow. Since pseudo-teacher data is used, the label that can be obtained from the threat document is the observation label. Therefore, in order to properly learn the noise model in the observation label classification layer 147, the learning unit 104 adjusts the parameters of the label classification layer 142 and performs the first half learning to improve the prediction accuracy of the true label, and then the noise. The classification model is completed by learning considering noise using the model. After the learning of the first half is completed, the learning unit 104 shifts to the learning of the second half.
 前半の学習では、学習部104は、トークン表現層143から観測ラベル分類層147を使用せずに、文書表現層141とラベル分類層142とを用いて、観測ラベルを真のラベルと見なして疑似的な教師データを基に学習を行う。この場合、学習部104は、ラベル分類層142の出力と観測ラベルの出力の誤差を最小化するように学習を行う。学習部104は、例えば、観測ラベルを真のラベルとみなした予測が所定回数完了すると、前半の学習が完了したと判定する。 In the first half of the learning, the learning unit 104 uses the document expression layer 141 and the label classification layer 142 without using the observation label classification layer 147 from the token expression layer 143, and treats the observation label as a true label and pseudo. Learning is performed based on typical teacher data. In this case, the learning unit 104 learns so as to minimize the error between the output of the label classification layer 142 and the output of the observation label. The learning unit 104 determines, for example, that the learning of the first half is completed when the prediction that the observation label is regarded as a true label is completed a predetermined number of times.
 後半の学習では、学習部104は、文書表現層141から観測ラベル分類層147までの全ての各層を用いて疑似的な教師データを基に学習を行う。この場合、学習部104は、観測ラベル分類層147の出力と疑似的な教師データにより与えられた観測ラベルとの誤差を最小化するように学習を行う。これにより、学習部104は、疑似的な教師データによるノイズの影響を抑えた分類モデルを文書表現層141及びラベル分類層142により実現することができる。学習部104は、例えば、真のラベルの予測が所定回数完了すると、後半の学習が完了したと判定する。後半の学習が完了した後に、学習部104は、生成した分類モデルを推論部20の文書分類部203へ出力する。 In the latter half of the learning, the learning unit 104 performs learning based on pseudo teacher data using all the layers from the document expression layer 141 to the observation label classification layer 147. In this case, the learning unit 104 learns so as to minimize the error between the output of the observation label classification layer 147 and the observation label given by the pseudo teacher data. As a result, the learning unit 104 can realize a classification model in which the influence of noise due to pseudo teacher data is suppressed by the document expression layer 141 and the label classification layer 142. The learning unit 104 determines, for example, that the latter half of the learning is completed when the prediction of the true label is completed a predetermined number of times. After the latter half of the learning is completed, the learning unit 104 outputs the generated classification model to the document classification unit 203 of the inference unit 20.
 推論部20は、入力された文書の脅威文書に含まれる攻撃者の行動の分類をモデル学習部10で生成された分類モデルを用いて行う。推論部20は、入力受付部201、脅威文書抽出部202、文書分類部203及び出力部204を有する。 The inference unit 20 classifies the behavior of the attacker included in the threat document of the input document using the classification model generated by the model learning unit 10. The inference unit 20 has an input reception unit 201, a threat document extraction unit 202, a document classification unit 203, and an output unit 204.
 入力受付部201は、文書の入力を受け付ける。そして、入力受付部201は、受信した文書を脅威文書抽出部202へ出力する。 The input reception unit 201 accepts the input of a document. Then, the input reception unit 201 outputs the received document to the threat document extraction unit 202.
 脅威文書抽出部202は、文書の入力を入力受付部201から受ける。次に、脅威文書抽出部202は、取得した文書からサイバー脅威に関連する脅威文書を抽出する。脅威文書抽出部202は、文書集合に含まれる各文書の取得元のWEBサイトの構造やルール及び統計的手法などを用いて脅威文書とそれ以外の文書とを分類する。その後、脅威文書抽出部202は、抽出した脅威文書を文書分類部203へ出力する。 The threat document extraction unit 202 receives the input of the document from the input reception unit 201. Next, the threat document extraction unit 202 extracts threat documents related to cyber threats from the acquired documents. The threat document extraction unit 202 classifies threat documents and other documents by using the structure, rules, statistical methods, etc. of the WEB site from which each document included in the document set is acquired. After that, the threat document extraction unit 202 outputs the extracted threat document to the document classification unit 203.
 文書分類部203は、入力された文書から抽出された脅威文書の入力を脅威文書抽出部202から受ける。また、文書分類部203は、モデル学習部10の学習部104により生成された分類モデルを取得する。そして、文書分類部203は、取得した分類モデルを用いて脅威文書の分類を行う。すなわち、文書分類部203は、ある脅威文書にどのような攻撃者の振る舞いが含まれる確率を求める。その後、文書分類部203は、分類結果を出力部204へ出力する。 The document classification unit 203 receives the input of the threat document extracted from the input document from the threat document extraction unit 202. Further, the document classification unit 203 acquires the classification model generated by the learning unit 104 of the model learning unit 10. Then, the document classification unit 203 classifies threat documents using the acquired classification model. That is, the document classification unit 203 obtains the probability that a certain threat document includes the behavior of an attacker. After that, the document classification unit 203 outputs the classification result to the output unit 204.
 出力部204は、入力された文書の分類結果の入力を文書分類部203から受ける。そして、出力部204は、入力された文書の分類結果をモニタなどに出力して、入力された文書の分類結果を利用者に通知する。具体的には、出力部204は、入力された文書の中の脅威文書にどのような攻撃者の振る舞いがどの程度含まれているかといった情報を出力する。 The output unit 204 receives the input of the input classification result of the document from the document classification unit 203. Then, the output unit 204 outputs the classification result of the input document to a monitor or the like, and notifies the user of the classification result of the input document. Specifically, the output unit 204 outputs information such as what kind of attacker's behavior is included in the threat document in the input document and to what extent.
 このように、脅威文書を分類して、脅威文書にどのような攻撃者の振る舞いがどの程度含まれているかを判別することで、例えば、どの様な攻撃者の振る舞いが頻出するのかを特定することが可能である。また、各脅威文書において攻撃者の第1の振る舞い及び攻撃者の第2の振る舞いがいずれも含まれる場合が多ければ、攻撃者の第1の振る舞いが検出された場合に、攻撃者の第2の振る舞いも発生する確率が高いと判断でき、迅速な対処を行うことが可能となる。 In this way, by classifying threat documents and determining what kind of attacker's behavior is included in the threat document and how much, for example, what kind of attacker's behavior occurs frequently is specified. It is possible. Also, if each threat document often includes both the attacker's first behavior and the attacker's second behavior, the attacker's second behavior when the attacker's first behavior is detected. It can be judged that there is a high probability that the behavior of will also occur, and it will be possible to take prompt measures.
 [分類処理]
 次に、図5を参照して、分類装置1のモデル学習部10による分類モデル生成のための学習処理の全体の流れについて説明する。図5は、学習処理のフローチャートである。
[Classification process]
Next, with reference to FIG. 5, the entire flow of the learning process for generating the classification model by the model learning unit 10 of the classification device 1 will be described. FIG. 5 is a flowchart of the learning process.
 文書収集部101は、WEB2などから文書を収集するなどして文書集合を取得する(ステップS1)。次に、文書収集部101は、取得した文書集合を脅威文書抽出部102へ出力する。 The document collection unit 101 acquires a document set by collecting documents from WEB 2 or the like (step S1). Next, the document collection unit 101 outputs the acquired document set to the threat document extraction unit 102.
 脅威文書抽出部102は、文書集合の入力を文書収集部101から受ける。次に、脅威文書抽出部102は、文書集合からサイバー脅威に関する脅威文書を抽出する(ステップS2)。その後、脅威文書抽出部102は、抽出した脅威文書を疑似教師抽出部103へ出力する。 The threat document extraction unit 102 receives the input of the document set from the document collection unit 101. Next, the threat document extraction unit 102 extracts a threat document related to the cyber threat from the document set (step S2). After that, the threat document extraction unit 102 outputs the extracted threat document to the pseudo-teacher extraction unit 103.
 疑似教師抽出部103は、文書集合から抽出された脅威文書の入力を脅威文書抽出部102から受ける。次に、疑似教師抽出部103は、脅威文書から疑似的な教師データを生成する(ステップS3)。その後、疑似教師抽出部103は、生成した疑似的な教師データを学習部104へ出力する。 The pseudo-teacher extraction unit 103 receives the input of the threat document extracted from the document set from the threat document extraction unit 102. Next, the pseudo-teacher extraction unit 103 generates pseudo-teacher data from the threat document (step S3). After that, the pseudo-teacher extraction unit 103 outputs the generated pseudo-teacher data to the learning unit 104.
 学習部104は、脅威文書、各各脅威文書に含まれるトークン及び疑似的な教師データの入力を疑似教師抽出部103から受ける。次に、学習部104は、疑似的な教師データを用いてノイズを考慮した学習を行って分類モデルを生成する(ステップS4)。その後、学習部104は、生成した分類モデルを推論部20の文書分類部203へ出力する。 The learning unit 104 receives the input of the threat document, the token included in each threat document, and the pseudo teacher data from the pseudo teacher extraction unit 103. Next, the learning unit 104 performs learning in consideration of noise using pseudo teacher data to generate a classification model (step S4). After that, the learning unit 104 outputs the generated classification model to the document classification unit 203 of the inference unit 20.
 次に、図6を参照して、疑似教師抽出部103による疑似的な教師データの生成処理の流れを説明する。図6は、疑似教師抽出部による疑似的な教師データの生成処理のフローチャートである。図6のフローチャートで示す一連の処理は、図5に示したフローチャートのステップS3で実行される処理の一例にあたる。 Next, with reference to FIG. 6, the flow of the pseudo-teacher data generation process by the pseudo-teacher extraction unit 103 will be described. FIG. 6 is a flowchart of a pseudo-teacher data generation process by the pseudo-teacher extraction unit. The series of processes shown in the flowchart of FIG. 6 corresponds to an example of the processes executed in step S3 of the flowchart shown in FIG.
 文分割部131は、脅威文書抽出部102から入力された脅威文書を取得する(ステップS101)。 The sentence division unit 131 acquires the threat document input from the threat document extraction unit 102 (step S101).
 次に、文分割部131は、脅威文書を文単位に分割する(ステップS102)。その後、文分割部131は、脅威文書を分割して得られた各文をトークン分割部132及びルール適合部133へ出力する。 Next, the sentence division unit 131 divides the threat document into sentence units (step S102). After that, the sentence division unit 131 outputs each sentence obtained by dividing the threat document to the token division unit 132 and the rule conforming unit 133.
 トークン分割部132は、脅威文書を分割して得られた各文の入力を文分割部131から受ける。そして、トークン分割部132は、取得した文をトークン単位に分割する(ステップS103)。その後、トークン分割部132は、脅威文書に含まれる各文を分割して得られた各トークンをルール適合部133へ出力する。 The token division unit 132 receives the input of each sentence obtained by dividing the threat document from the sentence division unit 131. Then, the token dividing unit 132 divides the acquired sentence into token units (step S103). After that, the token dividing unit 132 outputs each token obtained by dividing each sentence included in the threat document to the rule conforming unit 133.
 ルール適合部133は、脅威文書を分割して得られた各文の入力を文分割部131から受ける。また、ルール適合部133は、脅威文書に含まれる各文を分割して得られた各トークンの入力をトークン分割部132から受ける。そして、ルール適合部133は、自己が保持する教師データの作成ルールに外部知識参照部136を介して取得した情報を用いて脅威文書の文レベル及びトークンレベルでの教師データの作成ルールの適合可否の判定を行う。これにより、ルール適合部133は、各脅威文書に適合可能な教師データの作成ルールを特定する(ステップS104)。その後、ルール適合部133は、適合可能と判定した教師データの作成ルールを、その判定の対象とした脅威文書に含まれる文及びトークンとともに合成ルール適合部134へ出力する。 The rule conforming unit 133 receives the input of each sentence obtained by dividing the threat document from the sentence dividing unit 131. Further, the rule conforming unit 133 receives the input of each token obtained by dividing each sentence included in the threat document from the token dividing unit 132. Then, the rule conforming unit 133 uses the information acquired through the external knowledge reference unit 136 for the teacher data creation rule held by itself, and whether or not the teacher data creation rule at the sentence level and the token level of the threat document conforms. Judgment is made. As a result, the rule conforming unit 133 identifies a rule for creating teacher data applicable to each threat document (step S104). After that, the rule conforming unit 133 outputs the teacher data creation rule determined to be conformable to the synthesis rule conforming unit 134 together with the sentence and token included in the threat document targeted for the determination.
 合成ルール適合部134は、各脅威文書に適合した教師データの作成ルール及びそれに対応する脅威文書に含まれる文及びトークンの入力をルール適合部133から受ける。次に、合成ルール適合部134は、各脅威文書について、ルール適合部133により適合が可能とされた教師データの作成ルールの論理演算により合成ルールを生成する。そして、合成ルール適合部134は、合成ルールに外部知識参照部136を介して取得した情報を用いて脅威文書の文レベル及びトークンレベルでの合成ルールの適合可否の判定を行う(ステップS105)。その後、合成ルール適合部134は、適合可能と判定した合成ルールを、その判定の対象とした脅威文書に含まれる文及びトークンとともに収束判定部135へ出力する。また、合成ルール適合部134は、ルール適合部133から取得した各脅威文書が適合した教師データの作成ルールを疑似教師生成部137へ出力する。 The synthesis rule conforming unit 134 receives input of the teacher data creation rule conforming to each threat document and the sentences and tokens included in the corresponding threat document from the rule conforming unit 133. Next, the synthesis rule conforming unit 134 generates a synthesis rule for each threat document by a logical operation of a teacher data creation rule that can be conformed by the rule conforming unit 133. Then, the synthesis rule conforming unit 134 determines whether or not the synthesis rule is conformable at the sentence level and the token level of the threat document by using the information acquired through the external knowledge reference unit 136 for the synthesis rule (step S105). After that, the synthesis rule conforming unit 134 outputs the synthesis rule determined to be conformable to the convergence test unit 135 together with the sentence and the token included in the threat document targeted for the determination. Further, the synthesis rule conforming unit 134 outputs the creation rule of the teacher data conforming to each threat document acquired from the rule conforming unit 133 to the pseudo teacher generation unit 137.
 収束判定部135は、各脅威文書に適合した合成ルール及びそれに対応する脅威文書に含まれる文及びトークンの入力を合成ルール適合部134から受ける。次に、収束判定部135は、各脅威文書に含まれる攻撃者の振る舞いを集めた振舞集合に対する合成ルールの適合後に、新たな攻撃者の振る舞いがその振舞集合に追加されないか否かにより、合成ルールの適合が収束したか否かを判定する(ステップS106)。 The convergence test unit 135 receives the input of the composition rule conforming to each threat document and the sentences and tokens included in the corresponding threat document from the composition rule conforming unit 134. Next, the convergence test unit 135 synthesizes depending on whether or not a new attacker's behavior is added to the behavior set after the synthesis rule is applied to the behavior set that collects the attacker's behavior contained in each threat document. It is determined whether or not the conformity of the rules has converged (step S106).
 合成ルールの適合が収束していない場合(ステップS106:否定)、収束判定部135は、新たな攻撃者の振る舞いを導く適合の可能性がある合成ルールの適合可否の判定を合成ルール適合部134に依頼する。合成ルール適合部134は、新たな合成ルールが発生した脅威文書についての合成ルールに新たな合成ルールを追加する(ステップS107)。その後、合成ルール適合部134は、ステップS105へ戻る。 When the conformity of the synthesis rule has not converged (step S106: negation), the convergence test unit 135 determines whether or not the conformity of the synthesis rule may be conformed, which may lead to the behavior of a new attacker. Ask to. The synthesis rule conforming unit 134 adds a new synthesis rule to the synthesis rule for the threat document in which the new synthesis rule has occurred (step S107). After that, the synthesis rule conforming unit 134 returns to step S105.
 これに対して合成ルールの適合が収束した場合(ステップS106:肯定)、収束判定部135は、各脅威文書に適合した教師データの作成ルール及び合成ルールを、対応する脅威文書に含まれる文及びトークンとともに疑似教師生成部137へ出力する。疑似教師生成部137は、各脅威文書に適合した教師データの作成ルール及び合成ルール、並びに、対応する脅威文書に含まれる文及びトークンの入力を収束判定部135から受ける。そして、疑似教師生成部137は、脅威文書に含まれる文及びトークンに対して、その脅威文書に適合した教師データの作成ルール及び合成ルールを用いて疑似的な教師データを生成する(ステップS108)。その後、疑似教師生成部137は、生成した疑似的な教師データを学習部104へ出力する。 On the other hand, when the conformity of the synthesis rule is converged (step S106: affirmative), the convergence determination unit 135 sets the teacher data creation rule and the synthesis rule conforming to each threat document to the sentence included in the corresponding threat document and the synthesis rule. It is output to the pseudo-teacher generation unit 137 together with the token. The pseudo-teacher generation unit 137 receives input of a teacher data creation rule and a synthesis rule suitable for each threat document, and sentences and tokens included in the corresponding threat document from the convergence test unit 135. Then, the pseudo-teacher generation unit 137 generates pseudo-teacher data for the sentences and tokens included in the threat document by using the teacher data creation rule and the synthesis rule suitable for the threat document (step S108). .. After that, the pseudo-teacher generation unit 137 outputs the generated pseudo-teacher data to the learning unit 104.
 次に、図7を参照して、学習部104によるノイズを考慮した疑似的な教師データを用いた学習による分類モデルの生成処理の流れを説明する。図7は、学習部によるノイズを考慮した疑似的な教師データを用いた学習による分類モデルの生成処理のフローチャートである。図7のフローチャートで示す一連の処理は、図5に示したフローチャートのステップS4で実行される処理の一例にあたる。 Next, with reference to FIG. 7, the flow of the classification model generation process by learning using pseudo teacher data in consideration of noise by the learning unit 104 will be described. FIG. 7 is a flowchart of a classification model generation process by learning using pseudo teacher data in consideration of noise by the learning unit. The series of processes shown in the flowchart of FIG. 7 corresponds to an example of the processes executed in step S4 of the flowchart shown in FIG.
 文書表現層141及びトークン表現層143は、疑似教師抽出部103から出力された脅威文書を取得する(ステップS201)。 The document expression layer 141 and the token expression layer 143 acquire the threat document output from the pseudo-teacher extraction unit 103 (step S201).
 文書表現層141は、脅威文書全体の特徴を取得する(ステップS202)。その後、文書表現層141は、各脅威文書全体の特徴をラベル分類層142へ出力する。 The document expression layer 141 acquires the characteristics of the entire threat document (step S202). After that, the document expression layer 141 outputs the characteristics of each threat document as a whole to the label classification layer 142.
 ここで、学習部104は、観測ラベルを真のラベルとみなした場合のラベル確率の算出を所定回数行なったか否かなどにより、前半の学習が完了したか否かを判定する(ステップS203)。前半の学習が完了していない場合(ステップS203:否定)、学習部104は、トークン表現層143から観測ラベル分類層147を使用せずに、文書表現層141とラベル分類層142とを用いて、観測ラベルを真のラベルと見なして疑似的な教師データを基に学習を行う(ステップS204)。この際、ラベル分類層142は、脅威文書の全体的な特徴の入力を文書表現層141から受ける。そして、ラベル分類層142は、脅威文書の全体的な特徴に対してニューラルネットワークを用いて機械学習を行い、脅威文書に付与する攻撃者の振る舞いに対応する観測ラベルを真のラベルと見なして予測を行って確率ラベルを算出する。その後、学習部104は、ステップS201へ戻る。 Here, the learning unit 104 determines whether or not the learning of the first half is completed based on whether or not the calculation of the label probability when the observed label is regarded as a true label has been performed a predetermined number of times (step S203). When the learning of the first half is not completed (step S203: denial), the learning unit 104 uses the document expression layer 141 and the label classification layer 142 without using the observation label classification layer 147 from the token expression layer 143. , The observation label is regarded as a true label, and learning is performed based on the pseudo teacher data (step S204). At this time, the label classification layer 142 receives input of the overall characteristics of the threat document from the document expression layer 141. Then, the label classification layer 142 performs machine learning on the overall characteristics of the threat document using a neural network, and predicts the observation label corresponding to the attacker's behavior given to the threat document as a true label. To calculate the probability label. After that, the learning unit 104 returns to step S201.
 これに対して、前半の学習が完了した場合(ステップS203:肯定)、学習部104は、後半の学習を移行する。トークン表現層143は、各脅威文書に含まれるトークンを抽出する。次に、トークン表現層143は、各脅威文書に含まれるトークン個々の特徴を取得する(ステップS205)。その後、トークン表現層143は、取得したトークン個々の特徴をエンコード層144へ出力する。 On the other hand, when the learning in the first half is completed (step S203: affirmative), the learning unit 104 shifts the learning in the second half. The token expression layer 143 extracts tokens contained in each threat document. Next, the token expression layer 143 acquires the characteristics of each token included in each threat document (step S205). After that, the token expression layer 143 outputs the characteristics of each acquired token to the encoding layer 144.
 エンコード層144は、トークン個々の特徴の入力をトークン表現層143から受ける。そして、エンコード層144は、取得したトークン個々の特徴を固定長に変換する。次に、エンコード層144は、固定長にしたトークンの特徴を表す情報を統合して1つのベクトルを生成する。さらに、エンコード層144は、生成したベクトルに与えられたラベルの数をKとした場合に、そのベクトルを4K次元のベクトルに変換する(ステップS206)。その後、エンコード層144は、各トークンの特徴をまとめて表す1つの固定長且つ4K次元のベクトルを確信度層145へ出力する。 The encode layer 144 receives the input of the characteristics of each token from the token expression layer 143. Then, the encode layer 144 converts the characteristics of each acquired token into a fixed length. Next, the encode layer 144 integrates information representing the characteristics of the fixed-length token to generate one vector. Further, when the number of labels given to the generated vector is K, the encode layer 144 converts the vector into a 4K-dimensional vector (step S206). After that, the encode layer 144 outputs one fixed-length and 4K-dimensional vector collectively representing the characteristics of each token to the certainty layer 145.
 確信度層145は、各トークンの特徴をまとめて表す1つの固定長且つ4K次元のベクトルの入力をエンコード層144から受ける。そして、確信度層145は、取得したベクトルに保持するカーネル及びバイアスパラメータを用いて畳み込み演算を実行し、確信度を算出する(ステップS207)。その後、確信度層145は、算出した確信度をノイズ分布層146へ出力する。 The certainty layer 145 receives an input of one fixed-length and 4K-dimensional vector that collectively represents the characteristics of each token from the encode layer 144. Then, the conviction layer 145 executes a convolution operation using the kernel and the bias parameter held in the acquired vector, and calculates the conviction (step S207). After that, the certainty layer 145 outputs the calculated certainty to the noise distribution layer 146.
 ノイズ分布層146は、確信度の入力を確信度層145から受ける。そして、ノイズ分布層146は、取得した確信度に対してSoftmax関数による正規化を行ってノイズ分布を取得する(ステップS208)。その後、ノイズ分布層146は、ノイズ分布を観測ラベル分類層147へ出力する。 The noise distribution layer 146 receives input of certainty from the certainty layer 145. Then, the noise distribution layer 146 normalizes the acquired certainty with the Softmax function to acquire the noise distribution (step S208). After that, the noise distribution layer 146 outputs the noise distribution to the observation label classification layer 147.
 ラベル分類層142は、脅威文書の全体的な特徴の入力を文書表現層141から受ける。そして、ラベル分類層142は、脅威文書の全体的な特徴を用いて機械学習を行い、脅威文書に付与する攻撃者の振る舞いに対応する真のラベルを予測してラベル確率を生成する。観測ラベル分類層147は、ノイズ分布の入力をノイズ分布層146受ける。さらに、観測ラベル分類層147は、各脅威文書についての各真のラベルが含まれるそれぞれの確率を表すベクトルであるラベル確率をラベル分類層142から取得する。次に、観測ラベル分類層147は、ラベル分類層142の出力にノイズ分布層146の出力の重み付け和を計算することで、各脅威文書に付与するノイズを含む観測ラベルを予測する。そして、観測ラベル分類層147は、観測ラベル確率を生成する。学習部104は、ラベル分類層142により算出される確率ラベル及び観測ラベル分類層147により算出される観測ラベルモデルを基に、ノイズモデルを用いた分類モデルの学習を行う(ステップS209)。 The label classification layer 142 receives input of the overall characteristics of the threat document from the document expression layer 141. Then, the label classification layer 142 performs machine learning using the overall characteristics of the threat document, predicts the true label corresponding to the behavior of the attacker given to the threat document, and generates the label probability. The observation label classification layer 147 receives the input of the noise distribution layer 146. Further, the observation label classification layer 147 obtains a label probability, which is a vector representing each probability that each true label is included for each threat document, from the label classification layer 142. Next, the observation label classification layer 147 calculates the weighted sum of the output of the noise distribution layer 146 to the output of the label classification layer 142 to predict the observation label including noise to be given to each threat document. Then, the observation label classification layer 147 generates the observation label probability. The learning unit 104 learns a classification model using a noise model based on the probability label calculated by the label classification layer 142 and the observation label model calculated by the observation label classification layer 147 (step S209).
 次に、学習部104は、ノイズモデルを用いた分類モデルの学習の回数が所定回数に達したか否かなどにより、後半の学習が完了したか否かを判定する(ステップS210)。後の学習が完了していない場合(ステップS210:否定)、学習部104は、ステップS201へ戻る。 Next, the learning unit 104 determines whether or not the latter half of the learning is completed based on whether or not the number of times of learning of the classification model using the noise model has reached a predetermined number of times (step S210). If the subsequent learning is not completed (step S210: negation), the learning unit 104 returns to step S201.
 これに対して、後半の学習が完了した場合(ステップS210:肯定)、学習部104は、分類モデルの生成処理を終了する。 On the other hand, when the latter half of the learning is completed (step S210: affirmative), the learning unit 104 ends the classification model generation process.
 次に、図8を参照して、分類装置1の推論部20による脅威文書の分類処理の流れを説明する。図8は、分類装置による脅威文書の分類処理のフローチャートである。 Next, with reference to FIG. 8, the flow of the threat document classification process by the inference unit 20 of the classification device 1 will be described. FIG. 8 is a flowchart of the threat document classification process by the classification device.
 入力受付部201は、文書の入力を受け付ける(ステップS11)。そして、入力受付部201は、受信した文書を脅威文書抽出部202へ出力する。 The input receiving unit 201 accepts the input of the document (step S11). Then, the input reception unit 201 outputs the received document to the threat document extraction unit 202.
 脅威文書抽出部202は、文書の入力を入力受付部201から受ける。次に、脅威文書抽出部202は、取得した文書からサイバー脅威に関連する脅威文書を抽出する(ステップS12)。その後、脅威文書抽出部202は、抽出した脅威文書を文書分類部203へ出力する。 The threat document extraction unit 202 receives the input of the document from the input reception unit 201. Next, the threat document extraction unit 202 extracts a threat document related to the cyber threat from the acquired document (step S12). After that, the threat document extraction unit 202 outputs the extracted threat document to the document classification unit 203.
 文書分類部203は、入力された文書から抽出された脅威文書の入力を脅威文書抽出部202から受ける。また、文書分類部203は、モデル学習部10の学習部104により生成された分類モデルを取得する。そして、文書分類部203は、取得した分類モデルを用いて脅威文書の分類を行う(ステップS13)。すなわち、文書分類部203は、ある脅威文書にどのような攻撃者の振る舞いが含まれる確率を求める。その後、文書分類部203は、分類結果を出力部204へ出力する。 The document classification unit 203 receives the input of the threat document extracted from the input document from the threat document extraction unit 202. Further, the document classification unit 203 acquires the classification model generated by the learning unit 104 of the model learning unit 10. Then, the document classification unit 203 classifies the threat document using the acquired classification model (step S13). That is, the document classification unit 203 obtains the probability that a certain threat document includes the behavior of an attacker. After that, the document classification unit 203 outputs the classification result to the output unit 204.
 出力部204は、入力された文書の分類結果の入力を文書分類部203から受ける。そして、出力部204は、入力された文書の分類結果をモニタなどに出力して、利用者への分類結果の通知を行う(ステップS14)。 The output unit 204 receives the input of the input classification result of the document from the document classification unit 203. Then, the output unit 204 outputs the classification result of the input document to a monitor or the like, and notifies the user of the classification result (step S14).
[分類装置及び分類処理による効果]
 以上に説明したように、分類装置1は、取得した脅威文書に適合する疑似的な教師データの作成ルールを用いて、取得した脅威文書から疑似的な教師データを生成する。そして、分類装置1は、取得した脅威文書及び疑似的な教師データを基に分類モデルとノイズモデルの学習を同時に行う。その後、分類装置1は、学習した分類モデルを用いて、入力された文書に含まれる脅威文書の分類を行う。すなわち分類装置1は、人手を介さずに脅威文書から攻撃者の振る舞いを行うための疑似的な教師データを作成することができ、且つ、生成したノイズを含む自然文の疑似的な教師データを用いてノイズを考慮した分類モデルの生成を行うことができる。
[Effects of classification device and classification processing]
As described above, the classification device 1 generates pseudo-teacher data from the acquired threat document by using the rules for creating pseudo-teacher data that matches the acquired threat document. Then, the classification device 1 simultaneously learns the classification model and the noise model based on the acquired threat document and pseudo teacher data. After that, the classification device 1 classifies the threat documents included in the input document by using the learned classification model. That is, the classification device 1 can create pseudo teacher data for performing the behavior of an attacker from a threat document without human intervention, and can generate pseudo teacher data of a natural sentence including generated noise. It can be used to generate a classification model that takes noise into consideration.
 特に、分類装置1は、脅威文書から攻撃者の振る舞いを抽出する装置を学習するために、脅威文書を仮定した際のヒューリスティクスに基づく抽出ルールを基に、疑似的な教師データを自動生成する。この抽出ルールにおけるプログラム名やファイル名といったある振る舞いを遂行する際に利用されるシステム上での観測情報を用いることで、振る舞いとの結びつきが強く且つ表記ゆれが少ない疑似的な教師データの作成ルールが構築可能である。そのため、高い確率で真のラベルを有する疑似的な教師データの生成が可能となる。 In particular, the classification device 1 automatically generates pseudo teacher data based on an extraction rule based on heuristics when a threat document is assumed in order to learn a device that extracts the attacker's behavior from the threat document. .. By using the observation information on the system used when executing a certain behavior such as the program name and file name in this extraction rule, a rule for creating pseudo teacher data that has a strong connection with the behavior and has little notational fluctuation. Can be constructed. Therefore, it is possible to generate pseudo teacher data having a true label with high probability.
 また、分類装置1は、文書全体の特徴を分類モデルの構築に利用し、且つ、トークン個々の特徴をノイズモデルの構築に用いることで、可変長の自然文入力を対象としたネットワークのノイズモデル化を実現することができる。これにより、分類装置1は、ノイズを含む自然文の疑似的な教師データを用いて分類モデルを生成する際に、False Negative及びFalse Positiveのデータによる影響を低減するためのネットワーク構造を有する。 Further, the classification device 1 uses the characteristics of the entire document for the construction of the classification model and the characteristics of each token for the construction of the noise model, so that the noise model of the network for variable-length natural sentence input is used. Can be realized. As a result, the classification device 1 has a network structure for reducing the influence of False Negative and False Positive data when generating a classification model using pseudo teacher data of natural sentences including noise.
 すなわ、分類装置1では、攻撃者の振る舞いの分類を行う統計的学習モデルの訓練を行う際に、ラベル毎の作成ルールを用いることで、個々の文書に対する人手によるアノテーション作業なしに、低コストで教師データの構築が可能となる。また、分類装置1を用いることで、疑似的な教師データ等のノイズを含む教師データから、ノイズのないクリーンデータを用いなくても、ノイズモデル化ネットワークを学習し、False Positive及びFalse Negativeのデータの影響を低減させた脅威文書の分類を実現することが可能になる。したがって、分類装置1による高精度な攻撃者の振る舞いの抽出により、人手による分析が困難な大規模な脅威文書集合を用いて、複雑化する攻撃者の振る舞いの分析を行うことが可能となり、サイバーセキュリティ能力の向上を実現できる。 That is, in the classification device 1, when training a statistical learning model for classifying the behavior of an attacker, by using a creation rule for each label, low cost without manual annotation work for each document. It is possible to construct teacher data with. In addition, by using the classification device 1, noise modeling networks can be learned from teacher data including noise such as pseudo teacher data without using clean data without noise, and False Positive and False Negative data. It will be possible to realize the classification of threat documents with reduced influence of. Therefore, by extracting the attacker's behavior with high accuracy by the classification device 1, it becomes possible to analyze the complicated attacker's behavior using a large-scale threat document set that is difficult to analyze manually, and cyber. It is possible to improve the security capability.
[システム構成等]
 また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散及び統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散又は統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU(Central Processing Unit)及び当該CPUにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。
[System configuration, etc.]
Further, each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific forms of distribution and integration of each device are not limited to those shown in the figure, and all or part of them may be functionally or physically dispersed or physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device is realized by a CPU (Central Processing Unit) and a program that is analyzed and executed by the CPU, or hardware by wired logic. Can be realized as.
 また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or part of it can be done automatically by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.
[プログラム]
 一実施形態として、分類装置1は、パッケージソフトウェアやオンラインソフトウェアとして上記の情報処理を実行する分類プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の分類プログラムを情報処理装置に実行させることにより、情報処理装置を分類装置1として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handy-phone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等のスレート端末等がその範疇に含まれる。
[program]
As one embodiment, the classification device 1 can be implemented by installing a classification program that executes the above information processing as package software or online software on a desired computer. For example, by causing the information processing device to execute the above classification program, the information processing device can be made to function as the classification device 1. The information processing device referred to here includes a desktop type or notebook type personal computer. In addition, information processing devices include smartphones, mobile phones, mobile communication terminals such as PHS (Personal Handy-phone System), and slate terminals such as PDAs (Personal Digital Assistants). Is done.
 また、分類装置1は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の管理処理に関するサービスを提供する管理サーバ装置として実装することもできる。例えば、管理サーバ装置は、コンフィグ投入要求を入力とし、コンフィグ投入を行う管理サービスを提供するサーバ装置として実装される。この場合、管理サーバ装置は、Webサーバとして実装することとしてもよいし、アウトソーシングによって上記の管理処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Further, the classification device 1 can be implemented as a management server device in which the terminal device used by the user is a client and the service related to the above management process is provided to the client. For example, the management server device is implemented as a server device that receives a config input request as an input and provides a management service for inputting a config. In this case, the management server device may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above management processing by outsourcing.
 図9は、分類プログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。 FIG. 9 is a diagram showing an example of a computer that executes a classification program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
 メモリ1010は、ROM(Read Only Memory)1011及びRAM(Random Access Memory)1012を含む。ROM1011は、例えば、BIOS(BASIC Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (BASIC Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.
 ハードディスクドライブ1090は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、分類装置1と同等の機能を持つ分類装置1の各処理を規定する分類プログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、分類装置1における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSD(Solid State Drive)により代替されてもよい。 The hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. That is, the classification program that defines each process of the classification device 1 having the same function as the classification device 1 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing the same processing as the functional configuration in the classification device 1 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
 また、上述した実施形態の処理で用いられる設定データは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020は、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して、上述した実施形態の処理を実行する。 Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 as needed, and executes the process of the above-described embodiment.
 なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
 1 分類装置
 2 WEB
 10 モデル学習部
 20 推論部
 101 文書収集部
 102 脅威文書抽出部
 103 疑似教師抽出部
 104 学習部
 131 文分割部
 132 トークン分割部
 133 ルール適合部
 134 合成ルール適合部
 135 収束判定部
 136 外部知識参照部
 137 疑似教師生成部
 141 文書表現層
 142 ラベル分類層
 143 トークン表現層
 144 エンコード層
 145 確信度層
 146 ノイズ分布層
 147 観測ラベル分類層
 201 入力受付部
 202 脅威文書抽出部
 203 文書分類部
 204 出力部
1 Sorting device 2 WEB
10 Model learning unit 20 Reasoning unit 101 Document collection unit 102 Threat document extraction unit 103 Pseudo-teacher extraction unit 104 Learning unit 131 Sentence division unit 132 Token division unit 133 Rule conformance unit 134 Synthesis rule conformance unit 135 Convergence judgment unit 136 External knowledge reference unit 137 Pseudo-teacher generation unit 141 Document expression layer 142 Label classification layer 143 Token expression layer 144 Encoding layer 145 Confidence layer 146 Noise distribution layer 147 Observation label classification layer 201 Input reception unit 202 Threat document extraction unit 203 Document classification unit 204 Output unit

Claims (8)

  1.  疑似的な教師データの作成ルールを基に、サイバー脅威に関する記述を有する脅威文書から疑似的な教師データを作成する疑似教師作成部と、
     前記脅威文書及び前記疑似教師作成部により作成された前記疑似的な教師データを基に、ノイズモデル及び分類モデルを相互の関係を用いて並行して学習する学習部と、
     前記学習部による学習により生成された前記分類モデルを用いて、入力された分類対象脅威文書の分類を行う文書分類部と
     を備えたことを特徴とする分類装置。
    Pseudo-teacher creation department that creates pseudo-teacher data from threat documents that have descriptions about cyber threats based on the rules for creating pseudo-teacher data.
    Based on the threat document and the pseudo-teacher data created by the pseudo-teacher creation unit, a learning unit that learns a noise model and a classification model in parallel using mutual relationships, and a learning unit.
    A classification device including a document classification unit that classifies input classification target threat documents using the classification model generated by learning by the learning unit.
  2.  サイバー脅威に関する記述を有する脅威文書を抽出する脅威文書抽出部をさらに備えたことを特徴とする請求項1に記載の分類装置。 The classification device according to claim 1, further comprising a threat document extraction unit for extracting a threat document having a description regarding a cyber threat.
  3.  前記作成ルールは、文書集合の中から前記脅威文書を抽出するための抽出ルールに基づくルールであることを特徴とする請求項1又は2に記載の分類装置。 The classification device according to claim 1 or 2, wherein the creation rule is a rule based on an extraction rule for extracting the threat document from a document set.
  4.  前記疑似教師作成部は、前記作成ルール及び複数の前記作成ルールを組み合わせて生成した合成ルールを基に、前記疑似的な教師データを作成することを特徴とする請求項1~3のいずれか一つに記載の分類装置。 One of claims 1 to 3, wherein the pseudo-teacher creation unit creates the pseudo-teacher data based on a synthesis rule generated by combining the creation rule and a plurality of the creation rules. The classification device described in 1.
  5.  前記学習部は、前記脅威文書に含まれるトークンを基に前記ノイズモデルの学習を行い、前記脅威文書の全体を基に前記分類モデルの学習を行うことを特徴とする請求項1~4のいずれか一つに記載の分類装置。 Any of claims 1 to 4, wherein the learning unit learns the noise model based on the token included in the threat document, and learns the classification model based on the entire threat document. The classification device described in one.
  6.  前記学習部は、前記脅威文書及び前記疑似的な教師データを基に前記分類モデルを学習する前半の学習を行い、前記脅威文書、前記疑似的な教師データ及び前記前半の学習が完了した前記分類モデルを基に、前記分類モデル及び前記ノイズモデルを前記相互の関係を用いて並行して学習する後半の学習を行うことを特徴とする請求項1~5のいずれか一つに記載の分類装置。 The learning unit performs learning in the first half of learning the classification model based on the threat document and the pseudo teacher data, and completes the learning of the threat document, the pseudo teacher data, and the first half. The classification device according to any one of claims 1 to 5, wherein the latter half of the learning is performed based on the model, in which the classification model and the noise model are learned in parallel using the mutual relationship. ..
  7.  疑似的な教師データの作成ルールを基に、サイバー脅威に関する記述を有する脅威文書から疑似的な教師データを作成する作成工程と、
     前記脅威文書及び前記作成工程で作成された前記疑似的な教師データを基に、ノイズモデル及び分類モデルを相互の関係を用いて並行して学習する学習工程と、
     前記学習工程における学習により生成された前記分類モデルを用いて、入力された分類対象脅威文書の分類を行う文書分類工程と
     を含んだことを特徴とする分類方法。
    Based on the rules for creating pseudo-teacher data, the process of creating pseudo-teacher data from a threat document that has a description of cyber threats, and the process of creating pseudo-teacher data.
    A learning process in which a noise model and a classification model are learned in parallel using mutual relationships based on the threat document and the pseudo teacher data created in the creation process.
    A classification method including a document classification step of classifying an input threat document to be classified by using the classification model generated by learning in the learning step.
  8.  疑似的な教師データの作成ルールを基に、サイバー脅威に関する記述を有する脅威文書から疑似的な教師データを作成する作成ステップと、
     前記脅威文書及び前記作成ステップで作成された前記疑似的な教師データを基に、ノイズモデル及び分類モデルを相互の関係を用いて並行して学習する学習ステップと、
     前記学習ステップにおける学習により生成された前記分類モデルを用いて、入力された分類対象脅威文書の分類を行う文書分類ステップと
     をコンピュータに実行させることを特徴とする分類プログラム。
    Creation steps to create pseudo-teacher data from threat documents with descriptions of cyber threats based on the rules for creating pseudo-teacher data,
    A learning step in which a noise model and a classification model are learned in parallel using mutual relationships based on the threat document and the pseudo teacher data created in the creation step.
    A classification program characterized by causing a computer to execute a document classification step for classifying an input threat document to be classified by using the classification model generated by the training in the learning step.
PCT/JP2020/035873 2020-09-23 2020-09-23 Classification device, classification method, and classification program WO2022064579A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/035873 WO2022064579A1 (en) 2020-09-23 2020-09-23 Classification device, classification method, and classification program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/035873 WO2022064579A1 (en) 2020-09-23 2020-09-23 Classification device, classification method, and classification program

Publications (1)

Publication Number Publication Date
WO2022064579A1 true WO2022064579A1 (en) 2022-03-31

Family

ID=80844604

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/035873 WO2022064579A1 (en) 2020-09-23 2020-09-23 Classification device, classification method, and classification program

Country Status (1)

Country Link
WO (1) WO2022064579A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017167854A (en) * 2016-03-16 2017-09-21 株式会社東芝 Learning device, method and program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017167854A (en) * 2016-03-16 2017-09-21 株式会社東芝 Learning device, method and program

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ISHAN JINDAL; DANIEL PRESSEL; BRIAN LESTER; MATTHEW NOKLEBY: "An Effective Label Noise Model for DNN Text Classification", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 March 2019 (2019-03-18), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081154787 *
LONG ZI; TAN LIANZHI; ZHOU SHENGPING; HE CHAOYANG; LIU XIN: "Collecting Indicators of Compromise from Unstructured Text of Cybersecurity Articles using Neural-Based Sequence Labelling", 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), IEEE, 14 July 2019 (2019-07-14), pages 1 - 8, XP033621979, DOI: 10.1109/IJCNN.2019.8852142 *
OKADA, GOKI: "label Security Reports with LDA", IEICE TECHNICAL REPORT, vol. 117, no. 481, 30 November 2017 (2017-11-30), JP , pages 151 - 156, XP009535762, ISSN: 0913-5685 *

Similar Documents

Publication Publication Date Title
CN112131882B (en) Multi-source heterogeneous network security knowledge graph construction method and device
Zhao et al. Cyber threat intelligence modeling based on heterogeneous graph convolutional network
Tian et al. BVDetector: A program slice-based binary code vulnerability intelligent detection system
US11190562B2 (en) Generic event stream processing for machine learning
Niu et al. A deep learning based static taint analysis approach for IoT software vulnerability location
Friedman et al. Learning Bayesian network structure from massive datasets: The" sparse candidate" algorithm
Chen et al. Bert-log: Anomaly detection for system logs based on pre-trained language model
US20220036014A1 (en) Autonomous detection of compound issue requests in an issue tracking system
US12112155B2 (en) Software application container hosting
US20160098563A1 (en) Signatures for software components
US20240146755A1 (en) Risk-based vulnerability management
Xiao et al. Latent imitator: Generating natural individual discriminatory instances for black-box fairness testing
Kekül et al. A multiclass hybrid approach to estimating software vulnerability vectors and severity score
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
Gouda et al. Design and validation of blockeval, a blockchain simulator
Zhang et al. Slowing down the aging of learning-based malware detectors with api knowledge
Aranovich et al. Beyond NVD: Cybersecurity meets the Semantic Web.
Zhen et al. DA-GNN: A smart contract vulnerability detection method based on Dual Attention Graph Neural Network
Okutan et al. Predicting the severity and exploitability of vulnerability reports using convolutional neural nets
Tabiban et al. VinciDecoder: Automatically Interpreting Provenance Graphs into Textual Forensic Reports with Application to OpenStack
WO2022064579A1 (en) Classification device, classification method, and classification program
Nguyen et al. Pydhnet: a python library for dynamic heterogeneous network representation learning and evaluation
CN114896412A (en) Embedded vector generation method, and method and device for classifying same-name personnel based on enterprise pairs
Chopra et al. Transductive Instance Transfer Learning for Cross-Language Defect Prediction
Du et al. A vulnerability severity prediction method based on bimodal data and multi-task learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20955173

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20955173

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP