CN111435375A

CN111435375A - Threat information automatic labeling method based on FastText

Info

Publication number: CN111435375A
Application number: CN201811587862.3A
Authority: CN
Inventors: 翟江涛; 徐留杰; 孙中军
Original assignee: Nanjing Zhichangrong Information Technology Co ltd
Current assignee: Nanjing Zhichangrong Information Technology Co ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2020-07-21

Abstract

The invention discloses a threat information automatic labeling method based on FastText, which sequentially comprises the following steps: 1(1) establishing an information automatic labeling model; (2) and automatically labeling the threat intelligence by using an automatic labeling model. According to the method, a word bank special for threat information is constructed by collecting massive threat information and utilizing technologies such as word segmentation, word frequency statistics and the like, and is combined with a Fasttext classifier, so that higher recall ratio and precision ratio can be obtained, and automatic labeling of the threat information is realized.

Description

Threat information automatic labeling method based on FastText

Technical Field

The invention relates to a network and information security technology, in particular to an automatic labeling method for network security threat information based on FastText.

Background

The concept of threat intelligence was originally proposed in 2012 in the U.S. government's release of the big data research and development initiatives. The threat intelligence is threat intelligence which is isolated and disordered and is converted into threat intelligence with a fixed format, so that the threat information can be normalized and sorted, and the deep analysis of threat data is facilitated. However, at present, the understanding of threat intelligence by each threat intelligence organization is not uniform, which leads to the form of threat intelligence on the network being different. In order to improve the efficiency of query and analysis, more and more researchers are working on finding an automatic labeling method for threat intelligence, and obtaining a lot of research results.

The threat information labeling method based on FastText is a brand-new threat information labeling method, FastText is a text classifier which is proposed by Facebook AI Research in 16 years, and the method is characterized in that the FastText greatly shortens training time while keeping the classification effect compared with other text classification models such as SVM, &lttTtranslation = L "&gtTL &ltt/T &gtthistorical Regression and neural network.

The FastText method comprises three parts: model architecture, level Softmax and N-gram features. The FastText model inputs a sequence of words (a piece of text or a sentence) and outputs the probability that the sequence of words belongs to different categories. The words and phrases in the sequence constitute a feature vector, the feature vector is mapped to the middle layer through linear transformation, and the middle layer is mapped to the label. FastText uses a non-linear activation function in predicting tags, but does not use a non-linear activation function in the middle layers. The FastText model architecture is very similar to the CBOW model in Word2 Vec. The difference is that FastText predicts the tag, while the CBOW model predicts the interword.

A first part: the model architecture of FastText is similar to CBOW, both models are based on hierarchy software max, and both models are three-layer architectures: input layer, hidden layer, output layer. The CBOW model is based on an N-gram model and a BOW model, the CBOW model takes W (t-N +1) … … W (t-1) W (t-N +1) … … W (t-1) as input to predict W (t), and the FastText model takes the whole text as a characteristic to predict the category of the text.

A second part: and mapping between layers, namely forming a feature vector by words and phrases in an input layer, mapping the feature vector to a hidden layer through linear transformation, solving a maximum likelihood function by the hidden layer, constructing a Huffman tree according to the weight and model parameters of each category, and outputting the Huffman tree.

And a third part: n-gram characteristics of FastText: a common feature is the bag-of-words model (converting the input data into the corresponding Bow form). But the bag-of-words model cannot take into account the order between words, so FastText also adds N-gram features. For example, the bag model feature in the sentence "i love her" is "i", "love", "her". These features are the same as the feature of the sentence "she loves me". If a 2-Ngram is added, the first sentence is characterized by "I-love" and "love-her", and the two sentences of "I love her" and "she love me" can be distinguished. Of course, to improve efficiency, we need to filter out the low frequency N-grams.

A low-dimensional vector is associated with each word in FastText. The hidden representations are shared among all classifiers in different classes, so that the text information can be commonly used in different classes. This type of characterization is called bag of words (where word order is ignored). Vector representation of the word N-gram is also used in FastText to take local word order into account, which is important for many text classification problems.

The threat information automatic labeling method based on FastText has the main advantages that: (1) support for multi-language expressions: since the threat information in China is still in the starting stage, a lot of information comes from the threat information abroad, and the FastText method can support multiple languages including English, German, Spanish, French and the like (2) is suitable for large data + high-efficiency training speed: more than 10 hundred million words are processed in 10 minutes using a standard multi-core CPU.

At present, an automatic threat intelligence labeling method is not disclosed.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects in the prior art, the invention aims to provide an automatic threat information labeling method based on FastText.

The technical scheme is as follows: a threat information automatic labeling method based on FastText is characterized in that: the method sequentially comprises the following steps: (1) establishing an automatic threat information labeling model; (2) and automatically labeling the threat intelligence by using an automatic labeling model.

Wherein the specific process of the step (1) is as follows:

(1.1) setting a sample data collector: the method comprises the steps of capturing threat information data of each large threat information website by using a web crawler technology, and analyzing mail contents to obtain the threat information data by subscribing related information mails and combining a mail analysis technology.

(1.2) setting a data processor: and (3) processing the data acquired in the step (1.1) by using a data processor, namely splitting each module of each threat intelligence into fields, and extracting the original label to be divided into one field separately.

(1.3) constructing a threat information word bank: and a special vocabulary for threatening information is added on the basis of the original word stock, so that the accuracy of the labeling result is improved.

(1.4) setting a text segmenter: and performing word segmentation on the threat intelligence content and the description part by using a text segmenter.

(1.5) setting a summary generator: and (4) analyzing the word list divided in the step (1.4) to obtain a phrase with higher correlation, and forming a summary by establishing abstract semantic representation and using a natural language generation technology.

(1.6), training a model: and (5) training the phrases generated in the step (1.5) by using a FastText method to obtain a threat information automatic labeling model.

Wherein, the step (2) is to automatically label threat intelligence by using a model trained by Fastext, and the method specifically comprises the following steps:

(2.1), setting an information collector: threat information is collected through web crawlers, mail analysis and other modes, a crawler resisting strategy of a threat information website needs to be responded, and mail analysis software is designed.

(2.2) setting a data processor: and (4) extracting description and text content in the threat intelligence acquired in the step (2.1) as test data.

(2.3) setting a text segmenter: and (4) carrying out word segmentation processing on the threat intelligence text by using a Jieba algorithm in combination with the threat intelligence word bank constructed in the step (1.3) to obtain a corresponding word list.

(2.4) setting a summary generator: and (4) analyzing the word list generated in the step (2.3) by using a Textrand algorithm, and forming the abstract by establishing abstract semantic representation and using a natural language generation technology.

(2.5) judging the labeling result: and (3) inputting the abstract generated in the step (2.4) into the automatic threat information labeling model in the step (1.6), outputting a label, comparing the label with the label obtained in the step (1.2), and if the label is the same, accurately labeling, and if the label is different, mistakenly labeling.

Further, the sample data collector in the step (1.1) is web crawler and mail analysis software. .

Further, the text segmenter in the step (1.4) is text cutting software based on a Jieba algorithm and a threat intelligence word stock.

Further, the summary generator in step (1.5) is the automatic summary generation software based on Textrank algorithm.

Further, the automatic labeling model trained in the step (1.6) is automatic text classification software based on the FatText algorithm.

Has the advantages that: the method comprises the steps of firstly obtaining a sample data set with labels through a web crawler technology and a mail analysis technology, then constructing a special word bank of threat information, segmenting an obtained text by combining a Jieba method with the special word bank of the threat information, simultaneously establishing a graph model for segmented words by combining a Textrand method, sequencing important components in the text by using a voting mechanism, then taking 10 words which are sequenced in the front as abstracts of the text, establishing an automatic labeling model for the abstracts of the text by using a FastText method, and finally automatically labeling the originally obtained threat information through the model. According to the invention, by training the threat information automatic labeling model based on FastText and combining with the Textrand automatic summarization technology, the reliability, the accuracy and the high efficiency of threat information automatic labeling can be effectively improved.

Drawings

FIG. 1 is a flow chart of automated threat intelligence labeling model establishment in an embodiment.

FIG. 2 is a flow chart of an embodiment of automatic labeling with threat intelligence.

FIG. 3 is a graph showing the effect of the training samples varying in the examples

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

Example one:

the threat intelligence automatic labeling method based on FastText in the embodiment comprises the following processes:

1. an automatic threat information labeling model is established, as shown in fig. 1, and the specific flow is as follows:

(1) setting a sample data collector: threat intelligence data are collected through a web crawler technology and an email analysis technology, and the collected intelligence data comprise 2 thousands of vulnerability information at home and abroad, 2 thousands of threat intelligence on a threat intelligence website, and 2 thousands of safety organization technology reports. Meanwhile, the label of each piece of information is obtained and used as a sample set of the training model.

(2) Setting a data processor: and cleaning the obtained threat information data by using Python, screening out a text description part in the threat information data, and respectively storing the text description part and the corresponding label into two fields.

(3) Constructing a threat information word bank: and downloading a special vocabulary library of threat information from the internet, and counting words with higher frequency in each threat information text to be added into the vocabulary library of the threat information, so that the accuracy of the automatic label is improved.

(4) Setting a text divider: and performing word segmentation processing on the threat information by combining a Jieba method with a threat information word bank, and splitting the text into words.

(5) Setting a summary generator: if all the words generated in the step (4) are counted in the model, once the text is too large, the calculation speed of the model is affected, the segmented threat intelligence can be further sequenced by a Textrand method, and the top ten 10 words are selected as the abstracts of the text.

(6) Training a model: and training the threat information automatic labeling model by using a FastText method, and sampling and collecting the model from a sample data source to serve as a training set of the model.

2. The automated threat information labeling model is used for labeling, as shown in fig. 2, the specific process is as follows:

(A) setting an information collector: threat intelligence data are collected through a web crawler technology and an email analysis technology, and the collected intelligence data comprise 5000 pieces of vulnerability information at home and abroad, 5000 pieces of threat intelligence on a threat intelligence website, and 5000 pieces of safety organization technology reports.

(B) Setting a data processor: and cleaning the obtained threat information data by using Python, screening out a text description part in the threat information data, and respectively storing the text description part and the corresponding label into two fields.

(C) Setting a text divider: and performing word segmentation processing on the threat information by combining a Jieba method with a threat information word bank, and splitting the text into words.

(D) Setting a summary generator: if all the words generated in the step (4) are counted in the model, once the text is too large, the calculation speed of the model is affected, the segmented threat intelligence can be further sequenced by a Textrand method, and the top ten 10 words are selected as the abstracts of the text.

(E) Judging a labeling result: 5000 pieces of data are sampled and collected from the step (A) and used as a test set of the automatic threat information labeling, and the recall ratio and precision ratio of the model are judged through comparison with the original labeling and the prediction labeling.

As shown in fig. 3, the number of training sample sets sampled in this embodiment is increased from 5000 to 11000, and the accuracy and recall ratio of automated threat intelligence labeling are shown in the result diagram. The blue line indicates accuracy and the yellow line indicates recall. As can be seen from FIG. 3, the threat information can be automatically marked well by applying the method and the system. Along with the increase of the number of training sets, the accuracy and the recall rate are gradually increased, and as long as enough training sets are combined with abundant threat information word banks, the method can be used for automatically labeling threat information very effectively.

Claims

1. A threat information automatic labeling method based on FastText is characterized in that: the method sequentially comprises the following steps: (1) establishing an automatic threat information labeling model; (2) automatically labeling the threat information by using an automatic labeling model;

wherein the specific process of the step (1) is as follows:

(1.1) setting a sample data collector: using a web crawler technology to capture threat information data of each large threat information website, and analyzing mail contents to obtain the threat information data by subscribing related information mails and combining a mail analysis technology;

(1.2) setting a data processor: processing the data acquired in the step (1.1) by using a data processor, namely splitting each module of each threat intelligence into fields, and extracting the original label to be divided into one field;

(1.3) constructing a threat information word bank: adding a special vocabulary threatening information on the basis of the original word stock, and improving the accuracy of a labeling result;

(1.4) setting a text segmenter: utilizing a text divider to perform word segmentation processing on the threat information content and the description part;

(1.5) setting a summary generator: analyzing the word list divided in the step (1.4) to obtain a phrase with higher correlation, and forming a summary by establishing abstract semantic representation and using a natural language generation technology;

(1.6), training a model: training the phrases generated in the step (1.5) by using a FastText method to obtain a threat information automatic labeling model;

(2.1), setting an information collector: threat information is collected through web crawlers, mail analysis and other modes, a crawler resisting strategy of a threat information website needs to be responded, and mail analysis software is designed;

(2.2) setting a data processor: extracting description and text content in the threat information obtained in the step (2.1) as test data;

(2.3) setting a text segmenter: carrying out word segmentation processing on the threat information text by using a Jieba algorithm in combination with the threat information word bank constructed in the step (1.3) to obtain a corresponding word list;

(2.4) setting a summary generator: analyzing the word list generated in the step (2.3) by using a Textrand algorithm, and forming an abstract by establishing abstract semantic representation and using a natural language generation technology;

2. The automated FastText-based threat intelligence tagging method of claim 1, wherein: and (2) the sample data collector in the step (1.1) is web crawler and mail analysis software.

3. The automated FastText-based threat intelligence tagging method of claim 1, wherein: and (4) text segmentation software based on a Jieba algorithm and a threat intelligence word bank is used as the text segmenter in the step (1.4).

4. The automated FastText-based threat intelligence tagging method of claim 1, wherein: the summary generator in the step (1.5) is automatic summary generation software based on a Textrank algorithm.

5. The automated FastText-based threat intelligence tagging method of claim 1, wherein: and (4) training an automatic labeling model in the step (1.6), namely automatic text classification software based on a FatText algorithm.