CN111435375A - Threat information automatic labeling method based on FastText - Google Patents

Threat information automatic labeling method based on FastText Download PDF

Info

Publication number
CN111435375A
CN111435375A CN201811587862.3A CN201811587862A CN111435375A CN 111435375 A CN111435375 A CN 111435375A CN 201811587862 A CN201811587862 A CN 201811587862A CN 111435375 A CN111435375 A CN 111435375A
Authority
CN
China
Prior art keywords
threat
threat information
labeling
information
fasttext
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811587862.3A
Other languages
Chinese (zh)
Inventor
翟江涛
徐留杰
孙中军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhichangrong Information Technology Co ltd
Original Assignee
Nanjing Zhichangrong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhichangrong Information Technology Co ltd filed Critical Nanjing Zhichangrong Information Technology Co ltd
Priority to CN201811587862.3A priority Critical patent/CN111435375A/en
Publication of CN111435375A publication Critical patent/CN111435375A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a threat information automatic labeling method based on FastText, which sequentially comprises the following steps: 1(1) establishing an information automatic labeling model; (2) and automatically labeling the threat intelligence by using an automatic labeling model. According to the method, a word bank special for threat information is constructed by collecting massive threat information and utilizing technologies such as word segmentation, word frequency statistics and the like, and is combined with a Fasttext classifier, so that higher recall ratio and precision ratio can be obtained, and automatic labeling of the threat information is realized.

Description

Threat information automatic labeling method based on FastText
Technical Field
The invention relates to a network and information security technology, in particular to an automatic labeling method for network security threat information based on FastText.
Background
The concept of threat intelligence was originally proposed in 2012 in the U.S. government's release of the big data research and development initiatives. The threat intelligence is threat intelligence which is isolated and disordered and is converted into threat intelligence with a fixed format, so that the threat information can be normalized and sorted, and the deep analysis of threat data is facilitated. However, at present, the understanding of threat intelligence by each threat intelligence organization is not uniform, which leads to the form of threat intelligence on the network being different. In order to improve the efficiency of query and analysis, more and more researchers are working on finding an automatic labeling method for threat intelligence, and obtaining a lot of research results.
The threat information labeling method based on FastText is a brand-new threat information labeling method, FastText is a text classifier which is proposed by Facebook AI Research in 16 years, and the method is characterized in that the FastText greatly shortens training time while keeping the classification effect compared with other text classification models such as SVM, &lttTtranslation = L "&gtTL &ltt/T &gtthistorical Regression and neural network.
The FastText method comprises three parts: model architecture, level Softmax and N-gram features. The FastText model inputs a sequence of words (a piece of text or a sentence) and outputs the probability that the sequence of words belongs to different categories. The words and phrases in the sequence constitute a feature vector, the feature vector is mapped to the middle layer through linear transformation, and the middle layer is mapped to the label. FastText uses a non-linear activation function in predicting tags, but does not use a non-linear activation function in the middle layers. The FastText model architecture is very similar to the CBOW model in Word2 Vec. The difference is that FastText predicts the tag, while the CBOW model predicts the interword.
A first part: the model architecture of FastText is similar to CBOW, both models are based on hierarchy software max, and both models are three-layer architectures: input layer, hidden layer, output layer. The CBOW model is based on an N-gram model and a BOW model, the CBOW model takes W (t-N +1) … … W (t-1) W (t-N +1) … … W (t-1) as input to predict W (t), and the FastText model takes the whole text as a characteristic to predict the category of the text.
A second part: and mapping between layers, namely forming a feature vector by words and phrases in an input layer, mapping the feature vector to a hidden layer through linear transformation, solving a maximum likelihood function by the hidden layer, constructing a Huffman tree according to the weight and model parameters of each category, and outputting the Huffman tree.
And a third part: n-gram characteristics of FastText: a common feature is the bag-of-words model (converting the input data into the corresponding Bow form). But the bag-of-words model cannot take into account the order between words, so FastText also adds N-gram features. For example, the bag model feature in the sentence "i love her" is "i", "love", "her". These features are the same as the feature of the sentence "she loves me". If a 2-Ngram is added, the first sentence is characterized by "I-love" and "love-her", and the two sentences of "I love her" and "she love me" can be distinguished. Of course, to improve efficiency, we need to filter out the low frequency N-grams.
A low-dimensional vector is associated with each word in FastText. The hidden representations are shared among all classifiers in different classes, so that the text information can be commonly used in different classes. This type of characterization is called bag of words (where word order is ignored). Vector representation of the word N-gram is also used in FastText to take local word order into account, which is important for many text classification problems.
The threat information automatic labeling method based on FastText has the main advantages that: (1) support for multi-language expressions: since the threat information in China is still in the starting stage, a lot of information comes from the threat information abroad, and the FastText method can support multiple languages including English, German, Spanish, French and the like (2) is suitable for large data + high-efficiency training speed: more than 10 hundred million words are processed in 10 minutes using a standard multi-core CPU.
At present, an automatic threat intelligence labeling method is not disclosed.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects in the prior art, the invention aims to provide an automatic threat information labeling method based on FastText.
The technical scheme is as follows: a threat information automatic labeling method based on FastText is characterized in that: the method sequentially comprises the following steps: (1) establishing an automatic threat information labeling model; (2) and automatically labeling the threat intelligence by using an automatic labeling model.
Wherein the specific process of the step (1) is as follows:
(1.1) setting a sample data collector: the method comprises the steps of capturing threat information data of each large threat information website by using a web crawler technology, and analyzing mail contents to obtain the threat information data by subscribing related information mails and combining a mail analysis technology.
(1.2) setting a data processor: and (3) processing the data acquired in the step (1.1) by using a data processor, namely splitting each module of each threat intelligence into fields, and extracting the original label to be divided into one field separately.
(1.3) constructing a threat information word bank: and a special vocabulary for threatening information is added on the basis of the original word stock, so that the accuracy of the labeling result is improved.
(1.4) setting a text segmenter: and performing word segmentation on the threat intelligence content and the description part by using a text segmenter.
(1.5) setting a summary generator: and (4) analyzing the word list divided in the step (1.4) to obtain a phrase with higher correlation, and forming a summary by establishing abstract semantic representation and using a natural language generation technology.
(1.6), training a model: and (5) training the phrases generated in the step (1.5) by using a FastText method to obtain a threat information automatic labeling model.
Wherein, the step (2) is to automatically label threat intelligence by using a model trained by Fastext, and the method specifically comprises the following steps:
(2.1), setting an information collector: threat information is collected through web crawlers, mail analysis and other modes, a crawler resisting strategy of a threat information website needs to be responded, and mail analysis software is designed.
(2.2) setting a data processor: and (4) extracting description and text content in the threat intelligence acquired in the step (2.1) as test data.
(2.3) setting a text segmenter: and (4) carrying out word segmentation processing on the threat intelligence text by using a Jieba algorithm in combination with the threat intelligence word bank constructed in the step (1.3) to obtain a corresponding word list.
(2.4) setting a summary generator: and (4) analyzing the word list generated in the step (2.3) by using a Textrand algorithm, and forming the abstract by establishing abstract semantic representation and using a natural language generation technology.
(2.5) judging the labeling result: and (3) inputting the abstract generated in the step (2.4) into the automatic threat information labeling model in the step (1.6), outputting a label, comparing the label with the label obtained in the step (1.2), and if the label is the same, accurately labeling, and if the label is different, mistakenly labeling.
Further, the sample data collector in the step (1.1) is web crawler and mail analysis software. .
Further, the text segmenter in the step (1.4) is text cutting software based on a Jieba algorithm and a threat intelligence word stock.
Further, the summary generator in step (1.5) is the automatic summary generation software based on Textrank algorithm.
Further, the automatic labeling model trained in the step (1.6) is automatic text classification software based on the FatText algorithm.
Has the advantages that: the method comprises the steps of firstly obtaining a sample data set with labels through a web crawler technology and a mail analysis technology, then constructing a special word bank of threat information, segmenting an obtained text by combining a Jieba method with the special word bank of the threat information, simultaneously establishing a graph model for segmented words by combining a Textrand method, sequencing important components in the text by using a voting mechanism, then taking 10 words which are sequenced in the front as abstracts of the text, establishing an automatic labeling model for the abstracts of the text by using a FastText method, and finally automatically labeling the originally obtained threat information through the model. According to the invention, by training the threat information automatic labeling model based on FastText and combining with the Textrand automatic summarization technology, the reliability, the accuracy and the high efficiency of threat information automatic labeling can be effectively improved.
Drawings
FIG. 1 is a flow chart of automated threat intelligence labeling model establishment in an embodiment.
FIG. 2 is a flow chart of an embodiment of automatic labeling with threat intelligence.
FIG. 3 is a graph showing the effect of the training samples varying in the examples
Detailed Description
The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.
Example one:
the threat intelligence automatic labeling method based on FastText in the embodiment comprises the following processes:
1. an automatic threat information labeling model is established, as shown in fig. 1, and the specific flow is as follows:
(1) setting a sample data collector: threat intelligence data are collected through a web crawler technology and an email analysis technology, and the collected intelligence data comprise 2 thousands of vulnerability information at home and abroad, 2 thousands of threat intelligence on a threat intelligence website, and 2 thousands of safety organization technology reports. Meanwhile, the label of each piece of information is obtained and used as a sample set of the training model.
(2) Setting a data processor: and cleaning the obtained threat information data by using Python, screening out a text description part in the threat information data, and respectively storing the text description part and the corresponding label into two fields.
(3) Constructing a threat information word bank: and downloading a special vocabulary library of threat information from the internet, and counting words with higher frequency in each threat information text to be added into the vocabulary library of the threat information, so that the accuracy of the automatic label is improved.
(4) Setting a text divider: and performing word segmentation processing on the threat information by combining a Jieba method with a threat information word bank, and splitting the text into words.
(5) Setting a summary generator: if all the words generated in the step (4) are counted in the model, once the text is too large, the calculation speed of the model is affected, the segmented threat intelligence can be further sequenced by a Textrand method, and the top ten 10 words are selected as the abstracts of the text.
(6) Training a model: and training the threat information automatic labeling model by using a FastText method, and sampling and collecting the model from a sample data source to serve as a training set of the model.
2. The automated threat information labeling model is used for labeling, as shown in fig. 2, the specific process is as follows:
(A) setting an information collector: threat intelligence data are collected through a web crawler technology and an email analysis technology, and the collected intelligence data comprise 5000 pieces of vulnerability information at home and abroad, 5000 pieces of threat intelligence on a threat intelligence website, and 5000 pieces of safety organization technology reports.
(B) Setting a data processor: and cleaning the obtained threat information data by using Python, screening out a text description part in the threat information data, and respectively storing the text description part and the corresponding label into two fields.
(C) Setting a text divider: and performing word segmentation processing on the threat information by combining a Jieba method with a threat information word bank, and splitting the text into words.
(D) Setting a summary generator: if all the words generated in the step (4) are counted in the model, once the text is too large, the calculation speed of the model is affected, the segmented threat intelligence can be further sequenced by a Textrand method, and the top ten 10 words are selected as the abstracts of the text.
(E) Judging a labeling result: 5000 pieces of data are sampled and collected from the step (A) and used as a test set of the automatic threat information labeling, and the recall ratio and precision ratio of the model are judged through comparison with the original labeling and the prediction labeling.
As shown in fig. 3, the number of training sample sets sampled in this embodiment is increased from 5000 to 11000, and the accuracy and recall ratio of automated threat intelligence labeling are shown in the result diagram. The blue line indicates accuracy and the yellow line indicates recall. As can be seen from FIG. 3, the threat information can be automatically marked well by applying the method and the system. Along with the increase of the number of training sets, the accuracy and the recall rate are gradually increased, and as long as enough training sets are combined with abundant threat information word banks, the method can be used for automatically labeling threat information very effectively.

Claims (5)

1. A threat information automatic labeling method based on FastText is characterized in that: the method sequentially comprises the following steps: (1) establishing an automatic threat information labeling model; (2) automatically labeling the threat information by using an automatic labeling model;
wherein the specific process of the step (1) is as follows:
(1.1) setting a sample data collector: using a web crawler technology to capture threat information data of each large threat information website, and analyzing mail contents to obtain the threat information data by subscribing related information mails and combining a mail analysis technology;
(1.2) setting a data processor: processing the data acquired in the step (1.1) by using a data processor, namely splitting each module of each threat intelligence into fields, and extracting the original label to be divided into one field;
(1.3) constructing a threat information word bank: adding a special vocabulary threatening information on the basis of the original word stock, and improving the accuracy of a labeling result;
(1.4) setting a text segmenter: utilizing a text divider to perform word segmentation processing on the threat information content and the description part;
(1.5) setting a summary generator: analyzing the word list divided in the step (1.4) to obtain a phrase with higher correlation, and forming a summary by establishing abstract semantic representation and using a natural language generation technology;
(1.6), training a model: training the phrases generated in the step (1.5) by using a FastText method to obtain a threat information automatic labeling model;
wherein, the step (2) is to automatically label threat intelligence by using a model trained by Fastext, and the method specifically comprises the following steps:
(2.1), setting an information collector: threat information is collected through web crawlers, mail analysis and other modes, a crawler resisting strategy of a threat information website needs to be responded, and mail analysis software is designed;
(2.2) setting a data processor: extracting description and text content in the threat information obtained in the step (2.1) as test data;
(2.3) setting a text segmenter: carrying out word segmentation processing on the threat information text by using a Jieba algorithm in combination with the threat information word bank constructed in the step (1.3) to obtain a corresponding word list;
(2.4) setting a summary generator: analyzing the word list generated in the step (2.3) by using a Textrand algorithm, and forming an abstract by establishing abstract semantic representation and using a natural language generation technology;
(2.5) judging the labeling result: and (3) inputting the abstract generated in the step (2.4) into the automatic threat information labeling model in the step (1.6), outputting a label, comparing the label with the label obtained in the step (1.2), and if the label is the same, accurately labeling, and if the label is different, mistakenly labeling.
2. The automated FastText-based threat intelligence tagging method of claim 1, wherein: and (2) the sample data collector in the step (1.1) is web crawler and mail analysis software.
3. The automated FastText-based threat intelligence tagging method of claim 1, wherein: and (4) text segmentation software based on a Jieba algorithm and a threat intelligence word bank is used as the text segmenter in the step (1.4).
4. The automated FastText-based threat intelligence tagging method of claim 1, wherein: the summary generator in the step (1.5) is automatic summary generation software based on a Textrank algorithm.
5. The automated FastText-based threat intelligence tagging method of claim 1, wherein: and (4) training an automatic labeling model in the step (1.6), namely automatic text classification software based on a FatText algorithm.
CN201811587862.3A 2018-12-25 2018-12-25 Threat information automatic labeling method based on FastText Pending CN111435375A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811587862.3A CN111435375A (en) 2018-12-25 2018-12-25 Threat information automatic labeling method based on FastText

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811587862.3A CN111435375A (en) 2018-12-25 2018-12-25 Threat information automatic labeling method based on FastText

Publications (1)

Publication Number Publication Date
CN111435375A true CN111435375A (en) 2020-07-21

Family

ID=71579713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811587862.3A Pending CN111435375A (en) 2018-12-25 2018-12-25 Threat information automatic labeling method based on FastText

Country Status (1)

Country Link
CN (1) CN111435375A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765366A (en) * 2021-01-24 2021-05-07 中国电子科技集团公司第十五研究所 APT (android Package) organization portrait construction method based on knowledge map
CN113343241A (en) * 2021-07-20 2021-09-03 南京中孚信息技术有限公司 Dynamic label generation method based on online malicious software scanning platform
CN113688240A (en) * 2021-08-25 2021-11-23 南京中孚信息技术有限公司 Threat element extraction method, device, equipment and storage medium
CN114706972A (en) * 2022-03-21 2022-07-05 北京理工大学 Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765366A (en) * 2021-01-24 2021-05-07 中国电子科技集团公司第十五研究所 APT (android Package) organization portrait construction method based on knowledge map
CN113343241A (en) * 2021-07-20 2021-09-03 南京中孚信息技术有限公司 Dynamic label generation method based on online malicious software scanning platform
CN113688240A (en) * 2021-08-25 2021-11-23 南京中孚信息技术有限公司 Threat element extraction method, device, equipment and storage medium
CN113688240B (en) * 2021-08-25 2024-01-30 南京中孚信息技术有限公司 Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN114706972A (en) * 2022-03-21 2022-07-05 北京理工大学 Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression

Similar Documents

Publication Publication Date Title
Hassan et al. Sentiment analysis on bangla and romanized bangla text using deep recurrent models
CN109858041B (en) Named entity recognition method combining semi-supervised learning with user-defined dictionary
Dashtipour et al. Exploiting deep learning for Persian sentiment analysis
CN106919673A (en) Text mood analysis system based on deep learning
CN110502753A (en) A kind of deep learning sentiment analysis model and its analysis method based on semantically enhancement
CN111435375A (en) Threat information automatic labeling method based on FastText
CN112417863B (en) Chinese text classification method based on pre-training word vector model and random forest algorithm
Hassan et al. Sentiment analysis on bangla and romanized bangla text (BRBT) using deep recurrent models
CN111506732B (en) Text multi-level label classification method
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN113157859B (en) Event detection method based on upper concept information
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
US20230073602A1 (en) System of and method for automatically detecting sarcasm of a batch of text
CN110297986A (en) A kind of Sentiment orientation analysis method of hot microblog topic
CN109446299A (en) The method and system of searching email content based on event recognition
Rajalakshmi et al. Sentimental analysis of code-mixed Hindi language
Mahmud et al. Deep learning based sentiment analysis from Bangla text using glove word embedding along with convolutional neural network
Parvin et al. Multi-class textual emotion categorization using ensemble of convolutional and recurrent neural network
Jagadeesan et al. Twitter Sentiment Analysis with Machine Learning
CN112579730A (en) High-expansibility multi-label text classification method and device
Ruposh et al. A computational approach of recognizing emotion from Bengali texts
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN110413972B (en) Intelligent table name field name complementing method based on NLP technology
CN112528653A (en) Short text entity identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200721