CN112464666B

CN112464666B - Unknown network threat automatic discovery method based on hidden network data

Info

Publication number: CN112464666B
Application number: CN201910763695.1A
Authority: CN
Inventors: 刘亮; 李孟铭; 郑荣锋
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2023-07-21
Anticipated expiration: 2039-08-19
Also published as: CN112464666A

Abstract

The method discloses an unknown network threat automatic discovery method based on the hidden network data. The method comprises the following steps: 1) Crawling a hidden network forum and a transaction platform and marking texts as training sets; 2) Constructing a word2vec model and a dark net text named entity recognition model by using a text training set; 4) Obtaining each text characteristic by using a word2vec model, training a dark net text classification model, and classifying the dark net text by using the model; 6) Analyzing a database leakage type dark net text through a named entity recognition model, extracting company type named entities, and finding a database leakage event; 7) And analyzing the 'hacking tool' and 'malicious code' class dark net text through the named entity recognition model, extracting the hacking tool and the malicious code class named entity, and judging whether the hacking tool is unknown malicious code or the hacking tool through a search engine-based method. The method can help security personnel to timely cope with network threats.

Description

Unknown network threat automatic discovery method based on hidden network data

Technical Field

The invention belongs to the field of network data analysis and text mining, and designs an unknown network threat automatic discovery method based on hidden network data.

Background

A darknet, according to the wikipedia definition, is a web content that exists on a darknet, overlay network, and can only be accessed using special software, special authorization, or special settings for a computer. Dark networks that constitute the dark network include small point-to-point networks of F2F and large popular networks operated by public organizations and individuals, such as Tor, free network, I2P, and rifle 3.

For a long time, because the darknet can mask the true identity and the true network information of the Internet surfing person, the darknet is taken as a platform for spreading malicious codes and trading black products by lawbreakers. In recent years, with the rapid development of network attacks and black-out, more and more hacking tools, malicious codes and database leakage events have been widely spread and discussed in the darknet forum and the darknet trading platform before being discovered by security researchers.

Currently, due to huge data volume on a hidden network forum and a transaction platform, unknown network threats are difficult to timely and effectively discover from a hidden network in a manual browsing and analyzing mode.

Disclosure of Invention

Aiming at the characteristics of text information of the hidden network forum and the hidden network transaction platform, the unknown network threat automatic discovery method based on the hidden network data is provided, and can be used for automatically discovering the network threat which is not mastered by security personnel from the hidden network forum and the hidden network transaction platform. Based on the method, the database information which is attacked and stolen can be found in advance by processing the text of the dark network, and novel hacking tools and novel malicious codes sold in the dark network can be found in advance. Because the method has an automatic discovery function, a process of browsing and analyzing a large amount of dark network texts by network security researchers is omitted, and the coping capability of the dark network threat is improved.

The technical key point of the invention is that.

1. The method comprises the steps of preprocessing the dark net text by adopting a wrong word correction algorithm, a Poterstemming algorithm and a text replacement method based on a regular expression, solving the problem of word confusion caused by hacker language in the dark net text, and reducing or removing low-frequency features.

2. A word2vec model is adopted to construct a dark net text character mapping model, the influence of low-frequency words on dark net text classification is removed, meanwhile, the dimension of text features is reduced, and the efficiency of model construction and the accuracy of model classification are improved.

3. And constructing a dark net text named entity recognition model by combining the constructed word2vec model and the BiLSTM-CRF neural network model so as to solve the problem that the existing named entity model is difficult to effectively extract the named entity from the dark net text.

4. The unknown named entity judging method based on the search engine is used for efficiently judging whether the named entity is mastered by a network security researcher or not by utilizing the relation between a network security company and a novel hacker tool and a novel malicious code.

In order to reduce the manual marking cost, in the process of marking the forum posts and the commodity information of the hidden network transaction platform, the manual checking is performed by crawling the data of the classification labels existing in the hidden network forum and the hidden network transaction platform; in the process of marking the text sequence of the hidden network, a company set, a database software set, a common software set, a known hacker tool set and a known malicious code set are collected from the Wikipedia, the common hacker language set is summarized according to expert experience, then the text sequence is marked by adopting the existing named entity, the collected various text sets are used for marking the text sequence, and finally manual verification is carried out.

Conventional text analysis methods typically employ whitelist filtering methods that discover unknown named entities by removing known named entities, which make it difficult to automatically discover unknown network threats related to "database leaks", "hacking tools" and "malicious code". The invention utilizes the BiLSTM-CRF neural network model to construct the recognition model of the naming entity of the text of the hidden network, can automatically extract the naming entity of the category of the company from the text, find the database data sold in the hidden network; the method can automatically extract the hacker tool category and the malicious code category named entity from the text, and discover the hacker tool and the malicious code which are not mastered by network security researchers in combination with an unknown named entity judging method based on a search engine.

The traditional text classification method generally has the problems of high text dimension, low-frequency vocabulary introduction, classification efficiency reduction and the like, and the word2vec model constructed by combining the preprocessing method can effectively reduce the feature dimension of the dark net text, remove the low-frequency vocabulary and strengthen the text features.

The specific scheme of the method is as follows.

1) And crawling the forum posts and commodity information on the hidden network trading platform. Selecting a well-known hidden network forum and a well-known hidden network trading platform, and compiling a crawler to crawl forum posts and commodity information. And marking according to the existing classification labels of the posts or commodity information, and manually auditing the crawled data.

2) The text sequence is annotated. Collecting a company set, a database software set, a common software set, a known hacker tool set and a known malicious code set from Wikipedia, summarizing a common hacker language set according to expert experience, then adopting an existing named entity model Stanford NER to carry out sequence labeling on a hidden network text, then correcting the labeled sequence by using various special noun sets, and finally carrying out manual auditing on the labeled text sequence.

3) And constructing a word2vec model. Firstly, preprocessing the dark net text marked with the classification labels. The pretreatment method comprises the following steps: correcting the misspelled vocabulary using a misword correction algorithm; restoring the word into the original root form by using a Porterstemming algorithm, thereby reducing the feature dimension of the text; the text replacement method based on regular expression replaces the original low-frequency special symbol with the high-frequency text feature. Then, a word2vec model was constructed using the Continuous Bag of Words algorithm.

4) And constructing a dark net text classification model. Processing the pre-processed dark net text by using a word2vec model, mapping each vocabulary of each text into word2vec vectors, and calculating the average sum of all text vectors to obtain a feature vector of each text; and training a support vector machine model by using the text feature vector to obtain a dark net text classification model.

5) And constructing a dark net text named entity recognition model. Firstly, a word2vec model is used for processing the dark net text subjected to pretreatment, each vocabulary of each text is mapped into a word2vec vector, the word2vec vector and the marked text sequence are used as training samples, and a BiLSTM-CRF neural network model is adopted for constructing a dark net text named entity recognition model.

6) Classifying the dark net text. And processing the dark net text to be classified by using the text preprocessing method, classifying the dark net text by using the constructed dark net text classification model, and processing the classified dark net text in different modes respectively by using a 'database leakage', 'hacker tool' and a 'malicious code' type.

7) Database leakage events are discovered. And for classified 'database leakage' category dark net texts, using the built dark net text named entity recognition model, and if the named entity of the company type is successfully extracted, discovering a database leakage event.

8) Emerging hacking tools and malicious code are discovered. First, a series of website information of network security companies, network security teams, network security practitioners is collected. Then, for the classified dark net texts of the 'hacking tool' and the 'malicious code' categories, the built dark net text named entity recognition model is used for extracting named entities of the 'hacking tool' and the 'malicious code' categories. Finally, judging whether the named entity is mastered by a network security personnel or not through an unknown named entity judging method based on a search engine, wherein the method comprises the steps of searching the named entity by using the search engine, and if website links of a network security company, a network security team and a network security practitioner are not found in the first 20 search results of the named entity, finding a hacking tool or malicious code which is not mastered by the network security researchers.

Compared with the prior art, the invention has the following positive effects.

1. The unknown network threat which is not mastered by network security personnel can be automatically found out from the hidden network text, and the cost of manual analysis and retrieval is reduced.

2. The method has higher timeliness, and helps security personnel to timely cope with network threats.

3. Has higher accuracy.

Drawings

Fig. 1 is a flow chart of the present invention.

Fig. 2 is a preprocessing flow chart of the present invention.

FIG. 3 is a flow chart of the word2vec model construction of the present invention.

FIG. 4 is a flow chart of dark net text classification model training of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

The treatment method of the invention is as follows.

And the first step is to crawl posts of the forum on the hidden network and commodity information of the transaction platform on the hidden network.

And firstly, crawling posts of the forum on the hidden network and commodity information of the transaction platform on the hidden network by using the Scrapy, and reserving labels of the posts and the commodity as class labels of the text on the hidden network in the crawling process.

And (II) manually checking the class labels, and removing the incorrectly marked hidden screen text.

And secondly, marking the text sequence of the dark net.

Collecting various company name sets, various database software sets, common software sets, known hacker tool sets and known malicious code sets from Wikipedia, and summarizing a common hacker language set according to expert experience.

And secondly, labeling the dark net text by using a standard NER named entity recognition model, and correcting the labeled sequence by using the collected special noun sets.

And thirdly, manually checking the marked text sequence, and removing the incorrectly marked dark net text.

And thirdly, constructing a word2vec model.

And (one) preprocessing the dark net text marked with the category labels by adopting a preprocessing method shown in fig. 2. The preprocessing method comprises the steps of correcting misspelled words by using a Spellchecker library of Python, obtaining common word roots by using a Porterstemming algorithm by using an nltk library of Python, and realizing a text replacement method based on regular expressions by using a re library of Python.

And (II) implementing Continuous Bag of Word algorithm by using a Gensim library of Python, and constructing a word2vec model.

Fourth, constructing a dark net text classification model.

And (one) for the preprocessed dark net texts, mapping each word of each text into word2vec vectors, and obtaining the feature vector of each text by calculating the average sum of all text vectors.

And (II) training a dark net text classification model by using the feature vector by adopting a scikit-learn library of Python.

Fifthly, constructing a dark net text named entity recognition model.

For the preprocessed dark net text, a word2vec model is used to map each word of each text into a word2vec vector.

And secondly, taking word2vec vectors and marked text sequences as training samples, and adopting a Tensorflow library of Python to realize a BiLSTM-CRF neural network model to construct a dark network text naming entity identification model.

Sixth, classifying the dark net text by using a trained dark net text classification model, and classifying the class dark net text of 'database leakage', 'hacker tool', 'malicious code', 'dark net text'.

Seventh, for the "database leakage" text, if the company named entity is extracted through the built hidden network text named entity recognition model, the database of the company can be identified as a data leakage event.

Eighth, new emerging hacking tools and malicious code are discovered.

First, extracting the "hacker tool" and "malicious code" category named entities through the dark network text named entity recognition model for the "hacker tool" and "malicious code" category dark network text.

And secondly, for the extracted named entity, the named entity is searched by setting Google as a search engine and utilizing the Machannize library simulation browser function of Python.

And thirdly, if the name of the security company does not appear in the first 20 items of the search result, the dark network text is determined to be a hacking tool or malicious code which is not mastered by network security researchers.

The above examples are only illustrative of the technical solution of the present invention and not limiting thereof, and those skilled in the art may make modifications and equivalents to the technical solution of the present invention without departing from the spirit and scope of the invention, and the scope of the invention shall be defined by the claims.

Claims

1. An unknown network threat automatic discovery method based on dark network data is characterized in that:

A. crawling a known hidden network forum and a hidden network transaction platform, marking the forum posts and commodity information of the hidden network transaction platform, wherein marked categories comprise a 'hacking tool', 'malicious code' and a 'database leakage' category, and obtaining a marked hidden network text training set;

B. extracting the text information of each dark net text in the dark net text training set, preprocessing, and constructing a word2vec model by using the preprocessed dark net text;

C. preprocessing a dark net text training set, marking a text sequence, and constructing a dark net text naming entity identification model by combining the word2vec model and the BiLSTM-CRF neural network model;

D. processing each dark net text by using the constructed word2vec model, acquiring text characteristics, training a classifying model of the dark net text by using the text characteristics, and classifying the preprocessed dark net text to be classified by using the classifying model to obtain the class of the dark net text to be classified;

E. analyzing a database leakage type dark net text through a dark net text named entity recognition model, and if the company named entity is successfully extracted, finding a database leakage event;

F. analyzing the 'hacking tool' and the 'malicious code' category dark net text through the dark net text named entity recognition model, extracting the hacking tool and the malicious code category named entity, and judging whether the named entity is unknown malicious code or hacking tool through an unknown named entity judging method based on a search engine.

2. The method for automatically discovering unknown cyber threats based on the darknet data according to claim 1, wherein said step B further comprises the steps of:

b1, correcting misspelled words by adopting a misword correction algorithm, extracting word stems by adopting a Porter stemming algorithm, and using a text replacement method based on a regular expression, wherein the regular expression is used for matching IP, domain name, URL, price and mailbox entity types in the hidden network text, and replacing the entity types with the hidden network text types;

b2, constructing a dark net text 2vec model by using a Continuous Bag of Words algorithm.

3. The method for automatically discovering unknown cyber threats based on the darknet data according to claim 1, wherein said step C further comprises the steps of:

c1, processing a dark net text training set by using a preprocessing method;

labeling the text training set to form a dark net text training set with sequence labeling;

and C3, constructing a dark net text named entity recognition model by combining the word2vec model and the BiLSTM-CRF neural network model.

4. The method for automatically discovering unknown cyber threats based on the darknet data according to claim 1, wherein said step D further comprises the steps of:

d1, mapping each dark net text in the dark net text training set into a word2vec vector by using the constructed word2vec model, and obtaining a feature vector of each text by calculating the average sum of all text vectors;

d2, training the text feature vector by using a support vector machine model to obtain a dark net text classification model;

and D3, extracting feature vectors of the dark net text to be classified by the method in the step D1, and classifying the dark net text by a classifying model of the dark net text.

5. The method for automatically discovering unknown cyber-threat based on darknet data as recited in claim 1, wherein said step F further comprises the steps of:

f1, collecting a series of network security companies and website and domain name related information of the network security companies;

and F2, searching an unknown named entity through a search engine, and in the previous search results, if the collected related information of the network security company does not appear in the search results, identifying the named entity as a hacking tool or malicious code which is not mastered by security personnel.