CN112464666B - Unknown network threat automatic discovery method based on hidden network data - Google Patents

Unknown network threat automatic discovery method based on hidden network data Download PDF

Info

Publication number
CN112464666B
CN112464666B CN201910763695.1A CN201910763695A CN112464666B CN 112464666 B CN112464666 B CN 112464666B CN 201910763695 A CN201910763695 A CN 201910763695A CN 112464666 B CN112464666 B CN 112464666B
Authority
CN
China
Prior art keywords
text
dark net
model
network
dark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910763695.1A
Other languages
Chinese (zh)
Other versions
CN112464666A (en
Inventor
刘亮
李孟铭
郑荣锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201910763695.1A priority Critical patent/CN112464666B/en
Publication of CN112464666A publication Critical patent/CN112464666A/en
Application granted granted Critical
Publication of CN112464666B publication Critical patent/CN112464666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Abstract

The method discloses an unknown network threat automatic discovery method based on the hidden network data. The method comprises the following steps: 1) Crawling a hidden network forum and a transaction platform and marking texts as training sets; 2) Constructing a word2vec model and a dark net text named entity recognition model by using a text training set; 4) Obtaining each text characteristic by using a word2vec model, training a dark net text classification model, and classifying the dark net text by using the model; 6) Analyzing a database leakage type dark net text through a named entity recognition model, extracting company type named entities, and finding a database leakage event; 7) And analyzing the 'hacking tool' and 'malicious code' class dark net text through the named entity recognition model, extracting the hacking tool and the malicious code class named entity, and judging whether the hacking tool is unknown malicious code or the hacking tool through a search engine-based method. The method can help security personnel to timely cope with network threats.

Description

Unknown network threat automatic discovery method based on hidden network data
Technical Field
The invention belongs to the field of network data analysis and text mining, and designs an unknown network threat automatic discovery method based on hidden network data.
Background
A darknet, according to the wikipedia definition, is a web content that exists on a darknet, overlay network, and can only be accessed using special software, special authorization, or special settings for a computer. Dark networks that constitute the dark network include small point-to-point networks of F2F and large popular networks operated by public organizations and individuals, such as Tor, free network, I2P, and rifle 3.
For a long time, because the darknet can mask the true identity and the true network information of the Internet surfing person, the darknet is taken as a platform for spreading malicious codes and trading black products by lawbreakers. In recent years, with the rapid development of network attacks and black-out, more and more hacking tools, malicious codes and database leakage events have been widely spread and discussed in the darknet forum and the darknet trading platform before being discovered by security researchers.
Currently, due to huge data volume on a hidden network forum and a transaction platform, unknown network threats are difficult to timely and effectively discover from a hidden network in a manual browsing and analyzing mode.
Disclosure of Invention
Aiming at the characteristics of text information of the hidden network forum and the hidden network transaction platform, the unknown network threat automatic discovery method based on the hidden network data is provided, and can be used for automatically discovering the network threat which is not mastered by security personnel from the hidden network forum and the hidden network transaction platform. Based on the method, the database information which is attacked and stolen can be found in advance by processing the text of the dark network, and novel hacking tools and novel malicious codes sold in the dark network can be found in advance. Because the method has an automatic discovery function, a process of browsing and analyzing a large amount of dark network texts by network security researchers is omitted, and the coping capability of the dark network threat is improved.
The technical key point of the invention is that.
1. The method comprises the steps of preprocessing the dark net text by adopting a wrong word correction algorithm, a Poterstemming algorithm and a text replacement method based on a regular expression, solving the problem of word confusion caused by hacker language in the dark net text, and reducing or removing low-frequency features.
2. A word2vec model is adopted to construct a dark net text character mapping model, the influence of low-frequency words on dark net text classification is removed, meanwhile, the dimension of text features is reduced, and the efficiency of model construction and the accuracy of model classification are improved.
3. And constructing a dark net text named entity recognition model by combining the constructed word2vec model and the BiLSTM-CRF neural network model so as to solve the problem that the existing named entity model is difficult to effectively extract the named entity from the dark net text.
4. The unknown named entity judging method based on the search engine is used for efficiently judging whether the named entity is mastered by a network security researcher or not by utilizing the relation between a network security company and a novel hacker tool and a novel malicious code.
In order to reduce the manual marking cost, in the process of marking the forum posts and the commodity information of the hidden network transaction platform, the manual checking is performed by crawling the data of the classification labels existing in the hidden network forum and the hidden network transaction platform; in the process of marking the text sequence of the hidden network, a company set, a database software set, a common software set, a known hacker tool set and a known malicious code set are collected from the Wikipedia, the common hacker language set is summarized according to expert experience, then the text sequence is marked by adopting the existing named entity, the collected various text sets are used for marking the text sequence, and finally manual verification is carried out.
Conventional text analysis methods typically employ whitelist filtering methods that discover unknown named entities by removing known named entities, which make it difficult to automatically discover unknown network threats related to "database leaks", "hacking tools" and "malicious code". The invention utilizes the BiLSTM-CRF neural network model to construct the recognition model of the naming entity of the text of the hidden network, can automatically extract the naming entity of the category of the company from the text, find the database data sold in the hidden network; the method can automatically extract the hacker tool category and the malicious code category named entity from the text, and discover the hacker tool and the malicious code which are not mastered by network security researchers in combination with an unknown named entity judging method based on a search engine.
The traditional text classification method generally has the problems of high text dimension, low-frequency vocabulary introduction, classification efficiency reduction and the like, and the word2vec model constructed by combining the preprocessing method can effectively reduce the feature dimension of the dark net text, remove the low-frequency vocabulary and strengthen the text features.
The specific scheme of the method is as follows.
1) And crawling the forum posts and commodity information on the hidden network trading platform. Selecting a well-known hidden network forum and a well-known hidden network trading platform, and compiling a crawler to crawl forum posts and commodity information. And marking according to the existing classification labels of the posts or commodity information, and manually auditing the crawled data.
2) The text sequence is annotated. Collecting a company set, a database software set, a common software set, a known hacker tool set and a known malicious code set from Wikipedia, summarizing a common hacker language set according to expert experience, then adopting an existing named entity model Stanford NER to carry out sequence labeling on a hidden network text, then correcting the labeled sequence by using various special noun sets, and finally carrying out manual auditing on the labeled text sequence.
3) And constructing a word2vec model. Firstly, preprocessing the dark net text marked with the classification labels. The pretreatment method comprises the following steps: correcting the misspelled vocabulary using a misword correction algorithm; restoring the word into the original root form by using a Porterstemming algorithm, thereby reducing the feature dimension of the text; the text replacement method based on regular expression replaces the original low-frequency special symbol with the high-frequency text feature. Then, a word2vec model was constructed using the Continuous Bag of Words algorithm.
4) And constructing a dark net text classification model. Processing the pre-processed dark net text by using a word2vec model, mapping each vocabulary of each text into word2vec vectors, and calculating the average sum of all text vectors to obtain a feature vector of each text; and training a support vector machine model by using the text feature vector to obtain a dark net text classification model.
5) And constructing a dark net text named entity recognition model. Firstly, a word2vec model is used for processing the dark net text subjected to pretreatment, each vocabulary of each text is mapped into a word2vec vector, the word2vec vector and the marked text sequence are used as training samples, and a BiLSTM-CRF neural network model is adopted for constructing a dark net text named entity recognition model.
6) Classifying the dark net text. And processing the dark net text to be classified by using the text preprocessing method, classifying the dark net text by using the constructed dark net text classification model, and processing the classified dark net text in different modes respectively by using a 'database leakage', 'hacker tool' and a 'malicious code' type.
7) Database leakage events are discovered. And for classified 'database leakage' category dark net texts, using the built dark net text named entity recognition model, and if the named entity of the company type is successfully extracted, discovering a database leakage event.
8) Emerging hacking tools and malicious code are discovered. First, a series of website information of network security companies, network security teams, network security practitioners is collected. Then, for the classified dark net texts of the 'hacking tool' and the 'malicious code' categories, the built dark net text named entity recognition model is used for extracting named entities of the 'hacking tool' and the 'malicious code' categories. Finally, judging whether the named entity is mastered by a network security personnel or not through an unknown named entity judging method based on a search engine, wherein the method comprises the steps of searching the named entity by using the search engine, and if website links of a network security company, a network security team and a network security practitioner are not found in the first 20 search results of the named entity, finding a hacking tool or malicious code which is not mastered by the network security researchers.
Compared with the prior art, the invention has the following positive effects.
1. The unknown network threat which is not mastered by network security personnel can be automatically found out from the hidden network text, and the cost of manual analysis and retrieval is reduced.
2. The method has higher timeliness, and helps security personnel to timely cope with network threats.
3. Has higher accuracy.
Drawings
Fig. 1 is a flow chart of the present invention.
Fig. 2 is a preprocessing flow chart of the present invention.
FIG. 3 is a flow chart of the word2vec model construction of the present invention.
FIG. 4 is a flow chart of dark net text classification model training of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
The treatment method of the invention is as follows.
And the first step is to crawl posts of the forum on the hidden network and commodity information of the transaction platform on the hidden network.
And firstly, crawling posts of the forum on the hidden network and commodity information of the transaction platform on the hidden network by using the Scrapy, and reserving labels of the posts and the commodity as class labels of the text on the hidden network in the crawling process.
And (II) manually checking the class labels, and removing the incorrectly marked hidden screen text.
And secondly, marking the text sequence of the dark net.
Collecting various company name sets, various database software sets, common software sets, known hacker tool sets and known malicious code sets from Wikipedia, and summarizing a common hacker language set according to expert experience.
And secondly, labeling the dark net text by using a standard NER named entity recognition model, and correcting the labeled sequence by using the collected special noun sets.
And thirdly, manually checking the marked text sequence, and removing the incorrectly marked dark net text.
And thirdly, constructing a word2vec model.
And (one) preprocessing the dark net text marked with the category labels by adopting a preprocessing method shown in fig. 2. The preprocessing method comprises the steps of correcting misspelled words by using a Spellchecker library of Python, obtaining common word roots by using a Porterstemming algorithm by using an nltk library of Python, and realizing a text replacement method based on regular expressions by using a re library of Python.
And (II) implementing Continuous Bag of Word algorithm by using a Gensim library of Python, and constructing a word2vec model.
Fourth, constructing a dark net text classification model.
And (one) for the preprocessed dark net texts, mapping each word of each text into word2vec vectors, and obtaining the feature vector of each text by calculating the average sum of all text vectors.
And (II) training a dark net text classification model by using the feature vector by adopting a scikit-learn library of Python.
Fifthly, constructing a dark net text named entity recognition model.
For the preprocessed dark net text, a word2vec model is used to map each word of each text into a word2vec vector.
And secondly, taking word2vec vectors and marked text sequences as training samples, and adopting a Tensorflow library of Python to realize a BiLSTM-CRF neural network model to construct a dark network text naming entity identification model.
Sixth, classifying the dark net text by using a trained dark net text classification model, and classifying the class dark net text of 'database leakage', 'hacker tool', 'malicious code', 'dark net text'.
Seventh, for the "database leakage" text, if the company named entity is extracted through the built hidden network text named entity recognition model, the database of the company can be identified as a data leakage event.
Eighth, new emerging hacking tools and malicious code are discovered.
First, extracting the "hacker tool" and "malicious code" category named entities through the dark network text named entity recognition model for the "hacker tool" and "malicious code" category dark network text.
And secondly, for the extracted named entity, the named entity is searched by setting Google as a search engine and utilizing the Machannize library simulation browser function of Python.
And thirdly, if the name of the security company does not appear in the first 20 items of the search result, the dark network text is determined to be a hacking tool or malicious code which is not mastered by network security researchers.
The above examples are only illustrative of the technical solution of the present invention and not limiting thereof, and those skilled in the art may make modifications and equivalents to the technical solution of the present invention without departing from the spirit and scope of the invention, and the scope of the invention shall be defined by the claims.

Claims (5)

1. An unknown network threat automatic discovery method based on dark network data is characterized in that:
A. crawling a known hidden network forum and a hidden network transaction platform, marking the forum posts and commodity information of the hidden network transaction platform, wherein marked categories comprise a 'hacking tool', 'malicious code' and a 'database leakage' category, and obtaining a marked hidden network text training set;
B. extracting the text information of each dark net text in the dark net text training set, preprocessing, and constructing a word2vec model by using the preprocessed dark net text;
C. preprocessing a dark net text training set, marking a text sequence, and constructing a dark net text naming entity identification model by combining the word2vec model and the BiLSTM-CRF neural network model;
D. processing each dark net text by using the constructed word2vec model, acquiring text characteristics, training a classifying model of the dark net text by using the text characteristics, and classifying the preprocessed dark net text to be classified by using the classifying model to obtain the class of the dark net text to be classified;
E. analyzing a database leakage type dark net text through a dark net text named entity recognition model, and if the company named entity is successfully extracted, finding a database leakage event;
F. analyzing the 'hacking tool' and the 'malicious code' category dark net text through the dark net text named entity recognition model, extracting the hacking tool and the malicious code category named entity, and judging whether the named entity is unknown malicious code or hacking tool through an unknown named entity judging method based on a search engine.
2. The method for automatically discovering unknown cyber threats based on the darknet data according to claim 1, wherein said step B further comprises the steps of:
b1, correcting misspelled words by adopting a misword correction algorithm, extracting word stems by adopting a Porter stemming algorithm, and using a text replacement method based on a regular expression, wherein the regular expression is used for matching IP, domain name, URL, price and mailbox entity types in the hidden network text, and replacing the entity types with the hidden network text types;
b2, constructing a dark net text 2vec model by using a Continuous Bag of Words algorithm.
3. The method for automatically discovering unknown cyber threats based on the darknet data according to claim 1, wherein said step C further comprises the steps of:
c1, processing a dark net text training set by using a preprocessing method;
labeling the text training set to form a dark net text training set with sequence labeling;
and C3, constructing a dark net text named entity recognition model by combining the word2vec model and the BiLSTM-CRF neural network model.
4. The method for automatically discovering unknown cyber threats based on the darknet data according to claim 1, wherein said step D further comprises the steps of:
d1, mapping each dark net text in the dark net text training set into a word2vec vector by using the constructed word2vec model, and obtaining a feature vector of each text by calculating the average sum of all text vectors;
d2, training the text feature vector by using a support vector machine model to obtain a dark net text classification model;
and D3, extracting feature vectors of the dark net text to be classified by the method in the step D1, and classifying the dark net text by a classifying model of the dark net text.
5. The method for automatically discovering unknown cyber-threat based on darknet data as recited in claim 1, wherein said step F further comprises the steps of:
f1, collecting a series of network security companies and website and domain name related information of the network security companies;
and F2, searching an unknown named entity through a search engine, and in the previous search results, if the collected related information of the network security company does not appear in the search results, identifying the named entity as a hacking tool or malicious code which is not mastered by security personnel.
CN201910763695.1A 2019-08-19 2019-08-19 Unknown network threat automatic discovery method based on hidden network data Active CN112464666B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910763695.1A CN112464666B (en) 2019-08-19 2019-08-19 Unknown network threat automatic discovery method based on hidden network data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910763695.1A CN112464666B (en) 2019-08-19 2019-08-19 Unknown network threat automatic discovery method based on hidden network data

Publications (2)

Publication Number Publication Date
CN112464666A CN112464666A (en) 2021-03-09
CN112464666B true CN112464666B (en) 2023-07-21

Family

ID=74807078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910763695.1A Active CN112464666B (en) 2019-08-19 2019-08-19 Unknown network threat automatic discovery method based on hidden network data

Country Status (1)

Country Link
CN (1) CN112464666B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076464B (en) * 2021-04-13 2022-07-22 国家计算机网络与信息安全管理中心 Multi-channel network clue discovery method and device based on reconstruction coding anomaly detection
CN113328994B (en) * 2021-04-30 2022-07-12 新华三信息安全技术有限公司 Malicious domain name processing method, device, equipment and machine readable storage medium
CN114692593B (en) * 2022-03-21 2023-04-07 中国刑事警察学院 Network information safety monitoring and early warning method
CN115002045B (en) * 2022-07-19 2022-12-09 中国电子科技集团公司第三十研究所 Twin network-based dark website session identification method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107046543A (en) * 2017-04-26 2017-08-15 国家电网公司 A kind of threat intelligence analysis system traced to the source towards attack
CN108874943A (en) * 2018-06-04 2018-11-23 上海交通大学 A kind of darknet resource detection system based on shot and long term Memory Neural Networks
CN110046260A (en) * 2019-04-16 2019-07-23 广州大学 A kind of darknet topic discovery method and system of knowledge based map
CN110119469A (en) * 2019-05-22 2019-08-13 北京计算机技术及应用研究所 A kind of data collection and transmission and method towards darknet

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9392003B2 (en) * 2012-08-23 2016-07-12 Raytheon Foreground Security, Inc. Internet security cyber threat reporting system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107046543A (en) * 2017-04-26 2017-08-15 国家电网公司 A kind of threat intelligence analysis system traced to the source towards attack
CN108874943A (en) * 2018-06-04 2018-11-23 上海交通大学 A kind of darknet resource detection system based on shot and long term Memory Neural Networks
CN110046260A (en) * 2019-04-16 2019-07-23 广州大学 A kind of darknet topic discovery method and system of knowledge based map
CN110119469A (en) * 2019-05-22 2019-08-13 北京计算机技术及应用研究所 A kind of data collection and transmission and method towards darknet

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Yong Fang,etal.Analyzing and identifying Data Breaches in underground Forums.《IEEE Access》.2019,第7卷48770-48777. *

Also Published As

Publication number Publication date
CN112464666A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN112464666B (en) Unknown network threat automatic discovery method based on hidden network data
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN103544436B (en) System and method for distinguishing phishing websites
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN110781308B (en) Anti-fraud system for constructing knowledge graph based on big data
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN108038173B (en) Webpage classification method and system and webpage classification equipment
CN110177114A (en) The recognition methods of network security threats index, unit and computer readable storage medium
CN107341399A (en) Assess the method and device of code file security
Bannur et al. Judging a site by its content: learning the textual, structural, and visual features of malicious web pages
CN110569350B (en) Legal recommendation method, equipment and storage medium
CN111723371B (en) Method for constructing malicious file detection model and detecting malicious file
CN112989831B (en) Entity extraction method applied to network security field
CN107943514A (en) The method for digging and system of core code element in a kind of software document
CN103324886B (en) A kind of extracting method of fingerprint database in network intrusion detection and system
CN107437026A (en) A kind of malicious web pages commercial detection method based on advertising network topology
CN114915468B (en) Intelligent analysis and detection method for network crime based on knowledge graph
CN111931935A (en) Network security knowledge extraction method and device based on One-shot learning
CN111460803B (en) Equipment identification method based on Web management page of industrial Internet of things equipment
CN112148956A (en) Hidden net threat information mining system and method based on machine learning
CN109194605B (en) Active verification method and system for suspicious threat indexes based on open source information
Wu et al. Price tag: towards semi-automatically discovery tactics, techniques and procedures of E-commerce cyber threat intelligence
CN112307314A (en) Method and device for generating fine selection abstract of search engine
CN108717637B (en) Automatic mining method and system for E-commerce safety related entities
CN115760495A (en) Method and device for realizing automatic labeling of legal cases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant