WO2023138047A1

WO2023138047A1 - Cyber threat information extraction method, device, storage medium, and apparatus

Info

Publication number: WO2023138047A1
Application number: PCT/CN2022/113831
Authority: WO
Inventors: 唐杰; 吴龙平; 莫建平; 余凯
Original assignee: 三六零科技集团有限公司
Priority date: 2022-01-20
Filing date: 2022-08-22
Publication date: 2023-07-27
Also published as: CN116522331A

Abstract

The present invention relates to the field of internet technologies, and disclosed are a cyber threat information extraction method, a device, a storage medium, and an apparatus. The method comprises: performing natural language processing on unstructured cyber threat information to obtain a purpose of attack and a means of attack; performing means of attack prediction with respect to the purpose of attack by means of a preset machine learning model to obtain a knowledge-less means of attack; and generating structured cyber threat information according to the purpose of attack, the means of attack, and the knowledge-less means of attack. The purpose of attack and the means of attack of an attacker are automatically identified and extracted from the unstructured cyber threat information on the basis of natural language processing and the preset machine learning model, and consequently a process of analyzing the cyber threat information can be simplified, and a security and defense capability can also be improved.

Description

Network threat information extraction method, equipment, storage medium and device

technical field

The present invention relates to the technical field of the Internet, in particular to a network threat information extraction method, equipment, storage medium and device.

Background technique

With the explosive growth of network threat attacks, the extraction and sharing of information related to the techniques and tactics used by the attackers and the attack realization process (TTP) in the threat analysis report is crucial to the construction of network security. However, due to the lack of standard structured language description and automatic extraction and analysis technology of network threat reporting technical and tactical intelligence, analyzing complex and unstructured threat analysis reports is very time-consuming and laborious.

The above content is only used to assist in understanding the technical solution of the present invention, and does not mean that the above content is admitted as prior art.

Contents of the invention

The main purpose of the present invention is to provide a network threat information extraction method, equipment, storage medium and device, aiming to solve the technical problem in the prior art that analyzing complex and unstructured threat analysis reports is very time-consuming and laborious due to the lack of standard structured language description and automatic extraction and analysis technology of network threat report technical and tactical intelligence.

In order to achieve the above object, the present invention provides a network threat information extraction method, the network threat information extraction method includes the following steps:

Perform natural language processing on unstructured network threat information to obtain attack purpose and attack means;

Predict the attack method of the attack purpose through the preset machine learning model, and obtain the unknown attack method;

Generate structured network threat information according to the attack purpose, the attack means and the unknown attack means.

Optionally, the step of performing natural language processing on the unstructured network threat information to obtain the attack purpose and attack means includes:

Perform text preprocessing on unstructured network threat information to obtain simplified network threat information;

performing in-depth sentence segmentation on the simplified network threat information to obtain threat sentences;

Performing semantic dependency analysis on the threat sentence to obtain a standard threat sentence;

performing lexical tagging on the standard threat sentence to obtain a threat corpus;

performing synonym expansion on the threat corpus to obtain a target corpus;

The attack purpose and attack means are determined according to the target corpus.

Optionally, the step of performing semantic dependency analysis on the threat sentence to obtain the dependency relationship between each vocabulary in the threat sentence includes:

Performing semantic dependency analysis on the threat sentence to obtain the dependency relationship between the words in the threat sentence;

The threat sentence is standardized according to the dependency relationship to obtain a standard threat sentence.

Optionally, the step of standardizing the threat statement according to the dependency relationship to obtain a standard threat statement includes:

Obtain the part-of-speech information of each vocabulary in the threat sentence;

Standardize the threat sentence according to the part-of-speech information and the dependency relationship to obtain a standard threat sentence.

Optionally, the step of performing synonym expansion on the threat corpus to obtain a target corpus includes:

Obtain the frequency of occurrence of each keyword in the threat corpus, and determine the threat keyword according to the frequency of occurrence;

The threat keywords are synonymously expanded based on a preset dictionary to obtain a target corpus.

Optionally, the step of performing in-depth sentence segmentation on the simplified network threat information to obtain a threat sentence includes:

Obtaining the sentence end symbols, coordinating relative conjunctions and progressive relative conjunctions in the simplified network threat information;

The simplified network threat information is segmented in depth according to the sentence end symbol, the coordinating relative conjunction and the progressive relative conjunction to obtain a threat sentence.

Optionally, the step of performing lexical tagging on the standard threat sentence to obtain a threat corpus includes:

obtaining the necessary scores for each part of said standard threat statement;

The standard threat sentence is simplified according to the necessary score to obtain a threat corpus.

Optionally, the step of performing text preprocessing on unstructured network threat information to obtain simplified network threat information further includes:

Obtain random information in unstructured cyber threat information;

The random information is simplified to obtain simplified network threat information.

Optionally, before the step of predicting the attack means of the attack purpose through the preset machine learning model, and obtaining the unknown attack means, it also includes:

Constructing a training set corpus according to the target corpus;

The initial machine learning model is trained according to the training set corpus to obtain a preset machine learning model.

Optionally, the step of constructing a training set corpus according to the target corpus includes:

Obtain the number of sentences of synonymous threat sentences in the target corpus;

Selecting threat sentence samples from the target corpus according to the number of sentences;

A training set corpus is constructed based on the threat sentence samples.

Optionally, the step of selecting a threat sentence sample from the target corpus according to the number of sentences includes:

sorting the threat sentences in the target corpus according to the number of sentences;

Receiving user input semantic tags based on the target corpus;

Select a threat statement sample from the target corpus according to the sorting result and the semantic label.

Optionally, the step of predicting the attack means of the attack purpose through a preset machine learning model, and obtaining unknown attack means includes:

Obtain multi-platform cyber threat information;

Based on the multi-platform network threat information, the attack method is predicted for the attack purpose through a preset machine learning model, and an unknown attack method is obtained.

In addition, in order to achieve the above object, the present invention also proposes a network threat information extraction device, the network threat information extraction device includes a memory, a processor, and a network threat information extraction program stored in the memory and operable on the processor, the network threat information extraction program is configured to implement the network threat information extraction method as described above.

In addition, in order to achieve the above object, the present invention also proposes a storage medium on which a network threat information extraction program is stored, and when the network threat information extraction program is executed by a processor, the network threat information extraction method as described above is realized.

In addition, in order to achieve the above purpose, the present invention also proposes a network threat information extraction device, the network threat information extraction device includes: a language processing module, a means prediction module and an information generation module;

The language processing module is used to perform natural language processing on unstructured network threat information to obtain attack purposes and attack methods;

The method prediction module is used to predict the attack method for the attack purpose through a preset machine learning model, and obtain an unknown attack method;

The information generating module is configured to generate structured network threat information according to the attack purpose, the attack means and the unknown attack means.

Optionally, the language processing module is further configured to perform text preprocessing on unstructured network threat information to obtain simplified network threat information;

The language processing module is further configured to perform in-depth sentence segmentation on the simplified network threat information to obtain threat sentences;

The language processing module is further configured to perform semantic dependency analysis on the threat sentence to obtain a standard threat sentence;

The language processing module is further configured to perform lexical tagging on the standard threat sentence to obtain a threat corpus;

The language processing module is further configured to expand synonyms to the threat corpus to obtain a target corpus;

The language processing module is further configured to determine an attack purpose and an attack method according to the target corpus.

Optionally, the language processing module is further configured to perform semantic dependency analysis on the threat sentence to obtain the dependency relationship between the words in the threat sentence;

The language processing module is further configured to standardize the threat sentence according to the dependency relationship to obtain a standard threat sentence.

Optionally, the language processing module is further configured to obtain part-of-speech information of each vocabulary in the threat sentence;

The language processing module is further configured to standardize the threat sentence according to the part-of-speech information and the dependency relationship to obtain a standard threat sentence.

Optionally, the language processing module is further configured to obtain the frequency of occurrence of each keyword in the threat corpus, and determine the threat keyword according to the frequency of occurrence;

The language processing module is further configured to perform synonym expansion on the threat keywords based on a preset dictionary to obtain a target corpus.

Optionally, the language processing module is further configured to acquire sentence-ending symbols, coordinating relative conjunctions, and progressive relative conjunctions in the simplified network threat information;

The language processing module is further configured to perform in-depth sentence segmentation on the simplified network threat information according to the sentence end symbol, the coordinating relative conjunction and the progressive relative conjunction, to obtain a threat sentence.

In the present invention, it is disclosed that natural language processing is performed on unstructured network threat information to obtain the attack purpose and attack means, and the attack means are predicted by a preset machine learning model to obtain unknown attack means, and structured network threat information is generated according to the attack purpose, attack means, and unknown attack means; since the present invention automatically identifies and extracts the attack purpose and attack means of the attacker in the unstructured network threat information based on natural language processing and preset machine learning models, the analysis process of network threat information can be simplified, and the security defense capability can be improved.

Description of drawings

FIG. 1 is a schematic structural diagram of a network threat information extraction device in a hardware operating environment involved in the solution of an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a first embodiment of a method for extracting network threat information according to the present invention;

FIG. 3 is a schematic flowchart of a second embodiment of a method for extracting network threat information according to the present invention;

FIG. 4 is a schematic flowchart of a third embodiment of a method for extracting network threat information according to the present invention;

5 is a schematic diagram of semantic dependency analysis of an embodiment of the network threat information extraction method of the present invention;

6 is a schematic flowchart of a fourth embodiment of a method for extracting network threat information according to the present invention;

Fig. 7 is a structural block diagram of the first embodiment of the device for extracting network threat information according to the present invention.

The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

Referring to FIG. 1 , FIG. 1 is a schematic structural diagram of a device for extracting network threat information in a hardware operating environment according to an embodiment of the present invention.

As shown in FIG. 1, the network threat information extraction device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein, the communication bus 1002 is used to realize connection and communication between these components. The user interface 1003 may include a display screen (Display). The optional user interface 1003 may also include a standard wired interface and a wireless interface. The wired interface of the user interface 1003 may be a USB interface in the present invention. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a Wireless-Fidelity (Wi-Fi) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable memory (Non-volatile Memory, NVM), such as a disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .

Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the network threat information extraction device, and may include more or less components than those shown in the figure, or combine some components, or arrange different components.

As shown in FIG. 1 , the memory 1005 identified as a computer storage medium may include an operating system, a network communication module, a user interface module, and a network threat information extraction program.

In the network threat information extraction device shown in FIG. 1 , the network interface 1004 is mainly used to connect to a background server and perform data communication with the background server; the user interface 1003 is mainly used to connect to a user device; the network threat information extraction device calls the network threat information extraction program stored in the memory 1005 through the processor 1001, and executes the network threat information extraction method provided by the embodiment of the present invention.

Based on the above hardware structure, an embodiment of the network threat information extraction method of the present invention is proposed.

Referring to FIG. 2, FIG. 2 is a schematic flowchart of the first embodiment of the network threat information extraction method of the present invention, and proposes the first embodiment of the network threat information extraction method of the present invention.

In the first embodiment, the method for extracting network threat information includes the following steps:

Step S10: Perform natural language processing on unstructured network threat information to obtain attack purpose and attack means.

It should be understood that the execution subject of the method in this embodiment may be a network threat information extraction device with functions of data processing, network communication, and program operation, such as a server, or other electronic devices capable of achieving the same or similar functions, which is not limited in this embodiment.

It is understandable that with the explosive growth of network threat attacks, the extraction and sharing of relevant information on the techniques and tactics used by the attackers and the attack implementation process (TTP) in the threat analysis report is crucial to the construction of network security. However, due to the lack of standard structured language description and automatic extraction and analysis technology of network threat reporting technical and tactical intelligence, analyzing complex and unstructured threat analysis reports is very time-consuming and laborious.

The existing TRAM project implemented by MITER based on machine learning (ML) technology and the TTPDrill project implemented by the University of North Carolina based on information retrieval (IR) technology can process threat analysis reports relatively easily, but because the processing methods are based on English reports, they cannot be applied to Chinese threat analysis reports with complex and changeable descriptions. Moreover, the output accuracy and false alarm rate of these two projects are not ideal.

Therefore, in order to overcome the above-mentioned defects, in this embodiment, based on natural language processing and preset machine learning models, the attacker's attack purpose and attack means in unstructured network threat information are automatically identified and extracted, thereby simplifying the analysis process of network threat information and improving security defense capabilities.

It should be noted that the natural language processing may be at least one of text preprocessing, text deep sentence segmentation, sentence semantic dependency analysis, vocabulary tokenization, and vocabulary synonym expansion, which is not limited in this embodiment.

It should be noted that the attack purpose may be the techniques and tactics adopted by the attacker in the unstructured network threat information, for example, the techniques and tactics may be that the virus continues to run on the computer.

The attack means can be the attack implementation process of the attacker in the unstructured network threat information. For example, the attack implementation process can be modifying the registry or booting automatically.

Step S20: Predict the attack means of the attack purpose through the preset machine learning model, and obtain the unknown attack means.

It should be noted that the preset machine learning model can be preset. In this embodiment and other embodiments, a Bag of Words (BOW) model is used as an example for illustration.

The bag-of-words model puts all words into a bag, regardless of their grammar and word order, that is, each word is independent.

Step S30: Generate structured network threat information according to the attack purpose, the attack means and the unknown attack means.

It should be understood that generating structured network threat information according to attack purpose, attack means and unknown attack means may be to aggregate attack purpose, attack means and unknown attack means to obtain structured network threat information.

In the first embodiment, it is disclosed that natural language processing is performed on unstructured network threat information to obtain the attack purpose and attack means, and the attack purpose is predicted by the preset machine learning model to obtain unknown attack means, and structured network threat information is generated according to the attack purpose, attack means and unknown attack means; since this embodiment automatically recognizes and extracts the attack purpose and attack means of the attacker in the unstructured network threat information based on natural language processing and preset machine learning models, the analysis process of network threat information can be simplified, and security defense capabilities can be improved.

Referring to FIG. 3 , FIG. 3 is a schematic flowchart of a second embodiment of the network threat information extraction method of the present invention. Based on the first embodiment shown in FIG. 2 above, a second embodiment of the network threat information extraction method of the present invention is proposed.

In the second embodiment, the step S10 includes:

Step S101: Perform text preprocessing on unstructured network threat information to obtain simplified network threat information.

It should be understood that, in order to improve the processing effect of natural language processing, in this embodiment, text preprocessing may be performed on unstructured network threat information first, then in-depth sentence segmentation, semantic dependency analysis, vocabulary tagging, and synonym expansion may be performed to obtain the attack purpose and attack means of the attacker in the unstructured network threat information.

It can be understood that performing text preprocessing on unstructured network threat information to obtain simplified network threat information may be obtaining random information in the unstructured network threat information, performing simplified processing on the random information, and obtaining simplified network threat information.

Step S102: Perform in-depth sentence segmentation on the simplified network threat information to obtain threat sentences.

It should be understood that, in order to simplify the network threat information, it is ensured that each sentence to be analyzed independently expresses a technique, tactics and implementation process (TTP). In this embodiment, the simplified network threat information can be segmented in depth to obtain threat sentences.

It can be understood that performing in-depth sentence segmentation on the simplified network threat information to obtain the threat sentence may be to obtain the sentence end symbols, coordinating relative conjunctions, and progressive relative conjunctions in the simplified network threat information, and perform deep sentence segmentation on the simplified network threat information according to the sentence end symbols, parallel relative conjunctions, and progressive relative conjunctions to obtain the threat sentence.

Step S103: Perform semantic dependency analysis on the threat sentence to obtain a standard threat sentence.

It should be understood that, in order to standardize and unify complex and changeable description methods, in this embodiment, semantic dependency analysis may also be performed on the threat sentence to obtain the dependency relationship between each vocabulary in the threat sentence, and standardize the threat sentence according to the dependency relationship to obtain a standard threat sentence.

It can be understood that performing semantic dependency analysis on the threat sentence to obtain a standard threat sentence may be performing semantic dependency analysis on the threat sentence to obtain a dependency relationship between each vocabulary in the threat sentence, and standardizing the threat sentence according to the dependency relationship to obtain a standard threat sentence.

Step S104: Perform vocabulary tagging on the standard threat sentences to obtain a threat corpus.

It should be understood that, in order to converge the threat corpus, in this embodiment, the necessary scores of each part of the standard threat sentence can be obtained first, and then the standard threat sentence can be simplified according to the necessary score to obtain the threat corpus.

It can be understood that the lexical tagging of the standard threat sentences to obtain the threat corpus may be to obtain the necessary scores of each part of the standard threat sentences, and simplify the standard threat sentences according to the necessary scores to obtain the threat corpus.

Step S105: performing synonym expansion on the threat corpus to obtain a target corpus.

It should be understood that, in order to improve the recall rate of subsequent model predictions, in this embodiment, synonym expansion may be performed on high-frequency keywords in the threat corpus.

It can be understood that the synonym expansion of the threat corpus and the acquisition of the target corpus may be obtained by obtaining the occurrence frequency of each keyword in the threat corpus, determining the threat keywords according to the frequency of occurrence, and performing synonym expansion on the threat keywords based on a preset dictionary to obtain the target corpus.

Step S106: Determine the attack purpose and attack means according to the target corpus.

It should be noted that the attack purpose can be the techniques and tactics adopted by the attacker in the unstructured network threat information, for example, the technique and tactics can be that the virus continues to run on the computer.

In the second embodiment, text preprocessing of unstructured network threat information is disclosed, simplified network threat information is obtained, simplified network threat information is subjected to in-depth sentence segmentation, threat sentences are obtained, threat sentences are subjected to semantic dependency analysis, standard threat sentences are obtained, standard threat sentences are lexically marked, threat corpus is obtained, threat corpus is synonymously expanded, target corpus is obtained, and attack purpose and attack means are determined according to the target corpus; since this embodiment first performs text preprocessing on unstructured network threat information, and then performs deep sentence segmentation , then semantic dependency analysis, vocabulary tagging, and synonym expansion to obtain the attack purpose and attack means of the attacker in the unstructured network threat information, so as to improve the processing effect of natural language processing.

In the second embodiment, the step S20 includes:

Step S201: Obtain multi-platform network threat information.

It should be understood that, in order to obtain multi-dimensional unknown attack methods, in this embodiment, multi-platform network threat information may be obtained first, and then based on the multi-platform network threat information, the attack method is predicted for the attack purpose through a preset machine learning model to obtain unknown attack methods.

It should be noted that the multi-platform network threat information may be network threat information detected and obtained by multiple security platforms.

Step S202: Based on the multi-platform network threat information, predict the attack means of the attack target through a preset machine learning model, and obtain unknown attack means.

In the second embodiment, it is disclosed that multi-platform network threat information is obtained, and based on the multi-platform network threat information, attack means are predicted for the attack purpose through a preset machine learning model to obtain unknown attack means; since this embodiment predicts attack means based on multi-platform network threat information, multi-dimensional unknown attack means can be obtained.

Referring to FIG. 4 , FIG. 4 is a schematic flowchart of a third embodiment of the network threat information extraction method of the present invention. Based on the second embodiment shown in FIG. 3 above, a third embodiment of the network threat information extraction method of the present invention is proposed.

In the third embodiment, the step S101 includes:

Step S1011: Obtain random information in unstructured network threat information.

It should be understood that, in order to reduce input randomness and improve information processing speed, in this embodiment, random information in unstructured network threat information may be obtained first, and then the random information may be simplified to obtain simplified network threat information.

Step S1012: Simplify the random information to obtain simplified network threat information.

In a specific implementation, for example, the unstructured network threat information is "a Trojan horse program that releases a normal TP program TPHelper.exe and a malicious TPHelperBase.dll in the %TEMP% directory after it runs to constitute a dll hijacking." After simplifying the random information in the unstructured network threat information, the simplified network threat information "a Trojan horse program that releases a normal TP program EXE file and a malicious DLL file in a specific directory to constitute a dll hijacking" is obtained after it runs.

In the third embodiment, it is disclosed that the random information in the unstructured network threat information is obtained, and the random information is simplified to simplify the network threat information; because this embodiment first obtains the random information in the unstructured network threat information, and then performs simplified processing on the random information to obtain the simplified network threat information, thereby reducing the input randomness and improving the information processing speed.

In the third embodiment, the step S102 includes:

Step S1021: Obtain the sentence-end symbols, coordinating relative conjunctions and progressive relative conjunctions in the simplified network threat information.

It should be noted that the statement end symbol can include ? ! ………"' "' etc. Coordinating relative conjunctions can include and, and, with, and, etc. Progressive relative conjunctions can include not only, not only, but also, not to mention, etc.

Step S1022: Perform in-depth sentence segmentation on the simplified network threat information according to the sentence end symbol, the coordinating relative conjunction and the progressive relative conjunction to obtain a threat sentence.

It can be understood that the simplified network threat information is segmented in depth according to the sentence end symbol, the coordinating relative conjunction and the progressive relative conjunction, and the threat sentence is obtained by obtaining the position of the sentence end symbol, the coordinate relative conjunction and the progressive relative conjunction in the simplified network threat information, and performing deep sentence segmentation on the simplified network threat information according to the position to obtain the threat sentence.

In the third embodiment, it is disclosed to obtain the sentence end symbols, coordinating relative conjunctions and progressive relative conjunctions in the simplified network threat information, and perform in-depth sentence segmentation on the simplified network threat information according to the sentence end symbols, coordinating relative conjunctions and progressive relative conjunctions to obtain threat sentences; because this embodiment performs deep sentence segmentation on the simplified network threat information to obtain threat sentences, the network threat information can be simplified, ensuring that each sentence to be analyzed independently expresses a technical strategy and an implementation process.

In the third embodiment, the step S103 includes:

Step S1031: Perform semantic dependency analysis on the threat sentence to obtain the dependency relationship between the words in the threat sentence.

It should be noted that the dependency relationship may be a dependency relationship between parent and child words.

Step S1032: Perform standardization processing on the threat sentence according to the dependency relationship to obtain a standard threat sentence.

It can be understood that after the threat statement is standardized according to the dependency relationship, the tools, approaches, spatial locations, implementation scope, and achieved effects, etc. used by the attacker in the statement can be standardized and output.

In a specific implementation, for example, the subject words and the words and sentences can be standardized and unified.

Further, in order to improve the effect of standardization processing, the step S1032 includes:

For ease of understanding, description is made with reference to FIG. 5 , but this solution is not limited. Figure 5 is a schematic diagram of semantic dependency analysis. In the figure, the threat sentence is "the Trojan will send the obtained keyboard log to a configurable email address", ROOT represents the root node, which is the core node of the whole sentence, mDEPD represents the auxiliary word, FEAT represents the modifier, PAT represents the object of the subject operation (the object changes), rPAT represents the object of the subject operation (the object changes, passive), CONT represents the object of the subject operation (the object does not change significantly), rCONT represents the object of the subject operation (the object changes). No obvious change in aspect, passive sentence), mRELA represents conjunctions and prepositions, such as but, and etc., AGT represents the subject, LOC represents space, and mPUNC represents punctuation marks.

In the third embodiment, it is disclosed that the semantic dependency analysis is performed on the threat sentence to obtain the dependency relationship between the vocabulary in the threat sentence, and the threat sentence is standardized according to the dependency relationship to obtain the standard threat sentence; because the semantic dependency analysis is performed on the threat sentence in this embodiment, the dependency relationship between each vocabulary in the threat sentence is obtained, and the threat sentence is standardized according to the dependency relationship to obtain the standard threat sentence, so that complex and changeable description methods can be standardized and unified.

In the third embodiment, the step S104 includes:

Step S1041: Obtain the necessary scores of each part in the standard threat sentence.

It should be noted that the necessary score is used to measure the degree of necessity of each vocabulary in the threat sentence in the sentence.

Step S1042: Simplify the standard threat sentence according to the necessary score to obtain a threat corpus.

In a specific implementation, for example, the standard threat sentence is "the Trojan horse will send the obtained keyboard logs to a configurable email address". After the standard threat sentence is simplified according to the necessary scores of each part in the standard threat sentence, the threat corpus "send keyboard logs to an email address" in the threat corpus is obtained.

In the third embodiment, it is disclosed that the necessary scores of each part of the standard threat sentence are obtained, and the standard threat sentence is simplified according to the necessary score to obtain the threat corpus; since this embodiment first obtains the necessary score of each part of the standard threat sentence, and then according to the necessary score, the standard threat sentence is simplified to obtain the threat corpus, so that the threat corpus can be converged.

In the third embodiment, the step S105 includes:

Step S1051: Obtain the occurrence frequency of each keyword in the threat corpus, and determine the threat keyword according to the occurrence frequency.

It can be understood that determining the threat keyword according to the frequency of occurrence may be sorting the keywords according to the frequency of occurrence, and determining the threat keyword according to the ranking result.

Step S1052: Perform synonym expansion on the threat keywords based on a preset dictionary to obtain a target corpus.

It should be noted that the preset dictionary can be preset, and synonyms corresponding to each keyword can be stored in the preset dictionary.

In a specific implementation, for example, the "account name in the Trojan horse collection domain" in the threat corpus can be expanded to "user name in the Trojan horse collection domain", "user account in the Trojan horse collection domain", and "user login name in the Trojan horse harvesting domain".

In the third embodiment, it is disclosed that the frequency of occurrence of each keyword in the threat corpus is obtained, and the threat keyword is determined according to the frequency of occurrence, and the threat keyword is synonymously expanded based on the preset dictionary to obtain the target corpus; since this embodiment performs synonym expansion on the high-frequency keywords in the threat corpus, the recall rate of subsequent model predictions can be improved.

Referring to FIG. 6 , FIG. 6 is a schematic flowchart of a fourth embodiment of the network threat information extraction method of the present invention. Based on the second embodiment shown in FIG. 3 above, the fourth embodiment of the network threat information extraction method of the present invention is proposed.

In the fourth embodiment, before step S201, it also includes:

Step S110: Construct a training set corpus according to the target corpus.

It should be understood that, in order to improve the accuracy of the preset machine learning model, in this embodiment, the initial machine learning model may be trained first to obtain the preset machine learning model.

It can be understood that constructing the training set corpus according to the target corpus may be to randomly select training samples from the target corpus to construct the training set corpus.

Further, in order to cluster the training set corpus, the step S110 includes:

A training set corpus is constructed based on the threat sentence samples.

It should be understood that a synonymous threat statement may be a statement with the same semantics.

It can be understood that selecting the threat sentence samples from the target corpus according to the number of sentences may be sorting the synonymous threat sentences according to the number of sentences in descending order, and using the top-ranked preset number of synonymous threat sentences as the threat sentence samples.

Further, in order to improve the reliability of the threat sentence sample, the said threat sentence sample is selected from the target corpus according to the number of sentences, including:

Receiving user input semantic tags based on the target corpus;

It should be understood that, in order to improve the reliability of the threat sentence samples, the user may also input semantic tags to mark each threat sentence in the target corpus.

It can be understood that the selection of threat sentence samples from the target corpus according to the sorting results and semantic labels may be a preset number of synonymous threat sentences that are ranked first and have preset semantic labels as threat sentence samples. Wherein, the preset label may be preset, which is not limited in this embodiment.

Step S120: Train the initial machine learning model according to the training set corpus to obtain a preset machine learning model.

It should be understood that training the initial machine learning model according to the training set corpus to obtain the preset machine learning model may be inputting each threat sentence sample in the training set corpus into the initial machine learning model, and adjusting the initial machine learning model according to the output results, so as to train the initial machine learning model and obtain the preset machine learning model.

In the fourth embodiment, it is disclosed that the training set corpus is constructed according to the target corpus, and the initial machine learning model is trained according to the training set corpus to obtain the preset machine learning model; since this example pre-trains the initial machine learning model to obtain the preset machine learning model, thereby improving the accuracy of the preset machine learning model.

In addition, an embodiment of the present invention also proposes a storage medium, on which a network threat information extraction program is stored, and when the network threat information extraction program is executed by a processor, the network threat information extraction method as described above is implemented.

In addition, referring to FIG. 7 , an embodiment of the present invention also proposes a network threat information extraction device, the network threat information extraction device includes: a language processing module 10, a method prediction module 20, and an information generation module 30;

The language processing module 10 is configured to perform natural language processing on unstructured network threat information to obtain attack purpose and attack means.

It should be noted that the purpose of the attack may be the techniques and tactics adopted by the attacker in the unstructured network threat information, for example, the techniques and tactics may be that the virus continues to run on the computer.

The method prediction module 20 is configured to predict the attack method for the attack purpose through a preset machine learning model, and obtain an unknown attack method.

The information generating module 30 is configured to generate structured network threat information according to the attack purpose, the attack means and the unknown attack means.

It should be understood that generating structured network threat information based on attack purpose, attack means and unknown attack means can be to aggregate attack purpose, attack means and unknown attack means to obtain structured network threat information.

In this embodiment, it is disclosed that natural language processing is performed on unstructured network threat information to obtain the attack purpose and attack means, and the attack purpose is predicted by a preset machine learning model to obtain unknown attack means, and structured network threat information is generated according to the attack purpose, attack means, and unknown attack means; since this embodiment automatically identifies and extracts the attack purpose and attack means of the attacker in the unstructured network threat information based on natural language processing and preset machine learning models, the analysis process of network threat information can be simplified, and security defense capabilities can be improved.

For other embodiments or specific implementations of the device for extracting network threat information in the present invention, reference may be made to the above-mentioned method embodiments, which will not be repeated here.

It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such a process, method, article or system. Without further limitations, an element defined by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system comprising that element.

The serial numbers of the above embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is a better implementation. Based on this understanding, the technical solution of the present invention is essentially or contributed to the existing technology. The computer software products are stored in a storage medium (such as Read only Memory Image (ROM)/random access memory (RAM), magnetic magnetic (RAM), and magnetic. In the disc, discs), there are several instructions to enable a terminal device (can be a mobile phone, computer, server, air conditioner, or network device, etc.) to perform the methods described in each embodiment of the present invention.

The above are only preferred embodiments of the present invention, and do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the description of the present invention and the contents of the accompanying drawings, or directly or indirectly used in other related technical fields, are all included in the scope of patent protection of the present invention.

The present invention discloses A1. A method for extracting network threat information. The method for extracting network threat information includes the following steps:

A2. The method for extracting network threat information as described in A1, wherein the steps of performing natural language processing on unstructured network threat information to obtain attack purpose and attack means include:

performing synonym expansion on the threat corpus to obtain a target corpus;

A3. The method for extracting network threat information as described in A2, the step of performing semantic dependency analysis on the threat sentence and obtaining the dependency relationship between the words in the threat sentence includes:

A4. The method for extracting network threat information as described in A3, wherein the step of standardizing the threat statement according to the dependency relationship to obtain a standard threat statement includes:

A5. The method for extracting network threat information as described in A2, the step of performing synonym expansion on the threat corpus to obtain the target corpus includes:

A6. The method for extracting network threat information as described in A2, the step of performing in-depth sentence segmentation on the simplified network threat information to obtain a threat sentence includes:

A7. The method for extracting network threat information as described in A2, wherein the step of performing lexical tagging on the standard threat sentence and obtaining a threat corpus includes:

obtaining the necessary scores for each part of said standard threat statement;

A8. The method for extracting network threat information as described in A2, wherein the step of performing text preprocessing on unstructured network threat information to obtain simplified network threat information further includes:

Obtain random information in unstructured cyber threat information;

A9. The method for extracting network threat information as described in A2, before the step of predicting the attack means of the attack purpose through the preset machine learning model, and obtaining the unknown attack means, it also includes:

Constructing a training set corpus according to the target corpus;

A10, the method for extracting network threat information as described in A9, the step of constructing a training set corpus according to the target corpus includes:

A training set corpus is constructed based on the threat sentence samples.

A11. The method for extracting network threat information as described in A10, the step of selecting a threat sentence sample from the target corpus according to the number of sentences includes:

Receiving user input semantic tags based on the target corpus;

A12. The network threat information extraction method described in any one of A1 to A11, the step of predicting the attack means for the attack purpose through a preset machine learning model, and obtaining unknown attack means, including:

Obtain multi-platform cyber threat information;

The present invention also discloses B13, a network threat information extraction device. The network threat information extraction device includes: a memory, a processor, and a network threat information extraction program stored in the memory and operable on the processor. When the network threat information extraction program is executed by the processor, the network threat information extraction method as described above is realized.

The present invention also discloses C14, a storage medium, on which a network threat information extraction program is stored, and when the network threat information extraction program is executed by a processor, the above-mentioned network threat information extraction method is realized.

The present invention also discloses D15, a network threat information extraction device, the network threat information extraction device includes: a language processing module, a means prediction module and an information generation module;

The language processing module is used to perform natural language processing on unstructured network threat information to obtain attack purposes and attack means;

D16. The device for extracting network threat information as described in D15, wherein the language processing module is further configured to perform text preprocessing on unstructured network threat information to obtain simplified network threat information;

D17. The device for extracting network threat information as described in D16, wherein the language processing module is further configured to perform semantic dependency analysis on the threat sentence, and obtain a dependency relationship between each vocabulary in the threat sentence;

D18. The device for extracting network threat information as described in D17, wherein the language processing module is further configured to obtain part-of-speech information of each vocabulary in the threat sentence;

D19. The device for extracting network threat information as described in D16, wherein the language processing module is further configured to obtain the frequency of occurrence of each keyword in the threat corpus, and determine the threat keyword according to the frequency of occurrence;

D20. The device for extracting network threat information as described in D16, wherein the language processing module is further configured to acquire the sentence-end symbols, coordinating relative conjunctions, and progressive relative conjunctions in the simplified network threat information;

Claims

A method for extracting network threat information, characterized in that the method for extracting network threat information comprises the following steps:

Perform natural language processing on unstructured network threat information to obtain attack purpose and attack means;

Predict the attack method of the attack purpose through the preset machine learning model, and obtain the unknown attack method;

Generate structured network threat information according to the attack purpose, the attack means and the unknown attack means.
The network threat information extraction method according to claim 1, wherein the step of performing natural language processing on the unstructured network threat information to obtain the attack purpose and attack means includes:

Perform text preprocessing on unstructured network threat information to obtain simplified network threat information;

performing in-depth sentence segmentation on the simplified network threat information to obtain threat sentences;

Performing semantic dependency analysis on the threat sentence to obtain a standard threat sentence;

performing lexical tagging on the standard threat sentence to obtain a threat corpus;

performing synonym expansion on the threat corpus to obtain a target corpus;

The attack purpose and attack means are determined according to the target corpus.
The method for extracting network threat information according to claim 2, wherein the step of performing semantic dependency analysis on the threat sentence to obtain the dependency relationship between the words in the threat sentence includes:

Performing semantic dependency analysis on the threat sentence to obtain the dependency relationship between the words in the threat sentence;

The threat sentence is standardized according to the dependency relationship to obtain a standard threat sentence.
The method for extracting network threat information according to claim 3, wherein the step of standardizing the threat sentence according to the dependency relationship to obtain a standard threat sentence includes:

Obtain the part-of-speech information of each vocabulary in the threat sentence;

Standardize the threat sentence according to the part-of-speech information and the dependency relationship to obtain a standard threat sentence.
The method for extracting network threat information according to claim 2, wherein the step of expanding the threat corpus with synonyms to obtain the target corpus includes:

Obtain the frequency of occurrence of each keyword in the threat corpus, and determine the threat keyword according to the frequency of occurrence;

The threat keywords are synonymously expanded based on a preset dictionary to obtain a target corpus.
The method for extracting network threat information according to claim 2, wherein the step of performing in-depth sentence segmentation on the simplified network threat information to obtain threat sentences includes:

Obtaining the sentence end symbols, coordinating relative conjunctions and progressive relative conjunctions in the simplified network threat information;

The simplified network threat information is segmented in depth according to the sentence end symbol, the coordinating relative conjunction and the progressive relative conjunction to obtain a threat sentence.
The method for extracting network threat information according to claim 2, wherein the step of performing lexical tagging on the standard threat sentence to obtain a threat corpus includes:

obtaining the necessary scores for each part of said standard threat statement;

The standard threat sentence is simplified according to the necessary score to obtain a threat corpus.
A network threat information extraction device, characterized in that the network threat information extraction device comprises: a memory, a processor, and a network threat information extraction program stored on the memory and operable on the processor, and the network threat information extraction program is executed by the processor to implement the network threat information extraction method according to any one of claims 1 to 7.
A storage medium, wherein a network threat information extraction program is stored on the storage medium, and when the network threat information extraction program is executed by a processor, the network threat information extraction method according to any one of claims 1 to 7 is realized.
A network threat information extraction device, characterized in that the network threat information extraction device includes: a language processing module, a means prediction module, and an information generation module;

The language processing module is used to perform natural language processing on unstructured network threat information to obtain attack purposes and attack means;

The method prediction module is used to predict the attack method for the attack purpose through a preset machine learning model, and obtain an unknown attack method;

The information generating module is configured to generate structured network threat information according to the attack purpose, the attack means and the unknown attack means.