CN115687979A - Method and device for identifying specified technology in threat intelligence, electronic equipment and storage medium - Google Patents

Method and device for identifying specified technology in threat intelligence, electronic equipment and storage medium Download PDF

Info

Publication number
CN115687979A
CN115687979A CN202211387653.0A CN202211387653A CN115687979A CN 115687979 A CN115687979 A CN 115687979A CN 202211387653 A CN202211387653 A CN 202211387653A CN 115687979 A CN115687979 A CN 115687979A
Authority
CN
China
Prior art keywords
word
paragraph
prediction
specified
technology
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211387653.0A
Other languages
Chinese (zh)
Inventor
贾蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202211387653.0A priority Critical patent/CN115687979A/en
Publication of CN115687979A publication Critical patent/CN115687979A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for identifying a specified technology in threat information, electronic equipment and a storage medium, wherein the method comprises the following steps: preprocessing the network threat information to obtain a word sequence corresponding to each paragraph in the network threat information; after adding word masks to the word sequence, inputting the trained shape-filling model to obtain predicted words which are output by the shape-filling model and correspond to the word masks; inputting the word sequence into a trained technical classification model, obtaining multiple prediction categories output by the technical classification model and a confidence coefficient corresponding to each prediction category, and selecting a plurality of prediction categories with the confidence coefficients higher than the first as target prediction categories corresponding to the paragraphs; judging whether any target prediction category corresponding to the paragraph exists or not according to each paragraph, wherein the target prediction category comprises a prediction word corresponding to the paragraph; and determining whether the paragraph comprises the specified technology according to the judgment result corresponding to each paragraph. The method and the device can accurately identify the content describing the specified technology from the network threat intelligence.

Description

Method and device for identifying specified technology in threat intelligence, electronic equipment and storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a method and an apparatus for identifying a specific technology in threat intelligence, an electronic device, and a computer-readable storage medium.
Background
Threat intelligence is defined as "evidence-based knowledge, including background, mechanism, indicators, impacts, and actionable recommendations, that relate to existing or emerging threats or asset hazards and can be used to inform decision-making subjects of the reactions to the threats or hazards". Threat intelligence in the field of network security, or network threat intelligence, can provide relevant information in time, such as characteristics of an attack, helping to reduce the uncertainty of identifying potential security vulnerabilities and attacks. Individuals or businesses may obtain cyber-threat intelligence from social media (e.g., blogs), vendor (Microsoft, cisco, etc.) announcements, hacker forums, and the like.
However, the format of the cyber-threat intelligence is not fixed, and there may be standard identification for the technology involved, nor may there be only descriptive specification without standard identification. For example: for the "Sudo and Sudo Caching" technology, there may be a representation form "T1548.003 Sudo and Sudo Caching" in the network threat intelligence that directly says the technology name, and a representation form "Adversaries major performance and/or use the Sunodes to estimate documents" Adversaries major do to execute documents as other users or spawn processes with high risk documents "described using text.
There may be some techniques that require special attention for users (individuals or businesses) of cyber threat intelligence to improve the ability to defend against cyber threats by these techniques. Therefore, a solution that can accurately identify a given technology from cyber threat intelligence is needed.
Disclosure of Invention
The embodiment of the application aims to provide a method and a device for identifying a specified technology in threat intelligence, electronic equipment and a computer-readable storage medium, which are used for accurately identifying the content related to the specified technology from network threat intelligence.
In one aspect, the present application provides a method for identifying a technology specified in threat intelligence, including:
preprocessing network threat information to obtain a word sequence corresponding to each paragraph in the network threat information;
adding a word mask to the word sequence corresponding to each paragraph, inputting a trained complete shape filling model, and obtaining a predicted word corresponding to the word mask output by the complete shape filling model;
aiming at a word sequence corresponding to each paragraph, inputting the word sequence into a trained technical classification model, obtaining a plurality of prediction categories output by the technical classification model and a confidence coefficient corresponding to each prediction category, and selecting a plurality of prediction categories with the prior confidence coefficients as target prediction categories corresponding to the paragraph; wherein each prediction category indicates a technology name belonging to a specified technology;
for each paragraph, judging whether any target prediction category corresponding to the paragraph exists, including a prediction word corresponding to the paragraph;
and determining whether the paragraph comprises the specified technology or not according to the judgment result corresponding to each paragraph.
By the measures, after the network threat information is divided into a plurality of paragraphs, the assigned technology is identified for each paragraph by means of the complete filling model and the technology classification model, so that the paragraphs with the related content of the assigned technology are accurately identified.
In an embodiment, before the preprocessing the cyber-threat intelligence to obtain a word sequence corresponding to each paragraph in the cyber-threat intelligence, the method further includes:
performing regular matching on the network threat information according to a plurality of technical names under the specified technology, and judging whether any technical name can be matched;
if any technology name is matched, determining that the network threat intelligence comprises the specified technology;
and if the technical name can not be matched with any technical name, continuing to execute the step of preprocessing the network threat intelligence.
Through the measures, the specified technology in the network threat information can be quickly identified under the condition that the network threat information contains the technology name under the specified technology, so that the workload of the identification task is reduced.
In an embodiment, the preprocessing the cyber threat intelligence to obtain a word sequence corresponding to each paragraph in the cyber threat intelligence includes:
dividing the network threat intelligence into a plurality of paragraphs;
performing word segmentation on each paragraph, and filtering stop words and invalid words from word segmentation results;
and aiming at each paragraph, performing word stem extraction on the word segmentation result subjected to filtering processing to obtain a word sequence corresponding to the paragraph.
Through the measures, the network threat intelligence can be processed into word sequences corresponding to a plurality of paragraphs.
In one embodiment, the shape-completion gap-filling model is trained by:
replacing at least one word in the sample corpus by a word mask according to the sample corpus in the sample data set to obtain a specified sample corpus;
inputting the specified sample corpus into a pre-training model to obtain a sample prediction result corresponding to a word mask in the specified sample corpus;
and adjusting model parameters of the pre-training model according to a sample prediction result corresponding to the word mask in the specified sample corpus and the replaced word to obtain a complete filling model.
Through the measures, the complete filling model can be trained.
In one embodiment, the sample corpus includes a technology name and a technology description;
the replacing at least one term in the sample corpus with a term mask, comprising:
selecting a word from the technical name contained in the sample corpus and replacing the word with a word mask; and/or the presence of a gas in the gas,
selecting a related word of the specified technology from the technical description contained in the sample corpus, and replacing the related word with a word mask; and/or the presence of a gas in the gas,
and randomly selecting at least one word in the sample corpus and replacing the word with a word mask.
Through the measures, the sample corpora can be processed into the specified sample corpora.
In one embodiment, the technical classification model is trained by:
inputting technical descriptions included in sample corpora in a sample data set into a classification model to obtain a sample prediction category output by the classification model;
and adjusting the model parameters of the classification model according to the difference between the sample prediction category of the sample corpus and the technical name contained in the sample corpus to obtain the technical classification model.
Through the measures, the technical classification model can be obtained through training.
In an embodiment, the determining whether the paragraph includes the specified technique according to the determination result corresponding to each paragraph includes:
if the judgment result corresponding to any paragraph indicates that the target prediction category comprising the prediction words exists, determining that the paragraph comprises the specified technology;
and if the judgment result corresponding to any paragraph indicates that the target prediction category comprising the prediction words does not exist, determining that the paragraph does not comprise the specified technology.
By the above measures, a plurality of paragraphs containing the specified technology can be identified from the network threat intelligence.
In another aspect, the present application further includes an apparatus for identifying technology specified in threat intelligence, comprising:
the system comprises a preprocessing module, a processing module and a processing module, wherein the preprocessing module is used for preprocessing the network threat information to obtain a word sequence corresponding to each paragraph in the network threat information;
the first prediction module is used for adding a word mask code to the word sequence aiming at the word sequence corresponding to each paragraph, inputting a trained complete shape filling model and obtaining a prediction word which is output by the complete shape filling model and corresponds to the word mask code;
the second prediction module is used for inputting the word sequence to a trained technical classification model aiming at the word sequence corresponding to each paragraph, obtaining a plurality of prediction categories output by the technical classification model and the confidence coefficient corresponding to each prediction category, and selecting a plurality of prediction categories with the prior confidence coefficients as target prediction categories corresponding to the paragraphs; wherein each prediction category indicates a technology name belonging to a specified technology;
the judging module is used for judging whether any target prediction category corresponding to the paragraph exists or not according to each paragraph, wherein the target prediction category comprises a prediction word corresponding to the paragraph;
and the determining module is used for determining whether the paragraphs comprise the specified technology according to the judgment result corresponding to each paragraph.
In addition, the present application also includes an electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the identification method of the technology specified in the threat intelligence described above.
Further, the present application also includes a computer-readable storage medium storing a computer program, which is executable by a processor to perform the method for identifying a technology specified in the threat intelligence described above.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic view of an application scenario of an identification method of a specified technology in threat intelligence provided in an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 3 is a flow chart illustrating a method for identifying technology specified in threat intelligence according to an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating a method for preliminary identification of a given technology in threat intelligence according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating the details of step 310 in FIG. 3 according to an embodiment of the present disclosure;
fig. 6 is a schematic flowchart of a training method for a complete gap filling model according to an embodiment of the present application;
FIG. 7 is a flowchart illustrating a method for training a technical classification model according to an embodiment of the present application;
FIG. 8 is a general schematic diagram of a method for identifying technology specified in threat intelligence provided in an embodiment of the present application;
FIG. 9 is a flow diagram illustrating a method for identifying technology specified in threat intelligence according to another embodiment of the present application;
fig. 10 is a block diagram of an identification apparatus for specifying a technique in threat intelligence provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Fig. 1 is a schematic view of an application scenario of an identification method of a specific technology in threat intelligence provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes a client 20 and a server 30; the client 20 may be a user terminal such as a host, a mobile phone, a tablet computer, and the like, and is configured to send a manually constructed sample data set to the server 30; the server 30 may be a server, a server cluster or a cloud computing center, and may train a complete shape-filling model and a technical classification model based on sample corpora in the sample data set, so as to identify content including specified technical features from the network threat information by means of the complete shape-filling model and the technical classification model.
As shown in fig. 2, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor 11 being taken as an example in fig. 2. The processor 11 and the memory 12 are connected by a bus 10, and the memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the flow of the method in the embodiments described below. In one embodiment, the electronic device 1 may be the server 30 for executing the identification method of the technology specified in the threat intelligence.
The Memory 12 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.
The present application also provides a computer-readable storage medium storing a computer program executable by a processor 11 to perform a method for identifying a technology specified in threat intelligence provided by the present application.
Referring to fig. 3, a flow chart of a method for identifying technology specified in threat intelligence provided by an embodiment of the present application is shown in fig. 3, and the method may include the following steps 310 to 350.
Step 310: preprocessing the network threat intelligence to obtain a word sequence corresponding to each paragraph in the network threat intelligence.
The scheme is used for identifying the content containing the specified technology from the network threat intelligence. Here, the designated technology is a technology that is of greater concern to users of the cyber threat information, and can be configured according to the needs of the users. The specified technique may be one or more. Illustratively, the specified technology may be a technology in the MITRE ATT & CK (adaptive metrics, technologies, and Common Knowledge) Knowledge base, or the specified technology may be a technology in the CAPEC (Common attach Pattern implementation and Classification) dataset. A variety of subdivision techniques may be included under a given technology.
After the server side obtains the network threat information needing to be identified from the internet or a local storage space, the server side can preprocess the network threat information, so that the network threat information is split into a plurality of paragraphs, and word sequences corresponding to the paragraphs are obtained. Wherein the word sequence includes a plurality of words within the passage.
Step 320: and adding a word mask code for the word sequence aiming at the word sequence corresponding to each paragraph, and inputting the trained complete shape filling model to obtain a predicted word corresponding to the word mask code and output by the complete shape filling model.
After obtaining the word sequence of each paragraph in the cyber-threat intelligence, the server may add a word mask to the word sequence corresponding to each paragraph, where the word mask is used to indicate a position of a new word predicted in the word sequence. For example, the word mask may be added at the top of the word sequence, i.e., the word sequence is followed by the word mask; alternatively, the word mask may be added at the end of the word sequence, i.e., the word mask is placed next to the word sequence. The form of the word MASK may be pre-configured, and for example, the word MASK may be MASK.
After adding the word mask for the word sequence, the word sequence may be input to a finalized fill model. The shape filling model can be obtained by training a natural language model and is used for predicting a new vocabulary based on text context information. The natural language model may be, but not limited To, BERT (Bidirectional Encode retrieval from Transformer), T5 (Text-To-Text Transfer), mT5 (advanced Multilingual Pre-trained Text-To-Text Transfer), and the like. The server side can generate a predicted word of the position of the word mask according to other words in the word sequence through the complete filling model. The predicted word is the word that is most likely to occur where the word mask is located in the presence of other words in the word sequence.
For each paragraph, the predicted words corresponding to the word sequence of the paragraph can be output through the filled-in-space model.
Step 330: aiming at the word sequence corresponding to each paragraph, inputting the word sequence into the trained technical classification model, obtaining a plurality of prediction categories output by the technical classification model and the confidence corresponding to each prediction category, and selecting a plurality of prediction categories with the confidence higher than the first as target prediction categories corresponding to the paragraph; wherein each prediction category indicates a technology name that belongs to a specified technology.
For the word sequence corresponding to each paragraph, the server may input the word sequence to the technical classification model. The technical classification model is used for classifying texts and can be obtained by training of the classification model. The classification model may be, but not limited to, fastText, SVM (Support Vector Machine), GBDT (Gradient Boosting Decision Tree), and the like.
The prediction categories that the technical classification model can output can be configured according to requirements. For example, the technology names of n types of segmentation technologies are included in the specified technology, and at this time, the technology classification model may be trained to classify the n types of segmentation technologies.
The server side processes the word sequence through the technical classification model, and therefore a plurality of prediction categories and confidence degrees corresponding to the prediction categories are output. The server may arrange the confidence degrees of the multiple prediction categories in a descending order, so as to select a plurality of prediction categories with the higher confidence degrees as the target prediction categories corresponding to the paragraphs. Here, the number of prediction categories selected by the server may be configured as needed, and for example, the server may select two prediction categories with the highest confidence levels as the target prediction categories.
For each paragraph, the server may generate and select several target prediction categories for the paragraph through the technology classification model, where each target prediction category indicates a technology name of a subdivision technology that may be contained in the content of the paragraph.
Step 340: and judging whether any target prediction category corresponding to the paragraph exists or not according to each paragraph, wherein the target prediction category comprises a prediction word corresponding to the paragraph.
For any paragraph, the server may check each target prediction category of the paragraph to determine whether the prediction word corresponding to the paragraph is included, thereby determining whether at least one target prediction category includes the prediction word.
Step 350: and determining whether the paragraph comprises the specified technology according to the judgment result corresponding to each paragraph.
The server can determine whether the paragraphs include the content corresponding to the specified technology according to the judgment results corresponding to the paragraphs. In the case that any section of the cyber threat intelligence has a specified technology, the server can extract the section so as to use the content related to the specified technology in the following.
By the measures, after the network threat information is divided into a plurality of paragraphs, the assigned technology is identified for each paragraph by means of the complete filling model and the technology classification model, so that the paragraphs with the related content of the assigned technology are accurately identified.
In one embodiment, prior to identifying the specified techniques in the cyber threat intelligence through steps 310 through 350 described above, the cyber threat intelligence may be initially identified. Referring to fig. 4, a flow chart of a preliminary identification method for a given technology in threat intelligence provided by an embodiment of the present application is shown in fig. 4, and the method may include the following steps 410 to 430.
Step 410: and performing regular matching on the network threat information by using a plurality of technical names under the specified technology, and judging whether any technical name can be matched.
After obtaining the network threat information, the server can use the technical names of all the subdivision technologies under the specified technology to perform regular matching on the network threat information, and check whether the network threat information is matched with any technical name.
Step 420: if any technology name is matched, it is determined that the cyber-threat intelligence includes the specified technology.
In one case, if any technology name is matched, the threat intelligence contains information related to the specified technology indicated by the technology name. In this case, the identification flow of steps 310 to 350 described above may not be performed on cyber threat intelligence.
Step 430: and if the technical name can not be matched with any technical name, the step of preprocessing the network threat intelligence is continuously executed.
In another case, if any technical name under the specified technology cannot be matched, it means that the technical name is not directly contained in the cyber threat intelligence, and at this time, the identification process from the step 310 to the step 350 needs to be continuously executed on the cyber threat intelligence so as to identify the relevant content containing the specified technology under the condition that the technical description is contained in the cyber threat intelligence.
Through the preliminary identification process, the specified technology in the network threat information can be quickly identified under the condition that the network threat information contains the technology name under the specified technology, so that the workload of an identification task is reduced.
In an embodiment, referring to fig. 5, a detailed flowchart of step 310 in fig. 3 is provided for an embodiment of the present application, and as shown in fig. 5, the preprocessing process may include the following steps 311 to 313.
Step 311: the cyber threat intelligence is divided into several sections.
The server may divide the cyber threat intelligence into several paragraphs. The server can directly split out each natural segment of the network threat intelligence, thereby obtaining a plurality of paragraphs. Or, the server may split each segment of the cyber-threat intelligence and merge adjacent segments (for example, merge every two adjacent segments into one segment), thereby obtaining multiple segments. Alternatively, the server may select a plurality of continuous sentences as a paragraph, thereby dividing a plurality of paragraphs. Illustratively, 10 consecutive sentences are divided into a paragraph.
Step 312: and performing word segmentation on each paragraph, and filtering stop words and invalid words from word segmentation results.
For each paragraph, the server may perform word segmentation processing on the paragraph, so as to obtain a plurality of word segmentation results, where each word segmentation result is a word. The server can filter the stop word and the invalid word from the multiple word segmentation results by means of the stop word list and the invalid word list, so that the word segmentation result of the paragraph after filtering processing is obtained.
Step 313: and aiming at each paragraph, carrying out stem extraction on the word segmentation result subjected to filtering processing to obtain a word sequence corresponding to the paragraph.
For the word segmentation result of any paragraph after filtering processing, the server can check whether a word with an extractable word stem exists, and if so, the server can remove the word end and extract the word stem. For example, the common suffix in english text is "ing", "s", etc. And for the words without word endings in the word segmentation result, no processing is needed. After the word stems are extracted from the words containing the word tails, the extracted word stems and other words without word tails can form a word sequence corresponding to the paragraphs.
Through the measures, the network threat intelligence can be processed into word sequences corresponding to a plurality of paragraphs.
In an embodiment, referring to fig. 6, a flowchart of a training method for a complete blank-filling model provided in an embodiment of the present application is shown in fig. 6, and the method may include the following steps 610 to 630.
Step 610: and replacing at least one word in the sample corpus by the word mask according to the sample corpus in the sample data set to obtain the specified sample corpus.
The sample data set may include a plurality of sample corpuses, each sample corpus including a technical name of a subdivision technique under a specified technique and a technical description of the subdivision technique.
For any sample corpus, the server side can select at least one word from the sample corpus, and replace the position of the word in the sample corpus with a word mask, so that the specified sample corpus is obtained. Illustratively, the sample corpus includes 10 words, and the 2 nd word is selected to be replaced with a word mask, thereby obtaining a specified sample corpus of 9 words plus 1 word mask.
In one embodiment, when at least one word in the sample corpus is replaced by the word mask, the server may complete the replacement in one or more of the following ways.
First alternative: the server side can select a word from the technical names contained in the sample corpus and replace the word with a word mask. Because the technical name is usually composed of a plurality of words, for one sample corpus, a plurality of specified sample corpora can be obtained after different words in the technical name are selected and replaced by word masks.
Second alternative: the server side can select a related word of the specified technology from the technical description contained in the sample corpus and replace the related word with a word mask. Here, the related words may be words having relevance to a specified technology, and the related words may be preconfigured by a human. For example, the related words may be protocol, command. The server can search in the technical description contained in the sample corpus according to the pre-configured related terms, so that any related term in the searched technical description is replaced by a term mask. Because the technical description may include a plurality of related terms, for a sample corpus, a plurality of specified sample corpuses may be obtained after selecting different terms in the technical description and replacing the terms with word masks.
A third alternative: the server side can randomly select at least one word in the sample corpus and replace the word with a word mask. Here, each word selected at random is replaced with a word mask.
Step 620: and inputting the specified sample corpus into a pre-training model to obtain a sample prediction result corresponding to the word mask in the specified sample corpus.
The server may input the specified sample corpus into a pre-training model, where the pre-training model may be a model obtained by training a natural language model such as BERT, T5, mT5, and the like. The server can predict the most probable words at the positions of the word masks in the specified sample corpus through a pre-training model to obtain a sample prediction result. The sample prediction result may include a plurality of sample prediction terms, and each sample prediction term corresponds to a degree of match. The degree of matching is between 0 and 1.
Step 630: and adjusting model parameters of the pre-training model according to a sample prediction result corresponding to the word mask in the specified sample corpus and the replaced word to obtain a complete filling model.
Aiming at the sample prediction result corresponding to the word mask in each specified sample corpus, the server side can search the replaced word at the position of the word mask in the sample prediction result, so that the matching degree of the replaced word in the sample prediction result is obtained. After the matching degree of the replaced word is found, the server side can evaluate the difference between the matching degree of the replaced word and the target matching degree of the replaced word according to the loss function, and therefore the model parameters of the pre-training model are adjusted. Here, the target matching degree is 1.
After adjusting the model parameters, the method may return to step 620 to re-input the specified sample corpus into the adjusted pre-trained model. After multiple rounds of iterative training, the trained complete filling model can be obtained.
Through the measures, a shape completion and space filling model capable of outputting the prediction words to the positions of the word masks in the text can be trained.
In an embodiment, referring to fig. 7, a flowchart of a training method for a technical classification model provided in an embodiment of the present application is shown in fig. 7, and the method may include the following steps 710 to 720.
Step 710: and inputting the technical description included in the sample corpus into the classification model to obtain the sample prediction category output by the classification model.
Here, the classification model may be one of FastText, SVM, GBDT, and the like.
The server can input the technical description in the sample corpus into the classification model, so as to obtain the sample prediction category output by the classification model. For a classification model that can directly handle natural language, the technical description can be directly input into the classification model; for a classification model which cannot directly process natural language, the technical description can be converted into a multidimensional vector corresponding to the technical description by means of word vector conversion, and then the multidimensional vector is input into the classification model.
Step 720: and adjusting the model parameters of the classification model according to the difference between the sample prediction category of the sample corpus and the technical name contained in the sample corpus to obtain the technical classification model.
The server can evaluate the difference between the sample prediction category of the sample corpus and the technical name of the sample corpus through a loss function, so that the model parameters of the classification model are adjusted. After the adjustment, the step 710 may be returned to, and the technical description in the sample corpus is input into the classification model again to further adjust the model parameters of the classification model. After multiple rounds of iterative training, a trained technical classification model can be obtained.
Through the measures, the technical classification model for carrying out technical classification based on the corpus can be obtained through training.
In an embodiment, the sample data set may be constructed prior to training the shape filling model or the technical classification model. The server can respond to the user operation, extract the relevant content of the specified technology from the network threat intelligence, and construct the relevant content into a corpus according to the technical name and the technical description. And after filtering stop words and invalid words in the corpus, extracting word stems from the words subjected to filtering processing, and constructing a sample corpus by using the extracted word stems and other words without word tails. The server side can construct a sample corpus according to a plurality of sample corpora.
Referring to fig. 8, an overall schematic diagram of an identification method for a specific technology in threat intelligence provided in an embodiment of the present application is shown in fig. 8, first, a large amount of network threat intelligence may be obtained from a server, and content related to the specific technology is extracted from the network threat intelligence through manual operation, where the specific technology in fig. 8 is an ATT & CK technology. And constructing a sample corpus based on the extracted content, constructing a sample database by using a plurality of sample corpora, and training a shape filling model and a technical classification model according to the sample corpus. After two models are trained, the filling-in-space model and the technology classification model are completed, and the relevant content of the ATT & CK technology is extracted from the threat information to be detected.
In an embodiment, when the server determines whether a paragraph includes a specified technique according to a determination result corresponding to each paragraph, in one case, if the determination result corresponding to any paragraph indicates that a target prediction category of a prediction word exists, the server determines that the paragraph includes the specified technique. When the target prediction category includes a predicted word, it may be determined that the paragraph contains relevant content for the segmentation technique indicated by the target prediction category. If there are at least two target prediction classes, and the at least two target prediction classes each contain a predicted word, then the passage is determined to include relevant content of the subdivision technique indicated by the target prediction class with the greatest confidence.
In another case, if the determination result corresponding to any paragraph indicates that there is no target prediction category including the prediction word, it is determined that the paragraph does not include the specification technique.
Referring to fig. 9, a flow diagram of a method for identifying a given technology in threat intelligence according to another embodiment of the present application is shown in fig. 9, where a plurality of paragraphs are partitioned from cyber threat intelligence, for the paragraph 1, a prediction word W1 is generated by a complete shape-filling model, and target prediction categories R1 and R2 are generated by a technology classification model, where the confidence corresponding to R1 is S1 and the confidence corresponding to R2 is S2.
The server can determine whether W1 exists in R1 or R2. Since R1 and R2 are technical names indicating target prediction categories, in the case where paragraph 1 contains a specified technique, words in the technical names can be generated from paragraph 1 by a filled-in model. The target prediction category obtained by classifying the technical classification model for paragraph 1 will necessarily include the words predicted by the filled-in-shape model.
In one case, neither target prediction category R1 nor R2 contains the prediction word W1, indicating that paragraph 1 does not describe relevant content of the ATT & CK technique.
In one case, the target prediction category R1 includes the prediction word W1, the target prediction category R2 does not include the prediction word W1, and the description paragraph 1 describes the subdivision technique R1 under ATT & CK.
In one case, the target prediction category R2 includes the prediction word W1, the target prediction category R1 does not include the prediction word W1, and the description paragraph 1 describes the subdivision technique R2 under ATT & CK.
In one case, the target prediction categories R1 and R2 both contain the predicted word W1, then the target prediction category with higher confidence is the subdivision technique under ATT & CK described in paragraph 1.
Through the measures, a plurality of paragraphs including ATT & CK related contents can be identified from the network threat intelligence.
Fig. 10 is a block diagram of an apparatus for identifying technology specified in threat intelligence according to an embodiment of the present invention, as shown in fig. 10, the apparatus may include:
the preprocessing module 1010 is used for preprocessing the network threat information to obtain a word sequence corresponding to each paragraph in the network threat information;
a first prediction module 1020, configured to add a word mask to a word sequence corresponding to each paragraph, and then input a trained shape completion gap filling model to obtain a prediction word corresponding to the word mask and output by the shape completion gap filling model;
a second prediction module 1030, configured to, for a word sequence corresponding to each paragraph, input the word sequence to a trained technical classification model, obtain multiple prediction categories output by the technical classification model and a confidence level corresponding to each prediction category, and select, as a target prediction category corresponding to the paragraph, a plurality of prediction categories with a higher confidence level; wherein each prediction category indicates a technology name belonging to a specified technology;
the judging module 1040 is configured to, for each paragraph, judge whether any target prediction category corresponding to the paragraph exists, where the target prediction category includes a prediction word corresponding to the paragraph;
the determining module 1050 is configured to determine whether each paragraph includes a specified technology according to a determination result corresponding to the paragraph.
The implementation process of the functions and actions of each module in the device is specifically detailed in the implementation process of the corresponding step in the identification method of the technology specified in the threat information, and is not described herein again.
In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (10)

1. A method for identifying a technology specified in threat intelligence, comprising:
preprocessing network threat information to obtain a word sequence corresponding to each paragraph in the network threat information;
adding a word mask to the word sequence corresponding to each paragraph, inputting a trained shape-completing space-filling model, and obtaining a predicted word corresponding to the word mask and output by the shape-completing space-filling model;
aiming at a word sequence corresponding to each paragraph, inputting the word sequence into a trained technical classification model, obtaining a plurality of prediction categories output by the technical classification model and a confidence coefficient corresponding to each prediction category, and selecting a plurality of prediction categories with the prior confidence coefficients as target prediction categories corresponding to the paragraph; wherein each prediction category indicates a technology name belonging to a specified technology;
for each paragraph, judging whether any target prediction category corresponding to the paragraph exists, including a prediction word corresponding to the paragraph;
and determining whether the paragraph comprises the specified technology or not according to the judgment result corresponding to each paragraph.
2. The method of claim 1, wherein prior to the preprocessing of cyber-threat intelligence resulting in a sequence of words for each paragraph in the cyber-threat intelligence, the method further comprises:
performing regular matching on the network threat information according to a plurality of technical names under the specified technology, and judging whether any technical name can be matched;
if any technology name is matched, determining that the network threat intelligence comprises the specified technology;
and if the technical name can not be matched with any technical name, continuing to execute the step of preprocessing the network threat intelligence.
3. The method of claim 1, wherein preprocessing the cyber-threat intelligence to obtain a word sequence corresponding to each paragraph in the cyber-threat intelligence comprises:
dividing the network threat intelligence into a plurality of paragraphs;
performing word segmentation on each paragraph, and filtering stop words and invalid words from word segmentation results;
and aiming at each paragraph, carrying out stem extraction on the word segmentation result subjected to filtering processing to obtain a word sequence corresponding to the paragraph.
4. The method of claim 1, wherein the full-form gap-filling model is trained by:
replacing at least one word in the sample corpus by a word mask according to the sample corpus in the sample data set to obtain a specified sample corpus;
inputting the specified sample corpus into a pre-training model to obtain a sample prediction result corresponding to a word mask in the specified sample corpus;
and adjusting model parameters of the pre-training model according to a sample prediction result corresponding to the word mask in the specified sample corpus and the replaced word to obtain a complete filling model.
5. The method of claim 4, wherein the sample corpus comprises a technology name and a technology description;
the replacing at least one term in the sample corpus with a term mask, comprising:
selecting a word from the technical name contained in the sample corpus and replacing the word with a word mask; and/or the presence of a gas in the gas,
selecting a related word of the specified technology from the technical description contained in the sample corpus, and replacing the related word with a word mask; and/or the presence of a gas in the gas,
and randomly selecting at least one word in the sample corpus and replacing the word with a word mask.
6. The method of claim 1, wherein the technical classification model is trained by:
inputting technical descriptions included in sample corpora in a sample data set into a classification model to obtain a sample prediction category output by the classification model;
and adjusting the model parameters of the classification model according to the difference between the sample prediction category of the sample corpus and the technical name contained in the sample corpus to obtain the technical classification model.
7. The method of claim 1, wherein determining whether the paragraph includes the specified technique according to the determination result corresponding to each paragraph comprises:
if the judgment result corresponding to any paragraph indicates that the target prediction category comprising the prediction words exists, determining that the paragraph comprises the specified technology;
and if the judgment result corresponding to any paragraph indicates that the target prediction category comprising the prediction words does not exist, determining that the paragraph does not comprise the specified technology.
8. An apparatus for identifying a technology specified in threat intelligence, comprising:
the system comprises a preprocessing module, a processing module and a processing module, wherein the preprocessing module is used for preprocessing the network threat information to obtain a word sequence corresponding to each paragraph in the network threat information;
the first prediction module is used for adding a word mask code to the word sequence aiming at the word sequence corresponding to each paragraph, inputting a trained complete shape filling model and obtaining a prediction word which is output by the complete shape filling model and corresponds to the word mask code;
the second prediction module is used for inputting the word sequence to a trained technical classification model aiming at the word sequence corresponding to each paragraph, obtaining a plurality of prediction categories output by the technical classification model and the confidence coefficient corresponding to each prediction category, and selecting a plurality of prediction categories with the prior confidence coefficients as target prediction categories corresponding to the paragraphs; wherein each prediction category indicates a technology name belonging to a specified technology;
the judging module is used for judging whether any target prediction category corresponding to the paragraph exists or not according to each paragraph, wherein the target prediction category comprises a prediction word corresponding to the paragraph;
and the determining module is used for determining whether the paragraphs comprise the specified technology according to the judgment result corresponding to each paragraph.
9. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of identifying technologies specified in the threat intelligence of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of identifying technology specified in threat intelligence of any one of claims 1 to 7.
CN202211387653.0A 2022-11-07 2022-11-07 Method and device for identifying specified technology in threat intelligence, electronic equipment and storage medium Pending CN115687979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211387653.0A CN115687979A (en) 2022-11-07 2022-11-07 Method and device for identifying specified technology in threat intelligence, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211387653.0A CN115687979A (en) 2022-11-07 2022-11-07 Method and device for identifying specified technology in threat intelligence, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115687979A true CN115687979A (en) 2023-02-03

Family

ID=85049843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211387653.0A Pending CN115687979A (en) 2022-11-07 2022-11-07 Method and device for identifying specified technology in threat intelligence, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115687979A (en)

Similar Documents

Publication Publication Date Title
CN110427618B (en) Countermeasure sample generation method, medium, device and computing equipment
CN110020424B (en) Contract information extraction method and device and text information extraction method
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
CN111984792A (en) Website classification method and device, computer equipment and storage medium
KR101326354B1 (en) Transliteration device, recording medium, and method
CN111694826A (en) Data enhancement method and device based on artificial intelligence, electronic equipment and medium
US20220083772A1 (en) Identifying matching fonts utilizing deep learning
CN106383836A (en) Ascribing actionable attributes to data describing personal identity
CN113033198B (en) Similar text pushing method and device, electronic equipment and computer storage medium
US11030533B2 (en) Method and system for generating a transitory sentiment community
CN111814482B (en) Text key data extraction method and system and computer equipment
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN111125295A (en) Method and system for obtaining food safety question answers based on LSTM
CN115862040A (en) Text error correction method and device, computer equipment and readable storage medium
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN113934834A (en) Question matching method, device, equipment and storage medium
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
CN115640376A (en) Text labeling method and device, electronic equipment and computer-readable storage medium
CN115687979A (en) Method and device for identifying specified technology in threat intelligence, electronic equipment and storage medium
CN111767399B (en) Method, device, equipment and medium for constructing emotion classifier based on unbalanced text set
CN111581950B (en) Method for determining synonym names and method for establishing knowledge base of synonym names
JP5824429B2 (en) Spam account score calculation apparatus, spam account score calculation method, and program
CN113688240A (en) Threat element extraction method, device, equipment and storage medium
CN111782601A (en) Electronic file processing method and device, electronic equipment and machine readable medium
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination