CN117725458A - Method and device for obtaining threat information sample data generation model - Google Patents

Method and device for obtaining threat information sample data generation model Download PDF

Info

Publication number
CN117725458A
CN117725458A CN202311093179.5A CN202311093179A CN117725458A CN 117725458 A CN117725458 A CN 117725458A CN 202311093179 A CN202311093179 A CN 202311093179A CN 117725458 A CN117725458 A CN 117725458A
Authority
CN
China
Prior art keywords
sample
model
classification
label
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311093179.5A
Other languages
Chinese (zh)
Inventor
陈镜冰
高雅丽
李小勇
李娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Network Security Technology Co Ltd
Original Assignee
Beijing Topsec Network Security Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Network Security Technology Co Ltd filed Critical Beijing Topsec Network Security Technology Co Ltd
Priority to CN202311093179.5A priority Critical patent/CN117725458A/en
Publication of CN117725458A publication Critical patent/CN117725458A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application provides a method and a device for acquiring threat intelligence sample data generation model, wherein the method comprises the following steps: converting each sample sentence with the tag into a linear sequence to obtain a tag linearization sample data set; training a language generation model according to the label linearization sample data set so that the language generation model learns the word and the distribution rule of the labels in each label linearization sample sentence to obtain a target language generation model; training the relation classification model according to an original data set to obtain a target relation classification model, wherein the original data set is threat information sample data containing classification labels; and taking the target language generating model and the target relation classifying model as a target threat information sample data generating model. By adopting the embodiment of the application, the data can be generated through the language generation model, and the threat information data can be enhanced, so that the problem of data scarcity of the threat information is solved.

Description

Method and device for obtaining threat information sample data generation model
Technical Field
The application relates to the field of data enhancement processing, in particular to a method and a device for acquiring threat intelligence sample data generation model.
Background
Network threat intelligence (Cyber Threat Intelligence, CTI) is data material describing network security threats to an organization, most of which are updated and released in real time by security vendors or professional security institutions, widely exist in the internet in the form of unstructured text, such as security-related communities and forums, and security analysts find trends, patterns and relations of network security threats through correlation analysis of original security threat information collected from multiple sources, and intensively study and understand existing threats or hidden threats, thereby creating detailed network threat intelligence focused on specific organizations and specific situations. The network threat information is an important basis and data support for attack detection and prevention, attack partner tracking, threat hunting, event monitoring and response, vulnerability management based on information driving and hidden network information discovery.
In the field of network threat information extraction, a method based on a dictionary or rules is initially used, and in actual application, because the language field, the language diversity and the language style are all different, a rule template covering all languages is difficult to design, and the time is long and errors are easy to generate; later, a method for extracting network threat information based on statistical machine learning is researched, but most of the models based on statistical machine learning depend on the quality of marked data in characteristic engineering, have huge requirements on data quantity, and consume a large amount of resources and labor force.
With the rapid development of deep neural networks, a named entity recognition and relationship extraction method based on deep learning is proposed as a potential alternative to the traditional method, researchers have largely explored in the direction, and a good effect is achieved, however, due to the fact that network threat intelligence data acquisition difficulty and collection cost are high, the problem of data scarcity always exists. The data set is very important for deep neural networks, and a good data set can avoid overfitting of the model and improve the robustness of the model. Data enhancement techniques have long been used in the field of computer vision, where pictures and photographs can be data enhanced according to simple, customized rules, such as rotation, cropping, masking, etc. However, such simple rules are difficult to migrate to the text data field, and the rotation, clipping, covering, etc. of pictures or photos will not generally change the pictures or photos, but the deletion or substitution of a word in a sentence may thoroughly change its original meaning, so how the prior art data enhancement method for solving the sample scarcity cannot be applied to the threat intelligence field.
Disclosure of Invention
The embodiment of the application aims to provide a method and a device for acquiring a threat information sample data generation model, and the method and the device can be used for generating data through the language generation model to realize enhancement of threat information data so as to solve the problem of data scarcity of threat information.
In a first aspect, an embodiment of the present application provides a method for obtaining a threat intelligence sample data generation model, where the method includes: converting each sample sentence with a tag into a linear sequence to obtain a tag linearization sample data set, wherein the tag linearization sample data set comprises a plurality of tag linearization sample sentences, and each tag linearization sample sentence is obtained by inserting each tag in a corresponding entity tag between corresponding words; training a language generation model according to the label linearization sample data set so that the language generation model learns the word and the distribution rule of the labels in each label linearization sample sentence to obtain a target language generation model; training the relation classification model according to an original data set to obtain a target relation classification model, wherein the original data set is threat information sample data containing classification labels; and taking the target language generating model and the target relation classifying model as a target threat information sample data generating model.
Some embodiments of the present application train a language generation model based on a tag linearization sample sentence to obtain a target language generation model, and use the target language generation model and a target relationship classification model as a target threat information sample data generation model for generating threat information sample data, and obviously, embodiments of the present application can solve the problem that related technologies cannot enrich a sample data set by adopting existing image enhancement modes such as clipping.
In some embodiments, the converting each sample sentence with a tag into a linear sequence to obtain a tag linearization sample data set includes: cutting each text corresponding to threat information data into sentences to obtain a plurality of sample sentences; taking out one sample sentence, and inserting an entity tag in a tag sequence corresponding to the taken out sample sentence before a corresponding word; respectively adding a start tag BOS and an end tag EOS of the sample sentence before and after the extracted sample sentence to obtain a tag linearization sample sentence; and repeatedly executing the steps until the label linearization of each sample sentence is completed, and obtaining the label linearization sample data set.
According to the method and the device, sample sentences are linearized, marked sentences (namely sample sentences with labels) are converted into linear sequences, so that a language generation model can learn the distribution condition of words and marks in input data, and the technical effect of training the language generation model is improved.
In some embodiments, the cutting each text corresponding to threat intelligence data into sentences includes: cutting the corresponding text according to sentence ending punctuation marks included in any text in the threat information data, wherein the sentence ending punctuation marks comprise: periods, exclamation marks, or question marks.
According to the method and the device, sentence splitting is carried out on the text through sentence ending punctuation marks, so that each sentence to be linearized is obtained, and sentence linearization can be achieved.
In some embodiments, the language generation model includes: a two-way long and short term memory network and a one-way long and short term memory network connected to and after the two-way long and short term memory network; wherein the one-way long-short-term memory network is configured to integrate a plurality of feature representations output by the two-way long-short-term memory network and further learn correlations between the plurality of feature representations.
Some embodiments of the present application may further expand the context view of the model by connecting a one-way long-short-term memory network after a two-way long-short-term memory network, and enable the trained language generation model to extract more rich context information.
In some embodiments, the language generation model further comprises: sentence input layer, embedded layer, linear layer, full connection layer and prediction layer; the sentence input layer is configured to receive input label linearization sample sentences in the label linearization sample data set; the embedding layer is configured to receive the label linearization sample sentences output by the sentence input layer, and perform vectorization processing on the label linearization sample sentences to obtain a first group of vectors; the two-way long-short-term memory network is configured to receive the first group of vectors output by the embedding layer and mine the distribution of labels and words in the first group of vectors to obtain a second group of vectors; the one-way long-short-term memory network layer is configured to receive the second group of vectors output by the two-way long-short-term memory network, and further learn correlations among features of the second group of vectors while integrating feature representations of the second group of vectors to obtain a third group of vectors; the linear layer is configured to receive the third group of vectors, and process the input third group of vectors in a fully-connected mode to obtain a fourth group of vectors; the softmax layer is configured to receive the fourth set of input vectors, normalize the fourth set of vectors and obtain a set of probability vectors; the prediction layer is configured to receive the input set of probability vectors, and select a set of combinations with the largest probability according to the set of probability vectors as the output of the model.
Some embodiments of the present application provide a language generation model including multiple layers, through which learning effects on input data can be improved.
In some embodiments, the relationship classification model includes a second long-short-term memory network and an attention mechanism layer.
According to the method and the device, long-distance dependence of a traditional deep learning method is avoided by adopting a long-term memory network, and meanwhile, correlation of model input and output is effectively analyzed by adopting an attention mechanism, so that more context semantic information is acquired.
In some embodiments, the relationship classification model further comprises: the device comprises a word vector generation layer, a pooling layer and a feature fusion classification layer, wherein the word vector generation layer is configured to input original corpus and entity position information, generate word vectors according to the original corpus, and add positions of two columns of identification entities after the word vectors to generate a first group of classification vectors; the second long-term memory network is configured to receive the input first group of classification vectors and mine the characteristics of the first group of classification vectors to obtain a second group of classification vectors; the attention mechanism layer is configured to receive the input second group of classification vectors, calculate attention probability and obtain a third group of classification vectors processed according to the attention probability; the pooling layer is configured to receive the input third group of classification vectors, and perform maximum pooling processing on the third group of classification vectors to obtain a fourth group of classification vectors for representing the integral characteristics of sentences; the feature fusion classification layer is configured to receive the fourth group of input classification vectors, perform feature fusion on the fourth group of classification vectors and entity features, and perform normalization processing on the fused vectors through a softmax activation function to obtain a fifth group of classification vectors; the classifying layer is configured to receive the input fifth group of classifying vectors, restore the fifth group of classifying vectors according to the number of the entities to obtain corresponding relation matrixes, and output the relation matrixes.
Some embodiments of the present application provide an architecture of a relational classification model, through which accuracy of a target relational classification model obtained by training on an input data classification result can be improved.
In a second aspect, some embodiments of the present application provide a method of acquiring threat intelligence sample data, the method comprising: sending a start tag BOS corresponding to a sample sentence to be generated into a target language generation model included in a target threat intelligence sample data generation model obtained according to any embodiment of the first aspect; calculating the probability of a word or a label generated in the next step through the target language generating model, repeatedly executing the process of predicting the word or the label in the next step until an end label EOS corresponding to the sample sentence to be generated is generated, and obtaining a threat information sample sentence, wherein the target language generating model predicts and selects the word or the label with the highest probability as the word or the label generated in the next step through a linear layer and a softmax layer; repeating the above process until a target number of threat intelligence sample sentences are generated.
According to the method and the device, the labels of all sample sentences and the words or words corresponding to all the predicted labels are predicted through the target language generation model obtained through training, so that a plurality of threat information sample sentences can be generated, the quantity of threat information sample data is enriched, and the technical problem of scarcity of related data is solved.
In some embodiments, the calculating, by the target language generation model, a probability of a word or tag generated next comprises: in the process of generating the threat information sample sentence, the starting tag BOS is directly sent into the target language generation model, and tags except the starting tag included in the threat information sample sentence are obtained by sampling the probability calculated according to a target formula.
Due to the increased randomness of sampling, the language generation model of some embodiments of the present application can select similar substitutes in the same context by sampling to promote the richness of the resulting sample data.
In some embodiments, the target formula is:
wherein s is t-1 For characterizing the state of the t-1 stage, the minimum value of t being 1 and the maximum value being the length of the sentence, i * For characterising the i-th word or word w in the vocabulary i V is used to characterize the size, w, of the vocabulary t For characterising the t-th word or word in the vocabulary, w <t For characterising words preceding the t-th word or words in the vocabulary, exp(s) t-1 I) for characterizing s t-1 Is the i-th element of (c).
Some embodiments of the present application provide a calculation formula for quantifying probability, so that the calculation of probability values is more objective and accurate.
In some embodiments, the calculating, by the target language generation model, a probability of a word or tag generated next comprises: the label generated in the last step is used as input of the target language generating model to generate the next label.
According to the method and the device, the next label is predicted through the label in the last step, so that the accuracy of label prediction is improved, and then the whole sample sentence is generated.
In some embodiments, the method further comprises: inputting each obtained threat information sample sentence into a target relation classification model to obtain a classification result; and taking the classification result as the generated threat information sample data.
Some embodiments of the application obtain a classification result of the generated threat information data through a target relation classification model, and use the classification result as a label, thereby obtaining richer threat information sample data.
In some embodiments, the inputting each obtained threat intelligence sample sentence into the target relationship classification model to obtain a classification result includes: the feature vectors corresponding to the threat information sample sentences are imported into a two-way long-short-term memory network for correlation analysis to obtain a first vector; calculating attention probability by adopting an attention mechanism, and acquiring the output characteristics of the two-way long-short-term memory network according to the attention probability to obtain a second vector; carrying out maximum pooling treatment on the second vector to obtain the overall characteristics of the text; and fusing the text local features and the text integral features, guiding the fused features into a classifier for classification, and outputting the classification result.
Some embodiments of the present disclosure obtain a classification result of each threat information sample sentence generated through the provided target relationship classification model, thereby improving the accuracy of the obtained classification result.
In a third aspect, some embodiments of the present application provide a method of training a threat intelligence classification model, the method comprising: generating sample data according to the target threat intelligence sample data generation model obtained according to any one of the embodiments of the first aspect; training the threat information classification model at least based on the sample data to obtain a target threat information classification model.
In a fourth aspect, some embodiments of the present application provide a method of threat intelligence classification, the method comprising: inputting threat information data to be classified into a target threat information classification model obtained according to the embodiment of the third aspect; and obtaining the category of the threat information data to be classified through the target threat information classification model.
In a fifth aspect, some embodiments of the present application provide an apparatus for obtaining a threat intelligence sample data generation model, the apparatus comprising: the system comprises a tag linearization sample data set acquisition module, a word recognition module and a word recognition module, wherein the tag linearization sample data set acquisition module is configured to convert each sentence sample with a tag into a linear sequence to obtain a tag linearization sample data set, the tag linearization sample data set comprises a plurality of tag linearization sample sentences, each tag linearization sample sentence is obtained by inserting each tag in a corresponding entity tag between corresponding words, and the words and the corresponding tags form a word pair; the language generation model training module is configured to train a language generation model according to the label linearization sample data set so that the language generation model learns the word and the distribution rule of the labels in each label linearization sample sentence to obtain a target language generation model; the relation classification model training module is configured to train the relation classification model according to an original data set to obtain a target relation classification model, wherein the original data set is threat information sample data containing classification labels; and the target threat intelligence sample data generation model acquisition module is configured to take the target language generation model and the target relation classification model as one target threat intelligence sample data generation model.
In a sixth aspect, some embodiments of the present application provide an apparatus for acquiring threat intelligence sample data, the apparatus comprising: a start tag input module configured to send a start tag BOS corresponding to a sample sentence to be generated into a target language generation model included in a target threat intelligence sample data generation model obtained according to any one of the embodiments as in the first aspect; and each label or word determining module in the sentence is configured to calculate the probability of the word or label generated in the next step through the target language generating model, repeatedly execute the process of predicting the word or label in the next step until the end label EOS corresponding to the sample sentence to be generated is generated, and obtain a threat information sample sentence, wherein the target language generating model predicts and selects the word or label with the highest probability as the word or label generated in the next step through a linear layer and a softmax layer.
In a seventh aspect, some embodiments of the present application provide an apparatus for training a threat intelligence classification model, the apparatus comprising: a sample data generation module configured to generate sample data according to a target threat intelligence sample data generation model as obtained by any of the embodiments of the first aspect; and the training module is configured to train the threat information classification model at least based on the sample data to obtain a target threat information classification model.
In an eighth aspect, some embodiments of the present application provide an apparatus for threat intelligence categorization, the apparatus comprising: the data input module to be classified is configured to input threat information data to be classified into the target threat information classification model obtained in the embodiment; and the output module is configured to obtain the category of the threat information data to be classified through the target threat information classification model.
In a ninth aspect, some embodiments of the present application provide a computer readable storage medium having stored thereon a computer program, which when executed by a processor, performs a method as in any of the embodiments described above.
In a tenth aspect, some embodiments of the present application provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, may implement a method as in any one of the embodiments above.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a process for obtaining a target language generation model according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a process for obtaining a target relationship classification model according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a target threat intelligence sample data generation model according to an embodiment of the present application;
FIG. 4 is one of the flowcharts of the method for obtaining threat intelligence sample data generation model provided in the embodiments of the present application;
fig. 5 is a schematic diagram of a process for obtaining a tag linearization sample sentence according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a language generation model provided in an embodiment of the present application;
FIG. 7 is a schematic diagram of a relational classification model provided in an embodiment of the present application;
FIG. 8 is a block diagram of an apparatus for obtaining a threat intelligence sample data generation model according to an embodiment of the invention;
FIG. 9 is a block diagram of an apparatus for acquiring threat intelligence sample data according to an embodiment of the present application;
FIG. 10 is a block diagram of an apparatus for training threat intelligence classification models provided in an embodiment of the application;
FIG. 11 is a block diagram of an apparatus for threat intelligence classification provided in an embodiment of the application;
fig. 12 is a schematic diagram of electronic device composition according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
In order to overcome the defect of threat information data and overcome the defect that sample data cannot be enriched by adopting a traditional data enhancement mode (such as a mode of cutting an image and the like), the embodiment of the application provides a model capable of generating threat information sample data, wherein the target threat information sample data generation model can execute threat information data enhancement based on a generation method and mainly comprises two modules, namely a language generation module and a relation classification module, the generation of data with labels is completed through the language generation module, and the classification of entities in the generated data is completed through the relation module, so that the enhancement of a network threat information data set is completed.
Referring to fig. 1, fig. 1 is a process of obtaining a target language generation model included in a target threat information sample data generation model provided in some embodiments of the present application, in fig. 1, sample data processed by a tag linearization processing module is input into the language generation model and trained, and when training can be finished, network parameters are reserved, and then the saved network parameters are loaded on the corresponding model to construct the target language generation model.
The implementation of the tag linearization processing module will be described below, and the module will not be described too much to avoid redundancy.
Fig. 2 is a schematic diagram of a process of obtaining a target relationship classification model according to an embodiment of the present application, where data of an original dataset (i.e., collected threat information sample data and labeled classification result labels) are input into the relationship classification model to train the model, and when training is completed, network parameters are saved and then the saved network parameters are loaded into the corresponding network model, so that the target relationship classification model shown in fig. 2 can be obtained.
FIG. 3 provides a target threat intelligence sample data generation model provided in some embodiments of the present application, which in some embodiments of the present application includes a target language generation model and a target relationship classification model, and in other embodiments of the present application, which further includes a normalization processing module, where the normalization processing module may convert tag linearization generated data generated by the target language generation model into generated data in a target format, for example, into generated data in a standard BIO format.
It should be noted that, in some embodiments of the present application, at least one of the normalization processing module or the tag linearization processing module of fig. 1 may be integrated into the target language generation model or the target relationship classification model.
The process of obtaining the target threat intelligence sample data generation model of fig. 3 is described in exemplary detail below in conjunction with fig. 4.
As shown in fig. 4, an embodiment of the present application provides a method for obtaining a threat intelligence sample data generation model, the method includes:
s101, converting each sample sentence with the tag into a linear sequence to obtain a tag linearization sample data set.
It should be noted that the tag linearization sample data set includes a plurality of tag linearization sample sentences, and each tag linearization sample sentence is obtained by inserting each tag in the corresponding entity tag between corresponding words.
S102, training a language generation model according to the label linearization sample data set so that the language generation model learns the word and the distribution rule of the labels in each label linearization sample sentence to obtain a target language generation model.
And S103, training the relation classification model according to the original data set to obtain a target relation classification model.
It should be noted that the original data set is threat intelligence sample data including a classification tag.
S104, taking the target language generation model and the target relation classification model as a target threat information sample data generation model.
It is to be understood that some embodiments of the present application train a language generation model based on a tag linearization sample sentence to obtain a target language generation model, and use the target language generation model and a target relationship classification model as a target threat information sample data generation model for generating threat information sample data, and obviously, embodiments of the present application can solve the problem that related technologies cannot enrich a sample data set by adopting an existing image enhancement mode such as clipping.
The implementation of the above steps is exemplarily described below.
In some embodiments of the present application, S101 illustratively includes:
the first step, each text corresponding to threat information data is cut into sentences, and a plurality of sample sentences are obtained.
In some embodiments of the present application, the first step illustratively includes: cutting the corresponding text according to sentence ending punctuation marks included in any text in the threat information data, wherein the sentence ending punctuation marks comprise: periods, exclamation marks, or question marks. According to the method and the device, sentence splitting is carried out on the text through sentence ending punctuation marks, so that each sentence to be linearized is obtained, and sentence linearization can be achieved.
And secondly, taking out one sample sentence, and inserting an entity tag in a tag sequence corresponding to the taken out sample sentence before the corresponding word.
And thirdly, respectively adding a start tag BOS and an end tag EOS of the sample sentence before and after the extracted sample sentence to obtain a tag linearization sample sentence.
And repeatedly executing the steps until the label linearization of each sample sentence is completed, and obtaining the label linearization sample data set.
The implementation process of S101 is described below in conjunction with the example of fig. 5, and performing S101 completes tag linearization, which in some embodiments of the present application is:
sample sentence linearization is performed in some embodiments of the present application to convert a labeled sentence into a linear sequence, so that the language generation model can learn the word and label distribution in the golden data. As shown in fig. 5, during linearization, i.e., during execution of S101, a tag (e.g., the tag includes: B-ORG, I-ORG, B-LOC, I-LOC, and O in fig. 5) is inserted before the corresponding word, and thus is treated as a word pair. Since O-tags often appear in the cyber threat intelligence naming entity identification and relationship extraction tasks, embodiments of the present application delete these O-tags from the linearized sequence during execution of S101, with the remaining tags and corresponding words or words (i.e., only the entity tags B-ORG, I-ORG, B-LOC, I-LOC, etc.) being retained as one tag linearization sample sentence. After the sentences are complete linearization, some embodiments of the present application tag the beginning and end of each sentence with special tags: [ BOS ] and [ EOS ] for facilitating training of models and generation of data by marking boundaries of sentences.
The tag line of fig. 5 is used to characterize the tag determined for each word of the sentence, and the linearization of fig. 5 is performed as a result of the linearization process (which is also a function of the tag linearization process model of fig. 1) of the tag of the sentence, and the result may be used as a tag linearization sample sentence.
That is, in some embodiments of the present application this S101 illustratively includes:
step 1, cutting a text into sentences, and dividing punctuation marks representing the end of the sentences according to periods, exclamation marks, question marks and the like;
step 2, taking out a sentence, and inserting the non-O entity tag in the tag sequence into the corresponding word;
step 3, respectively adding tags BOS and EOS before and after the sentence to obtain a tag linearization sample sentence;
and 4, returning to the step 2, and repeating until the label linearization of all sentences is completed, so as to obtain a label linearization sample data set.
It is to be understood that some embodiments of the present application first perform sample sentence linearization, and convert a labeled sentence (i.e., a sample sentence with a label) into a linear sequence, so that a language generation model can learn the distribution condition of words and labels in input data, and improve the technical effect of training the language generation model.
After obtaining the linearly labeled sentences shown in fig. 5 in the embodiments of the present application, the language generation model can be used to learn the distribution of words and tags. The architecture of the language generation model is exemplarily set forth below, and it can be appreciated that the target language generation model has the same architecture as the language generation model.
In some embodiments of the present application, the language generation model of S102 includes: a two-way long and short term memory network and a one-way long and short term memory network connected to and after the two-way long and short term memory network; wherein the one-way long-short-term memory network is configured to integrate a plurality of feature representations output by the two-way long-short-term memory network and further learn correlations between the plurality of feature representations.
That is, in the language generating model of some embodiments of the present application, a two-layer long-short-term memory network recurrent neural network (i.e. a two-way long-term memory network) is used, and a one-way long-term memory network (i.e. a one-way long-term memory network) is connected to the recurrent neural network, so that the following three advantages are mainly achieved:
1. extracting more abundant context information, namely, the two-way long-short-term memory network can consider the forward and backward context information at each time step. However, in the network threat intelligence information extraction task, certain specific context patterns or semantic information may not be captured using only a two-way long and short term memory network. According to the embodiment of the application, the one-way long-short-term memory network is connected after the two-way long-short-term memory network, so that the context view of the model can be further enlarged, and more abundant context information can be extracted.
2. The expression capacity of the model is increased, namely, the two-way long-short-term memory network learns different characteristic representations in the forward and backward processing processes respectively. After stitching these representations, a more comprehensive and rich feature representation can be obtained. However, simple stitching may not fully exploit the correlation between the forward and backward representations. Some embodiments of the present application allow models to integrate these feature representations while further learning the correlations between them and enhancing the expressive power of the models by interfacing with a one-way long and short term memory network.
3. The LSTM is a neural network model suitable for modeling sequence data, and can well process time sequence information. Embodiments of the present application may better capture long-term dependencies and timing patterns in a sequence by using a combination of two-way long-short memory networks and one-way long-short memory networks. The two-way long-short-term memory network can transfer information in two directions, and the one-way long-short-term memory network can further integrate and process the information, so that modeling capacity of the model on sequence data is improved.
The modeling of the language generation model comprises the following specific steps: step 1, establishing a language generation model by using a two-way long-short-term memory network to connect one long-short-term memory network in series; step 2, taking linearization data (namely, each label linearization sample sentence in a label linearization sample set) generated in label linearization as the input of a model, and training the model; and 3, adjusting parameters, and comparing experiments to obtain a model with an optimal effect as a target language generation model.
It will be appreciated that some embodiments of the present application may further expand the contextual view of the model by connecting a one-way long-short-term memory network after a two-way long-short-term memory network, and allow the trained language generation model to extract more rich contextual information.
As shown in fig. 6, the language generating model provided in some embodiments of the present application further includes: a sentence input layer (corresponding to the sentence layer of fig. 6), an embedding layer (corresponding to the embedding layer of fig. 6), a bidirectional long and short term memory network (corresponding to the bidirectional LSTM layer of fig. 6), a unidirectional long and short term memory network (corresponding to the LSTM layer of fig. 6), a linear layer (corresponding to the linear layer of fig. 6), a fully connected layer (corresponding to the Softmax layer of fig. 6), and a predictive layer (corresponding to the predictive layer of fig. 6), wherein the sentence input layer is configured to receive each tag linearization sample sentence in the input tag linearization sample data set; the embedding layer is configured to receive the label linearization sample sentences output by the sentence input layer, and perform vectorization processing on the label linearization sample sentences to obtain a first group of vectors; the two-way long-short-term memory network is configured to receive the first group of vectors output by the embedding layer and mine the distribution of labels and words in the first group of vectors to obtain a second group of vectors; the one-way long-short-term memory network layer is configured to receive the second group of vectors output by the two-way long-short-term memory network, and further learn correlations among features of the second group of vectors while integrating feature representations of the second group of vectors to obtain a third group of vectors; the linear layer is configured to receive the third group of vectors, and process the input third group of vectors in a fully-connected mode to obtain a fourth group of vectors; the softmax layer is configured to receive the fourth set of input vectors, normalize the fourth set of vectors and obtain a set of probability vectors; the prediction layer is configured to receive the input set of probability vectors, and select a set of combinations with the largest probability according to the set of probability vectors as the output of the model.
In fig. 6, the sentence-level input is a generated tag-linearized sample sentence, which includes, in order, the following tags and words: tag BOS, tag B-ORG, word OceanLotus, word from, each of the input by the embedding layer of FIG. 6
It should be appreciated that some embodiments of the present application provide a language generation model that includes multiple layers, through which learning effects on input data may be enhanced.
It should be noted that the relationship classification model of some embodiments of the present application includes a second long-short-term memory network and an attention mechanism layer.
It is easy to understand that some embodiments of the present application use long-short-term memory networks to avoid long-distance dependence of traditional deep learning methods, and at the same time use attention mechanisms to effectively analyze the correlation between model input and output, thereby acquiring more context semantic information.
As shown in fig. 7, in some embodiments of the present application, the relationship classification model in S103 further includes: a word vector generation layer (corresponding to the word vector generation layer of fig. 7), a layer comprising a second long-short-term memory network (corresponding to the LSTM model of fig. 7), an attention mechanism layer (corresponding to the attention mechanism layer of fig. 7), a pooling layer (corresponding to the pooling layer of fig. 7), and a feature fusion classification layer (corresponding to the feature fusion classification layer of fig. 7), wherein the word vector generation layer is configured to input an original corpus and entity position information, generate a word vector according to the original corpus, and add positions of two columns of identification entities after the word vector to generate a first group of classification vectors; the second long-term memory network is configured to receive the input first group of classification vectors and mine the characteristics of the first group of classification vectors to obtain a second group of classification vectors; the attention mechanism layer is configured to receive the input second group of classification vectors, calculate attention probability and obtain a third group of classification vectors processed according to the attention probability; the pooling layer is configured to receive the input third group of classification vectors, and perform maximum pooling processing on the third group of classification vectors to obtain a fourth group of classification vectors for representing the integral characteristics of sentences; the feature fusion classification layer is configured to receive the fourth group of input classification vectors, perform feature fusion on the fourth group of classification vectors and entity features, and perform normalization processing on the fused vectors through a softmax activation function to obtain a fifth group of classification vectors; the classifying layer is configured to receive the input fifth group of classifying vectors, restore the fifth group of classifying vectors according to the number of the entities to obtain corresponding relation matrixes, and output the relation matrixes.
Some embodiments of the present application provide an architecture of a relational classification model, through which accuracy of a target relational classification model obtained by training on an input data classification result can be improved.
That is, the construction method (i.e., step 1 and step 2 described below) and the application process (i.e., step 3 described below) of the relationship classification model according to the embodiment of the present application are as follows:
and 1, constructing a relationship classification model by using a two-way long-short-term memory network and an attention mechanism.
And 2, training a relationship classification model by taking the original data set as input of the relationship classification model to obtain a target relationship classification model.
And 3, taking the generated data as the input of a target relation classification model, and obtaining the prediction classification of the relation of the entities in the generated data through the model.
The process of acquiring threat intelligence sample data from the resulting target threat intelligence sample data generation model is described below by way of example.
Some embodiments of the present application provide a method of acquiring threat intelligence sample data, the method comprising: sending a start tag BOS corresponding to a sample sentence to be generated into a target language generation model included in the target threat information sample data generation model obtained according to any one of the embodiments; calculating the probability of a word or a label generated in the next step through the target language generating model, repeatedly executing the process of predicting the word or the label in the next step until an end label EOS corresponding to the sample sentence to be generated is generated, and obtaining a threat information sample sentence, wherein the target language generating model predicts and selects the word or the label with the highest probability as the word or the label generated in the next step through a linear layer and a softmax layer; repeating the above process until a target number of threat intelligence sample sentences are generated.
It is easy to understand that some embodiments of the present application predict labels of each sample sentence and words or words corresponding to each prediction label through the target language generating model obtained by training, so that a plurality of threat information sample sentences can be generated, the number of threat information sample data is enriched, and the technical problem of scarcity of related data is solved.
The steps in the method of obtaining threat intelligence sample data are described below by way of example.
In some embodiments of the present application, the calculating, by the target language generation model, the probability of the word or tag generated in the next step includes: in the process of generating the threat information sample sentence, the starting tag BOS is directly sent into the target language generation model, and tags except the starting tag included in the threat information sample sentence are obtained by sampling the probability calculated according to a target formula.
It will be appreciated that the language generation model of some embodiments of the present application may choose similar alternatives in the same context by sampling to promote the richness of the resulting sample data due to the increased randomness of the sampling.
In some embodiments of the present application, the target formula is:
Wherein s is t-1 For characterising the state, i, of the t-1 stage * For characterising the i-th word or word w in the vocabulary i V is used to characterize the size, w, of the vocabulary t For characterising the t-th word or word in the vocabulary, w <t For characterising words preceding the t-th word or words in the vocabulary, exp(s) t-1 I) for characterizing s t-1 Is the i-th element of (c).
Some embodiments of the present application provide a calculation formula for quantifying probability, so that the calculation of probability values is more objective and accurate.
In some embodiments of the present application, the calculating, by the target language generating model, a probability of a word or a label generated in a next step includes: the label generated in the last step is used as input of the target language generating model to generate the next label.
According to the method and the device, the next label is predicted through the label in the last step, so that the accuracy of label prediction is improved, and then the whole sample sentence is generated.
That is, some embodiments of the present application, after deriving the target language generative model, use it to generate sample data for a named entity recognition task of threat intelligence. In some embodiments of the present application, in the process of generating each threat intelligence sample sentence, only the start tag [ BOS ] of the sample sentence to be generated is directly sent to the target language generating model, and the tags behind the sentence are sampled according to the probability calculated by the following equation 2 (i.e. the target equation). That is, in the embodiment of the present application, each threat intelligence sample sentence is generated by giving a start [ BOS ], and the threat intelligence sample sentence is automatically generated once, wherein the label generated in the previous step is used as an input to generate the next label. In the embodiment of the present application, as shown in the following equation 2, in the sampling process in the process of generating threat intelligence sample sentences based on the target language generation model, it is more likely to select a high probability of a tag. Due to the increased randomness of the sampling, the model may select similar alternatives in the same context. For example, in some embodiments of the present application, when predicting a given next tag "OceanLotus from Vietnam often uses," the probability of "B-MEH" is much higher than other choices, because the model sees many similar examples in the training data, such as "OceanLotus often uses B-MEH," "ATP30 uses B-MEH," and so on. Some embodiments of the present application then predict the latter word given "OceanLotus from Vietnam often uses". In the training data, all "B-MEH" are followed by words representing the attack pattern, so "watering hole attacks", "harpin attacks", "0day attacks", etc. are all possible choices, and their probabilities are very close. Due to the increased randomness, the model of embodiments of the present application may choose any one of them.
s t-1 =M T d′ t-1 (1)
Wherein i is * Is the word or word w in the word list i V is the size of the vocabulary,is a learnable weight matrix, r is the dimension of LSTM hidden state, s t-1,i Is s t-1 Is the i-th element of (c). The definition of the correlation function in the formula may be referred to the above description and will not be described in detail herein.
That is, the specific steps of generating threat intelligence sample sentences in some embodiments of the present application:
step 1, sending a starting tag BOS of the beginning of the sentence into a target language generation model.
And 2, calculating the probability of the word or label generated in the next step, and predicting and selecting the word or label with the highest probability through the linear layer and the softmax layer.
And 3, repeating the step 2 until the EOS label is generated.
And 4, repeating the steps 1 to 3 until a sufficient number of sentences are generated.
It should be noted that, in some embodiments of the present application, the method further includes: inputting each obtained threat information sample sentence into a target relation classification model to obtain a classification result; and taking the classification result as the generated threat information sample data.
It is easy to understand that some embodiments of the present application obtain, through a target relationship classification model, a classification result of the generated threat information data, and use the classification result as a tag, thereby obtaining richer threat information sample data.
In some embodiments of the present application, the process of inputting each obtained threat intelligence sample sentence into the target relationship classification model to obtain a classification result includes: the feature vectors corresponding to the threat information sample sentences are imported into a two-way long-short-term memory network for correlation analysis to obtain a first vector; calculating attention probability by adopting an attention mechanism, and acquiring the output characteristics of the two-way long-short-term memory network according to the attention probability to obtain a second vector; carrying out maximum pooling treatment on the second vector to obtain the overall characteristics of the text; and fusing the text local features and the text integral features, guiding the fused features into a classifier for classification, and outputting the classification result.
It will be appreciated that in some embodiments of the present application, after the synthesized sample data is generated by the target language generation model, extraction of entity relationships in the synthesized data may be accomplished by the target relationship classification model. Specifically, some embodiments of the application adopt long-term and short-term memory networks to avoid the long-distance dependence problem of the traditional deep learning method, and adopt an attention mechanism to effectively analyze the correlation between the input and the output of the model, so that more context semantic information is acquired. The structure of the object relation classification model is shown in fig. 7. Some embodiments of the application introduce feature vectors into a two-way long-short-term memory network, calculate attention probability by adopting an attention mechanism, analyze importance degree of correlation of input and output of a long-term memory network model, and acquire output features of the two-way long-term memory network according to the attention probability; carrying out maximum pooling processing on the output characteristics of the long-short-period memory network after the attention mechanism is introduced, and obtaining the overall characteristics of the text; and fusing the local text features and the integral text features, guiding the fused features into a classifier for classification, and outputting classification results.
Some embodiments of the present disclosure obtain a classification result of each threat information sample sentence generated through the provided target relationship classification model, thereby improving the accuracy of the obtained classification result.
The process of obtaining a target language classification model, obtaining a target relationship classification model, and generating threat intelligence sample data based on these models provided by some embodiments of the present application are described below in connection with two examples.
Example 1
Assuming that a named entity identification and relationship extraction small data set in the threat information field of BIO labeling is provided, the specific steps of the threat information data enhancement method (namely obtaining threat information sample sentences) based on the generating method of the embodiment of the application are as follows:
and 1, performing label linearization processing on data in a data set by using a data processing code (namely a code corresponding to a label linearization processing module in FIG. 1), wherein the processing principle is that a beginning label BOS is added to a sentence head, an ending label EOS is added to a sentence tail, and a non-O label is inserted before a corresponding word.
And 2, taking the data subjected to the linearization processing of the label as input of a language generating model, and training the language generating model to obtain a target language generating model.
And 3, generating sample data with a certain scale by using the trained language generation model (namely, the target language generation model).
And 4, restoring the generated data of the tag linearization into a standard BIO format by using a data processing code.
And 5, taking the generated data restored into the standard BIO format as the input of a target relationship classification model, and obtaining the prediction of the relationship between the entities in the generated data through the target relationship classification model.
And 6, mixing the generated data (namely the sample data obtained in the step 3 and the predicted relation data in the step 5) with the original data (namely the actually collected and marked threat information sample data), and then taking the mixed data as the input of a threat information naming entity identification and relation extraction experiment, and carrying out an entity identification and relation extraction experiment.
And 7, if the experimental result does not meet the expectation, returning to the step 3.
And 8, ending the experiment if the experimental result meets the expectation.
Example 2
Threat information named entity identification and relation extraction are the basis for threat information utilization, and tasks such as network attack tracing and situation awareness are needed to be based on the threat information. However, the current threat intelligence naming entity identification and relationship extraction task often faces the problem of data scarcity, which affects the training of entity identification and relationship extraction models. The method of some embodiments of the application can be used for carrying out data enhancement on the existing data set to generate brand new data, thereby expanding the data set and realizing efficient and accurate extraction of threat information entities and relations. The known named entity identification and relation extraction method has the advantages that a high-quality small data set is extracted in the threat information field, and the efficient and reliable threat information named entity identification and relation extraction task is completed, and the specific steps are as follows:
The method comprises the steps of 1, carrying out label linearization processing on data in a data set by using a data processing code, wherein the processing principle is that a sentence head is labeled with BOS, a sentence tail is labeled with EOS, and a non-O label is inserted before a corresponding word;
step 2, taking the data processed by the label linearization as the input of a language generating model, and training the language generating model;
step 3, generating generation data with a certain scale (such as 1 kilo) by using the trained language model;
step 4, restoring the generated data of the tag linearization into a standard BIO format by using a data processing code;
step 5, taking the generated data restored into the standard BIO format as the input of a classification model, and obtaining the prediction of the relation between the entities in the generated data through the classification model, thereby obtaining the generated named entity identification data and relation extraction data;
step 6, mixing the generated data with the original data, and then performing entity identification and relation extraction experiments as input of threat information naming entity identification and relation extraction experiments;
step 7, if the experimental result does not accord with the expectation, returning to the step 3;
and 8, ending the experiment if the experimental result meets the expectation.
Some embodiments of the present application provide a method of training a threat intelligence classification model, the method comprising: generating sample data according to the target threat information sample data generation model obtained in any one of the embodiments; training the threat information classification model at least based on the sample data to obtain a target threat information classification model.
It can be understood that, in some embodiments of the present application, training data is generated by using the target threat information sample data generation model obtained in the foregoing embodiments, then a threat information classification model for classifying threat information is trained based on the rich training data, and after the training is finished, the target threat information classification model is obtained, and then the threat information to be classified can be classified directly by using the target threat information classification model.
Some embodiments of the present application provide a method of threat intelligence classification, the method comprising: inputting threat information data to be classified into the target threat information classification model obtained in any one of the embodiments; and obtaining the category of the threat information data to be classified through the target threat information classification model.
It is easy to understand that some embodiments of the present application classify threat information to be classified through a target threat information classification model obtained through training, and improve accuracy of threat information data classification.
Referring to fig. 8, fig. 8 illustrates an apparatus for obtaining a threat intelligence sample data generating model provided in an embodiment of the present application, and it should be understood that the apparatus corresponds to the method embodiment of fig. 4, and is capable of executing the steps involved in the method embodiment, and specific functions of the apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein for avoiding repetition. The apparatus includes at least one software functional module capable of being stored in a memory in the form of software or firmware or solidified in an operating system of the apparatus, the apparatus for obtaining threat intelligence sample data generation model, comprising: the system comprises a label linearization sample data set acquisition module 101, a language generation model training module 102, a relationship classification model training module 103 and a target threat information sample data generation model acquisition module 104.
The system comprises a tag linearization sample data set acquisition module, a word segmentation module and a word segmentation module, wherein the tag linearization sample data set acquisition module is configured to convert each sentence sample with a tag into a linear sequence to obtain a tag linearization sample data set, the tag linearization sample data set comprises a plurality of tag linearization sample sentences, each tag linearization sample sentence is obtained by inserting each tag in a corresponding entity tag between corresponding words, and the words and the corresponding tags form a word pair.
The language generation model training module is configured to train a language generation model according to the label linearization sample data set so that the language generation model learns the word and the distribution rule of the labels in each label linearization sample sentence to obtain a target language generation model.
The relation classification model training module is configured to train the relation classification model according to an original data set to obtain a target relation classification model, wherein the original data set is threat information sample data containing classification labels.
And the target threat intelligence sample data generation model acquisition module is configured to take the target language generation model and the target relation classification model as one target threat intelligence sample data generation model.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.
Referring to fig. 9, fig. 9 shows an apparatus for acquiring threat intelligence sample data provided in an embodiment of the present application, and it should be understood that the apparatus corresponds to the above method embodiment, and is capable of executing each step related to the above method embodiment, and specific functions of the apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The apparatus includes at least one software functional module capable of being stored in a memory in the form of software or firmware or solidified in an operating system of the apparatus, the apparatus for obtaining threat intelligence sample data generation model, comprising: the start tag input module 201 and each tag or word determination module 202 in the sentence.
And the initial tag input module is configured to send an initial tag BOS corresponding to the sample sentence to be generated into a target language generation model included in the target threat information sample data generation model obtained according to any one of the embodiments.
And each label or word determining module in the sentence is configured to calculate the probability of the word or label generated in the next step through the target language generating model, repeatedly execute the process of predicting the word or label in the next step until the end label EOS corresponding to the sample sentence to be generated is generated, and obtain a threat information sample sentence, wherein the target language generating model predicts and selects the word or label with the highest probability as the word or label generated in the next step through a linear layer and a softmax layer.
It will be appreciated that by means of the apparatus of fig. 9, a set number of sample sentences can be generated, which can in turn be used to implement training of a classification model for classifying threat intelligence.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.
Referring to fig. 10, fig. 10 illustrates an apparatus for training a threat intelligence classification model according to an embodiment of the present application, and it should be understood that the apparatus corresponds to the foregoing method embodiment, and is capable of executing the steps involved in the foregoing method embodiment, and specific functions of the apparatus may be referred to the foregoing description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The apparatus includes at least one software functional module, which can be stored in memory in the form of software or firmware or solidified in the operating system of the apparatus, the apparatus for training threat intelligence classification models, comprising: the sample data generation module 301 and the training module 302.
The sample data generating module is configured to generate sample data according to the target threat intelligence sample data generating model obtained according to any one of the embodiments.
And the training module is configured to train the threat information classification model at least based on the sample data to obtain a target threat information classification model.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.
Referring to fig. 11, fig. 11 shows a threat intelligence classification apparatus provided in an embodiment of the present application, and it should be understood that the apparatus corresponds to the above method embodiment, and is capable of executing the steps involved in the above method embodiment, and specific functions of the apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The apparatus includes at least one software functional module, which can be stored in memory in the form of software or firmware or solidified in the operating system of the apparatus, the threat intelligence categorizing apparatus comprising: the data to be classified is input to the module 401 and output to the module 402.
The data input module to be classified is configured to input threat information data to be classified into the target threat information classification model obtained in the embodiment.
And the output module is configured to obtain the category of the threat information data to be classified through the target threat information classification model.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.
Some embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method as described in any of the embodiments above.
As shown in fig. 12, some embodiments of the present application provide an electronic device 700, including a memory 710, a processor 720, and a computer program stored on the memory 710 and executable on the processor 720, wherein the processor 720 can implement the method according to any one of the embodiments described above when reading the program and executing the program through a bus 730.
Processor 720 may process the digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 720 may be a microprocessor.
Memory 710 may be used for storing instructions to be executed by processor 720 or data related to execution of the instructions. Such instructions and/or data may include code to implement some or all of the functions of one or more modules described in embodiments of the present application. The processor 720 of the disclosed embodiments may be configured to execute instructions in the memory 710 to implement the method shown in fig. 4. Memory 710 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (21)

1. A method of obtaining a threat intelligence sample data generation model, the method comprising:
converting each sample sentence with a tag into a linear sequence to obtain a tag linearization sample data set, wherein the tag linearization sample data set comprises a plurality of tag linearization sample sentences, and each tag linearization sample sentence is obtained by inserting each tag in a corresponding entity tag between corresponding words;
training a language generation model according to the label linearization sample data set so that the language generation model learns the word and the distribution rule of the labels in each label linearization sample sentence to obtain a target language generation model;
training the relation classification model according to an original data set to obtain a target relation classification model, wherein the original data set is threat information sample data containing classification labels;
and taking the target language generating model and the target relation classifying model as a target threat information sample data generating model.
2. The method of claim 1, wherein converting each sample sentence with a tag into a linear sequence, resulting in a tag linearized sample data set, comprises:
Cutting each text corresponding to threat information data into sentences to obtain a plurality of sample sentences;
taking out one sample sentence, and inserting an entity tag in a tag sequence corresponding to the taken out sample sentence before a corresponding word;
respectively adding a sample sentence start tag BOS and a sample sentence end tag EOS before and after the extracted sample sentences to obtain a tag linearization sample sentence;
and repeatedly executing the steps until the label linearization of each sample sentence is completed, and obtaining the label linearization sample data set.
3. The method of claim 2, wherein the cutting each text corresponding to threat intelligence data into sentences comprises:
cutting the corresponding text according to sentence ending punctuation marks included in any text in the threat information data, wherein the sentence ending punctuation marks comprise: periods, exclamation marks, or question marks.
4. The method of claim 1, wherein the language generation model comprises:
a two-way long and short term memory network and a one-way long and short term memory network connected to and after the two-way long and short term memory network;
Wherein the one-way long-short-term memory network is configured to integrate a plurality of feature representations output by the two-way long-short-term memory network and further learn correlations between the plurality of feature representations.
5. The method of claim 4, wherein the language generation model further comprises: sentence input layer, embedded layer, linear layer, full connection layer and prediction layer; wherein,
the sentence input layer is configured to receive each label linearization sample sentence in the label linearization sample data set;
the embedding layer is configured to receive the label linearization sample sentences output by the sentence input layer, and perform vectorization processing on the label linearization sample sentences to obtain a first group of vectors;
the two-way long-short-term memory network is configured to receive the first group of vectors output by the embedding layer and mine the distribution of labels and words in the first group of vectors to obtain a second group of vectors;
the one-way long-short-term memory network layer is configured to receive the second group of vectors output by the two-way long-short-term memory network, and further learn correlations among features of the second group of vectors while integrating feature representations of the second group of vectors to obtain a third group of vectors;
The linear layer is configured to receive the third group of vectors, and process the input third group of vectors in a fully-connected mode to obtain a fourth group of vectors;
the full connection layer is configured to receive the fourth set of input vectors, normalize the fourth set of vectors and obtain a set of probability vectors;
the prediction layer is configured to receive the input set of probability vectors, and select a set of combinations with the largest probability according to the set of probability vectors as the output of the model.
6. The method of any one of claims 1-5, wherein the relationship classification model comprises a second long-term short-term memory network and an attention mechanism layer.
7. The method of claim 6, wherein the relationship classification model further comprises: a word vector generation layer, a pooling layer, and a feature fusion classification layer, wherein,
the word vector generation layer is configured to input original corpus and entity position information, generate word vectors according to the original corpus, and add positions of two columns of identification entities after the word vectors to generate a first group of classification vectors;
the second long-term memory network is configured to receive the input first group of classification vectors and mine the characteristics of the first group of classification vectors to obtain a second group of classification vectors;
The attention mechanism layer is configured to receive the input second group of classification vectors, calculate attention probability and obtain a third group of classification vectors processed according to the attention probability;
the pooling layer is configured to receive the input third group of classification vectors, and perform maximum pooling processing on the third group of classification vectors to obtain a fourth group of classification vectors for representing the integral characteristics of sentences;
the feature fusion classification layer is configured to receive the fourth group of input classification vectors, perform feature fusion on the fourth group of classification vectors and entity features, and perform normalization processing on the fused vectors through a softmax activation function to obtain a fifth group of classification vectors;
the classifying layer is configured to receive the input fifth group of classifying vectors, restore the fifth group of classifying vectors according to the number of the entities to obtain corresponding relation matrixes, and output the relation matrixes.
8. A method of acquiring threat intelligence sample data, the method comprising:
feeding a start tag BOS corresponding to a sample sentence to be generated into a target language generation model included in a target threat intelligence sample data generation model obtained according to any one of claims 1 to 7;
Calculating the probability of a word or a label generated in the next step through the target language generating model, repeatedly executing the process of predicting the word or the label in the next step until an end label EOS corresponding to the sample sentence to be generated is generated, and obtaining a threat information sample sentence, wherein the target language generating model predicts and selects the word or the label with the highest probability as the word or the label generated in the next step through a linear layer and a softmax layer;
repeating the above process until a target number of threat intelligence sample sentences are generated.
9. The method of claim 8, wherein,
the calculating, by the target language generating model, the probability of the word or the label generated in the next step includes:
in the process of generating the threat information sample sentence, the starting tag BOS is directly sent into the target language generation model, and tags except the starting tag included in the threat information sample sentence are obtained by sampling the probability calculated according to a target formula.
10. The method of claim 9, wherein the target formula is:
wherein,s t-1 the minimum value of t is 1 and the maximum value of t is sentence length, i, used for representing the state of t-1 stage * For characterising the i-th word or word w in the vocabulary i V is used to characterize the size, w, of the vocabulary t For characterizing the t-th word or word in the vocabulary, w < t for characterizing the word preceding the t-th word or word in the vocabulary, exp(s) t-1 I) for characterizing pairs s t-1 And (3) performing an exponential operation on the i-th element of the set.
11. The method of claim 8, wherein the calculating, by the target language generation model, a probability of a word or tag to be generated next comprises: the label generated in the last step is used as input of the target language generating model to generate the next label.
12. The method of any one of claims 8-11, wherein the method further comprises:
inputting each obtained threat information sample sentence into a target relation classification model to obtain a classification result;
and taking the classification result as the generated threat information sample data.
13. The method of claim 12, wherein inputting each of the obtained threat intelligence sample sentences into a target relationship classification model to obtain classification results comprises:
the feature vectors corresponding to the threat information sample sentences are imported into a two-way long-short-term memory network for correlation analysis to obtain a first vector;
Calculating attention probability by adopting an attention mechanism, and acquiring the output characteristics of the two-way long-short-term memory network according to the attention probability to obtain a second vector;
carrying out maximum pooling treatment on the second vector to obtain the overall characteristics of the text;
and fusing the text local features and the text integral features, guiding the fused features into a classifier for classification, and outputting the classification result.
14. A method of training a threat intelligence classification model, the method comprising:
generating sample data according to a target threat intelligence sample data generation model obtained according to any one of claims 1-7;
training the threat information classification model at least based on the sample data to obtain a target threat information classification model.
15. A method of threat intelligence classification, the method comprising:
inputting threat information data to be classified into the target threat information classification model obtained in claim 13;
and obtaining the category of the threat information data to be classified through the target threat information classification model.
16. An apparatus for obtaining a threat intelligence sample data generation model, the apparatus comprising:
The label linearization sample data set acquisition module is configured to convert each sample sentence with a label into a linear sequence to obtain a label linearization sample data set, wherein the label linearization sample data set comprises a plurality of label linearization sample sentences, each label linearization sample sentence is obtained by inserting each label in a corresponding entity label between corresponding words, and the words and the corresponding labels form a word pair;
the language generation model training module is configured to train a language generation model according to the label linearization sample data set so that the language generation model learns the word and the distribution rule of the labels in each label linearization sample sentence to obtain a target language generation model;
the relation classification model training module is configured to train the relation classification model according to an original data set to obtain a target relation classification model, wherein the original data set is threat information sample data containing classification labels;
and the target threat intelligence sample data generation model acquisition module is configured to take the target language generation model and the target relation classification model as one target threat intelligence sample data generation model.
17. An apparatus for obtaining threat intelligence sample data, the apparatus comprising:
a start tag input module configured to send a start tag BOS corresponding to a sample sentence to be generated into a target language generation model included in a target threat intelligence sample data generation model obtained according to any one of claims 1 to 7;
and each label or word determining module in the sentence is configured to calculate the probability of the word or label generated in the next step through the target language generating model, repeatedly execute the process of predicting the word or label in the next step until the end label EOS corresponding to the sample sentence to be generated is generated, and obtain a threat information sample sentence, wherein the target language generating model predicts and selects the word or label with the highest probability as the word or label generated in the next step through a linear layer and a softmax layer.
18. An apparatus for training a threat intelligence classification model, the apparatus comprising:
a sample data generation module configured to generate sample data from a target threat intelligence sample data generation model as obtained in any of claims 1-7;
And the training module is configured to train the threat information classification model at least based on the sample data to obtain a target threat information classification model.
19. An apparatus for threat intelligence classification, the apparatus comprising:
a to-be-classified data input module configured to input to-be-classified threat intelligence data into the target threat intelligence classification model as obtained in claim 13;
and the output module is configured to obtain the category of the threat information data to be classified through the target threat information classification model.
20. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, is adapted to carry out the method of any of claims 1-15.
21. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is operable to implement a method as claimed in any one of claims 1 to 15 when the program is executed by the processor.
CN202311093179.5A 2023-08-28 2023-08-28 Method and device for obtaining threat information sample data generation model Pending CN117725458A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311093179.5A CN117725458A (en) 2023-08-28 2023-08-28 Method and device for obtaining threat information sample data generation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311093179.5A CN117725458A (en) 2023-08-28 2023-08-28 Method and device for obtaining threat information sample data generation model

Publications (1)

Publication Number Publication Date
CN117725458A true CN117725458A (en) 2024-03-19

Family

ID=90200434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311093179.5A Pending CN117725458A (en) 2023-08-28 2023-08-28 Method and device for obtaining threat information sample data generation model

Country Status (1)

Country Link
CN (1) CN117725458A (en)

Similar Documents

Publication Publication Date Title
CN110765265B (en) Information classification extraction method and device, computer equipment and storage medium
CN107808011B (en) Information classification extraction method and device, computer equipment and storage medium
CN113312500A (en) Method for constructing event map for safe operation of dam
CN111985229A (en) Sequence labeling method and device and computer equipment
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
Yang et al. Rits: Real-time interactive text steganography based on automatic dialogue model
CN111274829B (en) Sequence labeling method utilizing cross-language information
CN111522908A (en) Multi-label text classification method based on BiGRU and attention mechanism
CN110705490B (en) Visual emotion recognition method
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN110866107A (en) Method and device for generating material corpus, computer equipment and storage medium
CN111723295A (en) Content distribution method, device and storage medium
CN114780723A (en) Portrait generation method, system and medium based on guide network text classification
Wang et al. Application of an emotional classification model in e-commerce text based on an improved transformer model
CN112632258A (en) Text data processing method and device, computer equipment and storage medium
CN115964497A (en) Event extraction method integrating attention mechanism and convolutional neural network
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN116089605A (en) Text emotion analysis method based on transfer learning and improved word bag model
CN115238645A (en) Asset data identification method and device, electronic equipment and computer storage medium
CN117725458A (en) Method and device for obtaining threat information sample data generation model
CN114842301A (en) Semi-supervised training method of image annotation model
CN113919358A (en) Named entity identification method and system based on active learning
CN111859979A (en) Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination