CN114297375A

CN114297375A - Training method and extraction method of network model of network security entity and relationship

Info

Publication number: CN114297375A
Application number: CN202111404928.2A
Authority: CN
Inventors: 潘季明; 姚剑文
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-04-08

Abstract

The invention discloses a training method and an extraction method of a network model of a network security entity and relationship, which comprises the steps of obtaining sample data of a plurality of sentence samples and adding training labels to the sentence samples; training a first classifier by using each element in the target sequence set to determine entity class characteristics of each element through the first classifier; constructing a set of candidate entity pairs based on each entity of the sentence sample; fusing the entity category characteristics of each candidate entity pair in the candidate entity pair set, the characteristic vector corresponding to the candidate entity and the context characteristics of the entity pair to determine the fusion characteristics of the candidate entity pair; training a second classifier by using the fusion characteristics of the candidate entity pair so as to output the relation classification of the entity pair by using the second classifier; jointly adjusting parameters of the first classifier and the second classifier to complete training. The method disclosed by the invention realizes automatic extraction of the relationship between the entities while identifying the entities, and has low complexity.

Description

Training method and extraction method of network model of network security entity and relationship

Technical Field

The invention relates to the technical field of network security, in particular to a training method and an extraction method of a network model of a network security entity and relationship.

Background

The APT attack is a serious threat faced by the current network field, and the ability to cope with the APT attack is an important guarantee of network security. At present, the common method for constructing the APT organization portrait is to generate an APT organization label by analyzing and collecting APT organization information and APT organization behaviors, and the method is low in efficiency and low in data structuring degree and is difficult to cope with the network attack with complex means at present, so that a refined data using method is urgently needed, and the utilization of threat intelligence data is facilitated, so that the capability of efficient APT attack response in the aspect of network security is achieved. With the development of natural language processing technology, Named Entity Recognition (NER) and Relationship Extraction (RE) become very important in the field of network security. It helps researchers extract cyber-threat information from unstructured text sources. The extracted network entities or key expressions may be used to model network attacks described in the open source text.

In the existing scheme, named entities are identified from a text data set, a word2 Vec-based model is adopted to pre-train a character level model by using mass data, a word2Vec model is used to obtain a low-dimensional vector of a character sequence, a Bi-LSTM and CRF combined model is trained, and the trained Bi-LSTM and CRF combined model is used for entity identification. And (3) extracting the relation, inputting the entity identified by the NER in the last step by using the trained word2vec model, obtaining the vector expression corresponding to the entity, splicing the vector by the entity, and finally classifying the relation by using a full connection layer. But the problem exists that the robustness of the model is lower than that of the language model pre-trained by using a transformer structure when the word2vec model is used for embedding and expressing the model at a character level; the problem of entity nesting cannot be solved by using the CRF as a classification layer; the generalization capability of the relation extraction does not consider the context problem to be too weak.

Disclosure of Invention

The embodiment of the invention provides a training method and an extraction method of a network model of network security entities and relationships.

The embodiment of the invention provides a training method for extracting a network model of a network security entity and relationship, which comprises the following steps: acquiring sample data comprising a plurality of sentence samples, and adding a training label to each sentence sample; sampling each entity of the sentence sample in the sample data added with the training label to obtain a target sequence set; training a first classifier by using each element in the target sequence set to determine entity class characteristics of each element through the first classifier; constructing a set of candidate entity pairs based on each entity of the sentence sample; fusing the entity category characteristics of each candidate entity pair in the candidate entity pair set, the characteristic vector corresponding to the candidate entity and the context characteristics of the entity pair to determine the fusion characteristics of the candidate entity pair; training a second classifier by using the fusion characteristics of the candidate entity pair so as to output the relation classification of the entity pair by using the second classifier; jointly adjusting parameters of the first classifier and the second classifier to complete training.

In some embodiments, sampling entities of the sentence sample to obtain the target sample set comprises:

enumerating each entity of the sentence sample as a first sample set; and

setting a sliding window, and sampling based on each entity boundary of the sentence sample to serve as a second sample set;

selecting elements which do not belong to the sentence sample and do not belong to the second sample set from the first sample set as a third sample set;

and extracting a subset from the third sample set, and taking the union of the subset and the second sample set as the target sequence set.

In some embodiments, the size of the sliding window is set to the number of entities that the sentence sample is heavy.

In some embodiments, training the first classifier based on elements of the set of target sequences is performed based on coding of a pre-trained language model.

In some embodiments, the encoding based on a pre-trained language model comprises:

taking the target sequence set of the sentence sample as an input sequence of the language model, and inserting a first identifier at a specified position of the input sequence by using the language model so as to associate context information of the sentence sample; and

and outputting the feature vector of each element in the target sequence set by using the language model to complete the coding.

In some embodiments, constructing the set of candidate entity pairs based on the entities of the sentence sample is accomplished by extracting a plurality of entity pairs based on the first set of samples.

In some embodiments, fusing the entity category features of each candidate entity pair in the set of candidate entity pairs, the feature vector corresponding to the candidate entity, and the context features of the entity pair to determine the fused features of the candidate entity pair includes:

for the candidate entity pair, extracting the feature vector of the candidate entity pair by utilizing maximum pooling;

adding the feature vector of the candidate entity pair and the entity category feature of the candidate entity pair and calculating an average value to obtain a fusion sub-feature of the entity pair;

and splicing the fused sub-feature of the entity pair with the context feature of the entity pair to obtain the fused feature of the candidate entity pair.

In some embodiments, jointly adjusting the parameters of the first classifier and the second classifier to complete training comprises:

taking the sum of the loss of the first classifier and the loss of the second classifier as a target loss, and adjusting parameters to be optimal in the training process; and

parameters of the language model are adjusted during the training process.

The embodiment of the present disclosure further provides a method for extracting network security entities and relationships, which is implemented by using a first classifier and a second classifier trained by the training method according to the embodiments of the present disclosure, and includes the following steps: acquiring text data to be detected; encoding the text data to be detected, and determining entity category characteristics of the text data to be detected by using a first classifier; constructing a candidate entity pair set based on each entity of the sentence of the text data to be detected; fusing the entity category characteristics of each candidate entity pair in the candidate entity pair set, the characteristic vector corresponding to the candidate entity and the context characteristics of the entity pair to determine the fusion characteristics of the candidate entity pair; and outputting the relation classification of the entity pair by using the second classifier according to the fusion characteristics of the candidate entity pair.

The embodiment of the present disclosure further provides an apparatus for extracting a network security entity and a relationship, including a memory and a processor, where the memory stores a computer program, and the computer program implements the steps of the method for extracting a network security entity and a relationship when the computer program is invoked and executed by the processor.

The method disclosed by the invention determines the entity class characteristics by training the first classifier, performs fusion based on the entity class characteristics and the context characteristics, and then trains the second classifier to complete the relationship classification between the entity pairs. The method disclosed by the invention can identify the entities and simultaneously realize the automatic extraction of the relationship between the entities, and has low complexity.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a basic flow diagram of a training method of the present disclosure;

fig. 2 is a sub-flow diagram of obtaining a target sample set of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The scheme of the disclosure is used for identifying and classifying entities and nested entities which are likely to exist in the entities, and an attack means describes that the entities contain some APT organizations, malicious software, vulnerabilities and the like, and the entities may contain entities with different spans, and the actual detection needs to identify the entities with different spans. For example [ sea, lotus, flower ] this sequence maps to a span, with the following cases, "sea", "lotus", "dolichos". There is a need to identify entity types and non-entity types in a span and to classify entity pairs of entity types in relation to each other in detection.

Based on this, an embodiment of the present invention provides a method for training a network model for extracting network security entities and relationships, as shown in fig. 1, including the following steps:

in step S101, sample data including a plurality of sentence samples is obtained, and a training tag is added to each sentence sample. Specifically, articles related to safety can be crawled from various sources on the network, such as safety technology blogs; security event articles issued by each large network security company; an APT event report; WeChat public account tweets related to security events, and the like.

The collected data includes text data and PDF data, which needs to be converted into text data. Firstly, extracting a picture contained in a PDF file, and then identifying the text content appearing in the picture by using an OCR technology. Then, a pdf file is converted into text data by using a pdf2text tool, and finally, the OCR recognition result and the pdf2text converted result are summarized. Since the entity recognition and the relation extraction are performed in sentence units, the text data is segmented in sentence units, so that sample data including a plurality of sentence samples is obtained, and then a label is added to each sentence sample.

In step S102, each entity of the sentence sample in the sample data added with the training tag is sampled to obtain a target sequence set. Due to the need to identify entities in the sentence sample that contain spans, the scheme of the present disclosure determines the target sequence set by way of sampling, whereby a plurality of sequences that can be used for the first classifier input are included in the target sequence set, and the entity sequences have corresponding tags.

In step S103, a first classifier is trained by using each element in the target sequence set, so as to determine an entity class characteristic of each element through the first classifier. In a specific training process, each element in the target sequence set may be encoded, so that the first classifier is trained on the feature of each span obtained after encoding. For example, the first classifier may include a max-pooling layer through which vectors for each span may be extracted and a full-concatenation layer through which entity classification may then be performed to determine entity class characteristics for each element.

In step S104, a set of candidate entity pairs is constructed based on the entities of the sentence sample. That is, after the entity category features are obtained, the relationship between the entity pairs with spans is further determined, and a candidate entity pair set can be constructed based on each entity of the sentence sample.

In step S105, the entity category feature of each candidate entity pair in the candidate entity pair set, the feature vector corresponding to the candidate entity, and the context feature of the entity pair are fused to determine a fusion feature of the candidate entity pair. After feature fusion, feature vectors corresponding to the entity pairs can be obtained, so that a second classifier can be input to determine the relationship category between the entity pairs.

In step S106, a second classifier is trained by using the fusion features of the candidate entity pair, so as to output a relationship classification of the entity pair by using the second classifier.

In step S107, parameters of the first classifier and the second classifier are jointly adjusted to complete training.

In some embodiments, sampling the entities of the sentence sample to obtain a target sample set, as shown in fig. 2, includes:

in step S201, enumerating each entity of the sentence sample as a first sample set; and

and setting a sliding window, and sampling based on each entity boundary of the sentence sample to serve as a second sample set. For example, the input sentence is represented as P ═ { P1, P2, P3, …, pn }, and the set of tagged entities is represented as Y ═ { Y1, Y2, Y3, …, yk } where n is the length of the input text and k represents the number of entities. Sampling the entity may take the following form: z is expressed as the longest segment sequence length based on the longest entity length, and then all segment sequence sets are enumerated and expressed as

In some embodiments, the size of the sliding window is set to the number of entities that the sentence sample is heavy. Specifically, all the labeled entities in a sentence sample are used in the training of the model, and a window with the size equal to the number of words in the entity is arranged around the entity in a sliding manner, and entity boundary-based sampling is performed by sliding around the entity to the left and right at a position respectively, which is expressed as W1 ═ W1, W2, W3, …, W2k }.

In step S202, an element that does not belong to the sentence sample and does not belong to the second sample set is selected from the first sample set as a third sample set. That is, all elements belonging to neither the Y set nor the W1 set are selected in the S set and represented as a second negative sample candidate set (third sample set) as W2.

In step S203, a subset is extracted from the third sample set, and the union of the subset and the second sample set is used as the target sequence set. Then randomly extracting a subset W3 from the second negative sample candidate set, wherein the size of the W3 sample set is

0<β<1，

Is rounded up. The finally obtained target sequence set is X ═ W1 ═ W3.

In particular, a pre-trained language model BERT may be used as its core, and a first identifier, which may be a special character [ CLS ] for example, is inserted at a first position for an input sequence to represent context information of an associated whole sentence. Special characters representing the respective entity classes are then inserted in sequence while being initialized to the same position vector. The embedded sequences (c, t1, t2, t3, … tm, e1, e2, e3, …, en) output via BERT are n + m +1 in length, where n is the length of the input text (number of words) and m is the number of inserted entity classes. c is [ CLS ] is used for representing context information of the associated whole sentence, t1, t2 and … tm respectively represent feature vectors of entity classes, and e1, e2, e3, … and en respectively represent feature vectors of words. Such that the entities of the span in the input sequence can be identified by the first classifier. For example, an input sequence such as [ sea, lotus, flower ] that maps to the span "sea", "lotus", "trollflower".

Describing the training process of relationship extraction further below, in some embodiments, constructing the set of candidate entity pairs based on the entities of the sentence sample is accomplished by extracting a plurality of entity pairs based on the first set of samples. I.e. the set of fragment sequences enumerated in the foregoing

The pair of entities is determined on the basis of (1). For example, each candidate entity pair may be extracted from S × S (S1, S2).

and for the candidate entity pair, extracting the feature vector of the candidate entity pair by utilizing maximum pooling. Maximum pooling is employed for candidate entity pairs (s1, s2) in the set of candidate entity pairs to extract feature vectors for the s1 and s2 entities.

And adding the feature vector of the candidate entity pair and the entity class feature of the candidate entity pair and calculating the average value to obtain the fusion sub-feature of the entity pair. For example, the feature vector and the corresponding entity class feature vector may be added and averaged to obtain fused sub-features F (s1) and F (s2) of the fused entity class.

And splicing the fused sub-feature of the entity pair with the context feature of the entity pair to obtain the fused feature of the candidate entity pair. In the fused sub-features F (s1) and F (s2), C (s1, s2) represents the context feature of an entity pair, and the context feature is the text feature between two entities, and the context meaning is helpful for a model to understand the relationship between the entity pair. The context feature may be extracted by maximal pooling using the span from the end of the first entity to the beginning of the second entity as context. And splicing the context features of the obtained entity pair, wherein an entity pair relation classification vector X is F (s1) & gtC (s1, s2) & ltF (s 2). Inputting X into a second classifier for training, and outputting the relation classification of the entity pair by using the second classifier.

parameters of the language model are adjusted during the training process.

The process of training includes fine-tuning the BERT model, defining a joint loss function for the first classifier and the second classifier: l equals Ls + Lr. Where Ls is the penalty of the first classifier and Lr is the penalty of the second classifier. For the first classifier, it is specified that the added [ CLS ] is not involved in the loss calculation, and the added entity class special character is involved in the loss calculation, and the class thereof is the corresponding class ID.

For the second classifier, we can use the tagged relations as positive examples, and draw several negative examples randomly from pairs of entities without relation labels. Setting the number of training rounds (epoch), the batch size (batch size), the learning rate (lr), the input maximum length (max _ len), and the beta parameter, and performing training and adjusting the hyper-parameters to the optimal state by using the cross entropy loss function as the loss function by both the entity classifier and the relationship classifier.

The training method disclosed by the invention is used for inserting the special characters corresponding to the entity class into the head of the sentence according to the defined relation set by using the pre-training language model and participating in the training of the NER model to obtain the entity class characteristics adaptive to the context and the fusion context characteristics. The method solves the problem that entity category characteristics and context characteristics are simultaneously blended into the combined entity relationship extraction model, and improves the performance of the classifier.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A training method for extracting a network model of network security entities and relationships is characterized by comprising the following steps:

acquiring sample data comprising a plurality of sentence samples, and adding a training label to each sentence sample;

sampling each entity of the sentence sample in the sample data added with the training label to obtain a target sequence set;

training a first classifier by using each element in the target sequence set to determine entity class characteristics of each element through the first classifier;

constructing a set of candidate entity pairs based on each entity of the sentence sample;

fusing the entity category characteristics of each candidate entity pair in the candidate entity pair set, the characteristic vector corresponding to the candidate entity and the context characteristics of the entity pair to determine the fusion characteristics of the candidate entity pair;

training a second classifier by using the fusion characteristics of the candidate entity pair so as to output the relation classification of the entity pair by using the second classifier;

jointly adjusting parameters of the first classifier and the second classifier to complete training.

2. The method of claim 1, wherein sampling entities of the sentence samples to obtain a target sample set comprises:

enumerating each entity of the sentence sample as a first sample set; and

3. The method for training a network model according to claim 1, wherein the size of the sliding window is set to the number of entities in the sentence sample.

4. A method for training a network model according to claim 1, wherein the training of the first classifier based on the elements in the target sequence set is performed based on a pre-trained language model.

5. The method for training a network model according to claim 4, wherein said encoding based on a pre-trained language model comprises:

6. The method of claim 2, wherein constructing the set of candidate entity pairs based on the entities of the sentence sample is performed by extracting a plurality of entity pairs based on the first set of samples.

7. The method of claim 1, wherein determining the fusion feature of the candidate entity pair based on the entity class feature of each candidate entity pair in the set of candidate entity pairs, the feature vector corresponding to the candidate entity, and the context feature of the entity pair comprises:

8. The method of claim 5, wherein jointly adjusting the parameters of the first classifier and the second classifier to complete training comprises:

parameters of the language model are adjusted during the training process.

9. A method for extracting network security entities and relationships, comprising a first classifier and a second classifier trained by the training method according to any one of claims 1 to 8, comprising the steps of:

acquiring text data to be detected;

encoding the text data to be detected, and determining entity category characteristics of the text data to be detected by using a first classifier;

constructing a candidate entity pair set based on each entity of the sentence of the text data to be detected;

and outputting the relation classification of the entity pair by using the second classifier according to the fusion characteristics of the candidate entity pair.

10. An apparatus for extracting network security entities and relationships, comprising a memory and a processor, wherein the memory stores a computer program, and the computer program implements the steps of the method for extracting network security entities and relationships as claimed in claim 9 when the computer program is invoked and executed by the processor.