CN112989832A

CN112989832A - Entity linking method applied to network security field

Info

Publication number: CN112989832A
Application number: CN202110344549.2A
Authority: CN
Inventors: 陆以勤; 谢树禄; 覃健诚; 李智鹏; 陈帅豪; 洪炜妍; 陈嘉睿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2021-06-18
Anticipated expiration: 2041-03-29
Also published as: CN112989832B

Abstract

The invention discloses a method for realizing entity link in the field of network security, which comprises the steps of utilizing an entity query reference table to generate a security candidate entity; performing word segmentation on the security text to be linked corresponding to the entity mention to obtain a first joint embedded vector; performing word segmentation on the secure text corresponding to the secure candidate entity to obtain a second combined embedded vector; sequentially inputting the first combined embedded vector and the second combined embedded vector into a BiLstm model and a CNN model to respectively obtain first characteristic information and second characteristic information of the security text; an Attention mechanism of a neural network is introduced into the characteristic information to enhance corresponding safety text characteristics; and performing cosine similarity calculation on the enhanced security text vector, and linking the candidate entity with the highest score to the entity mention, thereby realizing entity linking in the field of network security. The invention effectively improves the performance of the entity link system in the field of network security.

Description

Entity linking method applied to network security field

Technical Field

The invention relates to the field of natural language processing, in particular to an entity linking method applied to the field of network security.

Background

The rapid development of modern computer technology has led to the explosive growth trend of security data on the internet. While the network technology is rapidly developed, the number of security incidents and network security vulnerabilities attacking the network are increasing.

In order to effectively ensure the security of a network space, network security experts nowadays deploy network space security monitoring systems at numerous key positions to detect various network security threats. The huge monitoring systems have a large amount of safety data, and the analysis of the safety data has important significance for network safety risk prevention and control.

However, most of the current security data analysis methods are manually performed or a single matching method is used for analysis. The entity link technology of natural language processing is adopted to analyze the safety data, so that the safety data analysis capability is greatly improved, scientific assistance can be effectively provided for accurate judgment of network safety situation by network safety monitoring talents, and the safety of a network space is improved. At present, the research on the entity link technology in the network security field is relatively small, and therefore, the research on the entity link technology in the network security field becomes more urgent.

Disclosure of Invention

The embodiment of the invention provides an entity linking method applied to the field of network security, which effectively improves the entity linking performance in the field of network security.

An entity linking method applied in the field of network security comprises the following steps:

the method comprises the following steps: constructing a candidate entity query reference table in the network security field, and generating a security candidate entity by using the entity query reference table;

step two: utilizing a Word segmentation tool to segment the corresponding to-be-linked secure text mentioned by the entity, inputting the segmented to-be-linked secure text into a trained Word2vec model, outputting a first Word vector and a first Word vector of the to-be-linked secure text by the Word2vec model, simultaneously generating a corresponding first position vector, and adding the first Word vector, the first Word vector and the first position vector to obtain a first joint embedded vector;

step three: utilizing a Word segmentation tool to segment the safe text from the safe knowledge base corresponding to the safe candidate entity, inputting the segmented safe text into a trained Word2vec model, outputting a second Word vector and a second Word vector of the safe text by the Word2vec model, simultaneously generating a corresponding second position vector, and adding the second Word vector, the second Word vector and the second position vector to obtain a second combined embedded vector;

step four: sequentially inputting the first combined embedded vector and the second combined embedded vector into a BiLstm model to obtain a first safe text vector containing first context semantic information and a second safe text vector containing second context semantic information; respectively inputting the first secure text vector and the second secure text vector into a CNN model to respectively obtain first characteristic information and second characteristic information of the secure text;

step five: introducing an Attention mechanism of a neural network into the first characteristic information and the second characteristic information to enhance corresponding security text characteristics;

step six: and performing cosine similarity calculation on the security text vector obtained after the enhancement corresponding to the entity mention and the security text vector obtained after the enhancement corresponding to the candidate entity, performing descending order arrangement on the scores of all calculation results, and linking the candidate entity with the highest score to the entity mention.

Preferably, the training of the Word2vec model comprises: acquiring a network security text and cleaning; and (4) performing Word segmentation on the safe text by using a Word segmentation tool, and pre-training the input Word2vec model of the safe text after Word segmentation.

Preferably, the cleaning of the network security text comprises format conversion, simplified conversion and case conversion.

Preferably, the calculation formula of the position vector in step two and step three is:

wherein the variable pos represents the position of the security word in the security text, the variable d represents the dimension of the security word, the variable 2i represents the even dimension of the variable d, and the variable 2i +1 represents the odd dimension.

Preferably, in step two and step three, the formula for performing joint embedding is:

V_Joint＝V_char+V_word+V_position (2)

wherein V_JointRepresenting a joint vector, V_charRepresenting a character vector, V_wordRepresents a word vector sum V_positionA position vector is represented.

Preferably, the calculation formula of the cosine similarity of the two security text features in the step six is as follows:

wherein, A and B represent two n-dimensional vectors, wherein the A vector is [ A1, A2, A3,. An ] and the B vector is [ B1, B2, B3,. An, Bn ], and theta represents the space angle of the A and B vectors.

Compared with the prior art, the invention has the following advantages:

compared with the existing entity linking method in the network security field, the method mainly applies the crawled network security corpus to Word2vec model pre-training, improves the fit degree of the Word2ve model and the network security field, jointly embeds characters, words and position vectors, improves the semantic richness of a security text, and finally introduces a deep learning model and an Attention mechanism, and improves the entity linking system performance in the network security field.

Drawings

FIG. 1 is a schematic flow chart of an entity linking method applied in the field of network security according to the present invention

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the invention.

As shown in fig. 1, an entity linking method applied in the network security field includes the following steps:

the method comprises the following steps: the network security corpus is crawled by utilizing a network crawler technology based on Python, and security-related texts are mainly crawled from web pages of national security vulnerability libraries, Wikipedia, 360 security companies and the like.

Step two: and (4) performing word segmentation on the safety text crawled in the step one by using a Jieba word segmentation tool to obtain safety linguistic data to be trained, and performing word segmentation mainly by using an accurate word segmentation mode.

Step three: and (4) training a Word2vec model by using the safe corpus generated in the step two, wherein the Word2vec uses a Skip-gram model for pre-training.

It should be noted that, the steps from the first step to the third step are the process of pre-training the Word2vec model. The Word2vec model is pre-trained to be conveniently applied when the following entities are linked.

Step four: and constructing a candidate entity query reference table in the network security field, and quickly generating a security candidate entity by using the candidate entity query reference table.

Step five: and utilizing a word segmentation tool to mention the corresponding security text to be linked to the entity and utilizing the word segmentation tool to segment the security text from the security knowledge base corresponding to the security candidate entity.

Step six: inputting the segmented safe text to be linked into a trained Word2vec model, outputting a first Word vector and a first Word vector of the safe text to be linked by the Word2vec model, simultaneously generating a corresponding first position vector, and adding the first Word vector, the first Word vector and the first position vector to obtain a first combined embedded vector; inputting the safety text after Word segmentation into a trained Word2vec model, outputting a second Word vector and a second Word vector of the safety text by the Word2vec model, simultaneously generating a corresponding second position vector, and adding the second Word vector, the second Word vector and the second position vector to obtain a second combined embedded vector;

the function of the method to calculate the position vector is formulated as,

the variable pos represents the position of the security word in the security text, the variable d represents the dimension of the security word, the variable 2i represents the even dimension of the variable d, and the variable 2i +1 represents the odd dimension.

The formula of the method for carrying out combined embedding is as follows:

V_Joint＝V_char+V_word+V_position (2)

Step seven: sequentially inputting the first combined embedded vector and the second combined embedded vector into a BiLstm model to obtain a first safe text vector containing first context semantic information and a second safe text vector containing second context semantic information; respectively inputting the first secure text vector and the second secure text vector into a CNN model to respectively obtain first characteristic information and second characteristic information of the secure text;

step eight: an Attention mechanism of a neural network is introduced into the first characteristic information and the second characteristic information, and corresponding security text characteristics are enhanced.

Step nine: step six: and performing cosine similarity calculation on the security text vector obtained after the enhancement corresponding to the entity mention and the security text vector obtained after the enhancement corresponding to the candidate entity, performing descending order arrangement on the scores of all calculation results, and linking the candidate entity with the highest score to the entity mention. The calculation formula of the cosine similarity of the two security text features is as follows:

The above-mentioned embodiments are preferred embodiments of the present invention, and the present invention is not limited thereto, and any other modifications or equivalent substitutions that do not depart from the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. An entity linking method applied in the field of network security is characterized by comprising the following steps:

2. The entity linking method applied in the field of network security as claimed in claim 1, wherein the training of Word2vec model comprises:

acquiring a network security text and cleaning;

and (4) performing Word segmentation on the safe text by using a Word segmentation tool, and pre-training the input Word2vec model of the safe text after Word segmentation.

3. The entity linking method applied in the network security field as claimed in claim 1, wherein the cleaning of the network security text includes format conversion, simplified and capital and small conversion.

4. The entity linking method applied in the network security field as claimed in claim 1, wherein the calculation formula of the position vector in the second step and the third step is:

5. The entity linking method applied in the network security field as claimed in claim 1, wherein in step two and step three, the formula for performing the joint embedding is:

V_joint＝V_char+V_word+V_position (2)

6. The entity linking method applied in the network security field according to claim 1, wherein the calculation formula of the cosine similarity of the two security text features in the sixth step is: