CN112989832A - Entity linking method applied to network security field - Google Patents

Entity linking method applied to network security field Download PDF

Info

Publication number
CN112989832A
CN112989832A CN202110344549.2A CN202110344549A CN112989832A CN 112989832 A CN112989832 A CN 112989832A CN 202110344549 A CN202110344549 A CN 202110344549A CN 112989832 A CN112989832 A CN 112989832A
Authority
CN
China
Prior art keywords
vector
text
entity
security
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110344549.2A
Other languages
Chinese (zh)
Other versions
CN112989832B (en
Inventor
陆以勤
谢树禄
覃健诚
李智鹏
陈帅豪
洪炜妍
陈嘉睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110344549.2A priority Critical patent/CN112989832B/en
Publication of CN112989832A publication Critical patent/CN112989832A/en
Application granted granted Critical
Publication of CN112989832B publication Critical patent/CN112989832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

The invention discloses a method for realizing entity link in the field of network security, which comprises the steps of utilizing an entity query reference table to generate a security candidate entity; performing word segmentation on the security text to be linked corresponding to the entity mention to obtain a first joint embedded vector; performing word segmentation on the secure text corresponding to the secure candidate entity to obtain a second combined embedded vector; sequentially inputting the first combined embedded vector and the second combined embedded vector into a BiLstm model and a CNN model to respectively obtain first characteristic information and second characteristic information of the security text; an Attention mechanism of a neural network is introduced into the characteristic information to enhance corresponding safety text characteristics; and performing cosine similarity calculation on the enhanced security text vector, and linking the candidate entity with the highest score to the entity mention, thereby realizing entity linking in the field of network security. The invention effectively improves the performance of the entity link system in the field of network security.

Description

Entity linking method applied to network security field
Technical Field
The invention relates to the field of natural language processing, in particular to an entity linking method applied to the field of network security.
Background
The rapid development of modern computer technology has led to the explosive growth trend of security data on the internet. While the network technology is rapidly developed, the number of security incidents and network security vulnerabilities attacking the network are increasing.
In order to effectively ensure the security of a network space, network security experts nowadays deploy network space security monitoring systems at numerous key positions to detect various network security threats. The huge monitoring systems have a large amount of safety data, and the analysis of the safety data has important significance for network safety risk prevention and control.
However, most of the current security data analysis methods are manually performed or a single matching method is used for analysis. The entity link technology of natural language processing is adopted to analyze the safety data, so that the safety data analysis capability is greatly improved, scientific assistance can be effectively provided for accurate judgment of network safety situation by network safety monitoring talents, and the safety of a network space is improved. At present, the research on the entity link technology in the network security field is relatively small, and therefore, the research on the entity link technology in the network security field becomes more urgent.
Disclosure of Invention
The embodiment of the invention provides an entity linking method applied to the field of network security, which effectively improves the entity linking performance in the field of network security.
An entity linking method applied in the field of network security comprises the following steps:
the method comprises the following steps: constructing a candidate entity query reference table in the network security field, and generating a security candidate entity by using the entity query reference table;
step two: utilizing a Word segmentation tool to segment the corresponding to-be-linked secure text mentioned by the entity, inputting the segmented to-be-linked secure text into a trained Word2vec model, outputting a first Word vector and a first Word vector of the to-be-linked secure text by the Word2vec model, simultaneously generating a corresponding first position vector, and adding the first Word vector, the first Word vector and the first position vector to obtain a first joint embedded vector;
step three: utilizing a Word segmentation tool to segment the safe text from the safe knowledge base corresponding to the safe candidate entity, inputting the segmented safe text into a trained Word2vec model, outputting a second Word vector and a second Word vector of the safe text by the Word2vec model, simultaneously generating a corresponding second position vector, and adding the second Word vector, the second Word vector and the second position vector to obtain a second combined embedded vector;
step four: sequentially inputting the first combined embedded vector and the second combined embedded vector into a BiLstm model to obtain a first safe text vector containing first context semantic information and a second safe text vector containing second context semantic information; respectively inputting the first secure text vector and the second secure text vector into a CNN model to respectively obtain first characteristic information and second characteristic information of the secure text;
step five: introducing an Attention mechanism of a neural network into the first characteristic information and the second characteristic information to enhance corresponding security text characteristics;
step six: and performing cosine similarity calculation on the security text vector obtained after the enhancement corresponding to the entity mention and the security text vector obtained after the enhancement corresponding to the candidate entity, performing descending order arrangement on the scores of all calculation results, and linking the candidate entity with the highest score to the entity mention.
Preferably, the training of the Word2vec model comprises: acquiring a network security text and cleaning; and (4) performing Word segmentation on the safe text by using a Word segmentation tool, and pre-training the input Word2vec model of the safe text after Word segmentation.
Preferably, the cleaning of the network security text comprises format conversion, simplified conversion and case conversion.
Preferably, the calculation formula of the position vector in step two and step three is:
Figure BDA0002996938930000031
wherein the variable pos represents the position of the security word in the security text, the variable d represents the dimension of the security word, the variable 2i represents the even dimension of the variable d, and the variable 2i +1 represents the odd dimension.
Preferably, in step two and step three, the formula for performing joint embedding is:
VJoint=Vchar+Vword+Vposition (2)
wherein VJointRepresenting a joint vector, VcharRepresenting a character vector, VwordRepresents a word vector sum VpositionA position vector is represented.
Preferably, the calculation formula of the cosine similarity of the two security text features in the step six is as follows:
Figure BDA0002996938930000041
wherein, A and B represent two n-dimensional vectors, wherein the A vector is [ A1, A2, A3,. An ] and the B vector is [ B1, B2, B3,. An, Bn ], and theta represents the space angle of the A and B vectors.
Compared with the prior art, the invention has the following advantages:
compared with the existing entity linking method in the network security field, the method mainly applies the crawled network security corpus to Word2vec model pre-training, improves the fit degree of the Word2ve model and the network security field, jointly embeds characters, words and position vectors, improves the semantic richness of a security text, and finally introduces a deep learning model and an Attention mechanism, and improves the entity linking system performance in the network security field.
Drawings
FIG. 1 is a schematic flow chart of an entity linking method applied in the field of network security according to the present invention
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the invention.
As shown in fig. 1, an entity linking method applied in the network security field includes the following steps:
the method comprises the following steps: the network security corpus is crawled by utilizing a network crawler technology based on Python, and security-related texts are mainly crawled from web pages of national security vulnerability libraries, Wikipedia, 360 security companies and the like.
Step two: and (4) performing word segmentation on the safety text crawled in the step one by using a Jieba word segmentation tool to obtain safety linguistic data to be trained, and performing word segmentation mainly by using an accurate word segmentation mode.
Step three: and (4) training a Word2vec model by using the safe corpus generated in the step two, wherein the Word2vec uses a Skip-gram model for pre-training.
It should be noted that, the steps from the first step to the third step are the process of pre-training the Word2vec model. The Word2vec model is pre-trained to be conveniently applied when the following entities are linked.
Step four: and constructing a candidate entity query reference table in the network security field, and quickly generating a security candidate entity by using the candidate entity query reference table.
Step five: and utilizing a word segmentation tool to mention the corresponding security text to be linked to the entity and utilizing the word segmentation tool to segment the security text from the security knowledge base corresponding to the security candidate entity.
Step six: inputting the segmented safe text to be linked into a trained Word2vec model, outputting a first Word vector and a first Word vector of the safe text to be linked by the Word2vec model, simultaneously generating a corresponding first position vector, and adding the first Word vector, the first Word vector and the first position vector to obtain a first combined embedded vector; inputting the safety text after Word segmentation into a trained Word2vec model, outputting a second Word vector and a second Word vector of the safety text by the Word2vec model, simultaneously generating a corresponding second position vector, and adding the second Word vector, the second Word vector and the second position vector to obtain a second combined embedded vector;
the function of the method to calculate the position vector is formulated as,
Figure BDA0002996938930000051
the variable pos represents the position of the security word in the security text, the variable d represents the dimension of the security word, the variable 2i represents the even dimension of the variable d, and the variable 2i +1 represents the odd dimension.
The formula of the method for carrying out combined embedding is as follows:
VJoint=Vchar+Vword+Vposition (2)
wherein VJointRepresenting a joint vector, VcharRepresenting a character vector, VwordRepresents a word vector sum VpositionA position vector is represented.
Step seven: sequentially inputting the first combined embedded vector and the second combined embedded vector into a BiLstm model to obtain a first safe text vector containing first context semantic information and a second safe text vector containing second context semantic information; respectively inputting the first secure text vector and the second secure text vector into a CNN model to respectively obtain first characteristic information and second characteristic information of the secure text;
step eight: an Attention mechanism of a neural network is introduced into the first characteristic information and the second characteristic information, and corresponding security text characteristics are enhanced.
Step nine: step six: and performing cosine similarity calculation on the security text vector obtained after the enhancement corresponding to the entity mention and the security text vector obtained after the enhancement corresponding to the candidate entity, performing descending order arrangement on the scores of all calculation results, and linking the candidate entity with the highest score to the entity mention. The calculation formula of the cosine similarity of the two security text features is as follows:
Figure BDA0002996938930000061
wherein, A and B represent two n-dimensional vectors, wherein the A vector is [ A1, A2, A3,. An ] and the B vector is [ B1, B2, B3,. An, Bn ], and theta represents the space angle of the A and B vectors.
The above-mentioned embodiments are preferred embodiments of the present invention, and the present invention is not limited thereto, and any other modifications or equivalent substitutions that do not depart from the technical spirit of the present invention are included in the scope of the present invention.

Claims (6)

1. An entity linking method applied in the field of network security is characterized by comprising the following steps:
the method comprises the following steps: constructing a candidate entity query reference table in the network security field, and generating a security candidate entity by using the entity query reference table;
step two: utilizing a Word segmentation tool to segment the corresponding to-be-linked secure text mentioned by the entity, inputting the segmented to-be-linked secure text into a trained Word2vec model, outputting a first Word vector and a first Word vector of the to-be-linked secure text by the Word2vec model, simultaneously generating a corresponding first position vector, and adding the first Word vector, the first Word vector and the first position vector to obtain a first joint embedded vector;
step three: utilizing a Word segmentation tool to segment the safe text from the safe knowledge base corresponding to the safe candidate entity, inputting the segmented safe text into a trained Word2vec model, outputting a second Word vector and a second Word vector of the safe text by the Word2vec model, simultaneously generating a corresponding second position vector, and adding the second Word vector, the second Word vector and the second position vector to obtain a second combined embedded vector;
step four: sequentially inputting the first combined embedded vector and the second combined embedded vector into a BiLstm model to obtain a first safe text vector containing first context semantic information and a second safe text vector containing second context semantic information; respectively inputting the first secure text vector and the second secure text vector into a CNN model to respectively obtain first characteristic information and second characteristic information of the secure text;
step five: introducing an Attention mechanism of a neural network into the first characteristic information and the second characteristic information to enhance corresponding security text characteristics;
step six: and performing cosine similarity calculation on the security text vector obtained after the enhancement corresponding to the entity mention and the security text vector obtained after the enhancement corresponding to the candidate entity, performing descending order arrangement on the scores of all calculation results, and linking the candidate entity with the highest score to the entity mention.
2. The entity linking method applied in the field of network security as claimed in claim 1, wherein the training of Word2vec model comprises:
acquiring a network security text and cleaning;
and (4) performing Word segmentation on the safe text by using a Word segmentation tool, and pre-training the input Word2vec model of the safe text after Word segmentation.
3. The entity linking method applied in the network security field as claimed in claim 1, wherein the cleaning of the network security text includes format conversion, simplified and capital and small conversion.
4. The entity linking method applied in the network security field as claimed in claim 1, wherein the calculation formula of the position vector in the second step and the third step is:
Figure FDA0002996938920000021
wherein the variable pos represents the position of the security word in the security text, the variable d represents the dimension of the security word, the variable 2i represents the even dimension of the variable d, and the variable 2i +1 represents the odd dimension.
5. The entity linking method applied in the network security field as claimed in claim 1, wherein in step two and step three, the formula for performing the joint embedding is:
Vjoint=Vchar+Vword+Vposition (2)
wherein VJointRepresenting a joint vector, VcharRepresenting a character vector, VwordRepresents a word vector sum VpositionA position vector is represented.
6. The entity linking method applied in the network security field according to claim 1, wherein the calculation formula of the cosine similarity of the two security text features in the sixth step is:
Figure FDA0002996938920000031
wherein, A and B represent two n-dimensional vectors, wherein the A vector is [ A1, A2, A3,. An ] and the B vector is [ B1, B2, B3,. An, Bn ], and theta represents the space angle of the A and B vectors.
CN202110344549.2A 2021-03-29 2021-03-29 Entity linking method applied to network security field Active CN112989832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110344549.2A CN112989832B (en) 2021-03-29 2021-03-29 Entity linking method applied to network security field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110344549.2A CN112989832B (en) 2021-03-29 2021-03-29 Entity linking method applied to network security field

Publications (2)

Publication Number Publication Date
CN112989832A true CN112989832A (en) 2021-06-18
CN112989832B CN112989832B (en) 2023-04-28

Family

ID=76338566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110344549.2A Active CN112989832B (en) 2021-03-29 2021-03-29 Entity linking method applied to network security field

Country Status (1)

Country Link
CN (1) CN112989832B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866399A (en) * 2019-10-24 2020-03-06 同济大学 Chinese short text entity identification and disambiguation method based on enhanced character vector
CN110991187A (en) * 2019-12-05 2020-04-10 北京奇艺世纪科技有限公司 Entity linking method, device, electronic equipment and medium
CN111310470A (en) * 2020-01-17 2020-06-19 西安交通大学 Chinese named entity recognition method fusing word and word features
CN111401049A (en) * 2020-03-12 2020-07-10 京东方科技集团股份有限公司 Entity linking method and device
CN111460820A (en) * 2020-03-06 2020-07-28 中国科学院信息工程研究所 Network space security domain named entity recognition method and device based on pre-training model BERT
CN111709241A (en) * 2020-05-27 2020-09-25 西安交通大学 Named entity identification method oriented to network security field
US20210034701A1 (en) * 2019-07-30 2021-02-04 Baidu Usa Llc Coreference-aware representation learning for neural named entity recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210034701A1 (en) * 2019-07-30 2021-02-04 Baidu Usa Llc Coreference-aware representation learning for neural named entity recognition
CN110866399A (en) * 2019-10-24 2020-03-06 同济大学 Chinese short text entity identification and disambiguation method based on enhanced character vector
CN110991187A (en) * 2019-12-05 2020-04-10 北京奇艺世纪科技有限公司 Entity linking method, device, electronic equipment and medium
CN111310470A (en) * 2020-01-17 2020-06-19 西安交通大学 Chinese named entity recognition method fusing word and word features
CN111460820A (en) * 2020-03-06 2020-07-28 中国科学院信息工程研究所 Network space security domain named entity recognition method and device based on pre-training model BERT
CN111401049A (en) * 2020-03-12 2020-07-10 京东方科技集团股份有限公司 Entity linking method and device
CN111709241A (en) * 2020-05-27 2020-09-25 西安交通大学 Named entity identification method oriented to network security field

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
林广和等: "基于细粒度词表示的命名实体识别研究" *
罗达等: "基于多角度注意力机制的单一事实知识库问答方法", 《计算机科学》 *

Also Published As

Publication number Publication date
CN112989832B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
Li et al. Oscar: Object-semantics aligned pre-training for vision-language tasks
Mathur et al. Detecting offensive tweets in hindi-english code-switched language
Malik et al. Deep learning for hate speech detection: a comparative study
CN110781306B (en) English text aspect layer emotion classification method and system
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
CN111460820A (en) Network space security domain named entity recognition method and device based on pre-training model BERT
Quoc et al. Mining coreference relations between formulas and text using Wikipedia
WO2023020522A1 (en) Methods for natural language processing and training natural language processing model, and device
Alqahtani et al. A survey of text matching techniques
Yu et al. Detecting SQL injection attacks based on text analysis
CN115759071A (en) Government affair sensitive information identification system and method based on big data
Van Nguyen et al. Improving cross-lingual transfer for event argument extraction with language-universal sentence structures
Lei et al. Open domain question answering with character-level deep learning models
Han et al. Text adversarial attacks and defenses: Issues, taxonomy, and perspectives
CN114580371A (en) Program semantic confusion method and system based on natural language processing
Touati-Hamad et al. Arabic quran verses authentication using deep learning and word embeddings
Zhang et al. Selective decoding for cross-lingual open information extraction
CN112989832B (en) Entity linking method applied to network security field
Chen et al. Audio captioning with meshed-memory transformer
CN114817934A (en) Vulnerability severity assessment method and system based on vulnerability event argument
Chen et al. A writing style embedding based on contrastive learning for multi-author writing style analysis
Al-Mashhadany et al. Extracting numerical data from unstructured Arabic texts (ENAT)
Banerjee et al. Better Qualitative searching for effecting the performance of machine translation
Harode et al. Text processor for IPC prediction
Kwon et al. Textual Adversarial training of machine learning model for resistance to adversarial examples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant