CN116306632A - Network threat information labeling method and device, electronic equipment and readable storage medium - Google Patents

Network threat information labeling method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN116306632A
CN116306632A CN202310172469.2A CN202310172469A CN116306632A CN 116306632 A CN116306632 A CN 116306632A CN 202310172469 A CN202310172469 A CN 202310172469A CN 116306632 A CN116306632 A CN 116306632A
Authority
CN
China
Prior art keywords
network threat
training
threat information
target
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310172469.2A
Other languages
Chinese (zh)
Inventor
顾钊铨
贾焰
张欢
方滨兴
周可
景晓
王乐
谭昊
张钧建
唐可可
张登辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202310172469.2A priority Critical patent/CN116306632A/en
Publication of CN116306632A publication Critical patent/CN116306632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The application discloses a method, a device, electronic equipment and a readable storage medium for labeling network threat information, which are applied to the technical field of network security, wherein the method for labeling the network threat information comprises the following steps: acquiring a target text, wherein the target text carries network threat information; performing sentence splitting on the target text to obtain at least one target sentence; according to the target sentences and the network threat label prediction model, predicting the network threat types of the target sentences respectively to obtain network threat information labels, and marking the target sentences according to the network threat information labels respectively. The method and the device solve the technical problem of low labeling efficiency of the network threat information.

Description

Network threat information labeling method and device, electronic equipment and readable storage medium
Technical Field
The present disclosure relates to the field of network security technologies, and in particular, to a method and apparatus for labeling network threat information, an electronic device, and a readable storage medium.
Background
The network threat information is used as an external information resource of direct or potential security threat, and can help security personnel to rapidly screen malicious threat attack and timely respond and defend. Therefore, continuous tracking and analysis of network threat information is also an important task in network security protection. Such information is published in unstructured form in websites, which require extraction of information to obtain critical, useful vulnerability and/or attack information.
Along with the high-speed development of science and technology, the requirement for acquiring the network threat information is also increasing, at present, texts are marked by professional staff, manual data marking requires that mark staff have a solid network security knowledge background, and a great deal of time is consumed, so that the marking efficiency of the network threat information is low.
Disclosure of Invention
The main purpose of the application is to provide a method, a device, electronic equipment and a readable storage medium for labeling network threat information, which aim to solve the technical problem of low labeling efficiency of the network threat information in the prior art.
In order to achieve the above objective, the present application provides a method for labeling network threat information, where the method for labeling network threat information includes:
acquiring a target text, wherein the target text carries network threat information;
performing sentence splitting on the target text to obtain at least one target sentence;
according to the target sentences and the network threat label prediction model, predicting the network threat types of the target sentences respectively to obtain network threat information labels, and marking the target sentences according to the network threat information labels respectively.
In order to achieve the above object, the present application further provides a cyber threat information labeling device, where the cyber threat information labeling device includes:
the acquisition module is used for acquiring a target text, wherein the target text carries network threat information;
the splitting module is used for splitting the target text into sentences to obtain at least one target sentence;
the labeling module is used for respectively predicting the network threat type of each target sentence according to each target sentence and the network threat label prediction model to obtain a network threat information label, and respectively labeling each target sentence according to each network threat information label.
The application also provides an electronic device comprising: the system comprises a memory, a processor and a program of the network threat information labeling method, wherein the program of the network threat information labeling method is stored in the memory and can run on the processor, and the program of the network threat information labeling method can realize the steps of the network threat information labeling method when being executed by the processor.
The application also provides a computer readable storage medium, wherein the computer readable storage medium stores a program for realizing the network threat information labeling method, and the program for realizing the network threat information labeling method realizes the steps of the network threat information labeling method when being executed by a processor.
The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the cyber threat information tagging method described above.
Compared with a method for labeling texts by professionals, the method and device for labeling the texts by the aid of the network threat information, the method and device for labeling the texts by the aid of the professionals are capable of acquiring target texts, wherein the target texts carry network threat information; performing sentence splitting on the target text to obtain at least one target sentence; according to the target sentences and the network threat label prediction model, the network threat types of the target sentences are respectively predicted to obtain network threat information labels, the target sentences are respectively labeled according to the network threat information labels, and according to the network threat label prediction model which is obtained through sentence splitting and training, the automatic labeling of texts carrying network threat information is realized, the technical defects that labeling personnel need to have a firm network security knowledge background and a large amount of time are needed are avoided, and therefore the labeling efficiency of the network threat information is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a first embodiment of a method for labeling network threat information;
fig. 2 is a schematic diagram of a model prediction flow of a network threat tag prediction model related to a network threat information labeling method in an embodiment of the present application;
FIG. 3 is a flowchart of a second embodiment of a method for labeling cyber threat information according to the present application;
fig. 4 is a schematic diagram of a device structure related to a method for labeling network threat information in an embodiment of the present application;
fig. 5 is a schematic device structure diagram of a hardware operating environment related to a network threat information labeling method in an embodiment of the present application.
The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
Detailed Description
In order to make the above objects, features and advantages of the present application more comprehensible, the following description will make the technical solutions of the embodiments of the present application clear and complete with reference to the accompanying drawings of the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, based on the embodiments herein, which are within the scope of the protection of the present application, will be within the purview of one of ordinary skill in the art without the exercise of inventive faculty.
Example 1
In a first embodiment of the network threat information labeling method, referring to fig. 1, the network threat information labeling method includes:
step S10, a target text is obtained, wherein the target text carries network threat information;
as an example, step S10 includes: and acquiring content typed by a user, and taking target content carrying network threat information in the content as target text.
As an example, step S10 includes: and carrying out sentence recognition on the target object to obtain a target text carrying network threat information, wherein the target object can be a webpage, a report or other carriers carrying text information.
Step S20, carrying out sentence splitting on the target text to obtain at least one target sentence;
screening punctuation marks in the target text, and carrying out sentence splitting on the target text by taking the punctuation marks as splitting marks to obtain at least one target sentence, wherein the punctuation marks at least comprise one of periods, exclamation marks and question marks; the target sentence comprises at least one sentence element, wherein the sentence element is a word and/or a word.
Step S30, according to the target sentences and the network threat label prediction model, predicting the network threat types of the target sentences respectively to obtain network threat information labels, and marking the target sentences according to the network threat information labels respectively.
As an example, mapping each target sentence into a network threat type corresponding to each target sentence through a network threat label prediction model, generating a network threat information label according to the network threat type, and labeling each target sentence according to each network threat information label.
As an example, each target sentence is mapped to a network threat information label corresponding to each target sentence through a network threat label prediction model, and each target sentence is labeled according to each network threat information label.
Wherein, in step S30, the cyber threat tag prediction model includes a pre-trained language model and a conditional random field model,
the step of predicting the network threat type of each target sentence according to each target sentence and the network threat label prediction model to obtain a network threat information label comprises the following steps:
step S31, obtaining preset random inactivation times;
in this embodiment, it should be noted that the preset random deactivation number is a preset number of times of performing random node deactivation on the model hidden layer.
It will be appreciated that, as can be seen from the above definition, the preset random inactivation times are common in model application and model training, and when model training is performed, the larger the preset random inactivation times are, the longer the training time is, but the model training effect is not linearly increased (i.e. the longer the training time is, the better the model training effect is), so the preset random inactivation times are not set too high, and the preset random inactivation times may be 2, 3, or other times.
Step S32, carrying out semantic analysis on each target sentence through the pre-training language model according to the preset random inactivation times to obtain a semantic vector;
Illustratively, the pre-training language model is adjusted according to the preset random inactivation times, and semantic analysis is performed on each target sentence according to the adjusted pre-training language model to obtain semantic vectors with the number consistent with the preset random inactivation times.
Step S33, predicting the probability that each semantic vector belongs to each network threat type according to the conditional random field model to obtain a prediction result;
as an example, the conditional random field model is unique, and the probability that each semantic vector belongs to each network threat type is respectively predicted according to the conditional random field model, so as to obtain a prediction result.
As an example, there are a plurality of conditional random field models, the number of the conditional random field models is consistent with the preset random inactivation times, and the probability that each corresponding semantic vector belongs to each network threat type is respectively predicted according to each conditional random field model, so as to obtain a prediction result.
And step S34, generating a network threat information label corresponding to each target statement according to the prediction result.
As an example, a corresponding network threat type with the highest probability is selected from probabilities that the semantic vectors belong to the network threat types in the prediction probabilities to be used as the network threat information label.
As an example, according to the conditional random field model, predicting to obtain the network threat information label corresponding to each semantic vector.
Specifically, obtaining model parameters of the conditional random field model, and determining the transition probability between every two adjacent network threat types according to the model parameters; and predicting the probability that each corresponding semantic vector belongs to each network threat information label according to the transition probability and the semantic vector.
Optionally, the step of predicting, according to the transition probability and the semantic vector, a probability that each corresponding semantic vector belongs to each cyber threat information tag may specifically include:
Figure BDA0004099721370000051
wherein S is g (x, Z) is the probability that the target sentence x output by the g-th conditional random field model belongs to the network threat information label Z,
Figure BDA0004099721370000052
for being marked by the label z i Transfer to z i+1 Probability of U x [w i ][z i ]For word w in target sentence x i Belonging to the network threat information label z i Is a probability of (2).
In step S32, the step of performing semantic analysis on each target sentence through the pre-training language model according to the preset random inactivation times to obtain a semantic vector includes:
step A10, according to the preset random inactivation times, carrying out random node inactivation on the hidden layer of the pre-training language model;
As an example, there are a plurality of hidden layers of the pre-training speech model, the number of the hidden layers is consistent with the preset random deactivation times, and random node deactivation is performed on each hidden layer, so as to obtain each random node deactivated hidden layer.
Specifically, the hidden layer includes a self-attention mechanism layer and a feedforward layer, and random node deactivation is performed on the self-attention mechanism layer and/or the feedforward layer.
And step A20, carrying out semantic analysis on each target sentence through the pre-training language model after the random node is deactivated, and obtaining a semantic vector.
As an example, semantic analysis is performed on each target sentence through the pre-training language model under each hidden layer, so as to obtain a semantic vector.
As an example, performing random node deactivation on the hidden layer of the pre-training language model, performing semantic analysis on each target sentence through the pre-training language model after the random node deactivation to obtain a semantic vector, accumulating the total number of random node deactivation times of the hidden layer, judging whether the total number of random node deactivation times is smaller than or equal to the preset random node deactivation times, and if the total number of random node deactivation times is smaller than or equal to the preset random node deactivation times, returning to execute the step and the subsequent steps of performing random node deactivation on the hidden layer of the pre-training language model until the total number of random node deactivation times is larger than the preset random node deactivation times.
Specifically, the hidden layer of the pre-training language model comprises an embedded layer, a self-attention mechanism layer, a normalization layer, a feedforward layer, a summation and normalization layer and a full connection layer. And respectively carrying out node random inactivation on the feedforward layer and the self-attention mechanism layer according to the preset random inactivation times to obtain a node random inactivation feedforward layer and a node random inactivation self-attention mechanism layer. Inputting each target sentence into the pre-training language model, and sequentially processing each target sentence through the embedding layer, the node random inactivation self-attention mechanism layer, the normalization layer, the random inactivation feedforward layer and the summation and normalization layer to obtain a processing vector, and performing feature integration on the processing vector through a preset number of full-connection layers to obtain a semantic vector, wherein the preset number can be 2.
As an example, referring to fig. 2, fig. 2 is a schematic model prediction flow diagram of a network threat tag prediction model related to a network threat information labeling method in an embodiment of the present application. Fig. 2 includes: an embedding layer, a self-attention mechanism layer, a normalization layer, a feedforward layer, a summation and normalization layer and a conditional random field model. Inputting a target text (x in the diagram) into a network threat label prediction model, sequentially processing the target text through an embedding layer, a self-attention mechanism layer, a normalization layer, a feedforward layer, a summation and normalization layer and a conditional random field model in the network threat label prediction model to obtain the probability that the target text belongs to each network threat information label, and taking the network threat information label corresponding to the maximum probability in each probability as a final label.
Compared with a method for labeling texts by professionals, the method for labeling the network threat information provided by the embodiment of the application has the advantages that the target texts are obtained, wherein the target texts carry the network threat information; performing sentence splitting on the target text to obtain at least one target sentence; according to the target sentences and the network threat label prediction model, the network threat types of the target sentences are respectively predicted to obtain network threat information labels, the target sentences are respectively labeled according to the network threat information labels, and according to the network threat label prediction model which is obtained through sentence splitting and training, the automatic labeling of texts carrying network threat information is realized, the technical defects that labeling personnel need to have a firm network security knowledge background and a large amount of time are needed are avoided, and therefore the labeling efficiency of the network threat information is improved.
Example two
Further, referring to fig. 3, in another embodiment of the present application, the same or similar content as the first embodiment may be referred to the description above, and will not be repeated herein. On this basis, in step S30, before the step of predicting the cyber threat type of each target sentence according to each target sentence and the cyber threat tag prediction model to obtain a cyber threat information tag, the method further includes:
Step B10, a training sample, a real label corresponding to the training sample, preset random inactivation times and a network threat label prediction model to be trained are obtained, wherein the network threat label prediction model to be trained comprises a language model to be trained and a conditional random field model to be trained;
in this embodiment, it should be noted that the training sample is at least one sentence including cyber threat information for performing model training on a cyber threat label prediction model to be trained. The real label is real network threat type information of the training sample. The network threat label prediction model to be trained is an untrained network threat label prediction model.
The method comprises the steps of obtaining preset random inactivation times set by a user or set by a system, training samples, real labels corresponding to the training samples and a network threat label prediction model to be trained.
Step B20, carrying out semantic analysis on each target sentence through the language model to be trained according to the preset random inactivation times to obtain a training semantic vector;
step B30, predicting the probability that each training semantic vector belongs to each network threat information label according to the conditional random field model to be trained, and obtaining a training prediction result;
For example, the specific implementation steps of step B20 to step B30 may refer to the specific implementation contents of step S32 to step S34 and step a10 to step a20, which are not described herein.
Step B40, constructing model loss of the network threat label prediction model to be trained according to the training prediction result, the preset random inactivation times and the real label;
and step B50, performing iterative optimization on the network threat tag prediction model to be trained according to the model loss to obtain the network threat tag prediction model.
Illustratively, steps B40-B50 include: constructing model loss of the network threat label prediction model to be trained according to the training prediction result, the preset random inactivation times and the real label; judging whether the model loss is converged or not, and if the model loss is converged, taking the network threat label prediction model to be trained as the network threat label prediction model; and if the model loss is not converged, returning to the step and the subsequent steps of acquiring the training sample, the real label corresponding to the training sample, the preset random inactivation times and the network threat label prediction model to be trained until the model loss is converged.
In step B10, the obtaining the training sample and the real label corresponding to the training sample includes:
step B11, analyzing the annual reports and the network page information of each security manufacturer to obtain an original sample;
in this embodiment, it should be noted that the original sample may be an APT (Advanced Persistent Threat ) report.
As an example, after step B10, the raw sample is subjected to data cleaning.
Step B12, sample screening and sample content preprocessing are carried out on the original sample to obtain a processed sample;
it can be understood that, because of detailed information related to attack events, attack analysis and the like in the directly obtained original sample, for example, an APT report, the original sample is often long in space, for example, the original sample is directly split into sentences, so that model training is performed according to each sentence obtained by splitting, which easily results in lower efficiency of model training.
Illustratively, sample screening is performed on the original sample to obtain a screened sample, and deduplication processing is performed on the screened sample to obtain a processed sample.
As an example, the text length of each text in the original sample is obtained, and the text length in each text in the original sample is used as a length screening sample, where the length of the text is greater than or equal to a preset length threshold, the preset length threshold is a preset length screening threshold value that ensures the content richness of the original sample, and the preset length threshold may be 400 bytes. Extracting text keywords of each text in the length screening sample, wherein the number of times of the keywords of each text in the length screening sample is larger than a preset number of times threshold value, and the frequency of the keywords is larger than a preset frequency threshold value, the text keywords are words related to the network security field, the preset number of times threshold value is a preset keyword number of times threshold value for judging that the number of times of the keywords of the text meets the requirement of the network threat information richness, and the preset frequency threshold value is a preset keyword frequency threshold value for judging that the frequency of the keywords of the text meets the requirement of the network threat information richness.
Optionally, the step of screening the sample in which the number of times of the keywords of the text in each text in the length screening sample is greater than a preset number of times threshold and the frequency of the keywords is greater than a preset frequency threshold may specifically include:
len(doc∩V)>a
wherein len (doc n V) is the number of times of keywords of the text keywords, V is the text keywords, and a is a preset number of times threshold.
Figure BDA0004099721370000091
Wherein p (doc n V) is the keyword frequency of the text keywords, sum (w|w (∈v n doc)) is the total number of occurrences of the text keywords, len (doc) is the text length of the length screening sample, and b is a preset frequency threshold.
As an example, obtaining words contained in each text in the screening sample and text release time corresponding to each text in the screening sample, and selecting a processing sample with the word repetition rate smaller than or equal to a preset repetition rate threshold value and/or with a time interval between text release times larger than or equal to a preset time interval from each text in the screening sample, wherein the preset repetition rate threshold value is a preset word repetition rate critical value for judging that repeated contents among texts in the screening sample are more; the preset time interval is a preset text release time critical value for judging that the release interval between texts in the screening sample is short. For example, a processing sample with a word repetition rate of less than or equal to 0.03 in each text in the screening sample, and/or a processing sample with a time interval between text release times of greater than or equal to 7 days is selected.
It can be understood that the original samples are screened and de-duplicated, so that the obtained processed samples have rich network security knowledge and have reduced processing capacity, and the processed samples are used for model training, so that the model training efficiency is improved on the premise of ensuring the model training accuracy.
Step B13, carrying out sentence splitting on the processed sample to obtain the training sample;
for example, the implementation of step B13 may refer to the implementation of step S20, which is not described herein.
And step B14, determining the real labels corresponding to the training samples according to the training samples.
It can be understood that when model training is performed at present, manual labeling is required for the real labels of all samples, and manual data labeling requires labeling personnel to have a solid network security knowledge background, and a great deal of time is required to be consumed, so that the time required for determining the real labels is long, and further the model training efficiency is low.
Through the processing of the original sample, a training sample is obtained, the volume of the text to be marked is reduced to a certain extent, the time required for determining the real label is reduced, and the model training efficiency is improved.
Wherein in step B14, the real label includes an entity boundary, an entity category, a relationship type and an entity location,
the step of determining the real label corresponding to each training sample according to each training sample comprises the following steps:
step C10, determining at least one entity in each training sample;
in this embodiment, it should be noted that the entity is a network space security entity, which is used to describe a basic constituent unit in the network security domain.
Step C20, acquiring a preset labeling mode, and determining the entity boundary of each entity according to the preset labeling mode;
in this embodiment, it should be noted that, the preset labeling mode is a preset mode for labeling entity boundaries, and the preset labeling mode may be a BIOES (begin inside outside end single, start-interior-other-end-single) mode. B represents the starting position of each word in each training sample in each entity, I represents the internal part of each word in each entity, E represents the ending position of each word in each entity, S represents that each word is a single entity, and O represents that each word is a non-entity.
Step C30, performing category analysis on each entity to obtain an entity category;
in this embodiment, it should be noted that the entity class at least includes one of Organization, location, software, malware, flag Indicator, vulnerability vulnerabilities, action of-action, tool, attack pattern, industry and technology.
Step C40, taking the position information corresponding to each entity as the entity position;
illustratively, location information of each of the entities in the relationship triplet is determined, the location information being taken as the entity location, the location information including at least one of a head entity and a tail entity.
And step C50, determining the relation between every two entities to obtain the relation type.
In this embodiment, it should be noted that the relationship type at least includes one of uses, com-from, has-vulnerability, has-product, related-to, include, belong, track, has-target.
In step B30, the step of constructing the model loss of the network threat label prediction model to be trained according to the training prediction result, the preset random inactivation times and the real label includes:
Step B31, constructing the conditional random field model loss according to the training prediction result, the preset random inactivation times and the real labels;
the probability that each training semantic vector belongs to each network threat information label is determined according to the training prediction result, and the conditional random field model loss is constructed according to the probability that each training semantic vector belongs to each network threat information label, the preset random inactivation times and the real label.
Optionally, the step of constructing the conditional random field model loss according to the probability that each training semantic vector belongs to each cyber threat information tag, the preset random inactivation times and the real tag may specifically include:
Figure BDA0004099721370000111
wherein L is (CRF) For the loss of the conditional random field model, k is the preset random inactivation times, S g (x 1 ,Z 1 ) Training semantic vector x output for g-th conditional random field model to be trained 1 Belonging to network threat information label Z 1 Probability of S g (x 1 ,Z 2 ) Training semantic vector x output for g-th conditional random field model to be trained 1 Belonging to network threat information label Z 2 Probability of S g (x 1 ,Z N ) Training semantic vector x output for g-th conditional random field model to be trained 1 Belonging to network threat information label Z N Is a probability of (2).
Step B32, constructing a relative entropy loss corresponding to the preset random inactivation times according to the training prediction result and the preset random inactivation times;
illustratively, a target training prediction result with the highest probability and a target adjacent training prediction result adjacent to the target training prediction result are selected from the training prediction results, a first divergence value of the target training prediction result relative to the target adjacent training prediction result and a second divergence value of the target adjacent training prediction result relative to the target training prediction result are determined according to the target training prediction result and the target adjacent training prediction result, and a relative entropy loss corresponding to the preset random inactivation number is constructed according to the first divergence value, the second divergence value and the preset random inactivation number.
Optionally, the step of constructing the relative entropy loss corresponding to the preset random inactivation times according to the first divergence value, the second divergence value and the preset random inactivation times may specifically include:
Figure BDA0004099721370000121
wherein L is (KL) For the relative entropy loss corresponding to the preset random inactivation times, k is the preset random inactivation times, KL (P * j |P * j+1 ) For a first divergence value of the target training prediction result with respect to the target neighbor training prediction result, KL (P * j+1 |P * j ) A second divergence value for the target neighbor training prediction with respect to the target training prediction.
Optionally, the step of determining the first divergence value of the target training prediction result with respect to the target neighboring training prediction result according to the target training prediction result and the target neighboring training prediction result may specifically include:
Figure BDA0004099721370000122
wherein KL (P) * j |P * j+1 ) For the purpose ofA first divergence value, P, of the target training predictors with respect to the target neighbor training predictors * j For the target training prediction result, P * j+1 And training a predicted result for the target neighbor.
Optionally, the step of determining the second divergence value of the target neighboring training prediction result with respect to the target training prediction result according to the target training prediction result and the target neighboring training prediction result may specifically include:
Figure BDA0004099721370000131
wherein KL (P) * j+1 |P * j ) For a second divergence value of the target neighbor training prediction result with respect to the target training prediction result, P * j For the target training prediction result, P * j+1 And training a predicted result for the target neighbor.
And step B33, integrating the conditional random field model loss and the relative entropy loss into a model loss of the network threat label prediction model to be trained.
Illustratively, the sum of the conditional random field model loss and the relative entropy loss is taken as the model loss of the network threat tag predictive model to be trained.
Optionally, the step of integrating the conditional random field model loss and the relative entropy loss into a model loss of the network threat tag prediction model to be trained may specifically include:
L=L (KL) +L (CRF)
wherein L is the model loss of the network threat label prediction model to be trained, L (KL) For the relative entropy loss, L (CRF) Loss for the conditional random field model.
The embodiment of the application provides a network threat information labeling method, which comprises the steps of obtaining a training sample, a real label corresponding to the training sample, preset random inactivation times and a network threat label prediction model to be trained, wherein the network threat label prediction model to be trained comprises a language model to be trained and a conditional random field model to be trained; according to the preset random inactivation times, carrying out semantic analysis on each target sentence through the language model to be trained to obtain a training semantic vector; predicting the probability that each training semantic vector belongs to each network threat information label according to the conditional random field model to be trained, and obtaining a training prediction result; constructing model loss of the network threat label prediction model to be trained according to the training prediction result, the preset random inactivation times and the real label; according to the model loss, iterative optimization is carried out on the network threat label prediction model to be trained to obtain the network threat label prediction model, so that training efficiency is high when the network threat label prediction model to be trained is trained, and model prediction accuracy of the network threat label prediction model obtained through training is high.
Example III
The embodiment of the application also provides a network threat information labeling device, referring to fig. 4, the network threat information labeling device includes:
the acquisition module is used for acquiring a target text, wherein the target text carries network threat information;
the splitting module is used for splitting the target text into sentences to obtain at least one target sentence;
the labeling module is used for respectively predicting the network threat type of each target sentence according to each target sentence and the network threat label prediction model to obtain a network threat information label, and respectively labeling each target sentence according to each network threat information label.
Optionally, the cyber threat tag prediction model includes a pre-training language model and a conditional random field model, and the labeling module is further configured to:
acquiring preset random inactivation times;
according to the preset random inactivation times, carrying out semantic analysis on each target sentence through the pre-training language model to obtain a semantic vector;
predicting the probability that each semantic vector belongs to each network threat type according to the conditional random field model to obtain a prediction result;
And generating a network threat information label corresponding to each target statement according to the prediction result.
Optionally, the labeling module is further configured to:
according to the preset random inactivation times, carrying out random node inactivation on the hidden layer of the pre-training language model;
and carrying out semantic analysis on each target sentence through the pre-training language model after the random node is deactivated, so as to obtain a semantic vector.
Optionally, before the step of predicting the cyber threat type of each target sentence according to each target sentence and the cyber threat tag prediction model to obtain the cyber threat information tag, the cyber threat information labeling device is further configured to:
acquiring a training sample, a real label corresponding to the training sample, preset random inactivation times and a network threat label prediction model to be trained, wherein the network threat label prediction model to be trained comprises a language model to be trained and a conditional random field model to be trained;
according to the preset random inactivation times, carrying out semantic analysis on each target sentence through the language model to be trained to obtain a training semantic vector;
predicting the probability that each training semantic vector belongs to each network threat information label according to the conditional random field model to be trained, and obtaining a training prediction result;
Constructing model loss of the network threat label prediction model to be trained according to the training prediction result, the preset random inactivation times and the real label;
and carrying out iterative optimization on the network threat tag prediction model to be trained according to the model loss to obtain the network threat tag prediction model.
Optionally, the network threat information labeling device is further configured to:
analyzing the annual reports and the network page information of each security manufacturer to obtain an original sample;
sample screening and sample content preprocessing are carried out on the original sample to obtain a processed sample;
performing sentence splitting on the processing sample to obtain the training sample;
and determining the real labels corresponding to the training samples according to the training samples.
Optionally, the real label includes an entity boundary, an entity category, a relationship type and an entity location, and the cyber threat information labeling device is further configured to:
determining at least one entity in each of the training samples;
acquiring a preset annotation mode, and determining the entity boundary of each entity according to the preset annotation mode;
performing category analysis on each entity to obtain an entity category;
Taking the position information corresponding to each entity as the entity position;
and determining the relation between every two entities to obtain the relation type.
Optionally, the network threat information labeling device is further configured to:
constructing the conditional random field model loss according to the training prediction result, the preset random inactivation times and the real labels;
constructing a relative entropy loss corresponding to the preset random inactivation times according to the training prediction result and the preset random inactivation times;
and integrating the conditional random field model loss and the relative entropy loss into a model loss of the network threat label prediction model to be trained.
The network threat information labeling device provided by the application adopts the network threat information labeling method in the embodiment, so that the technical problem of low labeling efficiency of the network threat information is solved. Compared with the prior art, the beneficial effects of the network threat information labeling device provided by the embodiment of the application are the same as those of the network threat information labeling method provided by the embodiment, and other technical features in the network threat information labeling device are the same as those disclosed by the method of the embodiment, so that details are not repeated.
Example IV
An embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the cyber threat information labeling method of the above embodiment.
Referring now to fig. 5, a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers PDAs (Personal Digital Assistant, personal digital assistants), PADs (tablet computers), PMPs (Portable Media Player, portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 5, the electronic apparatus may include a processing device (e.g., a central processing unit, a graphics processor, etc.) that may perform various appropriate actions and processes according to a program stored in a ROM (Read-Only Memory) or a program loaded from a storage device into a RAM (Random Access Memory ). In the RAM, various programs and data required for the operation of the electronic device are also stored. The processing device, ROM and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus.
In general, the following systems may be connected to the I/O interface: input devices including, for example, touch screens, touch pads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, etc.; output devices including, for example, liquid Crystal Displays (LCDs), speakers, vibrators, etc.; storage devices including, for example, magnetic tape, hard disk, etc.; a communication device. The communication means may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While electronic devices having various systems are shown in the figures, it should be understood that not all of the illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device, or installed from a storage device, or installed from ROM. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by a processing device.
The electronic equipment provided by the application adopts the network threat information labeling method in the embodiment, so that the technical problem of low labeling efficiency of the network threat information is solved. Compared with the prior art, the beneficial effects of the electronic device provided by the embodiment of the application are the same as those of the network threat information labeling method provided by the embodiment, and other technical features of the electronic device are the same as those disclosed by the method of the embodiment, so that the description is omitted herein.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Example five
The present embodiment provides a computer-readable storage medium having computer-readable program instructions stored thereon for performing the method of the cyber threat information labeling method in the above embodiment.
The computer readable storage medium provided by the embodiments of the present application may be, for example, a usb disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Erasable Programmable Read Only Memory, erasable programmable read-only memory) or flash memory, an optical fiber, a CD-ROM (compact disc read-only memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (Radio Frequency), and the like, or any suitable combination thereof.
The above-described computer-readable storage medium may be contained in an electronic device; or may exist alone without being assembled into an electronic device.
The computer-readable storage medium carries one or more programs that, when executed by an electronic device, cause the electronic device to: acquiring a target text, wherein the target text carries network threat information; performing sentence splitting on the target text to obtain at least one target sentence; according to the target sentences and the network threat label prediction model, predicting the network threat types of the target sentences respectively to obtain network threat information labels, and marking the target sentences according to the network threat information labels respectively.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a LAN (Local Area Network ) or WAN (Wide Area Network, wide area network), or it may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented in software or hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.
The computer readable storage medium is stored with the computer readable program instructions for executing the network threat information labeling method, and solves the technical problem of low labeling efficiency of the network threat information. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the embodiment of the present application are the same as those of the network threat information labeling method provided by the above implementation, and are not described in detail herein.
Example six
The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the cyber threat information tagging method described above.
The computer program product provided by the application solves the technical problem of low labeling efficiency of the network threat information. Compared with the prior art, the beneficial effects of the computer program product provided by the embodiment of the present application are the same as the beneficial effects of the network threat information labeling method provided by the above embodiment, and are not described in detail herein.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims.

Claims (10)

1. The network threat information labeling method is characterized by comprising the following steps of:
acquiring a target text, wherein the target text carries network threat information;
performing sentence splitting on the target text to obtain at least one target sentence;
according to the target sentences and the network threat label prediction model, predicting the network threat types of the target sentences respectively to obtain network threat information labels, and marking the target sentences according to the network threat information labels respectively.
2. The cyber threat information labeling method of claim 1, wherein the cyber threat tag prediction model comprises a pre-trained language model and a conditional random field model,
the step of predicting the network threat type of each target sentence according to each target sentence and the network threat label prediction model to obtain a network threat information label comprises the following steps:
acquiring preset random inactivation times;
according to the preset random inactivation times, carrying out semantic analysis on each target sentence through the pre-training language model to obtain a semantic vector;
Predicting the probability that each semantic vector belongs to each network threat type according to the conditional random field model to obtain a prediction result;
and generating a network threat information label corresponding to each target statement according to the prediction result.
3. The method for labeling cyber threat information according to claim 2, wherein the step of performing semantic analysis on each of the target sentences through the pre-training language model according to the preset random inactivation times to obtain semantic vectors comprises:
according to the preset random inactivation times, carrying out random node inactivation on the hidden layer of the pre-training language model;
and carrying out semantic analysis on each target sentence through the pre-training language model after the random node is deactivated, so as to obtain a semantic vector.
4. The cyber threat information labeling method of claim 1, further comprising, before the step of predicting the cyber threat type of each of the target sentences according to each of the target sentences and the cyber threat tag prediction model to obtain cyber threat information tags, respectively:
acquiring a training sample, a real label corresponding to the training sample, preset random inactivation times and a network threat label prediction model to be trained, wherein the network threat label prediction model to be trained comprises a language model to be trained and a conditional random field model to be trained;
According to the preset random inactivation times, carrying out semantic analysis on each target sentence through the language model to be trained to obtain a training semantic vector;
predicting the probability that each training semantic vector belongs to each network threat information label according to the conditional random field model to be trained, and obtaining a training prediction result;
constructing model loss of the network threat label prediction model to be trained according to the training prediction result, the preset random inactivation times and the real label;
and carrying out iterative optimization on the network threat tag prediction model to be trained according to the model loss to obtain the network threat tag prediction model.
5. The method for labeling the cyber threat information of claim 4, wherein the obtaining the training sample and the real label corresponding to the training sample includes:
analyzing the annual reports and the network page information of each security manufacturer to obtain an original sample;
sample screening and sample content preprocessing are carried out on the original sample to obtain a processed sample;
performing sentence splitting on the processing sample to obtain the training sample;
and determining the real labels corresponding to the training samples according to the training samples.
6. The method of claim 5, wherein the real labels include entity boundaries, entity categories, relationship types, and entity locations,
the step of determining the real label corresponding to each training sample according to each training sample comprises the following steps:
determining at least one entity in each of the training samples;
acquiring a preset annotation mode, and determining the entity boundary of each entity according to the preset annotation mode;
performing category analysis on each entity to obtain an entity category;
taking the position information corresponding to each entity as the entity position;
and determining the relation between every two entities to obtain the relation type.
7. The cyber threat information labeling method of claim 4, wherein the step of constructing the model loss of the cyber threat tag prediction model to be trained based on the training prediction result, the preset random inactivation times, and the real tag comprises:
constructing the conditional random field model loss according to the training prediction result, the preset random inactivation times and the real labels;
constructing a relative entropy loss corresponding to the preset random inactivation times according to the training prediction result and the preset random inactivation times;
And integrating the conditional random field model loss and the relative entropy loss into a model loss of the network threat label prediction model to be trained.
8. The utility model provides a network threat information labeling device which characterized in that, network threat information labeling device includes:
the acquisition module is used for acquiring a target text, wherein the target text carries network threat information;
the splitting module is used for splitting the target text into sentences to obtain at least one target sentence;
the labeling module is used for respectively predicting the network threat type of each target sentence according to each target sentence and the network threat label prediction model to obtain a network threat information label, and respectively labeling each target sentence according to each network threat information label.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the cyber threat information labeling method of any of claims 1-7.
10. A computer-readable storage medium, wherein a program for implementing a cyber threat information labeling method is stored on the computer-readable storage medium, the program for implementing the cyber threat information labeling method being executed by a processor to implement the steps of the cyber threat information labeling method according to any of claims 1 to 7.
CN202310172469.2A 2023-02-21 2023-02-21 Network threat information labeling method and device, electronic equipment and readable storage medium Pending CN116306632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310172469.2A CN116306632A (en) 2023-02-21 2023-02-21 Network threat information labeling method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310172469.2A CN116306632A (en) 2023-02-21 2023-02-21 Network threat information labeling method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116306632A true CN116306632A (en) 2023-06-23

Family

ID=86802436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310172469.2A Pending CN116306632A (en) 2023-02-21 2023-02-21 Network threat information labeling method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116306632A (en)

Similar Documents

Publication Publication Date Title
CN111090987B (en) Method and apparatus for outputting information
US9923860B2 (en) Annotating content with contextually relevant comments
CN112015859B (en) Knowledge hierarchy extraction method and device for text, computer equipment and readable medium
CN111274815B (en) Method and device for mining entity focus point in text
US20170364587A1 (en) System and Method for Automatic, Unsupervised Contextualized Content Summarization of Single and Multiple Documents
CN111767366B (en) Question and answer resource mining method and device, computer equipment and storage medium
US20150169737A1 (en) Selecting a structure to represent tabular information
US10909320B2 (en) Ontology-based document analysis and annotation generation
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN110598157A (en) Target information identification method, device, equipment and storage medium
CN110245232B (en) Text classification method, device, medium and computing equipment
US20170004130A1 (en) Identifying word-senses based on linguistic variations
US20190114156A1 (en) Mapping of software code via user interface summarization
US11954173B2 (en) Data processing method, electronic device and computer program product
US20200364235A1 (en) Operations to transform dataset to intent
CN110633423A (en) Target account identification method, device, equipment and storage medium
US11095953B2 (en) Hierarchical video concept tagging and indexing system for learning content orchestration
US10042825B2 (en) Detection and elimination for inapplicable hyperlinks
CN115438232A (en) Knowledge graph construction method and device, electronic equipment and storage medium
US11361031B2 (en) Dynamic linguistic assessment and measurement
CN113033707B (en) Video classification method and device, readable medium and electronic equipment
CN116028868B (en) Equipment fault classification method and device, electronic equipment and readable storage medium
CN113033682B (en) Video classification method, device, readable medium and electronic equipment
CN116306632A (en) Network threat information labeling method and device, electronic equipment and readable storage medium
KR20180094738A (en) Apparatus and method for digitizing sentiment and predicting climax using the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination