CN115422376B - Network security event source tracing script generation method based on knowledge graph composite embedding - Google Patents

Network security event source tracing script generation method based on knowledge graph composite embedding Download PDF

Info

Publication number
CN115422376B
CN115422376B CN202211382679.6A CN202211382679A CN115422376B CN 115422376 B CN115422376 B CN 115422376B CN 202211382679 A CN202211382679 A CN 202211382679A CN 115422376 B CN115422376 B CN 115422376B
Authority
CN
China
Prior art keywords
embedding
expression
pos
knowledge graph
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211382679.6A
Other languages
Chinese (zh)
Other versions
CN115422376A (en
Inventor
车洵
孙捷
胡牧
程佳
孙瀚墨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhongzhiwei Information Technology Co ltd
Original Assignee
Nanjing Zhongzhiwei Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhongzhiwei Information Technology Co ltd filed Critical Nanjing Zhongzhiwei Information Technology Co ltd
Priority to CN202211382679.6A priority Critical patent/CN115422376B/en
Publication of CN115422376A publication Critical patent/CN115422376A/en
Application granted granted Critical
Publication of CN115422376B publication Critical patent/CN115422376B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a network security incident source tracing script generating method based on knowledge graph composite embedding, which comprises the following steps: s1: the entity relation is expanded by introducing a text corpus and is used for enriching the entity relation and expanding a knowledge graph; s2: extracting public features in the knowledge graph, and extracting all input public features by using a public extraction layer; s3: extracting relation features in the knowledge graph, and using corresponding relation extraction layers for different embedding relations; s4: projecting the public features and the relation features to an embedding space, and completing the knowledge graph; s5: sequencing the knowledge graphs obtained in the step (S4), acquiring POS token embedding and semantic context scores through corresponding modules, and generating a network security traceability script through the acquired POS token embedding and semantic context scores by a word replication probability prediction module; the network security event tracing script constructed by the method has extremely high applicability and accuracy.

Description

Network security event source tracing script generation method based on knowledge graph composite embedding
Technical Field
The invention relates to the field of knowledge graphs of network security event traceability scripts, in particular to a network security event traceability script generation method based on knowledge graph composite embedding.
Background
In recent years, with the rapid development of the internet, the network threat problem is becoming more serious and frequent. In contrast to the brand new network threat which is distinguished by the characteristics of high attack speed, long latency, wide attack area and the like, the traditional network security tracing method is time-consuming and labor-consuming. This is because even ordinary event tracing can reach multiple systems, and multiple systems involve multiple team personnel, which requires very high labor cost, and such tracing can be performed up to several times a day. The automation and response of SOAR security arrangements have been developed in response to various problems exposed by conventional methods.
Compared with the traditional method, the SOAR has the advantages of high tracing speed, low labor cost and the like. The SOAR has three core technology capabilities, namely a threat information platform, a safety event corresponding platform, safety arrangement and automation. Among these three technologies, security arrangements and automation are undoubtedly important and central functions. The safe arrangement and automation refer to arranging the scripts (playbooks) needing manual tracing in the traditional method in an automatic mode. In the field of automatically constructing the network security event tracing scenario, the network security event tracing scenario automatically constructed based on the knowledge graph is gradually developed, so how to quickly construct the network security event tracing scenario becomes a serious concern.
Disclosure of Invention
In order to achieve the purpose, the inventor provides a method for generating a network security event source-tracing scenario based on knowledge graph composite embedding, which comprises the following steps:
s1: the entity relation is expanded by introducing a text corpus and is used for enriching the entity relation and expanding a knowledge graph;
s2: extracting public features in the knowledge graph, and extracting all input public features by using a public extraction layer;
s3: extracting relation features in the knowledge graph, and using corresponding relation extraction layers for different embedding relations;
s4: projecting the public characteristic and the relation characteristic to an embedding space, and completing the knowledge graph;
s5: and sequencing the knowledge graph obtained in the step S4, acquiring POS token embedding and semantic context scores through a corresponding module, and generating a network security source tracing script through the acquired POS token embedding and semantic context scores by a word replication probability prediction module.
As a preferred mode of the invention, an entity pair (h, t) which is not mentioned is given, LDP with the mentioned entity pair extracted from a text corpus is ranked, an encoder f of the entity pair parameterized by theta is learned for a subject vector h and an object vector t, and the entity pair (h, t) is encoded into f (h, t; theta) through the encoder f;
where the input to encoder f is:
Figure GDA0004014327830000021
wherein the content of the first and second substances,
Figure GDA0004014327830000022
denotes concatenation of vectors, @ denotes multiplication of two vectors by element, (h-t) denotes host vector h minus guest vector t; />
LDP set S for connections h and t (h,t) LDP L is represented by a vector L using a pre-trained sentence encoder SBERT;
such that LDP co-occurring with entity pair (h, t) is similar to f (h, t; θ), LDP associated with both h and t is used as a positive training instance S = { (h, l, t) }, and LDP associated with either h or t alone is used as a negative training instance S' (h,t) The expression is:
Figure GDA0004014327830000023
wherein t ' and h ' respectively represent object vectors and subject vectors which are not equal to t and h, l ' represents the relation of a negative training example, and D represents a subject vector and a set of object vectors;
by minimizing S (h,t) And S' (h,t) Learning the parameters of f (h, t, θ) with the margin loss of:
Figure GDA0004014327830000031
wherein gamma ≧ 0 denotes the margin, f (h, t; theta) is calculated using theta obtained by minimizing the above formula, and then the inner product f (h, t; theta) is used T L scores each LDP L, and the first u LDPs with the highest inner product score of f (h, t; theta) are selected to expand the knowledge graph, wherein u is a hyperparameter.
As a preferred mode of the present invention, the S2 step includes the steps of:
after the knowledge map is expanded, the knowledge map is set
Figure GDA0004014327830000032
Wherein e s Representing the subject vector, e o Represents a guest vector, <' > is selected>
Figure GDA0004014327830000033
Representing a relation vector, connecting the main body vector and the relation vector, wherein the expression is as follows:
[e s ;e r ] 1d ∈R d
wherein, d = d e +d r And [ a; b] 1d A vector concatenation representing vectors a and b, the concatenated embedded vector representing the input of all subsequent layers;
extracting common features of vectors through a common dense layer, wherein the width of the common dense layer is the number of filters of the dense layer, and the size of a kernel contained in each filter is equal to the size of an input embedded kernel;
in the common dense layer, an affine function Ω (·) is applied to a given input embedding, and the expression of the common dense layer is:
Ω(x)=W h x+b h
wherein the content of the first and second substances,
Figure GDA0004014327830000034
width of public dense layer is nd h Given as, nd h Denotes d h Wherein n is a hyperparameter;
by applying a non-linear activation function f (-) to Ω ([ e ] s ;e r ]1d) An output of the common feature extraction is obtained.
As a preferred mode of the present invention, the S3 step includes the steps of:
for the relation r, the coding function is represented by Ω r Expressing, using a relation dense layer to extract relation perception characteristics, and coding a function omega r Is an affine function, and the expression is:
Ω r (x)=W r x+b r
wherein the content of the first and second substances,
Figure GDA0004014327830000041
and d z Represents omega r The output length of (d);
will omega r Application to input embedding [ e ] s ;e r ] 1d ∈R d Then, a non-linear activation function f (-) is applied, and the relationship-dense layer will have different encoders facing different relationships for extracting the relationship features.
As a preferred mode of the present invention, the S4 step includes the steps of:
after potential vectors are obtained from the relational dense layer and the public dense layer, the vectors are connected, and the connected vectors are projected to an embedding space through a projection matrix, wherein the expression is as follows:
Figure GDA0004014327830000042
then apply the nonlinear activation f (-) and will
Figure GDA0004014327830000043
Is defined as follows:
Figure GDA0004014327830000044
wherein h is sr For the predicted result, the link prediction score ψ (e) s ,e r ,e o ) Is defined as h sr And e o Inner product of (2)
Figure GDA0004014327830000045
Calculating the scores of all the triples, and calculating the loss by using a binary cross entropy function;
training the strategy using 1:B, let B denote the number of all entities in the knowledge-graph, the binary cross-entropy function loss
Figure GDA0004014327830000046
The expression of (c) is:
Figure GDA0004014327830000047
wherein the content of the first and second substances,
Figure GDA0004014327830000048
Figure GDA0004014327830000049
representing the i-th object entity, y i E {0,1} is a label, σ denotes a sigmoid function;
as a preferable mode of the present invention, the S4 step further includes the steps of:
extracting original embedded features U, then carrying out random disturbance transformation on the original embedding, extracting the features U' through an extraction layer, wherein the expression of a loss function is as follows:
L MC =KL(U′||U)
wherein the KL function represents a KL divergence;
a composite loss function of
Figure GDA00040143278300000410
As a preferable mode of the present invention, the S5 step further includes the steps of: POS token embedding is generated through a POS generator, and semantic context scores are obtained through a semantic context scoring module.
As a preferable mode of the present invention, the S5 step further includes the steps of:
when the knowledge graph is sequenced, the generated triple embedding characteristics are input into a sequencing network, a main body-relation-object triple structure is given, a placeholder is introduced to be filled into a fixed length N, and the main body-relation-object triple structure characteristics F are input into a sequencing network stru And fill F pad Connected and get S through the full connection layer with softmax matrix And predicting the sorting order S order The expression is:
S order =argmax row (FC s ([F stru ;F pad ]))
therein, FC s Denotes a fully connected layer with softmax, argmax r o w Indicating that the argmax operation is performed on the row,
considering the sequence prediction task as a classification problem, where N represents the number of classes, the true sequence Go is calculated rder And the ordering sequence So rder Cross entropy loss between, the expression:
Figure GDA0004014327830000051
wherein L is sort Which is indicative of a loss of ordering,
Figure GDA0004014327830000052
where N represents the fourth category, N ranging from 0 to N;
the knowledge graph generates an optimal description sequence through a sequencing network, the sequence is further decoded into sentences through a word decoder, and then syntactic supervision is applied through a POS generator, namely: in knowledge map order G order Conditioned by first marking<Main body>,<Relationships between>,<Object>The corresponding position added to each triplet linearizes the knowledge-graph and obtains G linear Then word encoder and POS generator with G linear As input, the word codes WI = { w are output, respectively i I ∈ 1 … M } and POS tag code PI = { p = { p } i ,i∈1…M};
Encoding a token w in a fusion module i And POS tag code p i Fusing to obtain the updated token code w i The expression is:
w i =LN(FC([w i ;p i ])+w i )
wherein LN represents layer normalization, and the updated token code w after fusion i Decoded in the word decoder as the statement WI '= { w' i ,i∈1…K};
The POS generator monitors through POS labels pre-extracted from sentences, and the loss function expression is as follows:
Figure GDA0004014327830000061
wherein, P gen Representing the predicted probability from the POS generator;
the loss function expression of the word encoder and decoder is:
Figure GDA0004014327830000062
wherein, W gen Representing the prediction probability of each word token.
As a preferable mode of the present invention, the S5 step further includes the steps of:
generating a sliding window for each word to provide local context, filling in the beginning words of the sliding window, and obtaining context information F through the word features in the sliding window context And input into FC layer to obtain semantic context score X semantic The expression is:
X semantic =σ(FC(F context ))
where σ denotes a sigmoid function.
As a preferable mode of the present invention, the S5 step further includes the steps of:
word replication probability prediction module using captured POS token embedding
Figure GDA0004014327830000063
And semantic context score X semantic Probability of copying a word from a knowledge-graph->
Figure GDA0004014327830000064
For selecting whether to use predicted words from a word decoder or words in a knowledge graph when generating a sentence, the expression is: />
Figure GDA0004014327830000065
Figure GDA0004014327830000066
Wherein, W 1 ,W 2 ,W 3 And b copy Is a parameter that can be learned by the user,
Figure GDA0004014327830000071
indicating token embedding, s k The last hidden state of the word decoder at each time step is represented as a balance coefficient which is set to be 0.3, the semantic context scoring module and the word replication probability prediction module are optimized in a combined mode, and the replication or prediction loss function expression is as follows:
Figure GDA0004014327830000072
wherein, y k A group-route 0-1 tag representing a word to be copied or predicted at the kth time step, which is generated from the knowledge-graph and the group-route sentence;
total training loss L total Composed of four parts, including ordering penalty L sort POS generates a loss L pos Word generation loss L token And replication or prediction loss L copy The expression for the total training loss is:
L total =L token1 L pos2 L sort3 L copy
wherein λ is 12 And λ 3 Is a trade-off factor.
Different from the prior art, the technical scheme has the following beneficial effects:
(1) The method effectively realizes that the creation of the network security incident source-tracing script is completed by constructing the high-performance knowledge map, the method generates the high-performance knowledge map to construct the network security incident source-tracing script, various defects of the traditional method are overcome, and the network security incident source-tracing script constructed by the method has extremely high applicability and accuracy;
(2) The invention introduces an additional network security text corpus to enable the constructed knowledge graph to have more entity relationships, simultaneously uses a new embedding method, and divides the extraction process into public feature extraction and relationship perception extraction, wherein the public feature extraction is used for extracting all inputs, and the relationship feature extraction is used for respectively extracting different relationships, so that the link prediction success rate is greatly increased, and the knowledge graph complemented by the method has extremely high applicability and accuracy in the aspect of generating the network security incident source tracing script.
Drawings
FIG. 1 is a block diagram of a method according to an embodiment;
FIG. 2 is a flowchart of the augmented knowledge graph according to an embodiment;
fig. 3 is a flowchart for generating a network security trace scenario according to the embodiment.
Detailed Description
To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
As shown in fig. 1 to fig. 3, the embodiment provides a method for generating a traceable scenario of a network security event based on knowledge graph composite embedding, including the following steps:
s1: the entity relation is expanded by introducing a text corpus and is used for enriching the entity relation and expanding a knowledge graph;
s2: extracting public features in the knowledge graph, and extracting all input public features by using a public extraction layer;
s3: extracting relation features in the knowledge graph, and using corresponding relation extraction layers for different embedding relations;
s4: projecting the public features and the relation features to an embedding space, and completing the knowledge graph;
s5: and sequencing the knowledge graph obtained in the step S4, acquiring POS token embedding and semantic context scores through a corresponding module, and generating a network security source tracing script through the acquired POS token embedding and semantic context scores by a word replication probability prediction module.
The step S1 in the above embodiment specifically includes the following steps:
given an entity pair (h, t), ranking the LDP syntactic dependency paths extracted from the text corpus with the mentioned entity pair, learning an encoder f of the entity pair parameterized by theta for a subject vector h and a subject vector t, and encoding the entity pair (h, t) into f (h, t; theta) through the encoder f;
wherein the encoder f is embodied as a non-linearly activated multilayer perceptron, the inputs of which are:
Figure GDA0004014327830000081
wherein the content of the first and second substances,
Figure GDA0004014327830000091
denotes concatenation of vectors, @ denotes multiplication of two vectors by element, (h-t) denotes host vector h minus guest vector t; the above equation independently considers the information in the head and tail entity embedding and the interaction between their corresponding dimensions.
LDP set S for connections h and t (h,t) Because LDP is a sequence of text labels, we can represent LDP by vector using a Sentence coder, in this embodiment LDP L by vector L using a pre-trained Sentence coder SBERT (Sennce-BERT: sentence embedding over twin networks);
such that LDP co-occurring with entity pair (h, t) is similar to f (h, t; θ), LDP associated with both h and t, i.e., both occur simultaneously, is used as a positive training instance S = { (h, l, t) }, and LDP associated with either h or t alone, i.e., not both, is used as a negative training instance S' (h,t) The expression is:
Figure GDA0004014327830000092
wherein t ' and h ' respectively represent object vectors and subject vectors which are not equal to t and h, l ' represents the relation of a negative training example, and D represents a subject vector and a set of object vectors;
by minimizing S (h,t) And S' (h,t) Learning the parameters of f (h, t, θ), the expression is:
Figure GDA0004014327830000093
where γ ≧ 0 represents a margin, which was set to 1 in the experiment of this example; to determine which LDPs to borrow for a particular pair of entities not mentioned, in this embodiment f (h, t; θ) is calculated using θ obtained by minimizing the above equation, and then the inner product f (h, t; θ) is used T And L scores each LDP L, wherein the L is obtained by a sentence encoder model, and then the first u LDPs with the highest inner product score of f (h, t; theta) are selected to expand the knowledge graph, wherein u is a hyper-parameter.
In the step S2 in the above embodiment, the method further includes the following steps:
after the knowledge map is expanded, the knowledge map is set
Figure GDA0004014327830000094
Wherein e s Representing the subject vector, e o Represents a guest vector, <' > is selected>
Figure GDA0004014327830000095
Representing a relation vector, connecting a main body vector and the relation vector in order to calculate a score function embedded in the method, wherein the expression is as follows:
[e s ;e r ] 1d ∈R d
wherein, d = d e +d r And [ a; b] 1d A vector concatenation representing vectors a and b, the concatenated embedded vector representing the input of all subsequent layers;
then extracting common features of the vectors through a common dense layer, wherein the width of the common dense layer is the number of filters of the dense layer, and the size of a kernel contained in each filter is equal to the size of the kernel embedded in the input;
in the common dense layer, an affine function Ω (·) is applied to a given input embedding, and the expression of the common dense layer is:
Ω(x)=W h x+b h
wherein the content of the first and second substances,
Figure GDA0004014327830000101
the width of the public dense layer is nd h Given as, nd h Denotes d h Wherein n is a hyperparameter;
by applying a non-linear activation function f (-) to Ω ([ e ] s ;e r ]1d) An output of the common feature extraction is obtained.
In step S3 in the above embodiment, the method further includes the following steps:
in order to extract relation specific features from cascade embedding, a relation perception coding function is considered, and for relation r, a coding function is formed by omega r Expressing, using a relation dense layer to extract relation perception characteristics, and coding a function omega r Is an affine function, and the expression is:
Ω r (x)=W r x+b r
wherein the content of the first and second substances,
Figure GDA0004014327830000102
and d z Represents omega r The output length of (d);
will omega r Application to input embedding [ e ] s ;e r ] 1d ∈R d Then, a non-linear activation function f (-) is applied, and the relationship-dense layer will have different encoders facing different relationships for extracting the relationship features.
In step S4 in the above embodiment, the following steps are further included:
after potential vectors are obtained from the relational dense layer and the public dense layer, the vectors are connected, and the connected vectors are projected to an embedding space through a projection matrix, wherein the expression is as follows:
Figure GDA0004014327830000103
then apply the non-linear activation f (-) and will
Figure GDA0004014327830000104
Is defined as follows:
Figure GDA0004014327830000111
wherein h is sr For the predicted result, the link prediction score ψ (e) s ,e r ,e o ) Is defined as h sr And e o Inner product of (2)
Figure GDA0004014327830000112
Calculating the scores of all the triples, and calculating the loss by using a binary cross entropy function;
using 1:B training strategy to let B represent the number of all entities in the knowledge-graph, the binary cross-entropy function loss
Figure GDA0004014327830000113
The expression of (c) is:
Figure GDA0004014327830000114
wherein the content of the first and second substances,
Figure GDA0004014327830000115
Figure GDA0004014327830000116
representing the i-th object entity, y i E {0,1} is a label, σ denotes a sigmoid function;
in step S4 in the above embodiment, the method further includes the following steps:
in order to improve robustness of an extraction layer, an autocorrelation strategy is introduced, original embedded features U are extracted, then random disturbance transformation is carried out on the original embedding, features U' are extracted through the extraction layer, then the features extracted by the extraction layer are expected to be the same as far as possible, and an expression of a loss function is as follows:
L MC =KL(U′||U)
wherein the KL function represents a KL divergence;
a composite loss function of
Figure GDA0004014327830000117
As shown in fig. 2, after the knowledge graph is generated and completed by the above method, the knowledge graph needs to be sorted, specifically: in step S5 in the above embodiment, the method further includes the following steps: POS token embedding is generated through a POS generator, and semantic context scores are obtained through a semantic context scoring module.
When the knowledge graph is sorted, the generated triple embedding features are input into a sorting network, and the sorting network is supervised by using sequences extracted from real sentences. Given a triple structure of subject-relationship-object, since the length of the knowledge graph is variable, in this embodiment, a placeholder is introduced to fill the triple structure to a fixed length N, which also represents the number of possible location classes, and the triple structure of subject-relationship-object is characterized by F stru And fill F pad Connected and S is obtained by a full connection layer FC with softmax matrix And predicting the sorting order S order The expression is:
S order =argmax row (FC s ([F stru ;F pad ]))
wherein, FC s Denotes a fully connected layer with softmax, argmax row Indicating that the argmax operation is performed on the row;
the sequence prediction task is treated as a classification problem, where N represents the number of classes, the maximum of the triplets in the knowledge-graph, and thus the true sequence G is computed order And a sorting sequence S order Cross entropy loss between, the expression is:
Figure GDA0004014327830000121
wherein L is sort Which is indicative of a loss of ordering,
Figure GDA0004014327830000122
where N represents the fourth category, N ranging from 0 to N;
the knowledge graph generates an optimal description sequence through the sequencing network, the sequence is further decoded into sentences through a word decoder, the word decoder uses a decoder in a Transfomer, and then additional syntactic supervision is applied through a POS generator, and the implementation of the POS generator is as follows: in knowledge map order G order Conditioned by first marking<Main body>,<Relationships between>,<Object>The corresponding position added to each triplet linearizes the knowledge-graph and obtains G linear Then word encoder and POS generator with G linear As input, the word codes WI = { w) are output separately i I ∈ 1 … M } and POS tag code PI = { p = { (p) } i ,i∈1…M};
The token is then encoded w in the fusion module i And POS tag code p i Fusing to obtain the updated token code w i The expression is:
w i =LN(FC([w i ;p i ])+w i )
wherein LN represents layer normalization, and the updated token code w after fusion i Decoded in the word decoder as the statement WI '= { w' i ,i∈1…K};
The POS generator monitors through POS labels pre-extracted from sentences, and the loss function expression is as follows:
Figure GDA0004014327830000131
wherein, P gen Representing a predicted probability from a POS generator;
the loss function expression of the word encoder and decoder is:
Figure GDA0004014327830000132
wherein the word encoder uses a pre-trained bert model, W gen Representing the prediction probability of each word token.
In addition to the syntactic constraints of the copied words, the present embodiment also designs a semantic context scoring component to evaluate the semantic consistency of the copied or predicted words in the sliding window. Generating a sliding window for each word to provide local context, filling in the beginning words of the sliding window, and obtaining context information F through the word features in the sliding window c o ntext And input into FC layer to obtain semantic context score X semantic The expression is:
X semantic =σ(FC(F context ))
where σ denotes a sigmoid function.
Word replication probability prediction module using obtained POS token embedding v pk And semantic context score X semantic Probability of copying words from a knowledge graph
Figure GDA0004014327830000133
The method is used for selecting a predicted word from a word decoder or a word machine in a knowledge graph at each time step when a sentence is generated, and the expression is as follows:
Figure GDA0004014327830000134
Figure GDA0004014327830000135
wherein, W 1 ,W 2 ,W 3 And b copy Is a parameter that can be learned by the user,
Figure GDA0004014327830000141
indicating token embedding, s k The last hidden state of the word decoder at each time step is represented as a balance coefficient which is set to be 0.3, the semantic context scoring module and the word replication probability prediction module are optimized in a combined mode, and the replication or prediction loss function expression is as follows:
Figure GDA0004014327830000142
wherein, y k A group-truth true 0-1 tag representing a word copied or predicted at the kth time step, generated from the knowledge-graph and the group-truth sentence;
finally, the total training loss L in the model total Composed of four parts, including ordering penalty L s ort, POS generate loss L pos And a loss L of word generation token And replication or prediction loss L copy The expression for the total training loss is:
L total =L token1 L pos2 L sort3 L copy
wherein λ is 12 And λ 3 Is a trade-off factor.
The data set used by the method is Comprehensive, multi-Source Cyber-Security Events (Comprehensive Multi-Source network Security Activity) data set and an ADFA (intrusion detection data set) data set. The Comprehensive, multi-Source Cyber-Security Events data set is obtained from various websites and various vulnerability databases on the network, wherein the data set comprises network Security and vulnerability information and network text data. The ADFA dataset contains data for various intrusions. A large number of experiments show that the method is superior to the most advanced method.
Figure GDA0004014327830000143
Figure GDA0004014327830000151
The table shows that the optimal link prediction performance is realized in MRR average reciprocal ranking, both HIT @1 (previous accuracy rate) and HIT @10 (previous ten accuracy rate) obtain excellent performances, and the method obtains excellent performances in Comprehensive, multi-Source Cyber-Security Events and ADFA data centralization.
Where MRR performance is improved by 0.6%,0.2% HIT @10 and 1.1%; performance enhancement in MRR 2.4%,2.5% hit @10 and 2.6% hit @1 this simple approach is superior to shared layer models such as Conv, SACN, interact and relationship-specific models such as RGCN. The result of the embodiment shows that the framework combining two different coding functions can effectively improve the performance of knowledge graph embedding. This example analyzes the performance of the method on different relationship types of the Comprehensive, multi-Source Cyber-Security Events dataset because the Comprehensive, multi-Source Cyber-Security Events contain more different relationships than ADFA. Four types of relationships based on the number of tails connected to the head and the number of heads connected to the tail: one-to-one (1:1), one-to-many (1:N), many-to-one (N: 1), and many-to-many (N: N). Using this data set, this example compares three models: convE, interactE, and ComDense perform under four types of relationships. As a result, as shown in the following table, comDenSE was found to be effective in both complex relationship types (i.e., 1:N, N: N, N: 1) and simple relationships (i.e., 1:1). Notably, the performance gain of 1:1 is higher, indicating that ComDensE is particularly effective at capturing simple relationships.
Figure GDA0004014327830000161
The embodiment shows that the method effectively realizes that the creation of the network security incident tracing script is completed by constructing the high-performance knowledge map, the method generates the high-performance knowledge map to construct the network security incident tracing script, various defects of the traditional method are overcome, and the network security incident tracing script constructed by the method has extremely high applicability and accuracy;
in addition, the invention introduces an additional network security text corpus to enable the constructed knowledge graph to have more entity relationships, and simultaneously uses a new embedding method, the extraction process is divided into public feature extraction and relationship perception extraction, wherein the public feature extraction is used for extracting all inputs, and the relationship feature extraction is used for respectively extracting different relationships, so that the link prediction success rate is greatly increased, and the knowledge graph complemented by the method has extremely high applicability and accuracy in the aspect of generating the network security incident source tracing script.
In the aspect of generating the network security event tracing script, the method optimizes knowledge sequence prediction under sequence supervision, and further enhances the consistency of the generated sentences and the knowledge map through syntax and semantic regularization.
Locations where words are copied from the knowledge-graph are restricted in conjunction with part-of-speech (POS) syntactic tokens, and a semantic context scoring function is used to generate a suitability of each word in the sentence in its context. The constructed network security event source tracing script has extremely high availability and accuracy.
It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims (5)

1. A network security event source-tracing script generation method based on knowledge graph composite embedding is characterized by comprising the following steps:
s1: the entity relation is expanded by introducing a text corpus and is used for enriching the entity relation and expanding a knowledge graph;
s2: extracting public features in the knowledge graph, and extracting all input public features by using a public extraction layer;
s3: extracting relation features in the knowledge graph, and using corresponding relation extraction layers for different embedding relations;
s4: projecting the public features and the relation features to an embedding space, and completing the knowledge graph;
s5: sequencing the knowledge graph obtained in the step S4, acquiring POS token embedding and semantic context scores through a corresponding module, and generating a network security source-tracing script through the acquired POS token embedding and semantic context scores by a word replication probability prediction module;
the step S2 comprises the following steps:
after the knowledge map is expanded, set e s ,
Figure FDA0004014327820000011
Wherein e s Representing the subject vector, e o The object vector is represented by a vector of objects,
Figure FDA0004014327820000012
representing a relation vector, connecting the main body vector and the relation vector, wherein the expression is as follows:
[e s ;e r ] 1d ∈R d
wherein, d = d e +d r And [ a; b] 1d A vector concatenation representing vectors a and b, the concatenated embedded vector representing the input of all subsequent layers;
extracting common features of vectors through a common dense layer, wherein the width of the common dense layer is the number of filters of the dense layer, and the size of a kernel contained in each filter is equal to the size of an input embedded kernel;
in the common dense layer, an affine function Ω (-) is applied to a given input embedding, the expression of the common dense layer being:
Ω(x)=W h x+b h
wherein the content of the first and second substances,
Figure FDA0004014327820000013
width of public dense layer is nd h Given as, nd h Denotes d h Wherein n is a hyperparameter;
by applying a non-linear activation function f (-) to Ω ([ e ] s ;e r ] 1d ) Obtaining an output of the common feature extraction;
the step S3 comprises the following steps:
for the relation r, the coding function is represented by Ω r Expressing, using a relation dense layer to extract relation perception characteristics, and coding a function omega r Is an affine function, and the expression is:
Ω r (x)=W r x+b r
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0004014327820000021
and d z Represents omega r The output length of (d);
will omega r Application to input embedding [ e ] s ;e r ] 1d ∈R d Then, applying a nonlinear activation function f (·), wherein the relationship dense layer has different encoders facing different relationships and is used for extracting relationship features;
the step S4 comprises the following steps:
after potential vectors are obtained from the relational dense layer and the public dense layer, the vectors are connected, and the connected vectors are projected to an embedding space through a projection matrix, wherein the expression is as follows:
Figure FDA0004014327820000022
then apply the nonlinear activation f (-) and will
Figure FDA0004014327820000023
Is defined as:
Figure FDA0004014327820000024
wherein h is sr The link prediction score psi (e) is the result of the prediction s ,e r ,e o ) Is defined as h sr And e o Inner product of (2)
Figure FDA0004014327820000025
Calculating the scores of all the triples, and calculating the loss by using a binary cross entropy function;
training the strategy using 1:B, let B denote the number of all entities in the knowledge-graph, the binary cross-entropy function loss
Figure FDA0004014327820000026
The expression of (c) is:
Figure FDA0004014327820000027
wherein the content of the first and second substances,
Figure FDA0004014327820000028
Figure FDA0004014327820000029
representing the i-th object entity, y i E {0,1} is a label, σ denotes a sigmoid function;
the step S5 further includes the steps of: POS token embedding is generated through a POS generator, and semantic context scores are obtained through a semantic context scoring module;
the step S5 further includes the steps of:
when the knowledge graph is sequenced, the generated triple embedding characteristics are input into a sequencing network, a main body-relation-object triple structure is given, a placeholder is introduced to be filled into a fixed length S, and the main body is sequencedBody-relation-object triad structural feature F stru And fill F pad Connected and get S through the full connection layer with softmax matrix And predicting the sorting order S order The expression is:
S order =argmax row (FC s ([F stru ;F pad ]))
therein, FC s Denotes a fully connected layer with softmax, argmax row Indicating that the argmax operation is performed on the row,
considering the sequence prediction task as a classification problem, where N represents the number of classes, the true sequence G is computed order And a sorting sequence S order Cross entropy loss between, the expression:
Figure FDA0004014327820000031
wherein L is sort Which is indicative of a loss of ordering,
Figure FDA0004014327820000032
where N represents the fourth category, N ranging from 0 to N;
the knowledge graph generates an optimal description sequence through a sequencing network, the sequence is further decoded into sentences through a word decoder, and then syntactic supervision is applied through a POS generator, namely: in knowledge map order G order Conditional, first, by marking<Main body>,<Relationships between>,<Object>The corresponding position added to each triplet linearizes the knowledge-graph and obtains G linear Then word encoder and POS generator with G linear As input, the word codes WI = { w are output, respectively i I ∈ 1 … M } and POS tag code PI = { p = { p } i ,i∈1…M};
Encoding a token w in a fusion module i And POS tag code p i Fusing to obtain updated token code w i The expression is:
w i =LN(FC([w i ;p i ])+w i )
wherein LN represents layer normalization, and the updated token code w after fusion i Decoded in the word decoder as the statement WI '= { w' i ,i∈1…K};
The POS generator monitors through POS labels pre-extracted from sentences, and the loss function expression is as follows:
Figure FDA0004014327820000041
wherein, P gen Representing a predicted probability from a POS generator;
the loss function expression of the word encoder and decoder is:
Figure FDA0004014327820000042
wherein, W gen Representing the prediction probability of each word token.
2. The method according to claim 1, wherein the S1 step comprises the steps of:
given a pair of entities (h) not mentioned * ,t * ) Ranking LDPs extracted from a text corpus and provided with the entity pairs, learning an encoder f of the entity pair parameterized by theta for a subject vector h and an object vector t, and encoding the entity pair (h, t) into f (h, t; θ);
where the input to encoder f is:
Figure FDA0004014327820000043
wherein the content of the first and second substances,
Figure FDA0004014327820000051
denotes concatenation of vectors, @ denotes multiplication of two vectors by elements, (h-t) denotes principal directionSubtracting the object vector t from the amount h;
LDP set S for connections h and t (h,t) LDP L is represented by a vector L using a pre-trained sentence encoder SBERT;
such that LDP co-occurring with entity pair (h, t) is similar to f (h, t; θ), LDP associated with both h and t is used as a positive training instance S = { (h, l, t) }, and LDP associated with either h or t alone is used as a negative training instance S' (h,t) The expression is:
Figure FDA0004014327820000053
wherein t ' and h ' respectively represent an object vector and a subject vector which are not equal to t and h, l ' represents the relationship of a negative training example, and D represents a subject vector and an object vector set;
by minimizing S (h,t) And S' (h,t) Learning the parameters of f (h, t, θ) with the margin loss of:
Figure FDA0004014327820000052
wherein γ ≧ 0 denotes the margin, f (h, t; θ) is calculated using θ obtained by minimizing the above formula, and then the inner product f (h, t; θ) is used T L scores each LDP L, and the first u LDPs with the highest inner product score of f (h, t; theta) are selected to expand the knowledge graph, wherein u is a hyperparameter.
3. The method of claim 2, wherein the step S4 further comprises the steps of:
extracting original embedded characteristics U, then carrying out random disturbance transformation on the original embedding, extracting characteristics U' through an extraction layer, wherein the expression of a loss function is as follows:
L MC =KL(U′||U)
wherein the KL function represents a KL divergence;
a composite loss function of
Figure FDA0004014327820000061
4. The method of claim 3, wherein the S5 step further comprises the steps of:
generating a sliding window for each word to provide local context, filling in the beginning words of the sliding window, and obtaining context information F through the word features in the sliding window context And input into FC layer to obtain semantic context score X semantic The expression is:
X semantic =σ(FC(F context ))
where σ denotes a sigmoid function.
5. The method of claim 4, wherein the S5 step further comprises the steps of:
word duplication probability prediction module using captured POS token embedding
Figure FDA0004014327820000066
And semantic context score X semantic Probability of copying words from a knowledge graph
Figure FDA0004014327820000062
For selecting whether to use predicted words from a word decoder or words in a knowledge graph when generating a sentence, the expression is:
Figure FDA0004014327820000063
Figure FDA0004014327820000064
wherein, W 1 ,W 2 ,W 3 And b copy Is a parameter that can be learned by the user,
Figure FDA0004014327820000065
indicating token embedding, s k The last hidden state of the word decoder at each time step is represented as a balance coefficient which is set to be 0.3, the semantic context scoring module and the word replication probability prediction module are optimized in a combined mode, and the replication or prediction loss function expression is as follows:
Figure FDA0004014327820000071
wherein, y k A group-route 0-1 tag representing a word to be copied or predicted at the kth time step, which is generated from the knowledge-graph and the group-route sentence;
total training loss L total Composed of four parts, including ordering penalty L sort POS generates a loss L pos Word generation loss L token And replication or prediction loss L copy The expression for the total training loss is:
L total =L token1 L pos2 L sort3 L copy
wherein λ is 12 And λ 3 Is a trade-off factor.
CN202211382679.6A 2022-11-07 2022-11-07 Network security event source tracing script generation method based on knowledge graph composite embedding Active CN115422376B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211382679.6A CN115422376B (en) 2022-11-07 2022-11-07 Network security event source tracing script generation method based on knowledge graph composite embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211382679.6A CN115422376B (en) 2022-11-07 2022-11-07 Network security event source tracing script generation method based on knowledge graph composite embedding

Publications (2)

Publication Number Publication Date
CN115422376A CN115422376A (en) 2022-12-02
CN115422376B true CN115422376B (en) 2023-03-24

Family

ID=84207874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211382679.6A Active CN115422376B (en) 2022-11-07 2022-11-07 Network security event source tracing script generation method based on knowledge graph composite embedding

Country Status (1)

Country Link
CN (1) CN115422376B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117951314B (en) * 2024-03-26 2024-06-07 南京众智维信息科技有限公司 Scenario generation decision method integrating knowledge graph and large language generation model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114491541B (en) * 2022-03-31 2022-07-22 南京众智维信息科技有限公司 Automatic arrangement method of safe operation script based on knowledge graph path analysis
CN114429198A (en) * 2022-04-07 2022-05-03 南京众智维信息科技有限公司 Self-adaptive layout method for network security emergency treatment script

Also Published As

Publication number Publication date
CN115422376A (en) 2022-12-02

Similar Documents

Publication Publication Date Title
Zhou et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
CN111985245B (en) Relationship extraction method and system based on attention cycle gating graph convolution network
Pei et al. Memory-attended recurrent network for video captioning
Bai et al. Boosting convolutional image captioning with semantic content and visual relationship
US20220138185A1 (en) Scene graph modification based on natural language commands
Huang et al. Neural mathematical solver with enhanced formula structure
Scholak et al. DuoRAT: towards simpler text-to-SQL models
Geng et al. Explainable zero-shot learning via attentive graph convolutional network and knowledge graphs
Huang et al. Cascade2vec: Learning dynamic cascade representation by recurrent graph neural networks
CN115422376B (en) Network security event source tracing script generation method based on knowledge graph composite embedding
CN114969304A (en) Case public opinion multi-document generation type abstract method based on element graph attention
CN115510236A (en) Chapter-level event detection method based on information fusion and data enhancement
Ross et al. Exploring contextualized neural language models for temporal dependency parsing
Zhao et al. Video captioning based on vision transformer and reinforcement learning
CN112015890B (en) Method and device for generating movie script abstract
Li et al. Lexical attention and aspect-oriented graph convolutional networks for aspect-based sentiment analysis
Yuan et al. Hierarchical multi-task learning for diagram question answering with multi-modal transformer
Song et al. Exploring explicit and implicit visual relationships for image captioning
Bi et al. Inferential visual question generation
Yumeng et al. News image-text matching with news knowledge graph
CN117033423A (en) SQL generating method for injecting optimal mode item and historical interaction information
Zhang et al. Topic scene graphs for image captioning
Liu et al. Question-conditioned debiasing with focal visual context fusion for visual question answering
CN115035455A (en) Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation
Zhang et al. MTSCANet: Multi temporal resolution temporal semantic context aggregation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant