CN114048314A - Natural language steganalysis method - Google Patents
Natural language steganalysis method Download PDFInfo
- Publication number
- CN114048314A CN114048314A CN202111330766.2A CN202111330766A CN114048314A CN 114048314 A CN114048314 A CN 114048314A CN 202111330766 A CN202111330766 A CN 202111330766A CN 114048314 A CN114048314 A CN 114048314A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- node
- nodes
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 239000013604 expression vector Substances 0.000 claims abstract description 11
- 238000013528 artificial neural network Methods 0.000 claims abstract description 9
- 230000002159 abnormal effect Effects 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 description 45
- 239000011159 matrix material Substances 0.000 description 29
- 238000003062 neural network model Methods 0.000 description 20
- 230000006870 function Effects 0.000 description 16
- 238000012549 training Methods 0.000 description 11
- 230000007246 mechanism Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 230000004931 aggregating effect Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000012821 model calculation Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a natural language steganalysis method, which comprises the following steps: step 1, constructing a data set into an abnormal graph with texts and words as nodes by using word pair correlation and word-text correlation; step 2, acquiring initial text node characteristics and initial word node characteristics; step 3, obtaining a node expression vector containing the steganalysis characteristics based on the graph attention neural network; and 4, inputting the obtained final text node expression vector to be analyzed into the trained joint classifier to realize the judgment of the steganographic text, the normally generated text and the normal natural text.
Description
Technical Field
The invention relates to the field of text steganalysis and natural language processing, in particular to a natural language steganalysis method based on BERT and a graph attention neural network.
Background
Steganography is a security technique that embeds secret information into a common carrier (e.g., images, text, audio, etc.) so that it is imperceptible, thereby hiding the secret information. The text is the most common and frequently used information carrier in daily life of people, and has very important significance in hiding information by utilizing the text. Therefore, text steganography has attracted extensive attention from researchers. In recent years, with the rapid development of deep learning in natural language processing, a series of studies related to text generation such as machine translation and dialogue systems have been significantly advanced. On the basis, the generative text steganography which enables a program to automatically generate text with high quality and carrying secret information becomes a research hotspot. Generative text steganography differs from traditional steganography methods in that it is capable of generating high quality, readable text content while carrying secret information, without the need to modify a given text to embed the secret information as in traditional steganography methods.
Steganalysis is a technique for the purpose of detecting whether or not secret information is hidden in a target text. Early steganalysis methods mainly extracted artificially designed features from target texts, such as word frequency, context similarity, and the like. However, these methods can only be applied to the steganographic text generated by using a specific steganography, and for the steganographic text generated by the deep learning-based generative steganography, because the steganographic text is highly similar to the natural text, the text quality is greatly improved, and the traditional steganographic analysis method cannot play an effective role. At present, steganography analysis research on a generating steganography text mostly considers the detection of the steganography text as a two-classification problem of the steganography text and a normally generated text (a generated text without embedded secret information), or a two-classification problem of the steganography text and a normal natural text (a normal artificially written text). There has been no study of combining three types of text together for steganalysis. However, in real life, the steganographic text containing the secret information, the normal natural text and the text which is used for automatic generation of some special scenes and does not contain the secret information coexist, so that the generated steganographic text recognized from the normal natural text and the normal generated text has higher application value.
Therefore, the invention provides a natural language steganography analysis method, which not only can accurately identify the steganography text, the normally generated text and the normal natural text, but also improves the detection performance of the steganography text.
Disclosure of Invention
In order to realize the purpose of the invention, the following technical scheme is adopted for realizing the purpose:
a natural language steganalysis method comprises the following steps: step 1, constructing a data set into an abnormal graph with texts and words as nodes by using word pair correlation and word-text correlation; step 2, acquiring initial text node and word node characteristics; step 3, obtaining a node expression vector containing the steganalysis characteristics based on the graph attention neural network; and 4, inputting the obtained final text node expression vector to be analyzed into the trained joint classifier to realize the judgment of the steganographic text, the normally generated text and the normal natural text.
The natural language steganalysis method comprises the following steps of 1: aiming at all labeled and unlabeled texts in a data set, constructing a large heterogeneous text graph comprising a large number of text nodes and word nodes, wherein each text node represents a text, and the text comprises a text to be analyzed, a steganographic text, a normally generated text and a normal natural text; each word node represents a word, and the word is obtained by removing the duplication of all words split from the text; the constructed text graph is represented by formula 1 as follows: g ═ v, ∈, where v denotes nodes in the text graph and ε denotes edges in the text graph(ii) a v includes all text nodesndocThe method comprises the steps of obtaining the number of texts in a data set, wherein the number of texts comprises a text to be analyzed, a steganographic text, a normally generated text and a normal natural text; and word nodenwordRepresents the number of words in the dataset (deduplication), i.e., ν ═ tuvuw; the set epsilon of the edges represents the relationship between all nodes including word nodes and text nodes, when the association degree between the words and the text is high, an edge is constructed between the word nodes and the text nodes, otherwise, no edge is constructed between the word nodes and the text nodes; when the word pair correlation degree between the word nodes is high, an edge is constructed between the two word nodes, otherwise, no edge is constructed between the two word nodes; no edge is constructed between the text node and the text node.
The natural language steganalysis method comprises the step 1 of calculating the relevance between words and texts, wherein the words wiAnd text tjThe relevance calculation method is as follows:
in which the text t is assumedjThere are s words, cos () is the cosine similarity function, tfidf () is the TF-IDF function, xiWord w obtained for Glove using a distributed representation model of wordsiThe word vector of (2);
for the calculated association degree F (w)i,tj) Comparing with a set threshold value delta:
i.e., if F (w)i,tj) If the value is higher than the threshold value delta, a text node t is constructedjAnd word node wiThe edge between the two edges of the strip,otherwise, no edge is constructed between the two.
The natural language steganalysis method comprises the step 1 of calculating word-to-word correlation between words, wherein the words wiAnd the word wjThe word pair relevance calculating method of (1) is as follows:
normalizing the value of the word pair correlation to between-1 and 1 is:
wherein p (w)i,wj) Means that within a fixed size contextual distance, the word w in the datasetiAnd the word wjProbability of co-occurrence, p (w)i) Means the word wiProbabilities of occurrence in the data set, which are calculated as follows:
p(wi,wj)=Count(wi,wj)/Counttotal
p(wi)=Count(wi)/Counttotal
wherein Count (w)i,wj) Meaning that within a fixed size contextual distance, the word wiAnd the word wjNumber of simultaneous occurrences in the dataset; count (w)i) Finger word wiTotal number of occurrences in the data set, CounttotalThe total number of words in the data set;
determining whether edges are constructed between word nodes according to the relevance between the words, and setting a threshold value beta:
i.e., when G (w)i,wj) When the value is larger than beta, the word pair is high in relevancy, and the word pair is constructedAn edge; when G (w)i,wj) When the value is less than or equal to beta, the relevance is low, and no edge is constructed for the word pair.
The natural language steganalysis method comprises the following steps of 2: constructing initial node characteristics for the text nodes: inputting the text content of each text node, namely all words and punctuations into a BERT model, wherein the obtained model output vector is the initial node representation of the text node; obtaining a word vector of each word by using a word distributed representation model Glove of the word node as an initial node representation of the word node;
the number of nodes N in the text graph is the number of texts in the data set plus the number of words, i.e. N equals Ndoc+nwordWherein n isdocIs the number of text nodes, nwordThe number of word nodes is determined, a matrix X is used as an initial characteristic matrix of the nodes, and a text node characteristic matrix and a word node characteristic matrix are respectively used as matrixesAnd d is the dimension of the vector, and the initial node characteristic matrix of the model is as follows:
wherein XdocRepresenting a BERT-based initial text node feature matrix, XwordRepresents the Glove-based initial word node feature matrix.
The natural language steganalysis method comprises the following steps of 3:
step 3.1, inputting X into the attention neural network model, and calculating attention weight as the edge weight between nodes by an attention mechanism: given a particular node i and its neighbors j, where i, j ∈ N, a vector x is represented according to the nodei And xjTo calculate attention weights;
the importance of node j to node i is first calculated:
eij=LeakyReLU(aT[Wxi||Wxj])
LeakyReLU is an activation function to provide non-linearity. W is the linear transformation weight matrix applied at each node,is a weight vector.
Then, the softmax function is used for normalizing neighbor nodes of the central node i to obtain the attention weight alpha from the node j to the node iijThe following were used:
wherein N isiA set of neighbor nodes representing node i;
step 3.2, for each node, aggregating adjacent node information through the edge weight by using an M-layer graph attention neural network model, extracting steganalysis characteristics to obtain output characteristics of the node, and updating node representation; wherein the M-th layer (M belongs to [1, M ]) graph attention layer utilizes a multi-head attention mechanism to stabilize the learning process, and the output characteristics of the computing nodes are as follows:
where a is the function of activation and where a is,is the attention weight normalized by the kth attention head, W is the weight matrix, NiRepresenting a neighbor node set of a node i, K representing the number of independent attention heads used, and | l representing splicing operation;
at the mth level, to make the result more stable, the vectors of each attention head are not spliced again, but the average value is chosen:
the natural language steganalysis method comprises the following steps of 4: obtaining a representation vector x of the text node after the M-layer graph attention neural network modelMAll the text nodes pass through the M-layer graph attention neural network model to obtain a final text node feature matrixThen the text node feature matrix is processedAnd (3) feeding a softmax layer, wherein the formula is as follows:
where each text node represents a vector xMAnd obtaining a 3-dimensional vector representing three types of text probability values after passing through the softmax layer.
The natural language steganalysis method, wherein the step 4 further comprises: constructing an auxiliary classifier based on BERT text representation, and directly acting on partial input of the graph attention neural network model, namely initial feature X of text nodedocThe vector is obtained by using BERT model operation, and the formula of the auxiliary classifier is expressed as follows:
ZBERT=softmax(WXdoc)
where each initial text node represents a vector X (X ∈ X)doc) Obtaining a 3-dimensional vector representing three types of text probability values after passing through a softmax layer;
when a parameter eta is set to 1, the classifier Z only using the graph attention neural network model is representedGATWhen η is 0, it means that only BERT-based auxiliary classifier Z is usedBERT;
Performing linear interpolation on the prediction of the attention neural network model and the prediction of the auxiliary classifier based on the BERT text representation to obtain a final joint classifier, wherein the expression of the final joint classifier is as follows:
Z=ηZGAT+(1-η)ZBERT。
and Z is a probability distribution matrix of all the text nodes obtained by final calculation, and the probability distribution of each text node is a 3-dimensional vector and represents the probability distribution of three text categories.
In the natural language steganalysis method, in the model training process, a back propagation algorithm is used for updating model parameters, a standard cross entropy loss function is used as a loss function, and a label with a label text is used for optimizing a minimum loss function through the iteration of the model, so that parameters of a BERT model and a graph attention neural network model are jointly optimized to obtain an optimal model, wherein the optimal model is represented as follows in the form:
wherein N islabelIndexing the labeled text nodes, wherein t is the number of categories of the labeled text nodes; y is the actual label, that is if the category of the f-th labeled text node is i, then Y isfi1, otherwise Yfi=0;ZfiThe f-th labeled text node is output after model calculation, namely the probability that the category is i;
after the training is finished and the optimal model is obtained, the text to be analyzed is classified through the optimal model, and judgment of the steganographic text, the normally generated text and the normal natural text is achieved.
Drawings
FIG. 1 is a flow chart of a natural language steganalysis method;
FIG. 2 is a schematic diagram of a natural language steganalysis method framework.
Detailed Description
The following detailed description of the present invention will be made with reference to the accompanying drawings 1-2.
As shown in fig. 1-2, the natural language steganalysis method of the present invention comprises: step 1, constructing a data set into an abnormal graph with texts and words as nodes by using word pair correlation and word-to-text correlation; step 2, acquiring initial text node characteristics and initial word node characteristics; step 3, obtaining a node expression vector containing the steganalysis characteristics through a graph attention neural network; and 4, after the final text node expression vector to be analyzed is obtained, inputting the constructed graph attention-based neural network classifier, and combining the auxiliary classifier based on BERT text expression to jointly predict and classify, so as to realize the final judgment of the steganographic text, the normally generated text and the normal natural text.
Step 1: constructing a data set into a heterogeneous graph with texts and words as nodes
The data set comprises all texts to be analyzed (test set, text without labels), steganographic texts, normally generated texts and normal natural texts (the steganographic texts, the normally generated texts and the normal natural texts are added to form a training set, and each text is provided with a category label).
The method first constructs a large heterogeneous text graph comprising a large number of text nodes and word nodes. The traditional composition method is only used for constructing a graph by using words in a single text, and the composition method can consider the relation between non-adjacent words by connecting the target word with words with different contexts and different distances from the target word, but ignores the information of the co-occurrence of global words and the information between different texts. The generation process of the steganographic text causes the difference between the steganographic text and the normally generated text and the normally natural text, which is generally global, not only local. Therefore, in order to automatically learn deep local and global features so as to sensitively sense the difference between the steganographic text, the normally generated text and the normal natural text in statistics and languages, the patent constructs the whole data set into a large heterogeneous text graph with the text and words as nodes so as to better aggregate global information and text information.
Each text is represented as a text node, the text comprises a text to be analyzed, a steganographic text and a normal natural text, the four types of texts in a data set are used as the text nodes to form a picture, a direct push learning (conductive learning) method is adopted to train a model, so that information contained in the text to be analyzed without labeling categories is used during training, global graph information is used during training, the node representation learning process is facilitated, and the model obtains higher performance; each word node represents a word, and the word is obtained by removing the duplication of all words split from the text.
As shown in fig. 2, the method constructs a global text graph by using all texts and words in the data set, which can be expressed as: g ═ v, epsilon, where v denotes nodes in the text graph and epsilon denotes edges in the text graph; v includes all text nodesndocThe number of texts in the data set is the number of texts to be analyzed (without labels), the number of steganographic texts, the number of normally generated texts and the number of normally natural texts; and word nodenwordRepresenting the number of words in the dataset (deduplication), i.e., ν ═ T @ W. The set of edges epsilon represents the relationship between all nodes, including word nodes and text nodes. Since n isdocAnd nwordThe text graph is large, and if the nodes are connected, the text graph is very large, so that the method considers the independence between the text and the text, and no edge exists between the text node and the text node; the text nodes and the word nodes determine whether edges are formed by considering the association degree between words and texts; and the relevance of the word pair is considered between the word and the word node to construct an edge. The specific steps of the construction of the text graph are as follows:
(1) calculating the association degree of the words and the texts and determining the construction of the edges between the word nodes and the text nodes
In order to clarify the relationship between words and text, if the association between words and text is high, edges between text and words are constructed. In this way, canThe method can omit unimportant words in the text, reduce the edges of the text and common words, reserve the edges of the important words, better promote the propagation of the message and obtain better text node representation. Word wiAnd text tjThe relevance calculation method is as follows:
in which the text t is assumedjThere are s words, cos () is the cosine similarity function, tfidf () is the TF-IDF function, xiWord w obtained for using word distributed representation model GloveiThe word vector of (2).
For the calculated association degree F (w)i,tj) Comparing with a set threshold value delta:
i.e., if F (w)i,tj) If the value is higher than the threshold value delta, a text node t is constructedjAnd word node wiThe edge in between.
(2) Calculating word pair correlation degree and determining construction of edges between word nodes
In order to further enrich the information of the text, promote the propagation of the message and better obtain global information, limit the edges between the word pairs is considered, otherwise, as the size of the data set increases, the subgraph covering the word nodes approaches to full connection, therefore, the patent considers that whether the word nodes and the word nodes construct the edges is determined by measuring the relevance between the word pairs. Word wiAnd the word wjThe word pair relevance calculating method of (1) is as follows:
for better representation, we normalize the value of word pair correlation between-1 and 1 as:
wherein p (w)i,wj) Means that within a fixed size contextual distance, the word w in the datasetiAnd the word wjProbability of co-occurrence, p (w)i) Means the word wiProbabilities of occurrence in the data set, which are calculated as follows:
p(wi,wj)=Count(wi,wj)/Counttotal
p(wi)=Count(wi)/Counttotal
wherein Count (w)i,wj) Meaning that within a fixed size contextual distance, the word wiAnd the word wjNumber of simultaneous occurrences in the dataset; count (w)i) Finger word wiTotal number of occurrences in the data set, CounttotalRefers to the total number of words in the data set.
Determining whether edges are constructed between word nodes according to the relevance between the words, and setting a threshold value beta:
i.e., when G (w)i,wj) When the correlation degree is larger than beta, the word pair correlation degree is high, and an edge is constructed for the word pair; when G (w)i,wj) When the value is less than or equal to beta, the relevance is low, and no edge is constructed for the word pair.
Step two: obtaining initial text node characteristics and initial word node characteristics
According to the expression of the composition method in the step one, composition is finished, and initial node features of all nodes need to be obtained.
(1) Glove-based word representation
Compared with Word2vec only considering Word local information, the Word distributed representation model Glove simultaneously considers the local information and the whole information of the whole corpus by utilizing the co-occurrence matrix, and can more accurately represent the grammar and semantic information of words, so that the Word vector of each Word in a dictionary is obtained by adopting the Glove model. Therefore, in the text graph constructed by the patent, word node feature representation is initialized to be a Glove-based word vector, and feature matrixes initialized by all word nodes are represented as
(2) BERT-based text representation
The text content (namely all words and punctuation marks) of each text node in the text graph is input into a pre-trained BERT model, and the vector representation of the text node is obtained through the operation of the BERT model. The BERT model is a pre-training model based on a Transfomer encoder and trained in a mask mode. The input to the BERT's Transfomer encoder involves three different vectors: 1) original word vectors for individual words in the text content. The vector is initialized randomly, and the matrix formed by the word vectors of all words is marked as Eword(ii) a 2) A text vector. The dimension of the text vector of each word is consistent with the word vector, and the value of the text vector is automatically learned through training. The matrix of text vectors of all words is denoted as Esegment(ii) a 3) A position vector. To encode the sequential information of the text in, each word learns a position vector based on the difference of positions in the text. The matrix of position vectors of all words is denoted as Eposition. Finally, the input to BERT is the sum of three vectors:
xinput=Eword+Esegment+Eposition
xinputafter being used as the input of the BERT model, vector representation fusing text content semantic information in text nodes is obtained through model operation and is represented as xbert=BERT(xinput). In order to enable the obtained text representation to better fit with the natural language steganalysis task, the patent finely adjusts the pre-trained BERT model. The text representation vectors finally obtained by BERT serve as initial features of the text nodes. Therefore, the initial feature matrix corresponding to all text nodes in the constructed text graph is
In summary, the initial feature matrix including all text nodes and word nodes is:
wherein, XdocRepresenting a BERT-based initial text node feature matrix, XwordRepresents the Glove-based initial word node feature matrix.
Step three: node representation vector update
And (4) after obtaining the initial feature matrixes of all the nodes based on the step two, calculating the attention weight of the target node and the adjacent nodes in the text graph by using the attention mechanism in the attention neural network as the edge weight between the nodes, and aggregating the information of the adjacent nodes by the target node through the edge weight to update the node representation.
(1) Edge weight calculation based on attention mechanism
The node feature matrix X of the text graph, the ID values of all texts in the data set and the ID values of all words are input into the graph attention neural network model, the attention mechanism calculates the attention weight as the edge weight between the nodes, the attention mechanism can help the nodes capture different importance of adjacent nodes, and the method has important significance for steganalysis tasks. Given a target node i and its neighboring node j (the neighboring node is a node j having an edge with the node i), where i, j ∈ N, the importance of the node j to the node i is first calculated as follows:
eij=LeakyReLU(aT[Wxi||Wxj])
LeakyReLU is an activation function to provide non-linearity. W is the linear transformation weight matrix applied at each node,is a weight vector. Then, the softmax function is used for normalizing neighbor nodes of the central node i to obtain the attention weight alpha from the node j to the node iijThe following were used:
(2) updating node representations based on edge weights
For each node, aggregating adjacent node information through the edge weight by using an M-layer graph attention neural network model, extracting steganalysis characteristics to obtain output characteristics of the node, and updating a node expression vector. Wherein the M-th layer (M belongs to [1, M ]) graph attention layer utilizes a multi-head attention mechanism to stabilize the learning process, and the output characteristics of the computing nodes are as follows:
where a is the function of activation and where a is,is the attention weight normalized by the kth attention head, W is the weight matrix, NiAnd the neighbor node set of the node i is represented, K represents the number of the used independent attention heads, and | represents splicing operation.
At the mth level, in order to make the result more stable, this patent no longer splices the vectors of each attention head, but chooses to take the average:step four: joint prediction and classification
Attention through M layer diagramAfter the neural network model is obtained, the expression vector x of the final text node can be obtainedMAll the text nodes pass through the M-layer graph attention neural network model to obtain a final text node feature matrixThen the text node feature matrix is processedAnd (3) feeding a softmax layer, wherein the formula is as follows:
where each text node represents a vector xMAnd obtaining a 3-dimensional vector representing three types of text probability values after passing through the softmax layer.
Preferably, in order to optimize the graph attention neural network model and improve the performance of the graph attention neural network model, the invention also constructs an auxiliary classifier based on BERT text representation, and the auxiliary classifier directly acts on part of input of the graph attention neural network model, namely an initial text node feature matrix XdocThe formula of the auxiliary classifier is expressed as:
ZBERT=softmax(WXdoc)
where W represents a trainable weight matrix, each initial text node represents a vector X (X ∈ X)doc) And obtaining a 3-dimensional vector after passing through the softmax layer, wherein the 3-dimensional vector is used for representing the probability values of the three types of texts.
In order to balance the auxiliary classifier with the graph attention neural network model, a parameter η is set to control the balance between two targets, wherein when η is 1, it means that only the graph attention neural network model is used, and when η is 0, it means that only the BERT-based auxiliary classifier module is used. When eta belongs to (0,1), the predicted values of the two models can be balanced, and the method can be optimized better. Therefore, the training objective of the present patent is to perform linear interpolation on the prediction from the graph attention neural network model and the prediction of the auxiliary classifier to obtain a final joint classifier, and the expression of the joint classifier is as follows:
Z=ηZGAT+(1-η)ZBERT
and Z is a probability distribution matrix of all the text nodes obtained by final calculation, and the probability distribution of each text node is a 3-dimensional vector and represents the probability distribution of three text categories. For example, when a certain text node is determined as a steganographic text having a probability value of 0.6, a normally generated text having a probability value of 0.3, and a normally natural text having a probability value of 0.1, the text node probability distribution is (0.6,0.3,0.1) z.
All parameters of the BERT model and the graph neural network model need to be obtained through training. The implementation of the method follows a supervised learning framework, the model parameters are updated by using a back propagation algorithm, a standard cross entropy loss function is used as a loss function, and the loss function is minimized by using the label of the labeled text through the iterative optimization of the model, so that the parameters of the BERT model and the graph attention neural network model are jointly optimized to obtain an optimal model. Expressed formally as:
wherein N islabelIndexing the labeled text nodes, wherein t is the number of categories of the labeled text nodes; y is the actual label, that is if the category of the f-th labeled text node is i, then Y isfi1, otherwise Yfi=0;ZfiThe f-th labeled text node is output after model calculation, namely the probability that the category is i.
After training is completed and an optimal model is obtained, the text to be analyzed is classified through the optimal model, and judgment of the steganographic text, the normally generated text and the normal natural text is realized:
Zno-label=ηZGAT+(1-η)ZBERT
Zno-labelfor the final calculation of the probability distribution matrix of all the text nodes to be analyzed, the probability distribution of each text node is a 3-dimensional vector, tableThe probability distribution of three text categories is shown, and at the moment, which category has the highest probability value, the text to be analyzed is predicted to be which category with the highest probability value by the model. For example, when a certain text node to be analyzed is determined as a steganographic text with a probability value of 0.8, is determined as a normally generated text with a probability value of 0.15, and is determined as a normal natural text with a probability value of 0.05, the text node to be analyzed is predicted as a steganographic text.
Through the four steps, the data set is constructed into a large heterogeneous text graph with texts and words as nodes. Obtaining word vectors as initial word node characteristics through a distributed word representation model Glove, and obtaining text vectors with rich semantic information and local information as initial text node characteristics through a large-scale pre-training model BERT. And then aggregating adjacent node information for the text nodes by using the graph attention neural network to obtain text node expression vectors containing rich steganalysis information characteristics, wherein the attention mechanism of the graph attention neural network is to calculate attention weights for each node and the adjacent nodes thereof as edge weights between the nodes, so that different importance of the adjacent nodes can be obtained for the text nodes, and further, the information of more important adjacent nodes is selectively aggregated, thereby improving the detection accuracy of the steganalysis.
Claims (2)
1. A natural language steganalysis method is characterized in that the method comprises the following steps: step 1, constructing a data set into an abnormal graph with texts and words as nodes by using word pair correlation and word-text correlation; step 2, acquiring initial text node and word node characteristics; step 3, obtaining a node expression vector containing the steganalysis characteristics based on the graph attention neural network; and 4, inputting the obtained final text node expression vector to be analyzed into the trained joint classifier to realize the judgment of the steganographic text, the normally generated text and the normal natural text.
2. The natural language steganalysis method according to claim 1, characterized in that step 1 comprises: and constructing a large heterogeneous text graph comprising a large number of text nodes and word nodes aiming at all labeled and unlabeled texts in the data set, wherein each text node represents one text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111330766.2A CN114048314B (en) | 2021-11-11 | 2021-11-11 | Natural language steganalysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111330766.2A CN114048314B (en) | 2021-11-11 | 2021-11-11 | Natural language steganalysis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114048314A true CN114048314A (en) | 2022-02-15 |
CN114048314B CN114048314B (en) | 2024-08-13 |
Family
ID=80208770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111330766.2A Active CN114048314B (en) | 2021-11-11 | 2021-11-11 | Natural language steganalysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114048314B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115169293A (en) * | 2022-09-02 | 2022-10-11 | 南京信息工程大学 | Text steganalysis method, system, device and storage medium |
CN117648681A (en) * | 2024-01-30 | 2024-03-05 | 北京点聚信息技术有限公司 | OFD format electronic document hidden information extraction and embedding method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101151622A (en) * | 2005-01-26 | 2008-03-26 | 新泽西理工学院 | System and method for steganalysis |
CN111488734A (en) * | 2020-04-14 | 2020-08-04 | 西安交通大学 | Emotional feature representation learning system and method based on global interaction and syntactic dependency |
US10755171B1 (en) * | 2016-07-06 | 2020-08-25 | Google Llc | Hiding and detecting information using neural networks |
-
2021
- 2021-11-11 CN CN202111330766.2A patent/CN114048314B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101151622A (en) * | 2005-01-26 | 2008-03-26 | 新泽西理工学院 | System and method for steganalysis |
US10755171B1 (en) * | 2016-07-06 | 2020-08-25 | Google Llc | Hiding and detecting information using neural networks |
CN111488734A (en) * | 2020-04-14 | 2020-08-04 | 西安交通大学 | Emotional feature representation learning system and method based on global interaction and syntactic dependency |
Non-Patent Citations (2)
Title |
---|
YUXIAO LIN, YUXIAN MENG等: "BertGCN: Transductive Text Classification by Combining GCN and BERT", 《FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL-IJCNLP 2021》, 6 August 2021 (2021-08-06), pages 1456 * |
喻靖民;向凌云;曾道建: "基于Word2vec的自然语言隐写分析方法", 《计算机工程》, vol. 45, no. 3, 29 March 2018 (2018-03-29) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115169293A (en) * | 2022-09-02 | 2022-10-11 | 南京信息工程大学 | Text steganalysis method, system, device and storage medium |
CN117648681A (en) * | 2024-01-30 | 2024-03-05 | 北京点聚信息技术有限公司 | OFD format electronic document hidden information extraction and embedding method |
CN117648681B (en) * | 2024-01-30 | 2024-04-05 | 北京点聚信息技术有限公司 | OFD format electronic document hidden information extraction and embedding method |
Also Published As
Publication number | Publication date |
---|---|
CN114048314B (en) | 2024-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11631007B2 (en) | Method and device for text-enhanced knowledge graph joint representation learning | |
CN108984724B (en) | Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation | |
CN110929030B (en) | Text abstract and emotion classification combined training method | |
CN109325231B (en) | Method for generating word vector by multitasking model | |
CN111666406B (en) | Short text classification prediction method based on word and label combination of self-attention | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
CN111461157A (en) | Self-learning-based cross-modal Hash retrieval method | |
CN112395417A (en) | Network public opinion evolution simulation method and system based on deep learning | |
CN112966525B (en) | Law field event extraction method based on pre-training model and convolutional neural network algorithm | |
CN114417851B (en) | Emotion analysis method based on keyword weighted information | |
CN111967267B (en) | XLNET-based news text region extraction method and system | |
CN114239574A (en) | Miner violation knowledge extraction method based on entity and relationship joint learning | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN114048314B (en) | Natural language steganalysis method | |
CN114925205B (en) | GCN-GRU text classification method based on contrast learning | |
CN113204975A (en) | Sensitive character wind identification method based on remote supervision | |
CN114564563A (en) | End-to-end entity relationship joint extraction method and system based on relationship decomposition | |
CN114491024A (en) | Small sample-based specific field multi-label text classification method | |
CN116958677A (en) | Internet short video classification method based on multi-mode big data | |
CN115329120A (en) | Weak label Hash image retrieval framework with knowledge graph embedded attention mechanism | |
CN114548117A (en) | Cause-and-effect relation extraction method based on BERT semantic enhancement | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN117421420A (en) | Chinese click decoy detection method based on soft prompt learning | |
CN116680407A (en) | Knowledge graph construction method and device | |
CN116662924A (en) | Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |