CN110110318A

CN110110318A - Text Stego-detection method and system based on Recognition with Recurrent Neural Network

Info

Publication number: CN110110318A
Application number: CN201910058680.5A
Authority: CN
Inventors: 黄永峰; 杨忠良; 王颗; 杨震; 胡雨婷; 武楚涵
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2019-08-09
Anticipated expiration: 2039-01-22
Also published as: CN110110318B

Abstract

The text Stego-detection method and system based on Recognition with Recurrent Neural Network that the invention discloses a kind of, wherein this method comprises: obtaining term vector matrix, input word sequence vector is converted for text to be detected according to term vector matrix；Input word sequence vector is input in the Recognition with Recurrent Neural Network model constructed in advance, the feature vector of correlativity between indicating text word to be detected is generated；Classified by classifier to feature vector, judges whether text to be detected includes hiding information；If text to be detected includes hiding information, the information insertion rate of text to be detected is estimated according to the difference of steganography Text eigenvector under different insertion rates.This method applies to Recognition with Recurrent Neural Network in text Stego-detection, can effectively identify whether text carrier contains hiding information, and the capacity of hiding information is accurately estimated according to the statistical distribution of extraction feature.

Description

Text Stego-detection method and system based on Recognition with Recurrent Neural Network

Technical field

The present invention relates to text information field of communication technology, in particular to a kind of text steganography based on Recognition with Recurrent Neural Network Detection method and system.

Background technique

In the monograph about information security, Shannon summarizes three basic information safety systems: encryption system, Privacy system and hiding system.

1, encryption system in a special manner encodes information, to only have authorized party that could be decoded to it, and Unwarranted side can be decoded it.It is by making message be difficult to read safety to ensure information.

2, privacy system is mainly the access limited to information, to only have authorized user that could access important information, not Authorized user can not access it in any way under any circumstance.However, although the two systems ensure information Safety, but they also expose the presence and importance of information, make it easier under attack, such as intercept and crack.

3, it hides system and the two secrecy systems is very different.It by being embedded in confidential information in common vector, Then it is transmitted using common signal channel, the presence of Hiding Secret information, to achieve the purpose that be not easy under a cloud and attack.

Wherein, Steganography is the key technology in system of hiding, and has the carrier of various media formats to can be used for Information hiding, Including image, audio, text etc..Text attracts as information carrier most popular in daily life, text Steganography The interest of numerous studies personnel.In recent years, more and more text based information concealing methods have already appeared.

J.Fridrich concludes that in general, it is embedding to determine that steganographic algorithm can use three kinds of different basic frameworks Enter the internal mechanism with extraction algorithm: being retrieved by carrier, the Steganography of carrier modification and carrier generation.

1, in the Steganography retrieved by carrier, all carriers concentrated first to carrier are encoded, and are then selected Different carriers is transmitted to realize hidden message transmission.The advantages of this method is carrier always " 100% is natural ", but bright Aobvious disadvantage is that transmitting information content is considerably less.

2, the steganography method of current most study is the Steganography based on carrier modification.By modifying given carrier come real The insertion of existing confidential information.This method has a wide range of applications on multiple carriers, such as image, voice and text.But it is total For, the redundancy space of image and voice is relatively large, therefore suitable modification not will cause very big visual effect Or sense of hearing variation.For text, it has higher information coding degree, and information redundancy amount is less, and space can be modified by limiting Size.Therefore, the method based on modification is difficult to realize sufficiently high hidden capacity on text carrier.

3, the third method is to generate to carry out steganography by carrier.The confidential information transmitted as needed automatically generates load Body, and hidden information is embedded in generating process.This method hidden capacity usually with higher, however, pervious model It is difficult to generate the readable text of high quality, therefore the concealment of such methods is limited.With deep learning (deep Learning) in the extensive use of natural language processing field (natural language processing, abbreviation NLP), out Many text generation formula steganography methods based on deep learning are showed.By the powerful feature learning of neural network and characterization energy The readable text of high quality can be generated in power, and by coding insertion hiding information in generating process, so that generating The text with hiding information be provided simultaneously with highly concealed type and high insertion rate.

However, above-mentioned various text steganography methods, such as utilized by terrorist, hacker criminal, for transmitting The hidden information of the public may be endangered, potential threat will be caused to public safety.This potential threat is fought, needs to identify text In whether include hiding information, key technology is text Stego-detection.

Traditional text Stego-detection technology, is typically based on such as lower frame: firstly, extracting training set by specific method Some statistical natures of middle steganography text and plain text；The above-mentioned statistical nature for analyzing steganography text and plain text again exists Difference；Discriminator is finally designed based on such difference, for determining whether given text includes hiding information.However it passes The text Stego-detection technology of system, some artificially specified plain text features of analysis and utilization, these features be easy to by Deep neural network with powerful feature learning and ability to express is imitated, to generate the steganography for meeting these feature distributions Text.Therefore, for these conventional methods when facing the newest production text steganographic algorithm based on deep learning, detection is accurate Rate is greatly reduced.This kind of powerful production text steganographic algorithm, is such as utilized by criminal, will pacify for cyberspace and the public Potential threat is brought entirely.Therefore, identify production steganography text and have become the urgent problem to be solved for being related to public safety.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, an object of the present invention is to provide a kind of text Stego-detection method based on Recognition with Recurrent Neural Network, This method applies to Recognition with Recurrent Neural Network in text Stego-detection, can effectively identify whether text carrier contains hiding information, And the capacity of hiding information is accurately estimated according to the statistical distribution of extraction feature.

It is another object of the present invention to propose a kind of text Stego-detection system based on Recognition with Recurrent Neural Network.

In order to achieve the above objectives, one aspect of the present invention embodiment proposes a kind of text steganography based on Recognition with Recurrent Neural Network Detection method, comprising: obtain term vector matrix, input term vector sequence is converted for text to be detected according to the term vector matrix Column；The input word sequence vector is input in the Recognition with Recurrent Neural Network model constructed in advance, generating indicates described to be detected The feature vector of correlativity between text word；Classified by classifier to described eigenvector, is judged described to be detected Whether text includes hiding information；If the text to be detected includes hiding information, according to steganography text under different insertion rates The information insertion rate of the text to be detected is estimated in the difference of feature vector.

The text Stego-detection method based on Recognition with Recurrent Neural Network of the embodiment of the present invention, by transporting Recognition with Recurrent Neural Network It uses in text Stego-detection, can effectively identify whether text carrier contains hiding information, and according to the statistical of extraction feature Cloth accurately estimates the capacity of hiding information.It can effectively overcome and be lacked possessed by traditional Stego-detection method based on simple feature It falls into, can also obtain very high detection accuracy in face of production text steganography method neural network based；Based on LSTM (Long Short-Term Memory, shot and long term memory network) statistical nature of correlativity between model extraction word, it can effectively model Long distance dependent relation in text；Based on the feature distribution difference of text under different insertion rates, text to be detected is accurately estimated Implicit information capacity.

In addition, the text Stego-detection method according to the above embodiment of the present invention based on Recognition with Recurrent Neural Network can also have There is following additional technical characteristic:

Further, in one embodiment of the invention, the acquisition term vector matrix, comprising: obtain a large amount of generate Text is segmented and is pre-processed to the generation text, and is filtered based on word frequency, and the common word finder for generating text is obtained It closes；Common vocabulary in the common lexical set is sequentially inputted in the module pre-established, each common word is obtained Converge corresponding term vector；Form term vector sample according to the corresponding term vector of each described common vocabulary, according to institute's predicate to It measures sample and generates the term vector matrix.

Further, in one embodiment of the invention, described to be turned text to be detected according to the term vector matrix Turn to input word sequence vector, comprising: the text to be detected is subjected to low-frequency word filtering, screening and participle, generate it is described to Detect the sequence of words of text；The term vector of all vocabulary in the sequence of words is obtained according to the term vector matrix, wherein The term vector matrix includes the corresponding relationship between common vocabulary and term vector；According to all vocabulary in the text to be detected Term vector, generate the input word sequence vector of the text to be detected.

Further, in one embodiment of the invention, the input word sequence vector is input in advance described Before in the Recognition with Recurrent Neural Network model of building, further includes: obtain the corresponding institute of generation text of multiple Given information insertion rates Input word sequence vector is stated, according to the corresponding input word sequence vector of the generation text of the multiple Given information insertion rate Construct the Recognition with Recurrent Neural Network model.

Further, in one embodiment of the invention, after the building Recognition with Recurrent Neural Network model, also It include: the character pair vector that the generation text of the Given information insertion rate is generated according to the Recognition with Recurrent Neural Network model； Classified using feature vector of the classifier to the generation text of the known insertion rate, generates the known insertion rate Generation text insertion rate estimated probability；Using preset algorithm to the generation text of the estimated probability and the known insertion rate This rate that is actually embedded in carries out analysis and generates analysis as a result, the Recognition with Recurrent Neural Network model according to the analysis modified result The weight parameter of parameter and the classifier.

In order to achieve the above objectives, it is hidden to propose a kind of text based on Recognition with Recurrent Neural Network for another aspect of the present invention embodiment Write detection system, comprising: module is obtained, for obtaining term vector matrix；First generation module, for according to the term vector square Text to be detected is converted input word sequence vector by battle array；Second generation module, for inputting the input word sequence vector Into the Recognition with Recurrent Neural Network model constructed in advance, generate the feature of correlativity between indicating the text word to be detected to Amount；Judgment module judges whether the text to be detected includes hidden for classifying by classifier to described eigenvector Hide information；Estimation module, for the text to be detected include hiding information, according to steganography text feature under different insertion rates to The information insertion rate of the text to be detected is estimated in the difference of amount.

The decentralization Internet of Things safety detecting system based on block chain of the embodiment of the present invention, by the way that nerve net will be recycled Network applies in text Stego-detection, can effectively identify whether text carrier contains hiding information, and according to the system of extraction feature Score cloth accurately estimates the capacity of hiding information.Can effectively it overcome possessed by traditional Stego-detection method based on simple feature Defect can also obtain very high detection accuracy in face of production text steganography method neural network based；Based on LSTM model The statistical nature of correlativity, can effectively model the long distance dependent relation in text between extraction word.

In addition, the decentralization Internet of Things safety detecting system according to the above embodiment of the present invention based on block chain may be used also With following additional technical characteristic:

Further, in one embodiment of the invention, the acquisition module, is specifically used for: obtaining a large amount of generation texts This, is segmented and is pre-processed to the generation text, and is filtered based on word frequency, and the common lexical set for generating text is obtained； Common vocabulary in the common lexical set is sequentially inputted in the module pre-established, each common vocabulary pair is obtained The term vector answered；Term vector sample is formed according to the corresponding term vector of each described common vocabulary, according to the term vector sample This generation term vector matrix.

Further, in one embodiment of the invention, first generation module, is specifically used for: will be described to be checked It surveys text and carries out low-frequency word filtering, screening and participle, generate the sequence of words of the text to be detected；According to the term vector square Battle array obtains the term vector of all vocabulary in the sequence of words, wherein the term vector matrix includes common vocabulary and term vector Between corresponding relationship；According to the term vector of all vocabulary in the text to be detected, the described of the text to be detected is generated Input word sequence vector.

Further, in one embodiment of the invention, further includes: building module；The building module, for obtaining The corresponding input word sequence vector of the generation text of multiple Given information insertion rates is embedded according to the multiple Given information The corresponding input word sequence vector of the generation text of rate constructs the Recognition with Recurrent Neural Network model.

Further, in one embodiment of the invention, further includes: correction module；

The correction module, is specifically used for: generating the Given information insertion rate according to the Recognition with Recurrent Neural Network model Generation text character pair vector；Using the classifier to the known insertion rate generation text feature vector into Row classification generates the insertion rate estimated probability of the generation text of the known insertion rate；It is general to the estimation using preset algorithm The rate that is actually embedded in of the generation text of rate and the known insertion rate carries out analysis generation analysis as a result, according to the analysis result Correct the parameter of the Recognition with Recurrent Neural Network model and the weight parameter of the classifier.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is the text Stego-detection method flow diagram based on Recognition with Recurrent Neural Network according to one embodiment of the invention；

Fig. 2 is to extract word association characteristic pattern using LSTM according to one embodiment of the invention；

Fig. 3 is special according to the potential association of one embodiment of the invention extracted between word using bidirectional circulating neural network Sign figure；

Fig. 4 is to be illustrated according to the text Stego-detection system structure based on Recognition with Recurrent Neural Network of one embodiment of the invention Figure.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

The text Stego-detection based on Recognition with Recurrent Neural Network proposed according to embodiments of the present invention is described with reference to the accompanying drawings Method and system describe the text steganography based on Recognition with Recurrent Neural Network proposed according to embodiments of the present invention with reference to the accompanying drawings first Detection method.

Fig. 1 is the text Stego-detection method flow diagram based on Recognition with Recurrent Neural Network according to one embodiment of the invention.

As shown in Figure 1, should text Stego-detection method based on Recognition with Recurrent Neural Network the following steps are included:

In step s101, obtain term vector matrix, according to term vector matrix by text to be detected be converted into input word to Measure sequence.

Further, in one embodiment of the invention, term vector matrix is obtained, comprising: a large amount of generation texts are obtained, It is segmented and is pre-processed to text is generated, and filtered based on word frequency, obtain the common lexical set for generating text；By common word The common vocabulary collected in closing is sequentially inputted in the module pre-established, obtains the corresponding term vector of each common vocabulary； Term vector sample is formed according to the corresponding term vector of each common vocabulary, term vector matrix is generated according to term vector sample.

Further, in one embodiment of the invention, input is converted for text to be detected according to term vector matrix Term vector sequence, comprising: text to be detected is subjected to low-frequency word filtering, screening and participle, generates the vocabulary sequence of text to be detected Column；According to term vector matrix obtain sequence of words in all vocabulary term vector, wherein term vector matrix include common vocabulary and Corresponding relationship between term vector；According to the term vector of vocabulary all in text to be detected, the input word of text to be detected is generated Sequence vector.

Specifically, the text that a large amount of common types are collected on the net, natural text and special text including Chinese and English, with Based on existing text generation formula steganography method, the text of a large amount of different insertion rates, the training number as Stego-detection are generated According to collection.

In step s 102, input word sequence vector is input in the Recognition with Recurrent Neural Network model constructed in advance, is generated Indicate the feature vector of correlativity between text word to be detected.

Further, in one embodiment of the invention, input word sequence vector is being input to following of constructing in advance Before in ring neural network model, further includes: obtain the corresponding input term vector of generation text of multiple Given information insertion rates Sequence constructs Recognition with Recurrent Neural Network mould according to the corresponding input word sequence vector of the generation text of multiple Given information insertion rates Type.

Specifically, Recognition with Recurrent Neural Network model is formed by 1 to multiple hidden layers, and each hidden layer includes 1 to multiple LSTM (Long Short-Term Memory) unit, wherein LSTM unit can be it is unidirectional be also possible to it is two-way.

Further, in one embodiment of the invention, after constructing Recognition with Recurrent Neural Network model, further includes: root The character pair vector of the generation text of Given information insertion rate is generated according to Recognition with Recurrent Neural Network model；Using classifier to known The feature vector of the generation text of insertion rate is classified, and the insertion rate estimated probability of the generation text of known insertion rate is generated； Using preset algorithm to the generation text of estimated probability and known insertion rate be actually embedded in rate carry out analysis generate analysis as a result, The parameter of Recognition with Recurrent Neural Network model and the weight parameter of classifier are corrected based on the analysis results.

Optionally, preset algorithm can be back-propagation algorithm, and classifier can be softmax classifier.

Specifically, the word correlated characteristic of the above-mentioned natural text collected is modeled, and constructs corresponding circulation The classifier of neural network model and diagnostic characteristics uses the steganography text of a large amount of plain text and different insertion rates as instruction Practice sample, training neural network model and classifier, continue to optimize model parameter and performance using back-propagation algorithm, to improve The feature extraction ability and characteristic differentiation ability of model, at interval of the loss value of a period of time test model, according to loss value tune Integral mould Training strategy, such as learning rate, until neural network model parameter and performance reach stable.

In step s 103, classified by classifier to feature vector, judge whether text to be detected includes hiding Information.

In step S104, if text to be detected includes hiding information, according to steganography text feature under different insertion rates The information insertion rate of text to be detected is estimated in the difference of vector.

Feature distribution difference of the embodiment of the present invention based on text under different insertion rates, accurately estimates that text to be detected is hidden The information capacity contained；It can be for different types of text (Chinese and English text, natural text and special format text) unified structure Then established model frame uses the text training pattern of corresponding types, can be to different types of hidden as long as reaching a model The effect that text is detected is write, there is good scalability；It does not need to think design rule and feature, what model learnt All features can more accurately extract the word in text compared to conventional method from a large amount of existing production steganography texts Related linked character between language, to obtain accurate identification result.

Further, the method for the embodiment of the present invention is different from previous steganalysis method, utilizes Recognition with Recurrent Neural Network Modeling and feature extraction are carried out to text to be detected, identify text to be detected and whether contain hiding information, and can be based on taking out The capacity of the feature assessment hiding information taken.It is superior in the past from the precision of steganographic detection, recall rate, accuracy rate various aspects Correlation technique, reached current optimal detection performance, be of great significance in terms of the Stego-detection of production text.

In an embodiment of the present invention, identify production steganography by designing better model to extract reasonable feature Text, is described in detail below how the model constructed in the present embodiment identifies steganography text.

The Recognition with Recurrent Neural Network model of the present embodiment building, including two main modulars: textual association analysis module and spy Levy identification module.Textual association analysis module is characteristic extracting module, using Recognition with Recurrent Neural Network to the modeling energy of sequence signal Power and feature extraction ability extract the statistical distribution of the Relating Characteristic between the word in text；Characteristic differentiation module by pair The analysis of the above-mentioned feature extracted identifies whether text to be detected includes hiding information, and according to text under different insertion rates The nuance of feature estimates the hiding information capacity of text insertion to be detected.

One, textual association analysis module

For plain text, there are stronger correlativities between word.If being embedded in hiding information in word, phase Pass relationship may be weakened.Therefore, correlativity can be considered as the index of steganography, extract the linked character between word, can be with For Stego-detection.The current output of Recognition with Recurrent Neural Network is always related with previous input data, therefore can be used to extract The linked character of word.

Linked character between word can be defined as follows: T={ X₁,X₂,…,X_mIndicate one section by the molecular text of m sentence This, wherein X_iIndicate i-th of sentence.Each sentence is made of several words:Wherein n_iIt is The length of a sentence, x_i,jIt is j-th of word of i-th of sentence.Assuming that being statistical iteration between all words, i.e., without appointing What correlativity, then following formula is set up:

P(x_i,j,x_k,l)=P (x_i,j)·P(x_k,l)

So, the case where violating above-mentioned formula is likely to imply there is association between word.Further, this method is by text Middle word association is substantially divided into three classes:

(1) adjacent word is associated with

One sentence X_i, a sequence signal can be modeled asFor such a by word The sequence signal of composition, adjacent word may include stronger semantic coherence and correlation as the signal continuously occurred. This kind of word association is defined as adjacent word association, is represented by P (x_i,j,x_k,l| i=k, | l-j |=1)

(2) across word association

Include the sentence of multiple words for one in view of syntactic rule, word may not only with adjacent word There is semantic association, also there are semantic associations with other farther away words of distance in same sentence.Long distance in this kind of same sentence It is defined as being represented by P (x across word association from word association_i,j,x_k,l| i=k, 1 < | l-j | < n_i)。

(3) across sentence association

More generally, for the text comprising multiple sentences, may also there be certain semantic association between different sentences.Cause This, even if two words, not in same sentence, still there may be potential incidence relations for they.In this kind of difference sentence Remote word association is defined as across sentence association, is represented by P (x_i,j,x_k,l|i≠ k,j∈[1,n_i],l∈[1,n_k])。

In textual association characteristic extraction procedure, the embodiment of the present invention mainly utilizes LSTM (Long Short-Term Memory, shot and long term memory network) great ability in terms of sequence signal feature extraction and expression and LSTM be to long range The modeling ability of dependence carries out the analysis and extraction of textual association feature.One LSTM unit is combined in the output of t moment Current input and current hidden layer state, and the input information at preceding t-1 moment is contained in hidden state, therefore the output of t moment It may be expressed as:

y_t=f_LSTM(x_t|x₁,x₂,…,x_t-1)

One sentence X comprising L word can be considered as the sequence signal that a length is L, wherein i-th of word x_iThe input signal of moment i can be regarded as.The first part of the neural network model of this method is word coding module, by a word The dense vector that language is mapped to a d dimension semantic space indicates.Therefore for the sentence X of input, matrix can be expressed asWherein the i-th row indicates i-th of word in sentence, it may be assumed that

In general, Recognition with Recurrent Neural Network is made of multilayer, every layer has multiple LSTM units.N is used in the embodiment of the present invention_j To indicate j-th of hidden layer U_jLSTM unit number, the unit of jth layer can indicate are as follows:

For first hidden layer, unitIt is x in the input at t-th of moment_tThe weighted sum of middle element, it may be assumed that

WhereinWithThe weight and deviation for being.In t moment,Output valve be:

Vector can be usedIndicate the output in j-th of hidden layer at t-th of moment,In each list of elements Show the output valve of the unit in j-th of hidden layer at t-th of moment, it may be assumed that

Previous work shows that in a certain range the neural network number of plies is more in space, model extraction and expression characteristic Ability is stronger.Therefore, the embodiment of the present invention builds network model using multilayer LSTM unit.Adjacent hidden layer can pass through Transfer matrix connection.For example, the l layers of transfer matrix between (l+1) layer can be expressed as matrix

Each unit in first of hidden layer at t-th of momentInput be in preceding layer the output valve of unit plus Quan He, it may be assumed that

The output that l layers of t moment is as follows:

To sum up, the output at t-th of moment depends not only on current time x_tInput vector, and depend on previous t-1 Vector in the unit at a moment.Therefore, the output of first of hidden layer at t-th of moment can be considered as the total of preceding t moment With, that is, t previous word { x₁,x₂,…,x_tInformation fusion.

Textual association analysis module has the potentiality for modeling aforementioned three kinds of word associations.LSTM has the energy of memory history input The LSTM unit of power, first layer can directly remember adjacent previous word feature, and the LSTM unit of later layer is with previous The LSTM unit output of layer is input, can remember increasingly complex history feature.Therefore, LSTM can easily be modeled across the time Correlated characteristic.And adjacent word is associated with, is only across the time pass on different time scales across word association, across sentence association Connection, therefore these linked characters can be extracted by textual association analysis module.Above-mentioned textual association analysis module such as Fig. 2 institute Show.

Two, characteristic differentiation module

Model shown in Fig. 2 can only extract each word and the word correlation before it.It is every in order to further extract The potentially relevant property of a word and all words (word and subsequent word including front) nearby, the embodiment of the present invention is into one Step is added to a reversed RNN (Recurrent Neural Network, Recognition with Recurrent Neural Network), therefore, entire textual association Analysis module is made of a two-way RNN, as shown in Figure 3.?

In Fig. 3, using bidirectional circulating neural network (BiRNN) extract word between potential linked character (including with forward direction The relationship of word and relationship with backward word), also identify whether text to be detected includes hiding information using linked character. Difference between reversed RNN and forward direction RNN be the input of reversed RNN be from the last one word of text to first word, because This, forward direction RNN is more likely to extract the correlation between each word and word before, and reversed RNN is more focused on extraction each Correlation between word and subsequent word.

WithWithRespectively indicate the word association feature extracted by forward direction RNN and the word association extracted by reversed RNN Feature.In order to merge the two features, first the feature vector of their last moments is spliced, obtains vector Z:

Then a Fusion Features matrix is definedFeatures described above Z is merged in the following way:

Wherein h is characteristic vector F=[f₁,f₂,…,f_h] dimension.In order to identify text to be detected with obtained feature F Whether this includes hiding information, defines the identification weight vectors C that length is h, the inner product of C and F is calculated, with amount of biasIt is added Afterwards, scalar output is obtainedIt is as follows:

And output is normalized into section [0,1] using sigmoid activation primitive S:

Final output are as follows:

Due to the memory function of LSTM, final output has contained the output information at first layer all moment.Therefore, most The value exported eventually reflects model and thinks the probability that text to be detected contains hiding information.By setting a detection threshold value, most Whole testing result is as follows:

That is, model attempts to provide a prediction label to input text (0 represents plain text, and 1 represents steganography text).

In addition, capacity of the also adjustable integral mould to estimate hiding information.After obtaining characteristic vector F, addition Softmax layers calculate the probability distribution of text hidden capacity.Prediction weight (PW) is defined as matrix That is:

Wherein n indicates the possible value type of insertion rate, and h is characterized the dimension of vector F.Use known matrix W_PTo count Calculate score a possibility that text to be detected belongs to the steganography text of n class difference insertion rate, it may be assumed that

Wherein W_PAnd b^pIt is known weight matrix and deviation, weight matrix W_PIn value reflect the weight of each feature in F The property wanted, the dimension of output vector Y are n.In order to estimate the probability distribution of text insertion rate to be detected, the method for the embodiment of the present invention With reference to the work of forefathers, softmax classifier is added to output layer to calculate the possibility probability of every kind of insertion rate:

p_iText as to be detected belongs to the probability of the i-th class insertion rate text.The insertion rate for remembering the i-th class insertion rate text is r_i, it thus can estimate the insertion rate of text to be detected:

R=r_k, wherein

All parameters of neural network, including each term vector require to obtain by training.The embodiment of the present invention follows Supervised learning frame updates network parameter using back-propagation algorithm, minimizes loss by the iteration optimization of network Function, to obtain optimal models.Wherein, loss function consists of two parts, and a part is error term, and another part is canonical Constraint.

For whether include hiding information two taxonomic histories, LOSS function is defined as follows:

Wherein N is the quantity of single batch of training sample, y_iIt includes hidden for representing i-th of the sample obtained through neural computing Hide the probability of information, t_iThe physical tags of i-th of sample are represented, C is aforementioned identification weight vectors.

For estimating the identification of hiding information capacity, LOSS function is defined as follows:

Wherein, Y_iIndicate the insertion rate probability distribution for the output of i-th of sample classification device, Y_i,jRepresent i-th of sample category In the probability of jth class insertion rate steganography text.T_iThe rate that is actually embedded in for i-th of sample indicates, if the sample belongs to kth class The steganography text of insertion rate, then T_iKth dimension is 1, residual value 0.W_PFor aforementioned Weight prediction matrix.

Error term in loss function calculates the average cross entropy between prediction probability value and true tag.Pass through model Self study, prediction error can become smaller and smaller, that is to say, that prediction result becomes closer to true tag.In order to add Strong regularization and prevent overfitting, the method for the embodiment of the present invention use in the training process loss mechanisms and to weight to The constraint of two norms of amount.

Deep learning (deep learning) is a branch of machine learning, is a kind of based on logarithm in machine learning According to the method for carrying out representative learning, it attempts using comprising labyrinth or the multiple process layers being made of multiple nonlinear transformation Higher level of abstraction is carried out to data.The benefit of deep learning is that the feature learning and layered characteristic with non-supervisory formula or Semi-supervised mention Highly effective algorithm is taken to substitute and obtain feature by hand.The target of representative learning is to seek better representation method and create better mould Type learns these representation methods from extensive Unlabeled data.Expression way similar to Neuscience progress, and loosely It creates in the understanding of information processing and communication pattern in similar nervous system, such as neural coding, it is intended to which definition pulls nerve Relationship between the reaction of member and the relationship between the electrical activity of the neuron in brain.So far have several deep learning frames Frame, as deep neural network, convolutional neural networks and depth confidence network and Recognition with Recurrent Neural Network by Appliance computer vision, The domain variabilities such as speech recognition, natural language processing, audio identification and bioinformatics obtain fabulous effect.

Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) is a kind of deep learning frame, it is a kind of For the neural network of processing sequence data, it is made of input layer, hidden layer and output layer.Essential characteristic is network in each step It all include feedback link, therefore it can extend on time dimension and form " depth " neural network on time dimension.This One structure makes Recognition with Recurrent Neural Network be capable of handling sequence data.Compare other depth, feedforward neural network, recycles nerve net Network is capable of handling the sequence data of random length by using the neural unit with self feed back, is a kind of depth for having much attraction Learning structure is spent, and there are many improved variants.Wherein, shot and long term memory models (Long Short-Term Memory, LSTM), using the Hidden unit structure of smart design, long distance dependent relation has more effectively been handled than common RNN, so as to Preferably to model nature text.

Recognition with Recurrent Neural Network is widely used in the tasks such as speech recognition, language model and language generation, is had very Strong feature extraction, expression, semantic understanding ability, it does not need artificial design feature, but the self-teaching from mass data To various features.It is core of the invention by the text vocabulary association feature extraction that Recognition with Recurrent Neural Network applies in Stego-detection The heart.RNN preferably can model text than Markov model, be drawn into more reasonable vocabulary association feature, from And more effectively identify potential steganography text.

To sum up, carrying out text generation formula Stego-detection using Recognition with Recurrent Neural Network has incomparable excellent of existing method Gesture can effectively solve the problems, such as that existing method, various aspects of performance are better than existing method.

The text Stego-detection method based on Recognition with Recurrent Neural Network proposed according to embodiments of the present invention, by the way that mind will be recycled Through network application into text Stego-detection, it can effectively identify whether text carrier contains hiding information, and according to extraction feature Statistical distribution accurately estimate the capacity of hiding information.Traditional Stego-detection method based on simple feature can be effectively overcome to be had Some defects can also obtain very high detection accuracy in face of production text steganography method neural network based；Based on LSTM The statistical nature of correlativity between model extraction word can effectively model the long distance dependent relation in text.

The text Stego-detection based on Recognition with Recurrent Neural Network proposed according to embodiments of the present invention is described referring next to attached drawing System.

As shown in figure 4, should text Stego-detection system 10 based on Recognition with Recurrent Neural Network include: to obtain module 100, the One generation module 200, the second generation module 300, judgment module 400 and estimation module 500.

Wherein, module 100 is obtained for obtaining term vector matrix.First generation module 200 is used for according to term vector matrix Input word sequence vector is converted by text to be detected.Second generation module 300 is pre- for input word sequence vector to be input to In the Recognition with Recurrent Neural Network model first constructed, the feature vector of correlativity between indicating text word to be detected is generated.Judge mould Block 400 judges whether text to be detected includes hiding information for classifying by classifier to feature vector.Estimation module 500 include hiding information for text to be detected, and the difference estimation according to steganography Text eigenvector under different insertion rates is to be checked Survey the information insertion rate of text.The detection system 10 applies to Recognition with Recurrent Neural Network in text Stego-detection, can effectively identify Whether text carrier contains hiding information, and the capacity of hiding information is accurately estimated according to the statistical distribution of extraction feature.

Further, in one embodiment of the invention, module 100 is obtained, is specifically used for: obtaining a large amount of generation texts This, is segmented and is pre-processed to text is generated, and filtered based on word frequency, and the common lexical set for generating text is obtained；It will be normal The common vocabulary seen in lexical set is sequentially inputted in the module pre-established, obtain the corresponding word of each common vocabulary to Amount；Term vector sample is formed according to the corresponding term vector of each common vocabulary, term vector matrix is generated according to term vector sample.

Further, in one embodiment of the invention, the first generation module 200, is specifically used for: by text to be detected Low-frequency word filtering, screening and participle are carried out, the sequence of words of text to be detected is generated；Sequence of words is obtained according to term vector matrix In all vocabulary term vector, wherein term vector matrix includes corresponding relationship between common vocabulary and term vector；According to be checked The term vector for surveying all vocabulary in text, generates the input word sequence vector of text to be detected.

Further, in one embodiment of the invention, further includes: building module；

Construct module, the corresponding input word sequence vector of generation text for obtaining multiple Given information insertion rates, root Recognition with Recurrent Neural Network model is constructed according to the corresponding input word sequence vector of the generation text of multiple Given information insertion rates.

Correction module is specifically used for: generating the generation text of Given information insertion rate according to Recognition with Recurrent Neural Network model Character pair vector；Classified using feature vector of the classifier to the generation text of known insertion rate, generates known insertion The insertion rate estimated probability of the generation text of rate；Using preset algorithm to the reality of the generation text of estimated probability and known insertion rate Border insertion rate carries out analysis and generates analysis as a result, correcting the parameter and classifier of Recognition with Recurrent Neural Network model based on the analysis results Weight parameter.

It should be noted that the aforementioned explanation to the text Stego-detection embodiment of the method based on Recognition with Recurrent Neural Network The system for being also applied for the embodiment, details are not described herein again.

The text Stego-detection system based on Recognition with Recurrent Neural Network proposed according to embodiments of the present invention, passes through

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims

1. a kind of text Stego-detection method based on Recognition with Recurrent Neural Network, which comprises the following steps:

Term vector matrix is obtained, input word sequence vector is converted for text to be detected according to the term vector matrix；

The input word sequence vector is input in the Recognition with Recurrent Neural Network model constructed in advance, generating indicates described to be detected The feature vector of correlativity between text word；

Classified by classifier to described eigenvector, judges whether the text to be detected includes hiding information；

If the text to be detected includes hiding information, estimated according to the difference of steganography Text eigenvector under different insertion rates The information insertion rate of the text to be detected.

2. the text Stego-detection method according to claim 1 based on Recognition with Recurrent Neural Network, which is characterized in that described to obtain Take term vector matrix, comprising:

A large amount of generation texts are obtained, the generation text is segmented and pre-processed, and is filtered based on word frequency, obtain generating text This common lexical set；

Common vocabulary in the common lexical set is sequentially inputted in the module pre-established, each common word is obtained Converge corresponding term vector；

Term vector sample is formed according to the corresponding term vector of each described common vocabulary, institute is generated according to the term vector sample Predicate vector matrix.

3. the text Stego-detection method according to claim 1 based on Recognition with Recurrent Neural Network, which is characterized in that described Input word sequence vector is converted by text to be detected according to the term vector matrix, comprising:

The text to be detected is subjected to low-frequency word filtering, screening and participle, generates the sequence of words of the text to be detected；

The term vector of all vocabulary in the sequence of words is obtained according to the term vector matrix, wherein the term vector matrix Include the corresponding relationship between common vocabulary and term vector；

According to the term vector of all vocabulary in the text to be detected, the input term vector sequence of the text to be detected is generated Column.

4. the text Stego-detection method according to claim 1 based on Recognition with Recurrent Neural Network, which is characterized in that described Before the input word sequence vector is input in the Recognition with Recurrent Neural Network model constructed in advance, further includes:

The corresponding input word sequence vector of generation text for obtaining multiple Given information insertion rates, according to the multiple known The corresponding input word sequence vector of the generation text of information insertion rate constructs the Recognition with Recurrent Neural Network model.

5. the text Stego-detection method according to claim 4 based on Recognition with Recurrent Neural Network, which is characterized in that described After constructing the Recognition with Recurrent Neural Network model, further includes:

The character pair vector of the generation text of the Given information insertion rate is generated according to the Recognition with Recurrent Neural Network model；

Classified using feature vector of the classifier to the generation text of the known insertion rate, is generated described known embedding Enter the insertion rate estimated probability of the generation text of rate；

It is analyzed using be actually embedded in rate of the preset algorithm to the generation text of the estimated probability and the known insertion rate Analysis is generated as a result, the parameter of Recognition with Recurrent Neural Network model and the weight of the classifier according to the analysis modified result Parameter.

6. a kind of text Stego-detection system based on Recognition with Recurrent Neural Network characterized by comprising

Module is obtained, for obtaining term vector matrix；

First generation module, for converting input word sequence vector for text to be detected according to the term vector matrix；

Second generation module, for the input word sequence vector to be input in the Recognition with Recurrent Neural Network model constructed in advance, Generate the feature vector of correlativity between indicating the text word to be detected；

Judgment module, for being classified by classifier to described eigenvector, judge the text to be detected whether include Hiding information；

Estimation module includes hiding information for the text to be detected, according to steganography Text eigenvector under different insertion rates Difference estimate the information insertion rate of the text to be detected.

7. the text Stego-detection system according to claim 6 based on Recognition with Recurrent Neural Network, which is characterized in that described to obtain Modulus block, is specifically used for:

8. the text Stego-detection system according to claim 6 based on Recognition with Recurrent Neural Network, which is characterized in that described One generation module, is specifically used for:

9. the text Stego-detection system according to claim 6 based on Recognition with Recurrent Neural Network, which is characterized in that also wrap It includes: building module；

The building module, the corresponding input term vector sequence of generation text for obtaining multiple Given information insertion rates Column construct the circulation mind according to the corresponding input word sequence vector of the generation text of the multiple Given information insertion rate Through network model.

10. the text Stego-detection system according to claim 6 based on Recognition with Recurrent Neural Network, which is characterized in that also wrap It includes: correction module；

The correction module, is specifically used for: