CN108363685A

CN108363685A - Based on recurrence variation own coding model from media data document representation method

Info

Publication number: CN108363685A
Application number: CN201711417351.2A
Authority: CN
Inventors: 王家彬; 黄江平
Original assignee: DIGITAL TELEVISION TECHNOLOGY CENTER BEIJING PEONY ELECTRONIC GROUP Co Ltd
Current assignee: DIGITAL TELEVISION TECHNOLOGY CENTER BEIJING PEONY ELECTRONIC GROUP Co Ltd
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-08-03
Anticipated expiration: 2037-12-25
Also published as: CN108363685B

Abstract

The present invention provide it is a kind of based on recurrence variation own coding model from media data document representation method, this method includes：The language material text of input is pre-processed, is encoded using recurrent neural network encoding model, the text vector of fixed dimension is generated；Mean vector and variance vectors are generated by the text vector of fixed dimension, the collecting sample from standardized normal distribution generates potential coded representation z using mean vector, variance vectors and sample using the method for variation reasoning；Then it is decoded to obtain decoding sequence using recurrent neural network decoded model, coding loss between calculation code sequence and decoding sequence, and the divergence between potential coded representation z and standardized normal distribution, the parameter of recurrence variation own coding model is updated using coding loss and divergence.The method coding efficiency of the present invention is high, can better adapt to the coded representation from media data, the distribution of data can also be described while the content to data is fitted.

Description

Based on recurrence variation own coding model from media data document representation method

Technical field

It is based on passing the present invention relates to deep learning and from media data text content analysis technical field, more particularly to one kind Return variation own coding model from media data document representation method.

Background technology

With the development of Social Media in recent years, user generates largely from media short text content, these texts Hold due to lacking effective contextual information, is difficult to indicate this class text using traditional bag of words.

Deep learning is derived from the research of artificial neural network, and the multitiered network containing more hidden layers is exactly a kind of deep learning knot Structure.Deep learning forms more abstract high-rise expression attribute classification or feature by combining low-level feature, to find data Distributed nature indicates.The concept of deep learning was proposed by Hinton et al. in 2006.Based on depth confidence network (DBN) It is proposed that non-supervisory greed successively training algorithm then proposes multilayer autocoding to solve the relevant optimization problem of deep structure Device deep structure.And the convolutional neural networks proposed by Lecun Yann et al. are first real multilayered structure learning algorithms, It reduces number of parameters to improve training performance using spatial correlation.Deep learning is exactly that one is generated from an input The involved calculating of output can indicate that each node indicates a basic meter in this figure by a flow graph It calculates and the value of a calculating, the result of calculating is applied to the value of the child node of this node.Deep learning simulates the mankind Cognitive process successively carries out, and gradually abstract process, i.e., learn simple concept first, then goes to indicate more to take out in this way The thought and concept of elephant.This method has been successfully applied to the fields such as computer vision, speech recognition, although depth in recent years The application that learning method is applied to natural language processing receives prodigious concern, but is mostly based on the design of model, shortage pair The introducing of knowledge.

It is traditional to indicate that study is mostly based on bag of words from media content of text for the presentation technology of content of text And using words representation methods such as only heat, this will inevitably result in " vocabulary wide gap " phenomenon, i.e. language serious between word and word Word similar in justice is also mutually orthogonal in vector indicates.Although these methods are more effective when indicating traditional text, But it is applied to then will appear serious Sparse Problem from media text representation.Traditional method generally use craft feature into Row indicates the feature extraction of study from media content of text, but this method depends on artificial experience, for some professional domains Then need corresponding expert to build knowledge base from media data could preferably to realize the expression of these data texts.

There are various data text analysis methods in the prior art, but these data text analysis methods are for common mostly Or part special dimension is analyzed from media data content of text, and these analysis methods are usually only with simple Text code is simply fitted data, lacks description to data distribution, therefore causes text representation inaccuracy etc. and asks Topic.

Invention content

It is an object of the present invention to provide it is a kind of based on recurrence variation own coding model from media data document representation method, Its coding efficiency is high, can better adapt to the coded representation from media data, and be fitted in the content to data The distribution of data can also be described simultaneously.

The present invention provide it is a kind of based on recurrence variation own coding model from media data document representation method, wherein should Method includes the following steps：

Step S100, the language material text of input is pre-processed, obtains coded sequence；

Step S200, the coded sequence is encoded using recurrent neural network encoding model, generates fixed dimension Text vector；

Step S300, mean vector and variance vectors are generated by the text vector of the fixed dimension, then just from standard Collecting sample in state distribution is given birth to using the mean vector, the variance vectors and the sample using the method for variation reasoning At potential coded representation z；

Step S400, the potential coded representation z is decoded using recurrent neural network decoded model and is decoded Sequence calculates coding loss between the coded sequence and the decoding sequence and the potential coded representation z and standard Divergence between normal distribution updates the parameter of recurrence variation own coding model using the coding loss and the divergence.

It is preferred that, wherein pretreatment is carried out to the language material text of input in step S100 and is included the following steps：

Step S110, filter every input the language material text, and remove the label of the language material text, label symbol with And link, and word segmentation processing is carried out to the content of the language material text and generates text T；

Step S120, the word in the language material text is counted, and generates the dictionary of word in the language material text, it is right Word in each language material text carries out vector initialising, wherein the initialization vector of the word in each language material text Dimension is according to experiment performance setting；

Step S130, dependency structure analysis is carried out to the text T, and serializing is carried out to the structure after analysis and is handled To coded sequence.

It is preferred that, wherein further include in step s 130：

Text content analysis is carried out to the text T using the dependency analysis device of Stamford and generates interdependent tree construction；

The serializing for carrying out binary tree to the interdependent tree construction handles to obtain the coded sequence.

It is preferred that, wherein in step S200 to the coded sequence using recurrent neural network encoding model into Row coding, the term vector that when coding uses include the initialization vector and/or term vector trained in advance.

It is preferred that, wherein in step S200 to the coded sequence using recurrent neural network encoding model into Row coding, the text vector for generating fixed dimension include the following steps：

Two S210, selection child node c₁And c₂, by the c₁With the c₂Generate first father node p₁；

S220, the father node p by generating₁New child node, which is constituted, with the word in the coded sequence generates second Father node p₂；

S230, it is encoded with step S220 recurrence, is generated every time by the word in a father node and a coded sequence New father node, until word all in the coded sequence is encoded position；Wherein,

During coding, code weight W_eIt is shared in each coding, so that the text for making coding generate is compiled Code table is shown as the vector of the fixed dimension.

It is preferred that, wherein in step S300 by identical mapping generate the mean vector and the variance to Amount.

It is preferred that, wherein step S300 includes：

Variable of the acquisition for generating the potential coded representation z, the distribution table of the variable in standard is just distributed very much Divergence when showing for model training calculates；

The variable and the variance vectors quadrature, then sum obtained product and the mean vector, and then To the potential coded representation z.

It is preferred that, wherein the decoding process of potential coded representation z described in step S400 includes the following steps：

S410, the input vector that dimension is twice of the coded representation z is generated on the basis of the coded representation z X a, wherein part of the input vector x is child node c, and another part is for decoding father node p；

S420, continue to decode the father node p, obtain new child node c₁＇ and p₁＇, wherein the p₁＇ is for solving The new father node of code；

S430, with step S420 recursive decodings, every time by a new child node as decoded father node in next step into Row decoding, until generating the decoding sequence identical with the coded sequence length.

It is preferred that, wherein the decoding sequence and the volume are calculated by Euclidean distance in step S400 Coding loss between code sequence.

It is preferred that, wherein the recurrence variation own coding mould is updated by back-propagation algorithm in step S400 The parameter of type.

The present invention has the following advantages and beneficial effect：

1, it is of the invention based on recurrence variation own coding model from media data document representation method, in content of text In terms of expression, overcomes tradition and lack problem of representation caused by context, and the party when being indicated from media data content of text Method introduces Heuristics by the expression that existing text processing facilities are content of text, improves the performance of text representation.

2, it is of the invention based on recurrence variation own coding model from media data document representation method, using recurrent neural The encoding model of network is not only able to for carrying out sequential encoding to content of text, can also be in the text with tree construction Appearance is encoded, and can only be carried out the deficiency of sequential encoding to content of text to effectively prevent conventional method, preferably be combined The real structure of text is indicated it, and then the structure of coded representation is made to be more in line with actual demand.

3, it is of the invention based on recurrence variation own coding model from media data document representation method, utilize variation reasoning Method preferably embody the process being really distributed of deep learning method analogue data.

4, it is of the invention based on recurrence variation own coding model from media data document representation method, using passing for expansion Return neural network decoded model, the input content of text can be reconstructed, and by modes such as Euclidean distance calculating come measurement model Coding efficiency, and by with new model parameter come Optimized model to the expression from media data content of text.

5, it is of the invention based on recurrence variation own coding model from media data document representation method, pass through the standard of introducing Normal distribution simultaneously calculates the mean vector of input text and variance vectors arrive potential coded representation z, and potential coded representation z accumulates Contain the knowledge such as term vector knowledge, text structure, and met certain distribution, and can be set as needed the dimension of vector, Contain more characteristic informations than traditional recurrence coding vector, is conducive to the expression and calculating of text.

6, it is of the invention based on recurrence variation own coding model from media data document representation method, coding can be utilized The parameter of loss and divergence update recurrence variation own coding model, and then Optimized model and it is preferably fitted training data, raising Coding efficiency.

Description of the drawings

It will be briefly described attached drawing used in this application below, it should be apparent that, these attached drawings are only used for explaining the present invention Design.

Fig. 1 is the flow chart from media data document representation method based on recurrence variation own coding model of the present invention；

Fig. 2 is that the use from media data document representation method based on recurrence variation own coding model of the present invention is interdependent The flow chart for the context dependent structure that analyzer obtains；

Fig. 3 is the refreshing using recurrence from media data document representation method based on recurrence variation own coding model of the present invention The structural schematic diagram encoded through network code model；

Fig. 4 be the present invention based on recurrence variation own coding model from media data document representation method generate mean value to The flow chart of amount and variance vectors；

Fig. 5 be the present invention based on recurrence variation own coding model from media data document representation method from just dividing very much Sample variation and the flow chart of potential coded representation is generated in cloth；

Fig. 6 is the recurrence variation from media data document representation method based on recurrence variation own coding model of the present invention The structural schematic diagram of own coding model.

Specific implementation mode

Hereinafter, with reference to the accompanying drawings the description present invention based on recurrence variation own coding model from media data text The embodiment of representation method.

The embodiment recorded herein is the specific specific implementation mode of the present invention, for illustrating design of the invention, It is explanatory and illustrative, should not be construed as the limitation to embodiment of the present invention and the scope of the invention.Except what is recorded herein Outside embodiment, those skilled in the art can also based on the application claims and specification disclosure of that using aobvious and The other technical solutions being clear to, these technical solutions include to the embodiment recorded herein make it is any it is obvious replacement and The technical solution of modification.

The attached drawing of this specification is schematic diagram, aids in illustrating the design of the present invention, it is schematically indicated the shape of each section And its correlation.The structure of each section for the ease of clearly showing the embodiment of the present invention is note that, between each attached drawing Not necessarily drawn according to identical ratio.Same or analogous reference marker is for indicating same or analogous part.

Referring to Fig. 1, the present invention provide it is a kind of based on recurrence variation own coding model from media data document representation method, Wherein, this approach includes the following steps：

Step S200, coded sequence is encoded using recurrent neural network encoding model, generates the text of fixed dimension This vector；

Step S300, mean vector and variance vectors are generated by the text vector of fixed dimension, then from standard normal point Collecting sample in cloth generates potential coded representation z using mean vector, variance vectors and sample using the method for variation reasoning；

Step S400, potential coded representation z is decoded to obtain decoding sequence using recurrent neural network decoded model, Dissipating between the coding loss and potential coded representation z and standardized normal distribution between calculation code sequence and decoding sequence Degree updates the parameter of recurrence variation own coding model using coding loss and divergence.

Potential coded representation z obtained by calculation has contained term vector knowledge, text structure etc. in the method for the present invention Knowledge, and meet certain distribution, and the dimension of vector can be set according to actual needs, referring now to traditional recurrence encode to Amount contains more characteristic informations, is conducive to the expression and calculating of text, and reduce coding dimension, improves computational efficiency. In addition, the method for the present invention is by using the coding loss between potential coding calculation code sequence and decoding sequence, Yi Jiqian Divergence between coded representation z and standardized normal distribution automatically updates recurrence variation own coding using coding loss and divergence The parameter of model to effectively raise the coding efficiency of model, and inputs different texts, the recurrence variation own coding mould Type can automatically update parameter according to the content of text, and then different texts is made accurately to be indicated.

Further, further include following step in being pre-processed to the language material text of input in the step S100 of the present invention Suddenly：

Step S110, the language material text of every input is filtered, and removes the label of language material text, label symbol and link, And word segmentation processing is carried out to the content of language material text and generates text T；

Step S120, the word in language material text is counted, and generates the dictionary of word in language material text, to each language material Word in text carries out vector initialising, wherein the initialization vector dimension of the word in each language material text is showed according to experiment Setting；

Step S130, dependency structure analysis is carried out to text T, and carries out serializing to the structure after analysis and handle to be compiled Code sequence.

Further, text content analysis generation is carried out to text T using the dependency analysis device of Stamford in step S130 Interdependent tree construction；The serializing that binary tree is carried out to interdependent tree construction handles to obtain coded sequence.By the way that text is carried out structure Analysis can overcome the shortcomings of that conventional method can only carry out content of text sequential encoding, preferably combine the real structure of text It is indicated, is more in line with actual demand.

Fig. 2 is that the use from media data document representation method based on recurrence variation own coding model of the present invention is interdependent The flow chart for the context dependent structure that analyzer obtains.It is further illustrated the present invention with reference to Fig. 2 and specific embodiment Method.

Fig. 2 is indicated to from media data content of text " My cat also likes eating fish and The input of hamburger " carries out the process of text retrieval conference TREC by dependency analysis device.It is passed through from media text data in input The interdependent tree construction of raw text after dependency analysis device is crossed, the word " likes " in text is connected to " My cat " and " eating The content of two parts fish and hamburger ", wherein adverbial word " also " modification verb " likes ", and " My cat " Be made of word " My " and " cat ", " eating fish and hamburger " and can be further split into " eating " and " fish and hamburger " two parts, " fish " and " hamburger " then constitutes structure arranged side by side by conjunction " and ".It is logical Cross above-mentioned dependency analysis tool, can using the knowledge of external resource carrying out explicit representation from the structure of media data text, And it is encoded by this explicit representation.The dependence of such structural visual being depicted between each word, It indicates between word in syntactical Matching Relation, and this Matching Relation is associated with semanteme, and then makes coded representation Context between it is more coherent.

Further, coded sequence is encoded using recurrent neural network encoding model in step S200, when coding The term vector of use includes initialization vector and/or term vector trained in advance, Heuristics can be thus introduced, to subtract Few encoding calculation amount, improves code efficiency.

Specifically, coded sequence is encoded in step S200 using recurrent neural network encoding model, generates and fixes The text vector of dimension includes the following steps：

Two S210, selection child node c₁And c₂, by c₁And c₂Generate first father node p₁；

S220, the father node p by generating₁New child node, which is constituted, with the word in coded sequence generates second father node p₂；

S230, it is encoded with step S220 recurrence, is generated newly by the word in a father node and a coded sequence every time Father node, until word all in coded sequence is encoded position；Wherein,

During coding, code weight W_eIt is shared in each coding, so that the text for making coding generate is compiled Code table is shown as the vector of fixed dimension.

Fig. 3 is indicated to the process from media data content of text coded representation, here to use recurrent neural network To list entries x=w₁,w₂,…,w₄It carries out describing cataloged procedure for coded representation.The coding structure is first the word of input Vectorial w₁And w₂It connects, is expressed as the child node vector [c that a dimension is 2n₁；c₂], it should be noted that (w₁,w₂)= (c₁,c₂), then utilize formula p=f (W_e[c₁；c₂]+b_e) pass through p₁=f (W_e[w₁；w₂]+b_e) father node p is calculated₁, Again w₃With the p being calculated₁Combination is expressed as new [c₁；c₂], i.e. (c₁,c₂)=(p₁,w₃), recycle formula p=f (W_e [c₁；c₂]+b_e) pass through p₂=f (W_e[p₁；w₃]+b_e) father node p is calculated₂, using p₃=f (W_e[p₂；w₄]+b_e) meter Calculation obtains father node p₃, then recurrence is all encoded position until the word in coded sequence successively.Due to recurrence encoding model profit Text representation is carried out with this binary combination, it is therefore desirable to text is expressed as diadactic structure according to certain mode, and step It is exactly the sequential organization of text to be expressed as to the process of hierarchical structure, and then expand to carry out dependency structure analysis to text in S130 The applicability of the method for the present invention model.

Further, mean vector and variance vectors are generated by identical mapping in step S300.

If Fig. 4 and Fig. 5 are the processes for carrying out variation reasoning by obtained coded representation, due to the latent variable table of generation Show z need meet obey distribution N (μ, σ) condition, wherein μ indicate generate mean vector, and σ indicate generate variance to Amount, wherein the process for generating mean vector and variance vectors is as shown in Figure 4.As shown in figure 5, generating potential coding by z=μ+ε σ It indicates, wherein ε~N (0, I).Variable of the acquisition for generating potential coded representation z, the distribution of variable in standard is just distributed very much Divergence when indicating for model training calculates；Variable and variance vectors quadrature, then ask obtained product with mean vector With, and then obtain potential coded representation z.That is Fig. 4 and Fig. 5 is described carries out Reparameterization using the coded representation of variation reasoning Processing, since the coded representation z of generation obeys distribution N (μ, σ), thus its obtained coding be distributed as a region without It is a single point, i.e., preferably describes the distribution of data.

Specifically, the decoding process of potential coded representation z includes the following steps in step S400：S410, in coded representation z On the basis of generate the input vector x that dimension is twice of coded representation z, wherein a part of input vector x is child node C, another part are for decoding father node p；S420, continue to decode father node p, obtain new child node c₁＇ and p₁＇, In, p₁＇ is for decoded new father node；S430, with step S420 recursive decodings, every time by a new child node conduct Decoded father node is decoded in next step, until generating decoding sequence identical with coded sequence length.

Fig. 6 is the recurrence variation from media data document representation method based on recurrence variation own coding model of the present invention The structural schematic diagram of own coding model.As seen from the figure, method of the invention is latent by what is generated after obtaining potential coded representation z Be converted into for decoded input expression in coded representation z, if such as the dimension of term vector from media data content of text be 100 dimensions, and the vector dimension of the coded representation z generated is 50 dimensions, then needs to make it be converted into 100 by the processing of neural network The vector of dimension indicates.The coded representation p for generating child node is obtained after transform coding₃＇, below equally to generate coding Explanation is decoded for the input of four words, first by p₃＇ passes through decoding matrix W_dGenerate the vector of one 200 dimension, the vector It is divided into two parts, the w that preceding 100 dimension obtains for decoding₄＇, rear 100 dimension are the father node p of subsequent decoding₂＇ passes through father node p₂＇ Regenerate w₃＇ and father node p₁＇, then w is generated by the father node₂＇ and w₁＇, the decoding process of implementation model, passes through Euclidean distance The coding loss between decoding sequence and coded sequence is calculated, recurrence variation own coding mould is updated by back-propagation algorithm The parameter and Optimized model of type.By the coding and decoding of model can complete coding text input and reconstruct text it is defeated Enter, realize the unsupervised expression from media data content of text, due to its unsupervised characteristic, so as to better adapt to From the coded representation of media data.

The method of the present invention by recurrent neural network encoding model and recurrent neural network decoded model to input from Media data text is encoded, and potential coded representation z is then calculated, then by being decoded to potential coded representation z, In the divergence lost by calculation code and between potential coded representation z and standardized normal distribution, using the coding loss and dissipate The parameter of degree update recurrence variation own coding model, improves the coding efficiency of model.Also, the model can be according to different defeated Enter the different potential coded representation z of text generation, and then realizes and accurate coded representation is carried out to different input texts.

Above to the embodiment party from media data document representation method based on recurrence variation own coding model of the present invention Formula is illustrated.For the present invention based on recurrence variation own coding model from the specific of media data document representation method Feature can specifically be designed according to the effect of the feature of above-mentioned disclosure, these designs are that those skilled in the art can be real Existing.Moreover, each technical characteristic of above-mentioned disclosure is not limited to disclosed and other feature combination, those skilled in the art Other combinations between each technical characteristic can be also carried out according to the purpose of the present invention, be subject to realize the present invention purpose.

Claims

1. it is a kind of based on recurrence variation own coding model from media data document representation method, wherein this method includes following Step：

Step S200, the coded sequence is encoded using recurrent neural network encoding model, generates the text of fixed dimension This vector；

Step S300, mean vector and variance vectors are generated by the text vector of the fixed dimension, then from standard normal point Collecting sample in cloth is generated using the method for variation reasoning using the mean vector, the variance vectors and the sample and is dived In coded representation z；

Step S400, the potential coded representation z is decoded to obtain decoding sequence using recurrent neural network decoded model, Calculate coding loss between the coded sequence and the decoding sequence and the potential coded representation z and standard normal Divergence between distribution updates the parameter of recurrence variation own coding model using the coding loss and the divergence.

2. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step Pretreatment is carried out in rapid S100 to the language material text of input to include the following steps：

Step S110, the language material text of every input is filtered, and removes the label of the language material text, label symbol and chain It connects, and word segmentation processing is carried out to the content of the language material text and generates text T；

Step S120, the word in the language material text is counted, and generates the dictionary of word in the language material text, to each Word in the language material text carries out vector initialising, wherein the initialization vector dimension of the word in each language material text According to experiment performance setting；

Step S130, dependency structure analysis is carried out to the text T, and carries out serializing to the structure after analysis and handle to be compiled Code sequence.

3. as claimed in claim 2 based on recurrence variation own coding model from media data document representation method, wherein Further include in step S130：

4. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The coded sequence is encoded using recurrent neural network encoding model in rapid S200, the term vector that when coding uses includes The initialization vector and/or term vector trained in advance.

5. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The coded sequence is encoded using recurrent neural network encoding model in rapid S200, generates the text vector of fixed dimension Include the following steps：

S220, the father node p by generating₁New child node, which is constituted, with the word in the coded sequence generates second father's section Point p₂；

S230, it is encoded with step S220 recurrence, is generated newly by the word in a father node and a coded sequence every time Father node, until word all in the coded sequence is encoded position；Wherein,

During coding, code weight W_eIt is shared in each coding, so that the text code for making coding generate indicates For the vector of the fixed dimension.

6. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The mean vector and the variance vectors are generated by identical mapping in rapid S300.

7. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step Suddenly S300 includes：

Variable of the acquisition for generating the potential coded representation z in standard is just distributed very much, the distribution of the variable indicate to use Divergence when model training calculates；

The variable and the variance vectors quadrature, then sum obtained product and the mean vector, and then obtain institute State potential coded representation z.

8. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The decoding process of potential coded representation z described in rapid S400 includes the following steps：

S410, the input vector x that dimension is twice of the coded representation z is generated on the basis of the coded representation z, In, a part of the input vector x is child node c, and another part is for decoding father node p；

S420, continue to decode the father node p, obtain new child node c₁＇ and p₁＇, wherein the p₁＇ is for decoded New father node；

S430, with step S420 recursive decodings, solved every time as decoded father node in next step by a new child node Code, until generating the decoding sequence identical with the coded sequence length.

9. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The coding loss between the decoding sequence and the coded sequence is calculated in rapid S400 by Euclidean distance.

10. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein The parameter of the recurrence variation own coding model is updated in step S400 by back-propagation algorithm.