CN108363685A - Based on recurrence variation own coding model from media data document representation method - Google Patents

Based on recurrence variation own coding model from media data document representation method Download PDF

Info

Publication number
CN108363685A
CN108363685A CN201711417351.2A CN201711417351A CN108363685A CN 108363685 A CN108363685 A CN 108363685A CN 201711417351 A CN201711417351 A CN 201711417351A CN 108363685 A CN108363685 A CN 108363685A
Authority
CN
China
Prior art keywords
text
coding
variation
recurrence
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711417351.2A
Other languages
Chinese (zh)
Other versions
CN108363685B (en
Inventor
王家彬
黄江平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DIGITAL TELEVISION TECHNOLOGY CENTER BEIJING PEONY ELECTRONIC GROUP Co Ltd
Original Assignee
DIGITAL TELEVISION TECHNOLOGY CENTER BEIJING PEONY ELECTRONIC GROUP Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DIGITAL TELEVISION TECHNOLOGY CENTER BEIJING PEONY ELECTRONIC GROUP Co Ltd filed Critical DIGITAL TELEVISION TECHNOLOGY CENTER BEIJING PEONY ELECTRONIC GROUP Co Ltd
Priority to CN201711417351.2A priority Critical patent/CN108363685B/en
Publication of CN108363685A publication Critical patent/CN108363685A/en
Application granted granted Critical
Publication of CN108363685B publication Critical patent/CN108363685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The present invention provide it is a kind of based on recurrence variation own coding model from media data document representation method, this method includes:The language material text of input is pre-processed, is encoded using recurrent neural network encoding model, the text vector of fixed dimension is generated;Mean vector and variance vectors are generated by the text vector of fixed dimension, the collecting sample from standardized normal distribution generates potential coded representation z using mean vector, variance vectors and sample using the method for variation reasoning;Then it is decoded to obtain decoding sequence using recurrent neural network decoded model, coding loss between calculation code sequence and decoding sequence, and the divergence between potential coded representation z and standardized normal distribution, the parameter of recurrence variation own coding model is updated using coding loss and divergence.The method coding efficiency of the present invention is high, can better adapt to the coded representation from media data, the distribution of data can also be described while the content to data is fitted.

Description

Based on recurrence variation own coding model from media data document representation method
Technical field
It is based on passing the present invention relates to deep learning and from media data text content analysis technical field, more particularly to one kind Return variation own coding model from media data document representation method.
Background technology
With the development of Social Media in recent years, user generates largely from media short text content, these texts Hold due to lacking effective contextual information, is difficult to indicate this class text using traditional bag of words.
Deep learning is derived from the research of artificial neural network, and the multitiered network containing more hidden layers is exactly a kind of deep learning knot Structure.Deep learning forms more abstract high-rise expression attribute classification or feature by combining low-level feature, to find data Distributed nature indicates.The concept of deep learning was proposed by Hinton et al. in 2006.Based on depth confidence network (DBN) It is proposed that non-supervisory greed successively training algorithm then proposes multilayer autocoding to solve the relevant optimization problem of deep structure Device deep structure.And the convolutional neural networks proposed by Lecun Yann et al. are first real multilayered structure learning algorithms, It reduces number of parameters to improve training performance using spatial correlation.Deep learning is exactly that one is generated from an input The involved calculating of output can indicate that each node indicates a basic meter in this figure by a flow graph It calculates and the value of a calculating, the result of calculating is applied to the value of the child node of this node.Deep learning simulates the mankind Cognitive process successively carries out, and gradually abstract process, i.e., learn simple concept first, then goes to indicate more to take out in this way The thought and concept of elephant.This method has been successfully applied to the fields such as computer vision, speech recognition, although depth in recent years The application that learning method is applied to natural language processing receives prodigious concern, but is mostly based on the design of model, shortage pair The introducing of knowledge.
It is traditional to indicate that study is mostly based on bag of words from media content of text for the presentation technology of content of text And using words representation methods such as only heat, this will inevitably result in " vocabulary wide gap " phenomenon, i.e. language serious between word and word Word similar in justice is also mutually orthogonal in vector indicates.Although these methods are more effective when indicating traditional text, But it is applied to then will appear serious Sparse Problem from media text representation.Traditional method generally use craft feature into Row indicates the feature extraction of study from media content of text, but this method depends on artificial experience, for some professional domains Then need corresponding expert to build knowledge base from media data could preferably to realize the expression of these data texts.
There are various data text analysis methods in the prior art, but these data text analysis methods are for common mostly Or part special dimension is analyzed from media data content of text, and these analysis methods are usually only with simple Text code is simply fitted data, lacks description to data distribution, therefore causes text representation inaccuracy etc. and asks Topic.
Invention content
It is an object of the present invention to provide it is a kind of based on recurrence variation own coding model from media data document representation method, Its coding efficiency is high, can better adapt to the coded representation from media data, and be fitted in the content to data The distribution of data can also be described simultaneously.
The present invention provide it is a kind of based on recurrence variation own coding model from media data document representation method, wherein should Method includes the following steps:
Step S100, the language material text of input is pre-processed, obtains coded sequence;
Step S200, the coded sequence is encoded using recurrent neural network encoding model, generates fixed dimension Text vector;
Step S300, mean vector and variance vectors are generated by the text vector of the fixed dimension, then just from standard Collecting sample in state distribution is given birth to using the mean vector, the variance vectors and the sample using the method for variation reasoning At potential coded representation z;
Step S400, the potential coded representation z is decoded using recurrent neural network decoded model and is decoded Sequence calculates coding loss between the coded sequence and the decoding sequence and the potential coded representation z and standard Divergence between normal distribution updates the parameter of recurrence variation own coding model using the coding loss and the divergence.
It is preferred that, wherein pretreatment is carried out to the language material text of input in step S100 and is included the following steps:
Step S110, filter every input the language material text, and remove the label of the language material text, label symbol with And link, and word segmentation processing is carried out to the content of the language material text and generates text T;
Step S120, the word in the language material text is counted, and generates the dictionary of word in the language material text, it is right Word in each language material text carries out vector initialising, wherein the initialization vector of the word in each language material text Dimension is according to experiment performance setting;
Step S130, dependency structure analysis is carried out to the text T, and serializing is carried out to the structure after analysis and is handled To coded sequence.
It is preferred that, wherein further include in step s 130:
Text content analysis is carried out to the text T using the dependency analysis device of Stamford and generates interdependent tree construction;
The serializing for carrying out binary tree to the interdependent tree construction handles to obtain the coded sequence.
It is preferred that, wherein in step S200 to the coded sequence using recurrent neural network encoding model into Row coding, the term vector that when coding uses include the initialization vector and/or term vector trained in advance.
It is preferred that, wherein in step S200 to the coded sequence using recurrent neural network encoding model into Row coding, the text vector for generating fixed dimension include the following steps:
Two S210, selection child node c1And c2, by the c1With the c2Generate first father node p1
S220, the father node p by generating1New child node, which is constituted, with the word in the coded sequence generates second Father node p2
S230, it is encoded with step S220 recurrence, is generated every time by the word in a father node and a coded sequence New father node, until word all in the coded sequence is encoded position;Wherein,
During coding, code weight WeIt is shared in each coding, so that the text for making coding generate is compiled Code table is shown as the vector of the fixed dimension.
It is preferred that, wherein in step S300 by identical mapping generate the mean vector and the variance to Amount.
It is preferred that, wherein step S300 includes:
Variable of the acquisition for generating the potential coded representation z, the distribution table of the variable in standard is just distributed very much Divergence when showing for model training calculates;
The variable and the variance vectors quadrature, then sum obtained product and the mean vector, and then To the potential coded representation z.
It is preferred that, wherein the decoding process of potential coded representation z described in step S400 includes the following steps:
S410, the input vector that dimension is twice of the coded representation z is generated on the basis of the coded representation z X a, wherein part of the input vector x is child node c, and another part is for decoding father node p;
S420, continue to decode the father node p, obtain new child node c1' and p1', wherein the p1' is for solving The new father node of code;
S430, with step S420 recursive decodings, every time by a new child node as decoded father node in next step into Row decoding, until generating the decoding sequence identical with the coded sequence length.
It is preferred that, wherein the decoding sequence and the volume are calculated by Euclidean distance in step S400 Coding loss between code sequence.
It is preferred that, wherein the recurrence variation own coding mould is updated by back-propagation algorithm in step S400 The parameter of type.
The present invention has the following advantages and beneficial effect:
1, it is of the invention based on recurrence variation own coding model from media data document representation method, in content of text In terms of expression, overcomes tradition and lack problem of representation caused by context, and the party when being indicated from media data content of text Method introduces Heuristics by the expression that existing text processing facilities are content of text, improves the performance of text representation.
2, it is of the invention based on recurrence variation own coding model from media data document representation method, using recurrent neural The encoding model of network is not only able to for carrying out sequential encoding to content of text, can also be in the text with tree construction Appearance is encoded, and can only be carried out the deficiency of sequential encoding to content of text to effectively prevent conventional method, preferably be combined The real structure of text is indicated it, and then the structure of coded representation is made to be more in line with actual demand.
3, it is of the invention based on recurrence variation own coding model from media data document representation method, utilize variation reasoning Method preferably embody the process being really distributed of deep learning method analogue data.
4, it is of the invention based on recurrence variation own coding model from media data document representation method, using passing for expansion Return neural network decoded model, the input content of text can be reconstructed, and by modes such as Euclidean distance calculating come measurement model Coding efficiency, and by with new model parameter come Optimized model to the expression from media data content of text.
5, it is of the invention based on recurrence variation own coding model from media data document representation method, pass through the standard of introducing Normal distribution simultaneously calculates the mean vector of input text and variance vectors arrive potential coded representation z, and potential coded representation z accumulates Contain the knowledge such as term vector knowledge, text structure, and met certain distribution, and can be set as needed the dimension of vector, Contain more characteristic informations than traditional recurrence coding vector, is conducive to the expression and calculating of text.
6, it is of the invention based on recurrence variation own coding model from media data document representation method, coding can be utilized The parameter of loss and divergence update recurrence variation own coding model, and then Optimized model and it is preferably fitted training data, raising Coding efficiency.
Description of the drawings
It will be briefly described attached drawing used in this application below, it should be apparent that, these attached drawings are only used for explaining the present invention Design.
Fig. 1 is the flow chart from media data document representation method based on recurrence variation own coding model of the present invention;
Fig. 2 is that the use from media data document representation method based on recurrence variation own coding model of the present invention is interdependent The flow chart for the context dependent structure that analyzer obtains;
Fig. 3 is the refreshing using recurrence from media data document representation method based on recurrence variation own coding model of the present invention The structural schematic diagram encoded through network code model;
Fig. 4 be the present invention based on recurrence variation own coding model from media data document representation method generate mean value to The flow chart of amount and variance vectors;
Fig. 5 be the present invention based on recurrence variation own coding model from media data document representation method from just dividing very much Sample variation and the flow chart of potential coded representation is generated in cloth;
Fig. 6 is the recurrence variation from media data document representation method based on recurrence variation own coding model of the present invention The structural schematic diagram of own coding model.
Specific implementation mode
Hereinafter, with reference to the accompanying drawings the description present invention based on recurrence variation own coding model from media data text The embodiment of representation method.
The embodiment recorded herein is the specific specific implementation mode of the present invention, for illustrating design of the invention, It is explanatory and illustrative, should not be construed as the limitation to embodiment of the present invention and the scope of the invention.Except what is recorded herein Outside embodiment, those skilled in the art can also based on the application claims and specification disclosure of that using aobvious and The other technical solutions being clear to, these technical solutions include to the embodiment recorded herein make it is any it is obvious replacement and The technical solution of modification.
The attached drawing of this specification is schematic diagram, aids in illustrating the design of the present invention, it is schematically indicated the shape of each section And its correlation.The structure of each section for the ease of clearly showing the embodiment of the present invention is note that, between each attached drawing Not necessarily drawn according to identical ratio.Same or analogous reference marker is for indicating same or analogous part.
Referring to Fig. 1, the present invention provide it is a kind of based on recurrence variation own coding model from media data document representation method, Wherein, this approach includes the following steps:
Step S100, the language material text of input is pre-processed, obtains coded sequence;
Step S200, coded sequence is encoded using recurrent neural network encoding model, generates the text of fixed dimension This vector;
Step S300, mean vector and variance vectors are generated by the text vector of fixed dimension, then from standard normal point Collecting sample in cloth generates potential coded representation z using mean vector, variance vectors and sample using the method for variation reasoning;
Step S400, potential coded representation z is decoded to obtain decoding sequence using recurrent neural network decoded model, Dissipating between the coding loss and potential coded representation z and standardized normal distribution between calculation code sequence and decoding sequence Degree updates the parameter of recurrence variation own coding model using coding loss and divergence.
Potential coded representation z obtained by calculation has contained term vector knowledge, text structure etc. in the method for the present invention Knowledge, and meet certain distribution, and the dimension of vector can be set according to actual needs, referring now to traditional recurrence encode to Amount contains more characteristic informations, is conducive to the expression and calculating of text, and reduce coding dimension, improves computational efficiency. In addition, the method for the present invention is by using the coding loss between potential coding calculation code sequence and decoding sequence, Yi Jiqian Divergence between coded representation z and standardized normal distribution automatically updates recurrence variation own coding using coding loss and divergence The parameter of model to effectively raise the coding efficiency of model, and inputs different texts, the recurrence variation own coding mould Type can automatically update parameter according to the content of text, and then different texts is made accurately to be indicated.
Further, further include following step in being pre-processed to the language material text of input in the step S100 of the present invention Suddenly:
Step S110, the language material text of every input is filtered, and removes the label of language material text, label symbol and link, And word segmentation processing is carried out to the content of language material text and generates text T;
Step S120, the word in language material text is counted, and generates the dictionary of word in language material text, to each language material Word in text carries out vector initialising, wherein the initialization vector dimension of the word in each language material text is showed according to experiment Setting;
Step S130, dependency structure analysis is carried out to text T, and carries out serializing to the structure after analysis and handle to be compiled Code sequence.
Further, text content analysis generation is carried out to text T using the dependency analysis device of Stamford in step S130 Interdependent tree construction;The serializing that binary tree is carried out to interdependent tree construction handles to obtain coded sequence.By the way that text is carried out structure Analysis can overcome the shortcomings of that conventional method can only carry out content of text sequential encoding, preferably combine the real structure of text It is indicated, is more in line with actual demand.
Fig. 2 is that the use from media data document representation method based on recurrence variation own coding model of the present invention is interdependent The flow chart for the context dependent structure that analyzer obtains.It is further illustrated the present invention with reference to Fig. 2 and specific embodiment Method.
Fig. 2 is indicated to from media data content of text " My cat also likes eating fish and The input of hamburger " carries out the process of text retrieval conference TREC by dependency analysis device.It is passed through from media text data in input The interdependent tree construction of raw text after dependency analysis device is crossed, the word " likes " in text is connected to " My cat " and " eating The content of two parts fish and hamburger ", wherein adverbial word " also " modification verb " likes ", and " My cat " Be made of word " My " and " cat ", " eating fish and hamburger " and can be further split into " eating " and " fish and hamburger " two parts, " fish " and " hamburger " then constitutes structure arranged side by side by conjunction " and ".It is logical Cross above-mentioned dependency analysis tool, can using the knowledge of external resource carrying out explicit representation from the structure of media data text, And it is encoded by this explicit representation.The dependence of such structural visual being depicted between each word, It indicates between word in syntactical Matching Relation, and this Matching Relation is associated with semanteme, and then makes coded representation Context between it is more coherent.
Further, coded sequence is encoded using recurrent neural network encoding model in step S200, when coding The term vector of use includes initialization vector and/or term vector trained in advance, Heuristics can be thus introduced, to subtract Few encoding calculation amount, improves code efficiency.
Specifically, coded sequence is encoded in step S200 using recurrent neural network encoding model, generates and fixes The text vector of dimension includes the following steps:
Two S210, selection child node c1And c2, by c1And c2Generate first father node p1
S220, the father node p by generating1New child node, which is constituted, with the word in coded sequence generates second father node p2
S230, it is encoded with step S220 recurrence, is generated newly by the word in a father node and a coded sequence every time Father node, until word all in coded sequence is encoded position;Wherein,
During coding, code weight WeIt is shared in each coding, so that the text for making coding generate is compiled Code table is shown as the vector of fixed dimension.
Fig. 3 is indicated to the process from media data content of text coded representation, here to use recurrent neural network To list entries x=w1,w2,…,w4It carries out describing cataloged procedure for coded representation.The coding structure is first the word of input Vectorial w1And w2It connects, is expressed as the child node vector [c that a dimension is 2n1;c2], it should be noted that (w1,w2)= (c1,c2), then utilize formula p=f (We[c1;c2]+be) pass through p1=f (We[w1;w2]+be) father node p is calculated1, Again w3With the p being calculated1Combination is expressed as new [c1;c2], i.e. (c1,c2)=(p1,w3), recycle formula p=f (We [c1;c2]+be) pass through p2=f (We[p1;w3]+be) father node p is calculated2, using p3=f (We[p2;w4]+be) meter Calculation obtains father node p3, then recurrence is all encoded position until the word in coded sequence successively.Due to recurrence encoding model profit Text representation is carried out with this binary combination, it is therefore desirable to text is expressed as diadactic structure according to certain mode, and step It is exactly the sequential organization of text to be expressed as to the process of hierarchical structure, and then expand to carry out dependency structure analysis to text in S130 The applicability of the method for the present invention model.
Further, mean vector and variance vectors are generated by identical mapping in step S300.
If Fig. 4 and Fig. 5 are the processes for carrying out variation reasoning by obtained coded representation, due to the latent variable table of generation Show z need meet obey distribution N (μ, σ) condition, wherein μ indicate generate mean vector, and σ indicate generate variance to Amount, wherein the process for generating mean vector and variance vectors is as shown in Figure 4.As shown in figure 5, generating potential coding by z=μ+ε σ It indicates, wherein ε~N (0, I).Variable of the acquisition for generating potential coded representation z, the distribution of variable in standard is just distributed very much Divergence when indicating for model training calculates;Variable and variance vectors quadrature, then ask obtained product with mean vector With, and then obtain potential coded representation z.That is Fig. 4 and Fig. 5 is described carries out Reparameterization using the coded representation of variation reasoning Processing, since the coded representation z of generation obeys distribution N (μ, σ), thus its obtained coding be distributed as a region without It is a single point, i.e., preferably describes the distribution of data.
Specifically, the decoding process of potential coded representation z includes the following steps in step S400:S410, in coded representation z On the basis of generate the input vector x that dimension is twice of coded representation z, wherein a part of input vector x is child node C, another part are for decoding father node p;S420, continue to decode father node p, obtain new child node c1' and p1', In, p1' is for decoded new father node;S430, with step S420 recursive decodings, every time by a new child node conduct Decoded father node is decoded in next step, until generating decoding sequence identical with coded sequence length.
Fig. 6 is the recurrence variation from media data document representation method based on recurrence variation own coding model of the present invention The structural schematic diagram of own coding model.As seen from the figure, method of the invention is latent by what is generated after obtaining potential coded representation z Be converted into for decoded input expression in coded representation z, if such as the dimension of term vector from media data content of text be 100 dimensions, and the vector dimension of the coded representation z generated is 50 dimensions, then needs to make it be converted into 100 by the processing of neural network The vector of dimension indicates.The coded representation p for generating child node is obtained after transform coding3', below equally to generate coding Explanation is decoded for the input of four words, first by p3' passes through decoding matrix WdGenerate the vector of one 200 dimension, the vector It is divided into two parts, the w that preceding 100 dimension obtains for decoding4', rear 100 dimension are the father node p of subsequent decoding2' passes through father node p2' Regenerate w3' and father node p1', then w is generated by the father node2' and w1', the decoding process of implementation model, passes through Euclidean distance The coding loss between decoding sequence and coded sequence is calculated, recurrence variation own coding mould is updated by back-propagation algorithm The parameter and Optimized model of type.By the coding and decoding of model can complete coding text input and reconstruct text it is defeated Enter, realize the unsupervised expression from media data content of text, due to its unsupervised characteristic, so as to better adapt to From the coded representation of media data.
The method of the present invention by recurrent neural network encoding model and recurrent neural network decoded model to input from Media data text is encoded, and potential coded representation z is then calculated, then by being decoded to potential coded representation z, In the divergence lost by calculation code and between potential coded representation z and standardized normal distribution, using the coding loss and dissipate The parameter of degree update recurrence variation own coding model, improves the coding efficiency of model.Also, the model can be according to different defeated Enter the different potential coded representation z of text generation, and then realizes and accurate coded representation is carried out to different input texts.
Above to the embodiment party from media data document representation method based on recurrence variation own coding model of the present invention Formula is illustrated.For the present invention based on recurrence variation own coding model from the specific of media data document representation method Feature can specifically be designed according to the effect of the feature of above-mentioned disclosure, these designs are that those skilled in the art can be real Existing.Moreover, each technical characteristic of above-mentioned disclosure is not limited to disclosed and other feature combination, those skilled in the art Other combinations between each technical characteristic can be also carried out according to the purpose of the present invention, be subject to realize the present invention purpose.

Claims (10)

1. it is a kind of based on recurrence variation own coding model from media data document representation method, wherein this method includes following Step:
Step S100, the language material text of input is pre-processed, obtains coded sequence;
Step S200, the coded sequence is encoded using recurrent neural network encoding model, generates the text of fixed dimension This vector;
Step S300, mean vector and variance vectors are generated by the text vector of the fixed dimension, then from standard normal point Collecting sample in cloth is generated using the method for variation reasoning using the mean vector, the variance vectors and the sample and is dived In coded representation z;
Step S400, the potential coded representation z is decoded to obtain decoding sequence using recurrent neural network decoded model, Calculate coding loss between the coded sequence and the decoding sequence and the potential coded representation z and standard normal Divergence between distribution updates the parameter of recurrence variation own coding model using the coding loss and the divergence.
2. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step Pretreatment is carried out in rapid S100 to the language material text of input to include the following steps:
Step S110, the language material text of every input is filtered, and removes the label of the language material text, label symbol and chain It connects, and word segmentation processing is carried out to the content of the language material text and generates text T;
Step S120, the word in the language material text is counted, and generates the dictionary of word in the language material text, to each Word in the language material text carries out vector initialising, wherein the initialization vector dimension of the word in each language material text According to experiment performance setting;
Step S130, dependency structure analysis is carried out to the text T, and carries out serializing to the structure after analysis and handle to be compiled Code sequence.
3. as claimed in claim 2 based on recurrence variation own coding model from media data document representation method, wherein Further include in step S130:
Text content analysis is carried out to the text T using the dependency analysis device of Stamford and generates interdependent tree construction;
The serializing for carrying out binary tree to the interdependent tree construction handles to obtain the coded sequence.
4. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The coded sequence is encoded using recurrent neural network encoding model in rapid S200, the term vector that when coding uses includes The initialization vector and/or term vector trained in advance.
5. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The coded sequence is encoded using recurrent neural network encoding model in rapid S200, generates the text vector of fixed dimension Include the following steps:
Two S210, selection child node c1And c2, by the c1With the c2Generate first father node p1
S220, the father node p by generating1New child node, which is constituted, with the word in the coded sequence generates second father's section Point p2
S230, it is encoded with step S220 recurrence, is generated newly by the word in a father node and a coded sequence every time Father node, until word all in the coded sequence is encoded position;Wherein,
During coding, code weight WeIt is shared in each coding, so that the text code for making coding generate indicates For the vector of the fixed dimension.
6. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The mean vector and the variance vectors are generated by identical mapping in rapid S300.
7. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step Suddenly S300 includes:
Variable of the acquisition for generating the potential coded representation z in standard is just distributed very much, the distribution of the variable indicate to use Divergence when model training calculates;
The variable and the variance vectors quadrature, then sum obtained product and the mean vector, and then obtain institute State potential coded representation z.
8. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The decoding process of potential coded representation z described in rapid S400 includes the following steps:
S410, the input vector x that dimension is twice of the coded representation z is generated on the basis of the coded representation z, In, a part of the input vector x is child node c, and another part is for decoding father node p;
S420, continue to decode the father node p, obtain new child node c1' and p1', wherein the p1' is for decoded New father node;
S430, with step S420 recursive decodings, solved every time as decoded father node in next step by a new child node Code, until generating the decoding sequence identical with the coded sequence length.
9. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein step The coding loss between the decoding sequence and the coded sequence is calculated in rapid S400 by Euclidean distance.
10. as described in claim 1 based on recurrence variation own coding model from media data document representation method, wherein The parameter of the recurrence variation own coding model is updated in step S400 by back-propagation algorithm.
CN201711417351.2A 2017-12-25 2017-12-25 Self-media data text representation method based on recursive variation self-coding model Active CN108363685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711417351.2A CN108363685B (en) 2017-12-25 2017-12-25 Self-media data text representation method based on recursive variation self-coding model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711417351.2A CN108363685B (en) 2017-12-25 2017-12-25 Self-media data text representation method based on recursive variation self-coding model

Publications (2)

Publication Number Publication Date
CN108363685A true CN108363685A (en) 2018-08-03
CN108363685B CN108363685B (en) 2021-09-14

Family

ID=63010041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711417351.2A Active CN108363685B (en) 2017-12-25 2017-12-25 Self-media data text representation method based on recursive variation self-coding model

Country Status (1)

Country Link
CN (1) CN108363685B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213975A (en) * 2018-08-23 2019-01-15 重庆邮电大学 It is a kind of that special document representation method is pushed away from coding based on character level convolution variation
CN109886388A (en) * 2019-01-09 2019-06-14 平安科技(深圳)有限公司 A kind of training sample data extending method and device based on variation self-encoding encoder
CN111581916A (en) * 2020-05-15 2020-08-25 北京字节跳动网络技术有限公司 Text generation method and device, electronic equipment and computer readable medium
CN113379068A (en) * 2021-06-29 2021-09-10 哈尔滨工业大学 Deep learning architecture searching method based on structured data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1510931A1 (en) * 2003-08-28 2005-03-02 DVZ-Systemhaus GmbH Process for platform-independent archiving and indexing of digital media assets
CN101645786A (en) * 2009-06-24 2010-02-10 中国联合网络通信集团有限公司 Method for issuing blog content and business processing device thereof
US9053431B1 (en) * 2010-10-26 2015-06-09 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
CN106844327A (en) * 2015-12-07 2017-06-13 科大讯飞股份有限公司 Text code method and system
CN107220311A (en) * 2017-05-12 2017-09-29 北京理工大学 A kind of document representation method of utilization locally embedding topic modeling

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1510931A1 (en) * 2003-08-28 2005-03-02 DVZ-Systemhaus GmbH Process for platform-independent archiving and indexing of digital media assets
CN101645786A (en) * 2009-06-24 2010-02-10 中国联合网络通信集团有限公司 Method for issuing blog content and business processing device thereof
US9053431B1 (en) * 2010-10-26 2015-06-09 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
CN106844327A (en) * 2015-12-07 2017-06-13 科大讯飞股份有限公司 Text code method and system
CN107220311A (en) * 2017-05-12 2017-09-29 北京理工大学 A kind of document representation method of utilization locally embedding topic modeling

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DIEDERIK P.KINGMA等: "Auto-Encoding Variational Bayes", 《ARXIV:1312.6114V10 [STAT.ML] 1 MAY 2014》 *
佚名: "【Learning Notes】变分自编码器(Variational Auto-Encoder,VAE)", 《HTTPS://BLOG.CSDN.NET/JACKYTINTIN/ARTICLE/DETAILS/53641885》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213975A (en) * 2018-08-23 2019-01-15 重庆邮电大学 It is a kind of that special document representation method is pushed away from coding based on character level convolution variation
CN109213975B (en) * 2018-08-23 2022-04-12 重庆邮电大学 Twitter text representation method based on character level convolution variation self-coding
CN109886388A (en) * 2019-01-09 2019-06-14 平安科技(深圳)有限公司 A kind of training sample data extending method and device based on variation self-encoding encoder
WO2020143321A1 (en) * 2019-01-09 2020-07-16 平安科技(深圳)有限公司 Training sample data augmentation method based on variational autoencoder, storage medium and computer device
CN109886388B (en) * 2019-01-09 2024-03-22 平安科技(深圳)有限公司 Training sample data expansion method and device based on variation self-encoder
CN111581916A (en) * 2020-05-15 2020-08-25 北京字节跳动网络技术有限公司 Text generation method and device, electronic equipment and computer readable medium
CN111581916B (en) * 2020-05-15 2022-03-01 北京字节跳动网络技术有限公司 Text generation method and device, electronic equipment and computer readable medium
CN113379068A (en) * 2021-06-29 2021-09-10 哈尔滨工业大学 Deep learning architecture searching method based on structured data
CN113379068B (en) * 2021-06-29 2023-08-08 哈尔滨工业大学 Deep learning architecture searching method based on structured data

Also Published As

Publication number Publication date
CN108363685B (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN106202010B (en) Method and apparatus based on deep neural network building Law Text syntax tree
CN110321417B (en) Dialog generation method, system, readable storage medium and computer equipment
CN108830287A (en) The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN107203511A (en) A kind of network text name entity recognition method based on neutral net probability disambiguation
CN108363695B (en) User comment attribute extraction method based on bidirectional dependency syntax tree representation
CN109359297B (en) Relationship extraction method and system
CN109213975B (en) Twitter text representation method based on character level convolution variation self-coding
CN104598611B (en) The method and system being ranked up to search entry
CN107748757A (en) A kind of answering method of knowledge based collection of illustrative plates
CN108363685A (en) Based on recurrence variation own coding model from media data document representation method
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN110609899A (en) Specific target emotion classification method based on improved BERT model
CN107608953B (en) Word vector generation method based on indefinite-length context
CN110851575B (en) Dialogue generating system and dialogue realizing method
CN112949647B (en) Three-dimensional scene description method and device, electronic equipment and storage medium
CN113254610B (en) Multi-round conversation generation method for patent consultation
CN113435211B (en) Text implicit emotion analysis method combined with external knowledge
CN112417289B (en) Information intelligent recommendation method based on deep clustering
CN111581966A (en) Context feature fusion aspect level emotion classification method and device
US20200334410A1 (en) Encoding textual information for text analysis
CN111858940A (en) Multi-head attention-based legal case similarity calculation method and system
CN114936287A (en) Knowledge injection method for pre-training language model and corresponding interactive system
CN114528898A (en) Scene graph modification based on natural language commands
CN111540470B (en) Social network depression tendency detection model based on BERT transfer learning and training method thereof
CN113761220A (en) Information acquisition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant