CN108363685B - Self-media data text representation method based on recursive variation self-coding model - Google Patents

Self-media data text representation method based on recursive variation self-coding model Download PDF

Info

Publication number
CN108363685B
CN108363685B CN201711417351.2A CN201711417351A CN108363685B CN 108363685 B CN108363685 B CN 108363685B CN 201711417351 A CN201711417351 A CN 201711417351A CN 108363685 B CN108363685 B CN 108363685B
Authority
CN
China
Prior art keywords
coding
text
vector
self
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711417351.2A
Other languages
Chinese (zh)
Other versions
CN108363685A (en
Inventor
王家彬
黄江平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Television Technology Center Of Beijing Peony Electronics Group Co ltd
Original Assignee
Digital Television Technology Center Of Beijing Peony Electronics Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Television Technology Center Of Beijing Peony Electronics Group Co ltd filed Critical Digital Television Technology Center Of Beijing Peony Electronics Group Co ltd
Priority to CN201711417351.2A priority Critical patent/CN108363685B/en
Publication of CN108363685A publication Critical patent/CN108363685A/en
Application granted granted Critical
Publication of CN108363685B publication Critical patent/CN108363685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a self-media data text representation method based on a recursive variational self-coding model, which comprises the following steps: preprocessing an input corpus text, and coding by adopting a recurrent neural network coding model to generate a text vector with fixed dimensionality; generating a mean vector and a variance vector by a text vector with fixed dimensionality, collecting samples from standard normal distribution, and generating a potential coding representation z by using the mean vector, the variance vector and the samples and adopting a variational reasoning method; then decoding the code sequence by adopting a recurrent neural network decoding model to obtain a decoded sequence, calculating the coding loss between the coding sequence and the decoded sequence and the divergence between the potential coding expression z and the standard normal distribution, and updating the parameters of the recursive variant self-coding model by utilizing the coding loss and the divergence. The method has high coding performance, can better adapt to the coding representation of the self-adaptive media data, and can describe the distribution of the data while fitting the content of the data.

Description

Self-media data text representation method based on recursive variation self-coding model
Technical Field
The invention relates to the technical field of deep learning and self-media data text content analysis, in particular to a self-media data text representation method based on a recursive variational self-coding model.
Background
With the development of social media in recent years, users generate a large amount of self-media short text contents, and the text contents lack effective context information, so that the text contents are difficult to represent by adopting the traditional word bag model.
Deep learning is derived from the research of an artificial neural network, and a multi-layer network comprising multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. The concept of deep learning was proposed by Hinton et al in 2006. An unsupervised greedy layer-by-layer training algorithm is provided based on a Deep Belief Network (DBN), and then a multilayer automatic encoder deep structure is provided in order to solve the optimization problem related to the deep structure. The convolutional neural network proposed by lecinnyann et al is the first true multi-layer structure learning algorithm that uses spatial relative relationships to reduce the number of parameters to improve training performance. Deep learning is the computation involved in producing an output from an input that can be represented by a flow graph in which each node represents a basic computation and a computed value, the results of which are applied to the values of the children of that node. Deep learning simulates the human cognitive process, which is performed layer by layer, and is an abstract process step by step, namely, a simple concept is firstly learned, and then a more abstract idea and concept are expressed by the method. The method has been successfully applied to the fields of computer vision, speech recognition and the like, and although the application of deep learning methods to natural language processing is receiving great attention in recent years, most of the methods are based on model design and lack of introduction of knowledge.
Regarding the representation technology of text content, most of the traditional self-media text content representation learning is based on a bag-of-words model and adopts a word representation method such as one-hot, which inevitably causes a serious "vocabulary gap" phenomenon between words, that is, words with similar semantics are mutually orthogonal in vector representation. While these methods are effective in representing traditional text, applying them to self-media text representations presents serious data sparseness problems. The traditional method usually adopts manual features to extract features from the representation learning of media text contents, but the method relies on manual experience, and for self-media data in some professional fields, corresponding experts are required to construct a knowledge base to well realize the representation of the data texts.
In the prior art, various data text analysis methods exist, but most of the data text analysis methods are used for analyzing text contents of self-media data in common or partial special fields, and the analysis methods generally only adopt simple text coding to simply fit data and lack description of data distribution, so that the problems of inaccurate text representation and the like are caused.
Disclosure of Invention
The invention aims to provide a self-media data text representation method based on a recursive variational self-coding model, which has high coding performance, can better adapt to the coded representation of self-media data, and can describe the distribution of the data while fitting the content of the data.
The invention provides a self-media data text representation method based on a recursive variational self-coding model, wherein the method comprises the following steps:
s100, preprocessing input corpus texts to obtain a coding sequence;
s200, coding the coding sequence by adopting a recurrent neural network coding model to generate a text vector with fixed dimensionality;
s300, generating a mean vector and a variance vector by the text vector of the fixed dimension, then collecting a sample from standard normal distribution, and generating a potential coding representation z by using the mean vector, the variance vector and the sample through a variational reasoning method;
step S400, decoding the potential code expression z by adopting a recurrent neural network decoding model to obtain a decoding sequence, calculating the coding loss between the coding sequence and the decoding sequence and the divergence between the potential code expression z and a standard normal distribution, and updating the parameters of the recursive variant self-coding model by using the coding loss and the divergence.
Preferably, the preprocessing the inputted corpus text in step S100 includes the following steps:
step S110, filtering each input corpus text, removing labels, labels and links of the corpus text, and performing word segmentation processing on the content of the corpus text to generate a text T;
step S120, counting words in the corpus text, generating a dictionary of words in the corpus text, and performing vector initialization on the words in each corpus text, wherein the initialized vector dimension of the words in each corpus text is set according to experimental performance;
and S130, performing dependency structure analysis on the text T, and performing serialization processing on the analyzed structure to obtain a coding sequence.
Preferably, step S130 further includes:
adopting a Stanford dependency analyzer to analyze the text content of the text T to generate a dependency tree structure;
and carrying out binary tree serialization processing on the dependency tree structure to obtain the coding sequence.
Preferably, in step S200, the coding sequence is coded by using a recurrent neural network coding model, and the word vectors used in coding include the initialization vector and/or the pre-trained word vectors.
Preferably, the encoding of the encoding sequence in step S200 by using a recurrent neural network coding model, and the generating of the fixed-dimension text vector includes the following steps:
s210, selecting two child nodes c1And c2From said c1And c is as described2Generating a first parent node p1
S220, generating the parent node p1Forming a new child node with a word in said coding sequence to generate a second parent node p2
S230, recursively encoding in the step S220, and generating a new father node from one father node and one word in the encoding sequence each time until all the words in the encoding sequence are encoded at the positions; wherein the content of the first and second substances,
in the course of encoding, the weight W is encodedeAnd sharing the text code generated by the encoding at each encoding time so as to enable the text code generated by the encoding to be represented as the vector with the fixed dimension.
Preferably, the mean vector and the variance vector are generated by identity mapping in step S300.
Preferably, step S300 includes:
collecting variables used to generate the potential coding representation z in a standard positive-space distribution, the distribution of the variables representing a divergence calculation used in model training;
and the variable and the variance vector are subjected to product calculation, and then the obtained product is summed with the mean vector to obtain the potential coding representation z.
Preferably, the decoding process of the potential coded representation z in step S400 includes the following steps:
s410, generating an input vector x with a dimension being twice as large as the encoding representation z on the basis of the encoding representation z, wherein one part of the input vector x is a child node c, and the other part of the input vector x is a parent node p for decoding;
s420, continuously decoding the father node p to obtain a new child node c1' and p1Wherein said p is1' is a new parent node for decoding;
and S430, recursively decoding in the step S420, wherein a new child node is used as a parent node for next decoding, and decoding is carried out until the decoding sequence with the same length as the coding sequence is generated.
Preferably, in step S400, the coding loss between the decoded sequence and the coded sequence is calculated by euclidean distance.
Preferably, in step S400, the parameters of the recursive variant self-coding model are updated by a back propagation algorithm.
The invention has the following advantages and beneficial effects:
1. the self-media data text representation method based on the recursive variational self-coding model overcomes the representation problem caused by the lack of context when the self-media data text content is represented in the aspect of representation of the text content, and the method introduces experience knowledge for the representation of the text content through the existing text processing tool and improves the performance of text representation.
2. The self-media data text representation method based on the recursive variational self-coding model adopts the coding model of the recursive neural network, not only can be used for sequentially coding the text content, but also can be used for coding the text content with a tree structure, thereby effectively avoiding the defect that the traditional method can only be used for sequentially coding the text content, better combining the real structure of the text to represent the text, and further leading the structure represented by coding to be more in line with the actual requirement.
3. The self-media data text representation method based on the recursive variation self-coding model better reflects the process of simulating the real distribution of data by a deep learning method by using a variation reasoning method.
4. The self-media data text representation method based on the recursive variational self-coding model adopts the expanded recursive neural network decoding model, can reconstruct the input content of the text, measures the coding performance of the model through Euclidean distance calculation and other modes, and optimizes the representation of the model to the self-media data text content through following new model parameters.
5. The method for representing the self-media data text based on the recursive variational self-coding model introduces standard normal distribution and calculates the mean vector and the variance vector of the input text to the potential coding representation z, and the potential coding representation z contains knowledge such as word vector knowledge, text structure and the like and meets certain distribution, can set the vector dimension according to the requirement, contains more characteristic information than the traditional recursive coding vector, and is beneficial to representing and calculating the text.
6. The self-media data text representation method based on the recursive variational self-coding model can update the parameters of the recursive variational self-coding model by using the coding loss and the divergence, further optimize the model, better fit training data and improve the coding performance.
Drawings
The drawings used in the present application will be briefly described below, and it should be apparent that they are merely illustrative of the concepts of the present invention.
FIG. 1 is a flow chart of a method for text representation of self-media data based on a recursive variational self-coding model according to the present invention;
FIG. 2 is a flow chart of a text dependency structure obtained by a dependency analyzer of the recursive variant self-coding model-based self-media data text representation method of the present invention;
FIG. 3 is a schematic structural diagram of a self-media data text representation method based on a recursive variational self-coding model of the present invention, which uses a recursive neural network coding model for coding;
FIG. 4 is a flow chart of the present invention for generating a mean vector and a variance vector from a media data text representation method based on a recursive variational self-coding model;
FIG. 5 is a flow chart of the method for text representation of self-media data based on a recursive variational self-coding model of the present invention for sampling variables from a positive distribution and generating a potential coding representation;
fig. 6 is a schematic structural diagram of a recursive variant self-coding model of the text representation method of self-media data based on the recursive variant self-coding model of the present invention.
Detailed Description
Hereinafter, an embodiment of a text representation method of self-media data based on a recursive variational self-coding model of the present invention will be described with reference to the accompanying drawings.
The examples described herein are specific embodiments of the present invention, are intended to be illustrative and exemplary in nature, and are not to be construed as limiting the scope of the invention. In addition to the embodiments described herein, those skilled in the art will be able to employ other technical solutions which are obvious based on the disclosure of the claims and the specification of the present application, and these technical solutions include any obvious replacement or modification of the embodiments described herein.
The drawings in the present specification are schematic views to assist in explaining the concept of the present invention, and schematically show the shapes of respective portions and their mutual relationships. It is noted that the drawings are not necessarily to the same scale so as to clearly illustrate the structure of portions of embodiments of the present invention. The same or similar reference numerals are used to denote the same or similar parts.
Referring to fig. 1, the present invention provides a method for text representation of self-media data based on a recursive variational self-coding model, wherein the method comprises the following steps:
s100, preprocessing input corpus texts to obtain a coding sequence;
s200, coding a coding sequence by adopting a recurrent neural network coding model to generate a text vector with fixed dimensionality;
s300, generating a mean vector and a variance vector by a text vector with fixed dimensionality, then collecting a sample from standard normal distribution, and generating a potential coding representation z by using the mean vector, the variance vector and the sample through a variational reasoning method;
and S400, decoding the potential code expression z by adopting a recurrent neural network decoding model to obtain a decoding sequence, calculating the coding loss between the coding sequence and the decoding sequence, calculating the divergence between the potential code expression z and standard normal distribution, and updating the parameters of the recursive variant self-coding model by utilizing the coding loss and the divergence.
The potential encoding expression z obtained by calculation in the method of the invention contains knowledge of word vector knowledge, text structure and the like, meets certain distribution, can set the dimension of the vector according to actual needs, contains more characteristic information for the traditional recursive encoding vector, is beneficial to the expression and calculation of the text, reduces the encoding dimension and improves the calculation efficiency. In addition, the method of the invention utilizes the potential codes to calculate the coding loss between the coding sequence and the decoding sequence, and the divergence between the potential codes representing z and the standard normal distribution, and utilizes the coding loss and the divergence to automatically update the parameters of the recursive variant self-coding model, thereby effectively improving the coding performance of the model, and inputting different texts.
Further, the preprocessing the inputted corpus text in step S100 of the present invention further includes the following steps:
step S110, filtering each input corpus text, removing labels, labels and links of the corpus text, and performing word segmentation processing on the content of the corpus text to generate a text T;
step S120, counting words in the corpus text, generating a dictionary of words in the corpus text, and performing vector initialization on the words in each corpus text, wherein the initialized vector dimension of the words in each corpus text is set according to experimental performance;
and S130, performing dependency structure analysis on the text T, and performing serialization processing on the analyzed structure to obtain a coding sequence.
Further, in step S130, a stanford dependency analyzer is used to analyze the text content of the text T to generate a dependency tree structure; and performing binary tree serialization processing on the dependency tree structure to obtain a coding sequence. The structural analysis of the text can overcome the defect that the traditional method can only carry out sequential coding on the text content, better combines the real structure of the text to express, and better meets the actual requirement.
FIG. 2 is a flow chart of the text dependency structure obtained by the dependency analyzer of the recursive variant self-coding model-based text representation method for self-media data according to the present invention. The process of the present invention is further illustrated below with reference to figure 2 and the specific examples.
FIG. 2 shows a process of text structure analysis by a dependency analyzer for an input from the media data text content "My cat words extracting and hashing". After the input self-media text data passes through the dependency analyzer, a dependency tree structure of the text is generated, the word "keys" in the text connects the contents of two parts, namely "My cat" and "marking fish and hash", wherein the adverb "also" modifies the verb "keys", the word "My cat" is composed of the words "My" and "cat", the word "marking fish and hash" can be further divided into two parts, namely "marking" and "hash", and the word "and" constitute a parallel structure. Through the dependency analysis tool, the structure of the text of the self-media data can be explicitly represented by using the knowledge of external resources, and the structure is coded through the explicit structural representation. The structure intuitively describes the dependency relationship among the words, indicates the collocations among the words in syntax, and the collocations are associated with semantics, so that the contexts expressed by the codes are more coherent.
Further, in step S200, a recurrent neural network coding model is used for coding the coding sequence, and the word vectors used in coding include initialization vectors and/or pre-trained word vectors, so that empirical knowledge can be introduced, thereby reducing the coding calculation amount and improving the coding efficiency.
Specifically, the encoding of the encoding sequence in step S200 by using a recurrent neural network encoding model, and the generating of the text vector with fixed dimensions includes the following steps:
s210, selecting two child nodes c1And c2From c1And c2Generating a first parent node p1
S220, generating a parent node p1Forming a new child node with the word in the coding sequence to generate a second parent node p2
S230, recursively encoding in the step S220, and generating a new father node by one father node and words in one encoding sequence each time until all the words in the encoding sequence are encoded at positions; wherein the content of the first and second substances,
in the course of encoding, the weight W is encodedeIs shared at each encoding in order to make the text encoding generated by the encoding to be represented as a fixed-dimension vector.
Fig. 3 shows a process for encoding a representation of text content from media data, where a recurrent neural network is used to encode an input sequence x-w1,w2,…,w4The encoding process is described as an example. The coding structure first takes the input word vector w1And w2Are connected in series and are expressed as a child node vector [ c ] with the dimension of 2n1;c2]Note that (w)1,w2)=(c1,c2) Then using the formula p ═ f (W)e[c1;c2]+be) Through p1=f(We[w1;w2]+be) Is calculated to obtain a parent node p1Then w is turned3And p is calculated1The combination is expressed as new1;c2]I.e. (c)1,c2)=(p1,w3) Then, the formula p ═ f (W) is usede[c1;c2]+be) Through p2=f(We[p1;w3]+be) Is calculated to obtain a parent node p2Then through p3=f(We[p2;w4]+be) Calculating to obtain a parent node p3And then recurse in sequence until the words in the encoding sequence are all encoded in position. Since the recursive coding model uses the binary combination to represent the text, the text needs to be represented as a binary structure in a certain manner, and the dependency structure analysis performed on the text in step S130 is a process of representing the sequential structure of the text as a hierarchical structure, thereby expanding the applicability of the model of the method of the present invention.
Further, a mean vector and a variance vector are generated through identity mapping in step S300.
As fig. 4 and 5 are the processes of variational reasoning by the resulting coded representation, since the generated potential vector representation z needs to satisfy the condition of obeying the distribution N (μ, σ), where μ represents the generated mean vector and σ represents the generated variance vector, where the processes of generating the mean vector and the variance vector are as shown in fig. 4. As shown in fig. 5, a potential code representation is generated from z ═ μ + ε σ, where ε through N (0, I). Collecting variables used for generating a potential coding representation z in standard positive-Taiwan distribution, wherein the distribution of the variables is used for divergence calculation in model training; the variable is multiplied by the variance vector and the resulting product is then summed with the mean vector to obtain the potential code representation z. I.e. fig. 4 and 5, describe the process of reparameterisation using a variational-inferred coded representation, since the generated coded representation z obeys the distribution N (μ, σ), the resulting coded distribution is a region rather than a single point, i.e. better describing the distribution of the data.
In particular, the decoding process of the potential encoded representation z in step S400 comprises the following steps:s410, generating an input vector x with a dimensionality twice as large as the encoding representation z on the basis of the encoding representation z, wherein one part of the input vector x is a child node c, and the other part of the input vector x is used for decoding a parent node p; s420, continuing to decode the father node p to obtain a new child node c1' and p1Of which p1' is a new parent node for decoding; and S430, recursively decoding in the step S420, wherein a new child node is used as a parent node for next decoding, and decoding is carried out until a decoding sequence with the same length as the coding sequence is generated.
Fig. 6 is a schematic structural diagram of a recursive variant self-coding model of the text representation method of self-media data based on the recursive variant self-coding model of the present invention. As can be seen from the figure, after obtaining the potential encoding representation z, the method of the present invention converts the generated potential encoding representation z into an input representation for decoding, for example, if the dimension of the word vector of the text content of the media data is 100 dimensions, and the dimension of the vector of the generated encoding representation z is 50 dimensions, it needs to be converted into a vector representation of 100 dimensions through processing of a neural network. Obtaining an encoded representation p for generating child nodes after transcoding3Similarly, the following explanation will be made by taking as an example the input of four words to generate codes, first, p3' pass through decoding matrix WdGenerating a 200-dimensional vector, which is divided into two parts, the first 100 dimensions being the decoded w4The next 100 dimensions are the subsequently decoded parent node p2' by the parent node p2Regeneration into w3' and parent node p1Generating w from the parent node2' and w1And realizing the decoding process of the model, calculating the coding loss between the decoding sequence and the coding sequence through the Euclidean distance, updating the parameters of the recursive variant self-coding model through a back propagation algorithm, and optimizing the model. The input of the coded text and the input of the reconstructed text can be completed through the coding and the decoding of the model, the representation of the unsupervised text content of the self-media data is realized, and the coded representation of the self-media data can be better adapted due to the unsupervised characteristic of the representation.
The method of the invention encodes the input self-media data text through a recurrent neural network coding model and a recurrent neural network decoding model, then calculates a potential coding representation z, then decodes the potential coding representation z, updates the parameters of the recurrent variable self-coding model by using the coding loss and the divergence by calculating the coding loss and the divergence between the potential coding representation z and the standard normal distribution, and improves the coding performance of the model. In addition, the model can generate different potential coding expressions z according to different input texts, and further realize accurate coding expression on different input texts.
The foregoing describes an embodiment of a method for text representation of self-media data based on a recursive variational self-coding model according to the present invention. The specific features of the text representation method for self-media data based on recursive variational self-coding model according to the present invention can be specifically designed according to the role of the above-disclosed features, and the design can be realized by those skilled in the art. Moreover, the technical features disclosed above are not limited to the combinations with other features disclosed, and other combinations between the technical features can be performed by those skilled in the art according to the purpose of the present invention, so as to achieve the purpose of the present invention.

Claims (6)

1. A method for text representation of self-media data based on a recursive variational self-coding model, wherein the method comprises the steps of:
s100, preprocessing input corpus texts to obtain a coding sequence;
s200, encoding the encoding sequence by adopting a recurrent neural network encoding model to generate a text vector with fixed dimensionality, wherein the word vector adopted during encoding comprises an initialization vector and/or a pre-trained word vector;
step S300, generating a mean vector and a variance vector from the text vector of the fixed dimension, then collecting a sample from a standard normal distribution, and generating a potential code representation z by using the mean vector, the variance vector and the sample through a variational reasoning method, wherein the potential code representation z obeys the condition of distribution N (mu, sigma), wherein mu represents the generated mean vector, and sigma represents the generated variance vector;
s400, decoding the potential coding representation z by adopting a recurrent neural network decoding model to obtain a decoding sequence, calculating coding loss between the coding sequence and the decoding sequence and divergence between the potential coding representation z and standard normal distribution, and updating parameters of a recurrent variant self-coding model by using the coding loss and the divergence;
the preprocessing of the corpus text in step S100 includes the following steps:
step S110, filtering each input corpus text, removing labels, labels and links of the corpus text, and performing word segmentation processing on the content of the corpus text to generate a text T;
step S120, counting words in the corpus text, generating a dictionary of words in the corpus text, and performing vector initialization on the words in each corpus text, wherein the initialized vector dimension of the words in each corpus text is set according to experimental performance;
s130, performing dependency structure analysis on the text T, and performing serialization processing on the analyzed structure to obtain a coding sequence;
in step S200, encoding the coding sequence by using a recurrent neural network coding model, and generating a fixed-dimension text vector includes the following steps:
s210, selecting two child nodes c1And c2From said c1And c is as described2Generating a first parent node p1
S220, generating the parent node p1Forming a new child node with a word in said coding sequence to generate a second parent node p2
S230, recursively encoding in the step S220, and generating a new father node from one father node and one word in the encoding sequence each time until all the words in the encoding sequence are encoded at the positions; wherein the content of the first and second substances,
in the course of encoding, the weight W is encodedeIs coded at each timeSharing so as to make the text code generated by the code to be represented as the vector of the fixed dimension;
in step S300, the mean vector and the variance vector are generated through identity mapping.
2. The method for text representation of self-media data based on recursive variational self-coding model according to claim 1, wherein in step S130 further comprising:
adopting a Stanford dependency analyzer to analyze the text content of the text T to generate a dependency tree structure;
and carrying out binary tree serialization processing on the dependency tree structure to obtain the coding sequence.
3. The method of text representation of self-media data based on a recursive variational self-coding model according to claim 1, wherein step S300 comprises:
collecting variables used for generating the potential coding representation z in a standard normal distribution, wherein the distribution of the variables is used for divergence calculation in model training;
and the variable and the variance vector are subjected to product calculation, and then the obtained product is summed with the mean vector to obtain the potential coding representation z.
4. The method of textual representation of self-media data based on a recursive variational self-coding model according to claim 1, wherein the decoding process of the potential coded representation z in step S400 comprises the steps of:
s410, generating an input vector x with the dimension being twice as large as the potential encoding representation z on the basis of the potential encoding representation z, wherein one part of the input vector x is a child node c, and the other part of the input vector x is a parent node p for decoding;
s420, continuously decoding the father node p to obtain a new child node c1' and p1Wherein said p is1' is a new parent node for decoding;
and S430, recursively decoding in the step S420, wherein a new child node is used as a parent node for next decoding, and decoding is carried out until the decoding sequence with the same length as the coding sequence is generated.
5. The method of textual representation of self-media data based on a recursive variational self-coding model according to claim 1, wherein the coding loss between the decoded sequence and the coded sequence is obtained by euclidean distance calculation in step S400.
6. The method of textual representation of self-media data based on a recursive variant self-coding model according to claim 1, wherein the parameters of the recursive variant self-coding model are updated by a back-propagation algorithm in step S400.
CN201711417351.2A 2017-12-25 2017-12-25 Self-media data text representation method based on recursive variation self-coding model Active CN108363685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711417351.2A CN108363685B (en) 2017-12-25 2017-12-25 Self-media data text representation method based on recursive variation self-coding model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711417351.2A CN108363685B (en) 2017-12-25 2017-12-25 Self-media data text representation method based on recursive variation self-coding model

Publications (2)

Publication Number Publication Date
CN108363685A CN108363685A (en) 2018-08-03
CN108363685B true CN108363685B (en) 2021-09-14

Family

ID=63010041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711417351.2A Active CN108363685B (en) 2017-12-25 2017-12-25 Self-media data text representation method based on recursive variation self-coding model

Country Status (1)

Country Link
CN (1) CN108363685B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213975B (en) * 2018-08-23 2022-04-12 重庆邮电大学 Twitter text representation method based on character level convolution variation self-coding
CN109886388B (en) * 2019-01-09 2024-03-22 平安科技(深圳)有限公司 Training sample data expansion method and device based on variation self-encoder
CN111581916B (en) * 2020-05-15 2022-03-01 北京字节跳动网络技术有限公司 Text generation method and device, electronic equipment and computer readable medium
CN113379068B (en) * 2021-06-29 2023-08-08 哈尔滨工业大学 Deep learning architecture searching method based on structured data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1510931A1 (en) * 2003-08-28 2005-03-02 DVZ-Systemhaus GmbH Process for platform-independent archiving and indexing of digital media assets
CN101645786A (en) * 2009-06-24 2010-02-10 中国联合网络通信集团有限公司 Method for issuing blog content and business processing device thereof
US9053431B1 (en) * 2010-10-26 2015-06-09 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
CN106844327A (en) * 2015-12-07 2017-06-13 科大讯飞股份有限公司 Text code method and system
CN107220311A (en) * 2017-05-12 2017-09-29 北京理工大学 A kind of document representation method of utilization locally embedding topic modeling

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1510931A1 (en) * 2003-08-28 2005-03-02 DVZ-Systemhaus GmbH Process for platform-independent archiving and indexing of digital media assets
CN101645786A (en) * 2009-06-24 2010-02-10 中国联合网络通信集团有限公司 Method for issuing blog content and business processing device thereof
US9053431B1 (en) * 2010-10-26 2015-06-09 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
CN106844327A (en) * 2015-12-07 2017-06-13 科大讯飞股份有限公司 Text code method and system
CN107220311A (en) * 2017-05-12 2017-09-29 北京理工大学 A kind of document representation method of utilization locally embedding topic modeling

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
【Learning Notes】变分自编码器(Variational Auto-Encoder,VAE);佚名;《https://blog.csdn.net/jackytintin/article/details/53641885》;20161214;第1-14页 *
Auto-Encoding Variational Bayes;Diederik P.Kingma等;《arXiv:1312.6114v10 [stat.ML] 1 May 2014》;20140501;第1-14页 *

Also Published As

Publication number Publication date
CN108363685A (en) 2018-08-03

Similar Documents

Publication Publication Date Title
CN107291693B (en) Semantic calculation method for improved word vector model
CN109359297B (en) Relationship extraction method and system
CN110321417B (en) Dialog generation method, system, readable storage medium and computer equipment
CN106502985B (en) neural network modeling method and device for generating titles
CN106202010B (en) Method and apparatus based on deep neural network building Law Text syntax tree
CN108363685B (en) Self-media data text representation method based on recursive variation self-coding model
CN113127624B (en) Question-answer model training method and device
CN107203511A (en) A kind of network text name entity recognition method based on neutral net probability disambiguation
CN109213975B (en) Twitter text representation method based on character level convolution variation self-coding
CN109582952B (en) Poetry generation method, poetry generation device, computer equipment and medium
CN113254610B (en) Multi-round conversation generation method for patent consultation
CN111966827B (en) Dialogue emotion analysis method based on heterogeneous bipartite graph
CN108197294A (en) A kind of text automatic generation method based on deep learning
CN110032638B (en) Encoder-decoder-based generative abstract extraction method
CN113435211B (en) Text implicit emotion analysis method combined with external knowledge
CN114528898A (en) Scene graph modification based on natural language commands
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN114528398A (en) Emotion prediction method and system based on interactive double-graph convolutional network
CN114358201A (en) Text-based emotion classification method and device, computer equipment and storage medium
CN111540470B (en) Social network depression tendency detection model based on BERT transfer learning and training method thereof
CN114327483A (en) Graph tensor neural network model establishing method and source code semantic identification method
CN107562729B (en) Party building text representation method based on neural network and theme enhancement
CN117094325B (en) Named entity identification method in rice pest field
CN114519353B (en) Model training method, emotion message generation method and device, equipment and medium
CN113326695B (en) Emotion polarity analysis method based on transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant