CN108363685B

CN108363685B - Self-media data text representation method based on recursive variation self-coding model

Info

Publication number: CN108363685B
Application number: CN201711417351.2A
Authority: CN
Inventors: 王家彬; 黄江平
Original assignee: Digital Television Technology Center Of Beijing Peony Electronics Group Co ltd
Current assignee: Digital Television Technology Center Of Beijing Peony Electronics Group Co ltd
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2021-09-14
Anticipated expiration: 2037-12-25
Also published as: CN108363685A

Abstract

The invention provides a self-media data text representation method based on a recursive variational self-coding model, which comprises the following steps: preprocessing an input corpus text, and coding by adopting a recurrent neural network coding model to generate a text vector with fixed dimensionality; generating a mean vector and a variance vector by a text vector with fixed dimensionality, collecting samples from standard normal distribution, and generating a potential coding representation z by using the mean vector, the variance vector and the samples and adopting a variational reasoning method; then decoding the code sequence by adopting a recurrent neural network decoding model to obtain a decoded sequence, calculating the coding loss between the coding sequence and the decoded sequence and the divergence between the potential coding expression z and the standard normal distribution, and updating the parameters of the recursive variant self-coding model by utilizing the coding loss and the divergence. The method has high coding performance, can better adapt to the coding representation of the self-adaptive media data, and can describe the distribution of the data while fitting the content of the data.

Description

Self-media data text representation method based on recursive variation self-coding model

Technical Field

The invention relates to the technical field of deep learning and self-media data text content analysis, in particular to a self-media data text representation method based on a recursive variational self-coding model.

Background

With the development of social media in recent years, users generate a large amount of self-media short text contents, and the text contents lack effective context information, so that the text contents are difficult to represent by adopting the traditional word bag model.

Deep learning is derived from the research of an artificial neural network, and a multi-layer network comprising multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. The concept of deep learning was proposed by Hinton et al in 2006. An unsupervised greedy layer-by-layer training algorithm is provided based on a Deep Belief Network (DBN), and then a multilayer automatic encoder deep structure is provided in order to solve the optimization problem related to the deep structure. The convolutional neural network proposed by lecinnyann et al is the first true multi-layer structure learning algorithm that uses spatial relative relationships to reduce the number of parameters to improve training performance. Deep learning is the computation involved in producing an output from an input that can be represented by a flow graph in which each node represents a basic computation and a computed value, the results of which are applied to the values of the children of that node. Deep learning simulates the human cognitive process, which is performed layer by layer, and is an abstract process step by step, namely, a simple concept is firstly learned, and then a more abstract idea and concept are expressed by the method. The method has been successfully applied to the fields of computer vision, speech recognition and the like, and although the application of deep learning methods to natural language processing is receiving great attention in recent years, most of the methods are based on model design and lack of introduction of knowledge.

Regarding the representation technology of text content, most of the traditional self-media text content representation learning is based on a bag-of-words model and adopts a word representation method such as one-hot, which inevitably causes a serious "vocabulary gap" phenomenon between words, that is, words with similar semantics are mutually orthogonal in vector representation. While these methods are effective in representing traditional text, applying them to self-media text representations presents serious data sparseness problems. The traditional method usually adopts manual features to extract features from the representation learning of media text contents, but the method relies on manual experience, and for self-media data in some professional fields, corresponding experts are required to construct a knowledge base to well realize the representation of the data texts.

In the prior art, various data text analysis methods exist, but most of the data text analysis methods are used for analyzing text contents of self-media data in common or partial special fields, and the analysis methods generally only adopt simple text coding to simply fit data and lack description of data distribution, so that the problems of inaccurate text representation and the like are caused.

Disclosure of Invention

The invention aims to provide a self-media data text representation method based on a recursive variational self-coding model, which has high coding performance, can better adapt to the coded representation of self-media data, and can describe the distribution of the data while fitting the content of the data.

The invention provides a self-media data text representation method based on a recursive variational self-coding model, wherein the method comprises the following steps:

s100, preprocessing input corpus texts to obtain a coding sequence;

s200, coding the coding sequence by adopting a recurrent neural network coding model to generate a text vector with fixed dimensionality;

s300, generating a mean vector and a variance vector by the text vector of the fixed dimension, then collecting a sample from standard normal distribution, and generating a potential coding representation z by using the mean vector, the variance vector and the sample through a variational reasoning method;

step S400, decoding the potential code expression z by adopting a recurrent neural network decoding model to obtain a decoding sequence, calculating the coding loss between the coding sequence and the decoding sequence and the divergence between the potential code expression z and a standard normal distribution, and updating the parameters of the recursive variant self-coding model by using the coding loss and the divergence.

Preferably, the preprocessing the inputted corpus text in step S100 includes the following steps:

step S110, filtering each input corpus text, removing labels, labels and links of the corpus text, and performing word segmentation processing on the content of the corpus text to generate a text T;

step S120, counting words in the corpus text, generating a dictionary of words in the corpus text, and performing vector initialization on the words in each corpus text, wherein the initialized vector dimension of the words in each corpus text is set according to experimental performance;

and S130, performing dependency structure analysis on the text T, and performing serialization processing on the analyzed structure to obtain a coding sequence.

Preferably, step S130 further includes:

adopting a Stanford dependency analyzer to analyze the text content of the text T to generate a dependency tree structure;

and carrying out binary tree serialization processing on the dependency tree structure to obtain the coding sequence.

Preferably, in step S200, the coding sequence is coded by using a recurrent neural network coding model, and the word vectors used in coding include the initialization vector and/or the pre-trained word vectors.

Preferably, the encoding of the encoding sequence in step S200 by using a recurrent neural network coding model, and the generating of the fixed-dimension text vector includes the following steps:

s210, selecting two child nodes c₁And c₂From said c₁And c is as described₂Generating a first parent node p₁；

S220, generating the parent node p₁Forming a new child node with a word in said coding sequence to generate a second parent node p₂；

S230, recursively encoding in the step S220, and generating a new father node from one father node and one word in the encoding sequence each time until all the words in the encoding sequence are encoded at the positions; wherein the content of the first and second substances,

in the course of encoding, the weight W is encoded_eAnd sharing the text code generated by the encoding at each encoding time so as to enable the text code generated by the encoding to be represented as the vector with the fixed dimension.

Preferably, the mean vector and the variance vector are generated by identity mapping in step S300.

Preferably, step S300 includes:

collecting variables used to generate the potential coding representation z in a standard positive-space distribution, the distribution of the variables representing a divergence calculation used in model training;

and the variable and the variance vector are subjected to product calculation, and then the obtained product is summed with the mean vector to obtain the potential coding representation z.

Preferably, the decoding process of the potential coded representation z in step S400 includes the following steps:

s410, generating an input vector x with a dimension being twice as large as the encoding representation z on the basis of the encoding representation z, wherein one part of the input vector x is a child node c, and the other part of the input vector x is a parent node p for decoding;

s420, continuously decoding the father node p to obtain a new child node c₁' and p₁Wherein said p is₁' is a new parent node for decoding;

and S430, recursively decoding in the step S420, wherein a new child node is used as a parent node for next decoding, and decoding is carried out until the decoding sequence with the same length as the coding sequence is generated.

Preferably, in step S400, the coding loss between the decoded sequence and the coded sequence is calculated by euclidean distance.

Preferably, in step S400, the parameters of the recursive variant self-coding model are updated by a back propagation algorithm.

The invention has the following advantages and beneficial effects:

1. the self-media data text representation method based on the recursive variational self-coding model overcomes the representation problem caused by the lack of context when the self-media data text content is represented in the aspect of representation of the text content, and the method introduces experience knowledge for the representation of the text content through the existing text processing tool and improves the performance of text representation.

2. The self-media data text representation method based on the recursive variational self-coding model adopts the coding model of the recursive neural network, not only can be used for sequentially coding the text content, but also can be used for coding the text content with a tree structure, thereby effectively avoiding the defect that the traditional method can only be used for sequentially coding the text content, better combining the real structure of the text to represent the text, and further leading the structure represented by coding to be more in line with the actual requirement.

3. The self-media data text representation method based on the recursive variation self-coding model better reflects the process of simulating the real distribution of data by a deep learning method by using a variation reasoning method.

4. The self-media data text representation method based on the recursive variational self-coding model adopts the expanded recursive neural network decoding model, can reconstruct the input content of the text, measures the coding performance of the model through Euclidean distance calculation and other modes, and optimizes the representation of the model to the self-media data text content through following new model parameters.

5. The method for representing the self-media data text based on the recursive variational self-coding model introduces standard normal distribution and calculates the mean vector and the variance vector of the input text to the potential coding representation z, and the potential coding representation z contains knowledge such as word vector knowledge, text structure and the like and meets certain distribution, can set the vector dimension according to the requirement, contains more characteristic information than the traditional recursive coding vector, and is beneficial to representing and calculating the text.

6. The self-media data text representation method based on the recursive variational self-coding model can update the parameters of the recursive variational self-coding model by using the coding loss and the divergence, further optimize the model, better fit training data and improve the coding performance.

Drawings

The drawings used in the present application will be briefly described below, and it should be apparent that they are merely illustrative of the concepts of the present invention.

FIG. 1 is a flow chart of a method for text representation of self-media data based on a recursive variational self-coding model according to the present invention;

FIG. 2 is a flow chart of a text dependency structure obtained by a dependency analyzer of the recursive variant self-coding model-based self-media data text representation method of the present invention;

FIG. 3 is a schematic structural diagram of a self-media data text representation method based on a recursive variational self-coding model of the present invention, which uses a recursive neural network coding model for coding;

FIG. 4 is a flow chart of the present invention for generating a mean vector and a variance vector from a media data text representation method based on a recursive variational self-coding model;

FIG. 5 is a flow chart of the method for text representation of self-media data based on a recursive variational self-coding model of the present invention for sampling variables from a positive distribution and generating a potential coding representation;

fig. 6 is a schematic structural diagram of a recursive variant self-coding model of the text representation method of self-media data based on the recursive variant self-coding model of the present invention.

Detailed Description

Hereinafter, an embodiment of a text representation method of self-media data based on a recursive variational self-coding model of the present invention will be described with reference to the accompanying drawings.

The examples described herein are specific embodiments of the present invention, are intended to be illustrative and exemplary in nature, and are not to be construed as limiting the scope of the invention. In addition to the embodiments described herein, those skilled in the art will be able to employ other technical solutions which are obvious based on the disclosure of the claims and the specification of the present application, and these technical solutions include any obvious replacement or modification of the embodiments described herein.

The drawings in the present specification are schematic views to assist in explaining the concept of the present invention, and schematically show the shapes of respective portions and their mutual relationships. It is noted that the drawings are not necessarily to the same scale so as to clearly illustrate the structure of portions of embodiments of the present invention. The same or similar reference numerals are used to denote the same or similar parts.

Referring to fig. 1, the present invention provides a method for text representation of self-media data based on a recursive variational self-coding model, wherein the method comprises the following steps:

s100, preprocessing input corpus texts to obtain a coding sequence;

s200, coding a coding sequence by adopting a recurrent neural network coding model to generate a text vector with fixed dimensionality;

s300, generating a mean vector and a variance vector by a text vector with fixed dimensionality, then collecting a sample from standard normal distribution, and generating a potential coding representation z by using the mean vector, the variance vector and the sample through a variational reasoning method;

and S400, decoding the potential code expression z by adopting a recurrent neural network decoding model to obtain a decoding sequence, calculating the coding loss between the coding sequence and the decoding sequence, calculating the divergence between the potential code expression z and standard normal distribution, and updating the parameters of the recursive variant self-coding model by utilizing the coding loss and the divergence.

The potential encoding expression z obtained by calculation in the method of the invention contains knowledge of word vector knowledge, text structure and the like, meets certain distribution, can set the dimension of the vector according to actual needs, contains more characteristic information for the traditional recursive encoding vector, is beneficial to the expression and calculation of the text, reduces the encoding dimension and improves the calculation efficiency. In addition, the method of the invention utilizes the potential codes to calculate the coding loss between the coding sequence and the decoding sequence, and the divergence between the potential codes representing z and the standard normal distribution, and utilizes the coding loss and the divergence to automatically update the parameters of the recursive variant self-coding model, thereby effectively improving the coding performance of the model, and inputting different texts.

Further, the preprocessing the inputted corpus text in step S100 of the present invention further includes the following steps:

Further, in step S130, a stanford dependency analyzer is used to analyze the text content of the text T to generate a dependency tree structure; and performing binary tree serialization processing on the dependency tree structure to obtain a coding sequence. The structural analysis of the text can overcome the defect that the traditional method can only carry out sequential coding on the text content, better combines the real structure of the text to express, and better meets the actual requirement.

FIG. 2 is a flow chart of the text dependency structure obtained by the dependency analyzer of the recursive variant self-coding model-based text representation method for self-media data according to the present invention. The process of the present invention is further illustrated below with reference to figure 2 and the specific examples.

FIG. 2 shows a process of text structure analysis by a dependency analyzer for an input from the media data text content "My cat words extracting and hashing". After the input self-media text data passes through the dependency analyzer, a dependency tree structure of the text is generated, the word "keys" in the text connects the contents of two parts, namely "My cat" and "marking fish and hash", wherein the adverb "also" modifies the verb "keys", the word "My cat" is composed of the words "My" and "cat", the word "marking fish and hash" can be further divided into two parts, namely "marking" and "hash", and the word "and" constitute a parallel structure. Through the dependency analysis tool, the structure of the text of the self-media data can be explicitly represented by using the knowledge of external resources, and the structure is coded through the explicit structural representation. The structure intuitively describes the dependency relationship among the words, indicates the collocations among the words in syntax, and the collocations are associated with semantics, so that the contexts expressed by the codes are more coherent.

Further, in step S200, a recurrent neural network coding model is used for coding the coding sequence, and the word vectors used in coding include initialization vectors and/or pre-trained word vectors, so that empirical knowledge can be introduced, thereby reducing the coding calculation amount and improving the coding efficiency.

Specifically, the encoding of the encoding sequence in step S200 by using a recurrent neural network encoding model, and the generating of the text vector with fixed dimensions includes the following steps:

s210, selecting two child nodes c₁And c₂From c₁And c₂Generating a first parent node p₁；

S220, generating a parent node p₁Forming a new child node with the word in the coding sequence to generate a second parent node p₂；

S230, recursively encoding in the step S220, and generating a new father node by one father node and words in one encoding sequence each time until all the words in the encoding sequence are encoded at positions; wherein the content of the first and second substances,

in the course of encoding, the weight W is encoded_eIs shared at each encoding in order to make the text encoding generated by the encoding to be represented as a fixed-dimension vector.

Fig. 3 shows a process for encoding a representation of text content from media data, where a recurrent neural network is used to encode an input sequence x-w₁,w₂,…,w₄The encoding process is described as an example. The coding structure first takes the input word vector w₁And w₂Are connected in series and are expressed as a child node vector [ c ] with the dimension of 2n₁；c₂]Note that (w)₁,w₂)＝(c₁,c₂) Then using the formula p ═ f (W)_e[c₁；c₂]+b_e) Through p₁＝f(W_e[w₁；w₂]+b_e) Is calculated to obtain a parent node p₁Then w is turned₃And p is calculated₁The combination is expressed as new₁；c₂]I.e. (c)₁,c₂)＝(p₁,w₃) Then, the formula p ═ f (W) is used_e[c₁；c₂]+b_e) Through p₂＝f(W_e[p₁；w₃]+b_e) Is calculated to obtain a parent node p₂Then through p₃＝f(W_e[p₂；w₄]+b_e) Calculating to obtain a parent node p₃And then recurse in sequence until the words in the encoding sequence are all encoded in position. Since the recursive coding model uses the binary combination to represent the text, the text needs to be represented as a binary structure in a certain manner, and the dependency structure analysis performed on the text in step S130 is a process of representing the sequential structure of the text as a hierarchical structure, thereby expanding the applicability of the model of the method of the present invention.

Further, a mean vector and a variance vector are generated through identity mapping in step S300.

As fig. 4 and 5 are the processes of variational reasoning by the resulting coded representation, since the generated potential vector representation z needs to satisfy the condition of obeying the distribution N (μ, σ), where μ represents the generated mean vector and σ represents the generated variance vector, where the processes of generating the mean vector and the variance vector are as shown in fig. 4. As shown in fig. 5, a potential code representation is generated from z ═ μ + ε σ, where ε through N (0, I). Collecting variables used for generating a potential coding representation z in standard positive-Taiwan distribution, wherein the distribution of the variables is used for divergence calculation in model training; the variable is multiplied by the variance vector and the resulting product is then summed with the mean vector to obtain the potential code representation z. I.e. fig. 4 and 5, describe the process of reparameterisation using a variational-inferred coded representation, since the generated coded representation z obeys the distribution N (μ, σ), the resulting coded distribution is a region rather than a single point, i.e. better describing the distribution of the data.

In particular, the decoding process of the potential encoded representation z in step S400 comprises the following steps:s410, generating an input vector x with a dimensionality twice as large as the encoding representation z on the basis of the encoding representation z, wherein one part of the input vector x is a child node c, and the other part of the input vector x is used for decoding a parent node p; s420, continuing to decode the father node p to obtain a new child node c₁' and p₁Of which p₁' is a new parent node for decoding; and S430, recursively decoding in the step S420, wherein a new child node is used as a parent node for next decoding, and decoding is carried out until a decoding sequence with the same length as the coding sequence is generated.

Fig. 6 is a schematic structural diagram of a recursive variant self-coding model of the text representation method of self-media data based on the recursive variant self-coding model of the present invention. As can be seen from the figure, after obtaining the potential encoding representation z, the method of the present invention converts the generated potential encoding representation z into an input representation for decoding, for example, if the dimension of the word vector of the text content of the media data is 100 dimensions, and the dimension of the vector of the generated encoding representation z is 50 dimensions, it needs to be converted into a vector representation of 100 dimensions through processing of a neural network. Obtaining an encoded representation p for generating child nodes after transcoding₃Similarly, the following explanation will be made by taking as an example the input of four words to generate codes, first, p₃' pass through decoding matrix W_dGenerating a 200-dimensional vector, which is divided into two parts, the first 100 dimensions being the decoded w₄The next 100 dimensions are the subsequently decoded parent node p₂' by the parent node p₂Regeneration into w₃' and parent node p₁Generating w from the parent node₂' and w₁And realizing the decoding process of the model, calculating the coding loss between the decoding sequence and the coding sequence through the Euclidean distance, updating the parameters of the recursive variant self-coding model through a back propagation algorithm, and optimizing the model. The input of the coded text and the input of the reconstructed text can be completed through the coding and the decoding of the model, the representation of the unsupervised text content of the self-media data is realized, and the coded representation of the self-media data can be better adapted due to the unsupervised characteristic of the representation.

The method of the invention encodes the input self-media data text through a recurrent neural network coding model and a recurrent neural network decoding model, then calculates a potential coding representation z, then decodes the potential coding representation z, updates the parameters of the recurrent variable self-coding model by using the coding loss and the divergence by calculating the coding loss and the divergence between the potential coding representation z and the standard normal distribution, and improves the coding performance of the model. In addition, the model can generate different potential coding expressions z according to different input texts, and further realize accurate coding expression on different input texts.

The foregoing describes an embodiment of a method for text representation of self-media data based on a recursive variational self-coding model according to the present invention. The specific features of the text representation method for self-media data based on recursive variational self-coding model according to the present invention can be specifically designed according to the role of the above-disclosed features, and the design can be realized by those skilled in the art. Moreover, the technical features disclosed above are not limited to the combinations with other features disclosed, and other combinations between the technical features can be performed by those skilled in the art according to the purpose of the present invention, so as to achieve the purpose of the present invention.

Claims

1. A method for text representation of self-media data based on a recursive variational self-coding model, wherein the method comprises the steps of:

s100, preprocessing input corpus texts to obtain a coding sequence;

s200, encoding the encoding sequence by adopting a recurrent neural network encoding model to generate a text vector with fixed dimensionality, wherein the word vector adopted during encoding comprises an initialization vector and/or a pre-trained word vector;

step S300, generating a mean vector and a variance vector from the text vector of the fixed dimension, then collecting a sample from a standard normal distribution, and generating a potential code representation z by using the mean vector, the variance vector and the sample through a variational reasoning method, wherein the potential code representation z obeys the condition of distribution N (mu, sigma), wherein mu represents the generated mean vector, and sigma represents the generated variance vector;

s400, decoding the potential coding representation z by adopting a recurrent neural network decoding model to obtain a decoding sequence, calculating coding loss between the coding sequence and the decoding sequence and divergence between the potential coding representation z and standard normal distribution, and updating parameters of a recurrent variant self-coding model by using the coding loss and the divergence;

the preprocessing of the corpus text in step S100 includes the following steps:

s130, performing dependency structure analysis on the text T, and performing serialization processing on the analyzed structure to obtain a coding sequence;

in step S200, encoding the coding sequence by using a recurrent neural network coding model, and generating a fixed-dimension text vector includes the following steps:

in the course of encoding, the weight W is encoded_eIs coded at each timeSharing so as to make the text code generated by the code to be represented as the vector of the fixed dimension;

in step S300, the mean vector and the variance vector are generated through identity mapping.

2. The method for text representation of self-media data based on recursive variational self-coding model according to claim 1, wherein in step S130 further comprising:

3. The method of text representation of self-media data based on a recursive variational self-coding model according to claim 1, wherein step S300 comprises:

collecting variables used for generating the potential coding representation z in a standard normal distribution, wherein the distribution of the variables is used for divergence calculation in model training;

4. The method of textual representation of self-media data based on a recursive variational self-coding model according to claim 1, wherein the decoding process of the potential coded representation z in step S400 comprises the steps of:

s410, generating an input vector x with the dimension being twice as large as the potential encoding representation z on the basis of the potential encoding representation z, wherein one part of the input vector x is a child node c, and the other part of the input vector x is a parent node p for decoding;

5. The method of textual representation of self-media data based on a recursive variational self-coding model according to claim 1, wherein the coding loss between the decoded sequence and the coded sequence is obtained by euclidean distance calculation in step S400.

6. The method of textual representation of self-media data based on a recursive variant self-coding model according to claim 1, wherein the parameters of the recursive variant self-coding model are updated by a back-propagation algorithm in step S400.