CN110852070A

CN110852070A - Document vector generation method

Info

Publication number: CN110852070A
Application number: CN201911025383.7A
Authority: CN
Inventors: 金霞; 杨红飞; 张庭正
Original assignee: Hangzhou Firestone Technology Co Ltd
Current assignee: Hangzhou Firestone Technology Co Ltd
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2020-02-28

Abstract

The invention discloses a document vector generating method, which comprises the steps of firstly converting words of each sentence in a document into a list, then forming the list of the document by the list of each sentence, finally mapping each word in the document list to a unique integer number, and then sliding a window in the document list to obtain a training sample of a hierarchical attention network; secondly, constructing a hierarchical attention network comprising a word-level encoder, a word-level attention layer, a sentence-level encoder, a sentence-level attention layer and document attention; and then constructing a language model framework, wherein the hierarchical attention network and the language model framework form a language model, training the language model, fitting parameters to be trained in the model, and inputting the prediction data into the model after the language model is trained to obtain a document vector of the model. When the method is used, partial variables do not need to be trained again for each different document, and the use cost is greatly reduced.

Description

Document vector generation method

Technical Field

The invention relates to the field of document vector generation, in particular to a document vector generation method.

Background

In recent years, the use of deep learning methods to learn low-dimensional real-valued vector expressions for words, sentences or documents has been widely applied in the field of natural language processing, and in some natural language processing tasks of text classification, text similarity calculation and emotion analysis, the quality of a document vector directly affects the result of the task.

The existing method for generating the document feature vector is based on a word2vec (Le Q V, MikolovT. distributed retrieval of sequences and Documents [ J ].2014.) word vector model, a document feature vector is added in the training process of a language model, and the document vector is directly trained while the word vector is trained. Although the document vector model captures semantic and sequence information among words, the range capable of being captured is limited; and because the document vectors are directly trained, the vectors of different documents need to be retrained, so that the document vectors of the training set need to be trained instead of being directly predicted.

The existing document vector generation method is generally difficult to pay attention to the whole document information, and when a trained model is used for predicting a vector of a new document, partial variables need to be retrained, which is high in cost in a production environment.

Disclosure of Invention

The invention aims to provide a document vector generation method based on a hierarchical attention network, aiming at overcoming the defects of the prior art, and the method can capture the information of the whole document by using the attention network and does not need to retrain any variable during prediction.

The purpose of the invention is realized by the following technical scheme: a document vector generation method, the method comprising the steps of:

(1) preparing data;

(1.1) converting words of each sentence in the document into a list, forming the list of the document by the list of each sentence, and finally mapping each word in the document list to a unique integer number which is marked as doc;

(1.2) sliding a window with the window size of window _ size in a document list, recording a word list in the window as words, and recording the second window _ size//2 words in the window as label; one document will eventually form a plurality of words and labels, while the data of one sample will include doc, words and label, and one document will eventually form a plurality of samples as training samples of the hierarchical attention network, which samples have the same doc, but different words and labels.

(2) Constructing a hierarchical attention network; the hierarchical attention network comprises a word-level encoder word encoder, a word-level attention layer word attention, a sentence-level encoder sensor encoder, a sentence-level attention layer sensor attention and a document attention doc attention; the word encoder layer and the sensor encoder layer are both of a sequence encoder layer type; the word attribute layer and the presence attribute layer are attention networks, the context text context of the word attribute layer and the presence attribute layer are different, and the context of the word attribute layer is u_wAnd context of sentenceatention layer is u_s；

(3) Language model framework: splicing word vectors corresponding to doc _ attribute and words output by the hierarchical attention network, and mapping to label by adopting softmax; the word vector is an embedded matrix W_eBy the integer number of words from the embedded matrix W_eIndexing to the corresponding word vector;

(4) the hierarchical attention network and the language model framework form a language model, the language model is trained, parameters to be trained in the model are fitted, and doc _ attention corresponding to each document is a final document vector after the language model is trained; the prediction data is input into the model, and a document vector of the prediction data can be obtained.

Further, in step (1.1), words of each sentence in the document may be converted into a list, the list of each sentence constitutes the list of the document, and finally, each word in the list of the document is mapped to a unique integer number.

Further, in step (1.2), both words and label are the result of the words mapped to integer numbers.

Further, in step (1.2), the window size window _ size value is 8.

Further, in step (1.2), window _ size//2 represents the integer part of the window size divided by 2.

Further, in step (2), the sequence encoder is based on a GRU (gated current unit) network, the GRU updates the hidden layer state, mainly through an association gate and an update gate, and keeps the hidden state at current step t as h_tThen the update gate is:

wherein the content of the first and second substances,

Γ_u＝σ(W_u[h_t-1,x_t]^T+b_u)

where σ is the sigmoid function, W_uAnd b_uIs a parameter to be trained and is,

to associate the door, x_tIs the input of the GRU;

the associated gates are as follows:

wherein tanh represents a hyperbolic tangent function, W_CAnd b_CIs a parameter to be trained;

the associated gate weights are as follows:

Γ_r＝σ(W_r[h_t-1,x_t]^T+b_r)

in the formula, W_rAnd b_rIs the parameter to be trained.

Further, in the step (2), the word-level encoder word encoder is specifically:

x_it＝W_ew_it,t∈[1,T]

in the formula, x_itFor the result of vectorization of the integer number corresponding to a word, W_eIs an embedded matrix, w_itIs the vector after the integer number one-hot coding OneHot in words, T is the maximum value of the word number of each sentence,

and

the outputs of the GRUs in forward and reverse directions, respectively.

Further, in the step (2), the syntax-level encoder transmit encoder is specifically:

in the formula, s_iIs the presence _ entry of the ith sentence, and L is the number of sentences.

Further, in the step (2), the word-level attention layer word attention specifically includes:

u_it＝tanh(W_wh_it+b_w)

in the formula, h_itBy passingAnd

is spliced to obtain s_iIs the sense _ entry, W, of the ith sentence_wAnd b_wIs the parameter to be trained.

Further, in step (2), the sentence-level attention layer presence _ entry specifically includes:

u_i＝tanh(W_sh_i+b_s)

in the formula, h_iBy passingAnd

is spliced to obtain_sAnd b_sIs the parameter to be trained.

The invention has the beneficial effects that: the present invention uses a hierarchical attention network to obtain document vectors: and mapping words in the clause to the content entry, and then mapping the information of the content entry to the final document vector by using the information of the content entry, so that the document vector is ensured to contain the information of the whole document, and the finally generated document vector does not lose semantic information among the words depending on a language model. Because the document vector is generated by using the hierarchical attention network, the hidden layer weight required by the generation of the final document vector is fitted in the model training process, and when the model is used, partial variables do not need to be trained again for each different document, so that the use cost is greatly reduced.

Drawings

FIG. 1 is a diagram of a hierarchical attention network framework of the present invention;

FIG. 2 is a diagram of a language model framework according to the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

As shown in FIG. 1, the document vector generation method provided by the present invention is based on a language model, and embeds documents into the language model in the form of a hierarchical attention network. The method comprises the following specific steps:

(1) preparing data: for each document, the following processing is performed:

1) converting the words of each sentence in the document into a list, forming the list of the document by the list of each sentence, and finally mapping each word in the document list to a unique integer number and marking as doc.

For example, the original document is as follows:

the industry universal recommendation system includes two stages of flow, matching (match) and ranking (rank). In the matching process, a potential commodity set which is possibly interested is found mainly according to some interest points of the user. Because the massive nature of the whole commodity set, it is impractical for a user accessing in real time to calculate the interest level of the user in all commodities, so that a potential commodity set which the user may be interested in needs to be searched in advance according to some interests, feature strategies and the like, on the basis, deep ranking of interest categories of the commodity set is performed according to a specific model algorithm, and effect indexes are often quantified through Click rate (Click TroughRate), conversion rate, duration and the like, so that the main purpose of the rank stage is to predict the CTR and the like of a user in the commodity which the user is interested in, perform ranking according to the size of the predicted score, and finally return the ranking as the recommendation result of the recommendation system.

The list of converted documents is as follows:

[ [ 'industrial', 'etc', 'industrial',

'in', 'matching', 'in', 'course', 'in', 'main', 'to', 'is', 'root', 'in', 'use', 'user', 'of', 'one', 'some', 'with', 'interest', 'point', 'place', 'to', 'can', 'sense', 'like', 'can', 'latent', 'is', 'in', 'quotient', 'product', 'set',

the 'is a little bit of the' is a 'from', 'is a', 'is', the 'can', the ' g ', ' h ', ' R ', ' a ', ' T ', ' e ', ' a ', ' when ', ' long ', ' etc ', ' in ', ' amount ', ' a ','d ', ' in ', ' a ', ' n ', ' k ', ' a ', ' stage ', ' segment ', ' a ', ' main ', ' an ' of ', ' eye ', ' just ', ' in ', ' pre ', ' measuring ', ' a ', ' in ', ' a ', ' using ', ' a ' of the, the term "pre '," measure', 'divide', 'value', 'large', 'small', 'enter', 'line', 'arrange', 'sequence', 'most', 'final', 'back', 'arrange', 'sequence', 'do', 'is', 'push', 'recommend', 'system', 'is', 'push', 'recommend', 'result' ], or the like

After mapping the document list (as long as it is ensured that each literal can be mapped to a unique integer number) to an integer number:

[[599,227,773,153,138,537,1549,248,1533,693,4371,84,39,3104,465,2,954,1085,1, 1366,435,18,529,393,801,1032,972,17,6,864,2756,18,1913,393,1435,1575,17],

[5,1366,435,2,191,1085,35,449,80,12,653,121,138,385,2,29,1201,624,674,86,1, 616,56,88,90,736,624,674,2,2126,5,302,433,422],

[82,62,985,39,302,433,422,2,363,218,311,1,32,504,96,1841,730,2,138,385,164, 928,818,130,32,186,268,302,433,2,736,624,674,1085,250,12,42,504,1210,2,1,146,46, 133,80,546,336,653,121,29,1201,624,674,4,479,2037,1368,735,22,109,1668,616,138, 385,88,90,736,624,674,2,2126,5,302,433,422,1,5,327,367,4594,38,1,147,653,121,479, 499,2,1068,232,818,370,109,590,261,302,433,422,624,674,262,242,315,864,2756,1, 1103,1171,166,535,778,778,153,191,86,3210,234,18,140,1153,717,1032,1575,0,332, 1913,2034,2063,633,972,0,510,393,801,785,17,4,313,220,234,4,96,241,22,109,218, 220,1,146,46,0,1913,393,1435,1575,0,3104,465,2,449,80,915,2,55,5,62,546,769,29, 39,138,385,5,130,736,624,674,2,302,433,35,2,0,140,332,510,0,22,1,53,169,653,121, 546,769,262,270,2,50,144,590,261,864,2756,1,131,1200,1459,407,864,2756,580,15, 537,1549,248,1533,2,537,1549,1223,1171]]

in this step, each word in the document may also be converted into a list, and then the list of each word constitutes the list of the document, and finally each word in the document list is mapped to a unique integer number.

2) A window with a size of window size (window size 8 is used in this example) is slid over the document, the list of words in the window is denoted words, the first window size//2 (integer part of the quotient of window size divided by 2) words in the window are denoted label, where words and label are also the result of the words mapped to integer numbers. A document will eventually have multiple words and labels, while the data for a sample will include doc, words and label, and a document will eventually form multiple samples as training samples for the hierarchical attention network, with doc being the same but word and label being different.

(2) Hierarchical attention network: because two layers of attention networks are used, it is called a hierarchical attention network. Each sentence of doc passes through a word encoder and a word attention layer in sequence to obtain a sensor _ attention layer, and the sensor _ attention layer of each sentence passes through a sensor encoder and a sensor attention layer in sequence to obtain a final doc _ attention layer; both the word encoder layer and the sensor encoder layer are of a sequence encoder layer type; both the word attribute layer and the presence attribute layer are attention networks, except that their context text context is different.

The sequence encoder is based on a GRU (gated Current Unit) network, the GRU updates hidden state mainly through an association gate and an update gate, the GRU network can be replaced by other Recurrent neural networks, such as LSTM (Long Short Term memory), and the hidden state of the current step t is recorded as h_tThen the update gate is:

wherein the content of the first and second substances,

Γ_u＝σ(W_u[h_t-1,x_t]^T+b_u)

to associate the door, x_tIs the input of the GRU;

the associated gates are as follows:

the associated gate weights are as follows:

Γ_r＝σ(W_r[h_t-1,x_t]^T+b_r)

in the formula, W_rAnd b_rIs the parameter to be trained.

The word-level encoder word encoder specifically comprises:

x_it＝W_ew_it,t∈[1,T]

andthe outputs of the GRUs in forward and reverse directions, respectively.

The syntax level encoder transmit encoder is specifically:

The word-level attention layer word attention specifically comprises:

u_it＝tanh(W_wh_it+b_w)

in the formula, h_itBy passing

And

The sentence-level attention layer sensor _ attribute specifically includes:

u_i＝tanh(W_sh_i+b_s)

in the formula, h_iBy passingAndis spliced to obtain_sAnd b_sIs the parameter to be trained.

(3) Language model framework: as shown in fig. 2, word vectors corresponding to doc _ attention and words of the hierarchical attention network are spliced and mapped to label, and the mapping method uses softmax; the word vector is an embedded matrix W_eBy means of the integer digits of words, can be embedded from the matrix W_eIndex to its corresponding word vector.

(4) The hierarchical attention network and the language model framework form a language model, the language model is trained, parameters to be trained in the model are fitted, and doc _ attention corresponding to each document is a final document vector after the model is trained; for the prediction data, because the parameters to be trained in the model are fitted, the document vector doc _ attention can also be obtained; the obtained document vector can be used in natural language processing tasks of text classification, text similarity calculation and emotion analysis.

The invention also carries out document vector training based on the language model, but obtains the document vector by using the hierarchical attention network instead of direct training, so that the model can pay attention to the information of the whole document, and the use cost is greatly reduced.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

1. A method for generating a document vector, the method comprising the steps of:

(1) preparing data;

(1.1) converting the words of each sentence in the document into a list, forming the list of the document by the list of each sentence, and finally mapping each word in the document list to a unique integer number which is recorded as doc.

(2) Constructing a hierarchical attention network; the hierarchical attention network comprises a word-level encoder word encoder, a word-level attention layer word attention, a sentence-level encoder sensor encoder, a sentence-level attention layer sensor attention and a document attention doc attention; the word encoder layer and the sensor encoder layer are both of a sequence encoder layer type; the word attribute layer and the presence attribute layer are attention networks, the context text context of the word attribute layer and the presence attribute layer are different, and the context of the word attribute layer is u_wSen, thenContext at tenceattention level is u_s。

(3) Language model framework: splicing word vectors corresponding to doc _ attribute and words output by the hierarchical attention network, and mapping to label by adopting softmax; the word vector is an embedded matrix W_eBy the integer number of words from the embedded matrix W_eIndex to its corresponding word vector.

2. A method for generating document vectors according to claim 1, wherein in step (1.1), words of each sentence in the document are converted into a list, the list of each sentence constitutes the list of the document, and finally each word in the document list is mapped to a unique integer number.

3. The method according to claim 1, wherein in step (1.2), both words and label are the result of mapping words to integer numbers.

4. The method of claim 1, wherein in step (1.2), the window size is 8.

5. The method of claim 1, wherein in step (1.2), window _ size//2 represents the integer portion of the quotient of window size divided by 2.

6. The method of claim 1, wherein in step (2), the sequence encoder is based on a GRU (gated Current Unit) network, and the GRU updates the hidden key mainly through the associative gate and the refresh gateThe hidden state h is the hidden state of the current step t_tThen the update gate is:

wherein the content of the first and second substances,

Γ_u＝σ(W_u[h_t-1,x_t]^T+b_u)

where σ is the sigmoid function, W_uAnd b_uIs a parameter to be trained and is,to associate the door, x_tIs the input of the GRU;

the associated gates are as follows:

the associated gate weights are as follows:

Γ_r＝σ(W_r[h_t-1,x_t]^T+b_r)

in the formula, W_rAnd b_rIs the parameter to be trained.

7. The method for generating document vectors according to claim 1, wherein in the step (2), the word-level encoder word encoder specifically comprises:

x_it＝W_ew_it,t∈[1,T]

andthe outputs of the GRUs in forward and reverse directions, respectively.

8. The method for generating document vectors according to claim 1, wherein in the step (2), the syntax-level encoder sensor encoder is specifically:

9. The method for generating document vectors according to claim 1, wherein in the step (2), the word-level attention layer word attention specifically comprises:

u_it＝tanh(W_wh_it+b_w)

in the formula, h_itBy passing

Andis spliced to obtain s_iIs the sense _ entry, W, of the ith sentence_wAnd b_wIs the parameter to be trained.

10. The method for generating document vectors according to claim 1, wherein in step (2), the sentence-level attention level sensor _ attention specifically comprises:

u_i＝tanh(W_sh_i+b_s)

in the formula, h_iBy passing

And

is spliced to obtain_sAnd b_sIs the parameter to be trained.