CN107608953B

CN107608953B - Word vector generation method based on indefinite-length context

Info

Publication number: CN107608953B
Application number: CN201710609471.6A
Authority: CN
Inventors: 王俊丽; 王小敏; 杨亚星
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2017-07-25
Filing date: 2017-07-25
Publication date: 2020-08-14
Anticipated expiration: 2037-07-25
Also published as: CN107608953A

Abstract

A word vector generation method based on indefinite length context. The invention relates to the field of natural language processing, in particular to a word vector generation method based on indefinite-length context. The technical scheme of the invention provides a context division strategy with indefinite length and a word vector generation method based on the context with indefinite length. This strategy uses punctuation to divide the corpus into contexts of indefinite length but complete semantics. The non-fixed length results in the inability of conventional language models to generate word vectors using such contexts. In order to solve the problem, a language Model F-Model which can process the context with variable length is designed by combining a convolutional neural network and a cyclic neural network. By implementing result analysis, the quality of word vectors can be improved by dividing the corpus into contexts with complete semantics by using punctuation. The F-Model has good learning ability, and the word vector obtained by implementation contains rich semantics and a better linear relation.

Description

Word vector generation method based on indefinite-length context

Technical Field

The invention relates to the field of natural language processing, in particular to a word vector generation method based on indefinite-length context.

Background

In common natural language processing tasks, most are implemented based on word vectors, and the final processing result often depends to a large extent on the quality of the word vectors. Generally speaking, the higher the quality of a word vector is, the richer and more accurate the semantics contained in the word vector are, and the semantics in a natural language can be solved by a computing mechanism more easily, so that the processing results of other natural language processing tasks are improved fundamentally. Therefore, how to generate high-quality word vectors is a basic and important task in the field of natural language processing, which has direct and significant influence on other subsequent natural language processing tasks, such as machine translation, part-of-speech tagging, and the like.

In a conventional word vector generation method, a corpus is divided into context units with fixed lengths in order to simplify problems and calculate complexity, but the context with the fixed length is not a complete semantic unit, which causes semantic missing or semantic confusion of the context. The semantic missing and semantic disorder of the context can be transferred into the word vector, and the semantic missing and semantic disorder of the word vector can be directly caused.

In order to solve the problems of word vector semantic deletion and semantic confusion caused by the fixed context, the original corpus information is fully utilized, punctuation marks are utilized to divide a corpus into context units with relatively complete semantics, and the length of the context units is uncertain, so that the traditional word vector generation method based on the fixed context is not applicable any more.

Therefore, the invention provides a word vector generation method based on indefinite-length context. The method is based on a convolutional neural network and a cyclic neural network, and long dependence information among words is strengthened. Finally, the implementation result shows that the word vectors generated by the method contain richer semantics, and the word vectors have better linear relation.

Disclosure of Invention

The invention aims to provide a context division strategy with indefinite length and a word vector generation method based on the context with indefinite length. The strategy divides the corpus into contexts with indefinite length and complete semantics by utilizing punctuation marks, and solves the problems of semantic missing and confusion caused by using the contexts with fixed length in the traditional language model. The method for generating the word vector of the indefinite length context based on the strategy division strengthens the long dependence information among words by utilizing the characteristics and the advantages of the convolutional neural network and the cyclic neural network, and finally improves the quality of the generated word vector.

In order to achieve the purpose, the invention provides a word vector generation method based on indefinite length context, which is characterized in that the semantic integrity of the context is completed by utilizing the characteristics and advantages of punctuation marking, probability statistics, a convolutional neural network and a cyclic neural network, the long dependence relationship between words is enhanced, and the semantic inclusion capability of word vectors is improved.

The method comprises the steps of preprocessing a corpus, dividing a context by using punctuation marks, and dividing the corpus into context units with different lengths and complete semantics. The weights of the words in the context are then learned using a convolutional neural network, which is then combined with the global distribution of the corpus to generate the final weights of the words in the context. This final weight and the word vector are then used to compute a vector representation of the context. A vector expression of the context is then utilized to construct a one-to-many mapping relationship between each word in the context. And then training the model through a stochastic gradient algorithm, and finally obtaining a word vector.

The invention is realized by the following technical scheme:

(8) and preprocessing the document to obtain a training corpus. A group of document sets related to a certain professional field is given, and useful information in a corpus is obtained through preprocessing technologies such as word removal stop words and low-frequency words, so that a training corpus is formed.

(9) And (5) carrying out word frequency statistics and carrying out corpus distribution statistics. And generating a dictionary of the corpus based on the statistics of the occurrence frequency of the words in the document, wherein the dictionary comprises the words in the corpus, the indexes of the words and the frequency of the words.

(10) And constructing a training set, and dividing the corpus into contexts with different lengths according to punctuations in the training corpus to form the training set.

(11) Weights are computed for the word vectors in the context. The word vectors of the words in the context constitute a context matrix. And acquiring the weight of each word by using a convolution neural network through convolution operation on the context matrix, and combining the weight with the frequency of the words in the corpus to form a final weight.

(12) A distributed representation of a computing context. And (4) combining the weight of the word vector obtained in the step (4) to obtain the distributed expression of the current context. And generating a distributed expression of the latest context by using the historical context information in the recurrent neural network, and updating the historical information in the recurrent neural network.

(13) And (6) model inference. And (5) constructing a one-to-many mapping relation between the context and the words in the context by using the context distribution information obtained in the step (5). A loss function of the model is constructed.

(14) And training the model to obtain a word vector. And (4) performing optimal training on a training set according to the mapping relation constructed in the step (6), wherein the training method adopts a negative sampling and random gradient descent algorithm.

In the above method, punctuation marks are used in the step (3), and the punctuation marks used herein refer to punctuation marks containing relatively complete segmentation semantics, such as ".", ". ","? ","! "and the like.

In the above method, the step (4) uses a convolutional neural network, and the size of the convolutional kernel is [1,3, m,1], where m represents the dimension of the word vector. Using a convolutional neural network, a context matrix shaped as [ k, m ] can be convolved to generate weights shaped as [ k,1], where k represents the number of words in the context. The weights are then combined with the corpus distribution to calculate the final weights.

In the above method, the step (4) specifically includes the following steps:

d) a context matrix is constructed. And constructing a context matrix according to the input of the F-Model. The input In to the model is In context and comprises two parts: an index I vector of words in the context, a global distribution vector Wt1 of words in the context. Each term in the vector I represents an index of a word in the dictionary. The length of the vector Wt1 is the same as the length of the context, and the values in each dimension represent the frequency distribution of the corresponding words in I throughout the corpus. The word vectors in the dictionary table are looked up from the input index vector I by the gather () operation and are assembled in order into a context matrix C.

Wt1＝(d₀,d₁,…,d_k) (1)

I＝(i₀,i₁,…,i_k) (2)

In＝(I,Wt1) (3)

C＝gather(D，I) (4)

Where k is the length of the context, i_kThe expression W_kIndex in a dictionary, d_kThe expression W_kIs distributed over the entire area. e) The context matrix C is convolved by the convolution layer to obtain the shape of [ k,1]]Wt 2.

Wt2＝f(C)+B (5)

Wherein: d is a dictionary, f () is a convolution operation, the shape of a convolution kernel is [1,3, m,1], and B is an offset term of convolution.

f) The final weights are calculated. The final weight Wt is calculated in conjunction with the corpus distribution.

Wt＝Wt2·Wt1 (6)

In the above method, the step (5) calculates the semantic vector Ct of the context by the word vector C of the context and the weight Wt obtained in the step (4). The recurrent neural network calculates the latest context semantic vector Ct according to the context semantic vector Ct and the historical context state Ct learned in the recurrent neural network, and the context semantic vector C contains the historical context semantic learned in the recurrent neural network.

In the above method, the step (5) specifically includes the following steps:

c) a distributed representation of the current context is computed. And adding weight information of the words on the basis of the bag-of-words model.

The distributed expression of the current context is calculated in the following way:

Ct＝C·(Wt) (7)

where C represents a context matrix and · represents a matrix multiplication.

d) And adding historical information by using a recurrent neural network.

Ct＝rnn(Ct) (8)

Wherein rnn () represents the traversal of the recurrent neural network.

In the above method, the step (6) predicts each word in the context according to the context semantic vector Ct. A conditional probability of the target word under the context condition is calculated. The calculation method is as follows:

P(W|C)＝sigmoid(C·θ_i) (9)

wherein, theta_iIs the word W_iThe parameter (c) of (c).

In the above method, when the model is trained in the step (7), the loss function is model complexity (perplexity) and represents the length of the average context of the predicted target word.

Where k denotes the length of the context, P (W)_i| C) the expression W_iConditional probability under context C.

The invention has the advantages that: the corpus is divided by punctuation marks to form a context with complete semantics. The F-Model utilizes the characteristics and advantages of the convolutional neural network and the cyclic neural network to strengthen the long dependence relationship between words and improve the quality of word vectors.

Drawings

FIG. 1 is a structural view of the F-Model.

FIG. 2 is a block diagram of an implementation of the present invention.

FIG. 3 shows the variation in loss of F-Model for different learning rates.

FIG. 4: table 1 similarity analysis of words.

FIG. 5: table 2F-Model linear relationship analysis data (section).

FIG. 6 Table 3 shows the accuracy of the query test set for different dimensions and iterations.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the ontology concept and the hierarchy generating method according to the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. It should be understood that the embodiments described herein are only for illustrating the present invention and are not intended to limit the present invention, i.e., the scope of the present invention is not limited to the following embodiments, but rather, appropriate changes may be made by one skilled in the art according to the inventive concept of the present invention, which may fall within the scope of the invention as defined by the claims.

As shown in the block diagram of fig. 1, the method for generating a word vector based on an indefinite length context according to the embodiment of the present invention includes the following steps:

1) a preprocessing module:

preprocessing a corpus is carried out, stop words and low-frequency words are mainly removed, numbers are replaced by fixed marks, all words are in a lower case form, and the like, and finally an effective training corpus is formed.

2) A model construction module:

and counting word frequency distribution of the training corpus and generating a dictionary, wherein the dictionary content comprises words, indexes and frequencies. The corpus is divided into context units with indefinite length and complete semantics by punctuation marks to form a training set. The input to the F-Model is in context, and comprises two parts: an index I vector of words in the context, a global distribution vector Wt1 of words in the context. The index I then constructs a context matrix C, and the convolutional neural network learns the weights Wt2 of the words in the context matrix through the context matrix. The weight Wt2 and corpus distribution Wt1 form the final weight Wt. The F-Model then calculates the distribution representation Ct of the context according to Wt and the context matrix C. And the F-Model then constructs one-to-many mapping of the context and the words, and utilizes the negative sampling to construct a loss function of the Model, thereby providing an optimization target for subsequent training.

3) A model training module:

training a model by continuously tuning a loss function on a training set, wherein the training adopts a random gradient algorithm, and the quality of the obtained word vector is improved by adjusting the hyper-parameters of the model. The hyper-parameters in the model include: vector dimensions, iteration times, and learning rate. In the specific implementation, the dimension of the word vector is respectively 3 different dimensions: 50.

dimension

100 and 200. The iteration times refer to the iteration times of each training set after entering the model, and two embodiments of 10 and 20 are respectively adopted. Learning rate we have adopted several fixed learning rates, which are: 0.1, 0.01, 0.001;

description of the implementation and results:

in the implementation, a total of 4 data sets are used, which are respectively: english corpus news.2011.en.shuffled, ptb.train.txt corpus, Wordsim353 dataset and query-words.txt dataset in the Billion-Words corpus. Wherein the news.2011.en.shuffled corpus is used as a training set, and the ptb.train.txt corpus is used for testing the function of the model. Both the Wordsim353 dataset and the query-words. txt dataset are used as test sets, the Wordsim353 dataset is used for comparing similarity of word vectors, and the query-words. txt dataset is used for evaluating linear relations of word vectors.

news.2011.en.shuffled corpus is a piece of plain text data captured from 2011 news, and has large data volume, complete information and original information such as punctuations and the like. When processing this corpus, words with a word frequency of less than 5 times and in the stop word list are ignored, all words are converted to lowercase, and all numbers are converted to capital letters N. The ptb.train.txt corpus is a relatively small corpus compared to Billion-Words, and is used in practice for testing models. The Wordsim353 data set is a small corpus constructed artificially and comprises a plurality of word pairs and the similarity between the word pairs, wherein the similarity is obtained through artificial scoring, the minimum score is 0, and the maximum score is 10. Wordsim353 contains 3 sets of sub-data: set1, set2 and combined. Wherein set1 contains the data of scoring the similarity of the manual word pairs from 13 different sides, set2 contains the data of scoring the similarity of the manual word pairs from 16 sides, and combined is the result of the average scoring of the similarities in set1 and set 2. The Wordsim353 data set in the implementation serves as a test set and serves as a benchmark for similarity evaluation. Txt dataset is a dataset used in word2vec in tensoflow, which is a manually constructed test dataset for testing the linear relationship of word vectors. Each test case in this dataset contains 4 words with a linear relationship similar to King-Man + Woman Queen.

The similarity and linear relation of the word vectors are analyzed and tested to find that: the quality of the word vector may be affected by the model, context, and dimension size at the same time. The indefinite-length context divided by the punctuation marks can ensure more complete semantics and improve the quality of word vectors. The F-Model has good learning ability, and word vectors obtained by training contain rich semantics and have good linear relation. The larger the dimension of the word vector, the stronger the ability of containing the semantic meaning, but this also brings about dimension disaster and overfitting phenomenon, resulting in the poor effect of the word vector migration.

It is to be noted and understood that various modifications and improvements can be made to the invention described in detail above without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any of the specific exemplary teachings provided.

Claims

1. A method for generating a word vector based on an indefinite length context,

firstly, preprocessing a corpus, dividing a context by using punctuation marks, and dividing the corpus into context units with different lengths and complete semantics;

then, learning the weight of each word in the context by using a convolutional neural network, and combining the weight with the global distribution of the corpus to generate the final weight of each word in the context;

then using the final weight and the word vector to calculate a vector expression of the context;

then, constructing a one-to-many mapping relation between each word in the context by using the vector expression of the context;

then training a model through a random gradient algorithm, and finally obtaining a word vector;

the method specifically comprises the following steps:

(1) preprocessing a document to obtain a training corpus;

a group of document sets related to a certain professional field are given, and useful information in a corpus is obtained through a word elimination stop word and low-frequency word preprocessing technology, so that a training corpus is formed;

(2) carrying out word frequency statistics and corpus distribution statistics;

generating a dictionary of a corpus based on the statistics of the occurrence frequency of words in the document, wherein the dictionary comprises the words in the corpus, the indexes of the words and the frequency of the words;

(3) constructing a training set;

dividing a corpus into contexts with different lengths according to punctuations in a training corpus to form a training set;

(4) calculating the weight of the word vector in the context;

the word vectors of all words in the context form a context matrix; acquiring the weight of each word by convolution operation of the context matrix by utilizing a convolution neural network, and combining the weight with the frequency of the words in the corpus to form a final weight;

(5) computing a distributed representation of a context;

combining the weight of the word vector obtained in the step (4) to obtain the distributed expression of the current context; generating a distributed expression of the latest context by using historical context information in the recurrent neural network, and updating the historical information in the recurrent neural network;

(6) model inference;

constructing a one-to-many mapping relation between the context and words in the context by using the context distribution information obtained in the step (5); constructing a loss function of the model;

(7) training a model to obtain a word vector;

performing optimized training on a training set according to the mapping relation constructed in the step (6), wherein the training method adopts a negative sampling and random gradient descent algorithm;

in the above method, punctuation marks are used in the step (3), where the punctuation marks are punctuation marks containing relatively complete segmentation semantics.

2. The method according to claim 1, wherein the step (4) uses a convolutional neural network, and the size of the convolutional kernel is [1,3, m,1], where m represents the dimension of the word vector; a context matrix with the shape of [ k, m ] is convolved by a convolutional neural network to generate a weight with the shape of [ k,1], wherein k represents the number of words in the context; the weights are then combined with the corpus distribution to calculate the final weights.

3. The method for generating word vectors based on indefinite length contexts as claimed in claim 1, wherein the step (4) comprises the following steps:

a) constructing a context matrix;

constructing a context matrix according to the input of the F-Model; the input In to the model is In context and comprises two parts: index vector I of words in context, global distribution vector Wt1 of words in context; each term in the index vector I represents an index of a word in the dictionary; the length of the global distribution vector Wt1 is the same as that of the context, and the value in each dimension represents the frequency distribution of the corresponding word in the index vector I in the whole corpus; searching word vectors in a dictionary table according to the input index vector I through gather () operation, and arranging and combining the word vectors into a context matrix C in sequence;

Wt1＝(d₀,d₁,…,d_k) (1)

I＝(i₀,i₁,…,i_k) (2)

In＝(I,Wt1) (3)

C＝gather(D，I) (4)

where k is the length of the context, i_kThe expression W_kIndex in a dictionary, d_kThe expression W_kA global distribution of;

b) the context matrix C is convolved by a convolutional layer to obtain a weight Wt2 with the shape of [ k,1 ];

Wt2＝f(C)+B (5)

wherein: d is a dictionary, f () is convolution operation, the shape of a convolution kernel is [1,3, m,1], and B is a bias term of convolution;

c) calculating the final weight;

calculating final weight Wt in combination with corpus distribution

Wt＝Wt2·Wt1 (6)。

4. The method according to claim 1, wherein the step (5) calculates the context semantic vector Ct by the context matrix C and the weight Wt obtained in the step (4); the recurrent neural network calculates the latest context semantic vector Ct according to the context semantic vector Ct and the historical context vector Ct learned in the recurrent neural network, wherein the historical context semantic vector Ct learned in the recurrent neural network is included in the context semantic vector Ct.

5. The method for generating word vectors based on indefinite length contexts as claimed in claim 4, wherein the step (5) comprises the following steps:

a) calculating a distributed expression of the current context; adding weight information of words on the basis of the bag-of-words model;

Ct＝C·(Wt) (7)

where C represents a context matrix,. represents a matrix multiplication;

b) adding historical information by using a recurrent neural network;

Ct＝rnn(Ct) (8)

wherein rnn () represents the traversal of the recurrent neural network.

6. The indefinite length context-based word vector generation method of claim 1, wherein the step (6) predicts each word in the context based on the context semantic vector Ct; calculating the conditional probability of the target word under the context condition; the calculation method is as follows:

P(W_i|C)＝sigmoid(C·θ_i) (9)

wherein, theta_iIs the word W_iThe parameter (c) of (c).

7. The method as claimed in claim 1, wherein, in the step (7) of training the model, the loss function employs a model complexity (persistence) representing the length of the average context of the predicted target word;

where k denotes the length of the context, P (W)_i| C) the expression W_iConditional probabilities under the context matrix C.