CN109670171B

CN109670171B - Word vector representation learning method based on word pair asymmetric co-occurrence

Info

Publication number: CN109670171B
Application number: CN201811413427.9A
Authority: CN
Inventors: 石隽锋; 李济洪; 王瑞波
Original assignee: Shanxi University
Current assignee: Shanxi Zhonghuida Technology Co.,Ltd.
Priority date: 2018-11-23
Filing date: 2018-11-23
Publication date: 2021-05-14
Anticipated expiration: 2038-11-23
Also published as: CN109670171A

Abstract

The invention belongs to the field of natural processing, and particularly relates to a word vector representation learning method based on word pair asymmetric co-occurrence. Comprises the following steps. S100, counting word lists from the corpus; counting the occurrence frequency of each word from a given corpus, sequencing the words in the corpus from high to low according to frequency, S200-sequentially traversing the words in the corpus, and counting a left side co-occurrence matrix and a right side co-occurrence matrix which are expressed as X^LAnd X^RS300, setting model hyper-parameters, adopting the objective function of the Glove model, and respectively using X^LAnd X^RTraining out the left low-dimensional vector representation V of the word^LAnd right low-dimensional vector representation V^RStitching them together to get the low-dimensional vector representation of the word V = [ V =^L，V^R]. The invention adopts a parallel computing method to train word vectors by two co-occurrence matrixes simultaneously, thereby greatly reducing the running time of a program.

Description

Word vector representation learning method based on word pair asymmetric co-occurrence

Technical Field

The invention belongs to the field of natural processing, and particularly relates to a word vector representation learning method based on word pair asymmetric co-occurrence.

Background

In the field of natural processing, there are many ways to represent words inside computers, and typically there are the following:

1) one-hot representation, which is applied to the conventional rule-based, statistical natural language processing method. Each word is represented as a vector, the length of the vector is the size of a word list, the value of only one dimension in the vector is 1 and represents the current word, and the other dimensions are 0. Such a representation is not conducive to semantic computation of words.

2) The distributed representation, the length of the vector represented by the method is also the size of a word list, and is obtained by counting a co-occurrence matrix from a corpus, each row of the co-occurrence matrix corresponds to a word, each column also corresponds to a word, each element in the matrix represents the co-occurrence frequency of the two words in a corpus, each row in the matrix is a word vector corresponding to the word, and the representation improves the semantic information of the word represented by one-hot.

3) The distributed representation is a low-dimensional dense vector obtained by dimension reduction of the distributed representation through various methods, overcomes the defects of the distributed representation, and can better perform semantic calculation.

The low-dimensional word representation method based on the Glove model is one of the main representation learning methods at present, and the Glove model has the advantages of relatively simple learning algorithm, high efficiency and easiness in implementation. The trained word vector has better performance in a semantic similarity task and a word inference task.

Detailed description of the Glove model is described in the following documents:

Pennington J,Socher R,Manning C.Glove:Global Vectors for Word Representation[C]//Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.

the Glove model mainly comprises the following steps: setting the size of a fixed window, taking words in the fixed windows at two sides of each word (target word) as context, counting the co-occurrence frequency to generate a co-occurrence matrix, and then training by adopting a random gradient descent method to obtain the vector representation of each word. Although the model has better performance, the sequence of the words is not considered, when the co-occurrence matrix of the target word is counted, the words on the left side and the right side of the target word are not treated differently, and the words on the left side and the right side of the target word are mixed together to be used as the context of the target word, so that the precision of the word vector trained by the co-occurrence matrix is further improved.

Disclosure of Invention

In order to solve the problems, the invention provides a word vector representation learning method based on word pair asymmetric co-occurrence.

The invention adopts the following technical scheme: a word vector representation learning method based on word pair asymmetric co-occurrence comprises the following steps.

S100, counting word lists from the corpus; counting the number of occurrences of each word from a given corpus, ordering by frequency from high to low, c_iDenotes the ith word, f_iAnd i is more than or equal to 1 and less than or equal to n, wherein n is the number of different words in the corpus.

S200, setting the size of a fixed window as w, sequentially traversing words in a corpus, and counting a left co-occurrence matrix and a right co-occurrence matrix which are expressed as X^LAnd X^RThe size of both matrices is n × n.

The row of the matrix is the sequence number of each word in the vocabulary, and the column is also the sequence number of each word in the vocabulary. By using

Denotes c_i、c_jThe position of the kth co-occurrence in the corpus.

The process of counting the left co-occurrence matrix and the right co-occurrence matrix is as follows:

s201. matrix X^LAnd X^RIs initialized to 0;

s202, traversing each word in the corpus, and finding a sequence number i of the word in a word list;

s203-traverse the fixed windowEach word co-occurring at the left side of the word finds the serial number j of the word in the word list, calculates the weight according to the relative position of the word i and the word j, and accumulates to

At the same time, the weight is added up to

Performing the following steps; generating a left co-occurrence matrix X after traversal is finished^LAnd the right co-occurrence matrix X^R。

S300, setting model hyper-parameters, adopting the objective function of the Glove model, and respectively using X^LAnd X^RTraining out the left low-dimensional vector representation V of the word^LAnd right low-dimensional vector representation V^RConcatenating them together to get the low-dimensional vector representation of the word V ═ V^L，V^R]。

Training V^LThe objective function of (a) is:

wherein

And

respectively represent words c_iAnd c_jThe left-hand low-dimensional word vector representation of,

and

is composed of

And

corresponding deviationThe item is put into the device,

each term in the objective function is weighted according to the co-occurrence frequency of the word pairs as a weighting function.

Training V^RThe objective function of (a) is:

wherein

And

respectively represent words c_iAnd c_jThe right-hand low-dimensional word vector representation of,

and

is composed of

And

the corresponding offset term is used to determine the offset,

And

the weighting method of (3) is the same as that of the Glove model, and the function is as follows.

Compared with the prior art, the invention provides a new windowing mode, namely a mode of respectively taking words in a fixed window before and after a target word as contexts, and effectively fusing word vectors trained by the two windowing modes together to form a word expression vector, thereby improving the precision of the word vectors, obviously improving the precision on a disclosed test set in a word inference task, and being beneficial to parallel calculation.

The invention improves the way of the Glove model to count the co-occurrence matrix. The following three main advantages are:

1. an asymmetric mode statistical method of the word pair co-occurrence is provided, and a left co-occurrence matrix and a right co-occurrence matrix are counted.

2. An effective fusion mode of vectors trained by two co-occurrence matrixes is provided, and word expression vectors with higher precision than those under a symmetrical window under the same dimensionality can be obtained.

3. The word vectors are trained by two co-occurrence matrixes simultaneously by adopting a parallel computing method, so that the running time of a program is greatly reduced.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a flow chart for generating a left-side co-occurrence matrix and a right-side co-occurrence matrix.

Detailed Description

As shown in fig. 1 european, a word vector representation learning method based on word pair asymmetric co-occurrence includes the following steps,

S200, setting the size of a fixed window as w, sequentially traversing words in the corpus, counting a left co-occurrence matrix and a right co-occurrence matrix,is represented by X^LAnd X^RThe size of both matrices is n × n;

the sequence number of each word in the behavioral vocabulary of the matrix, and the column also the sequence number of each word in the vocabulary, are used

Denotes c_i、c_jThe position of the kth co-occurrence in the corpus.

s201. matrix X^LAnd X^RIs initialized to 0;

s203, traversing each word co-occurring at the left side of the word in the fixed window, finding the serial number j of the word in the word list, calculating the weight according to the relative position of the word i and the word j, and accumulating the weight to

At the same time, the weight is added up to

Training V^LThe objective function of (a) is:

wherein

And

and

is composed of

And

the corresponding offset term is used to determine the offset,

weighting each item in the target function according to the co-occurrence frequency of the word pairs as a weighting function;

training V^RThe objective function of (a) is:

wherein

And

and

is composed of

And

the corresponding offset term is used to determine the offset,

and

Example (b):

1. selecting English Wikipedia corpora, and generating a vocabulary list by 100000 words with high occurrence frequency.

2. Setting the size of the fixed window to be 10, and respectively counting the ten words before and after each word in the corpus to obtain a left co-occurrence matrix and a right co-occurrence matrix，X^LAnd X^R。

3. Setting the initial learning rate to be 0.05, the iteration times to be 50, and respectively using X^LAnd X^RTraining out 300-dimensional left-side low-dimensional word vector representation V^LAnd 300D Right Low-dimensional word vector representation V^RStitching them together results in a 600-dimensional low-dimensional word vector representation.

Table 1 shows the comparison between word vector representation obtained by training in the method and word vector representation obtained by training in a Glove model on a grammar-based word inference task, the Glove model adopts a symmetrical window, the size of a fixed window is set to be 10, the initial learning rate is set to be 0.05, the iteration frequency is 50, and the dimension of the word vector is 600 dimensions. Four corpora with different sizes are divided from the English Wikipedia corpus, and the corpora respectively contain 2 hundred million, 5 hundred million, 10 hundred million and 16 hundred million words, and the file sizes are respectively 1.09GB, 2.71GB, 5.42GB and 8.64 GB. The data in the table is the accuracy rate comparison of the 600-dimensional word vector obtained by training the invention and the Glove model to complete the word inference task of grammar.

TABLE 1 comparison of the present invention and Glove model on grammar-based word inference tasks

The experimental result shows that the accuracy of the task of the method is higher than that of a Glove model on corpora with different sizes, and meanwhile, when words with the same dimension are generated through training, the method adopts a parallel processing technology and trains V simultaneously^LAnd V^RAnd then the word vectors V ═ V obtained by splicing the words and the vectors^L，V^R]，V^LAnd V^RThe dimension of (1) is half of that of a word vector obtained by training a Glove model, so that the training time can be greatly reduced.

Claims

1. A word vector representation learning method based on word pair asymmetric co-occurrence is characterized in that: comprises the following steps of (a) carrying out,

s100, counting word lists from the corpus; counting per-word occurrences from a given corpusIn order of frequency from high to low, c_iDenotes the ith word, f_iRepresenting the frequency of the ith word, wherein i is more than or equal to 1 and less than or equal to n, and n is the number of different words in the corpus;

s200, setting the size of a fixed window as w, sequentially traversing words in a corpus, and counting a left co-occurrence matrix and a right co-occurrence matrix which are expressed as X^LAnd X^RThe size of both matrices is n × n;

Denotes c_i、c_jThe position of the kth co-occurrence in the corpus;

2. The word vector representation learning method based on word pair asymmetric co-occurrence according to claim 1, characterized in that: in step S200, the process of counting the left co-occurrence matrix and the right co-occurrence matrix is as follows:

s201. matrix X^LAnd X^RIs initialized to 0;

s203-traversing each word co-occurring at the left side of the word in the fixed window, finding the serial number j of the word in the word list according to c_iAnd c_jThe relative position of the two points is calculated and added to

At the same time, the weight is added up to

3. The word vector representation learning method based on word pair asymmetric co-occurrence according to claim 2, characterized in that: the step S300 specifically adopts the following method,

training V^LThe objective function of (a) is: