CN107239444B

CN107239444B - A kind of term vector training method and system merging part of speech and location information

Info

Publication number: CN107239444B
Application number: CN201710384135.6A
Authority: CN
Inventors: 文坤梅; 李瑞轩; 刘其磊; 李玉华; 辜希武; 昝杰; 杨琪
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-05-26
Filing date: 2017-05-26
Publication date: 2019-10-08
Anticipated expiration: 2037-05-26
Also published as: CN107239444A

Abstract

The invention discloses a kind of term vector training methods and system for merging part of speech and location information, this method comprises: being pre-processed to obtain target text to data；Participle and part-of-speech tagging are carried out to target text；Part-of-speech information is modeled and location information is modeled；Part of speech is merged on the basis of the skip-gram model based on negative sampling policy and location information carries out term vector and learns to obtain target term vector, which assesses for word analogy task and word similarity task.The present invention considers the part-of-speech information and location information of word, and to word part of speech and location information model on the basis of, the location information between the part-of-speech information and part of speech of word is made full use of to help the training of term vector, and also more reasonable for the update of parameter during training.

Description

Word vector training method and system fusing part of speech and position information

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a word vector training method and system fusing part of speech and position information.

Background

In recent years, with the rapid development of mobile internet technology, the scale of data in the internet is rapidly increased, and the complexity of data is also rapidly increased. This makes the processing and analysis of these massive unstructured and unmarked data a difficult problem.

The traditional machine learning method adopts Feature engineering (Feature engineering) to symbolize data so as to facilitate modeling and solving of a model, but a common bag-of-words representation technology in the Feature engineering, such as One-hot vectors, increases along with the increase of data complexity, and the dimension of the features also increases rapidly, so that the dimension disaster problem is caused. And the method based on One-hot vector representation has semantic gap phenomenon. As a distribution hypothesis (distributed hypothesis) that "if two words are similar in context, their semantics are also similar" is proposed, a word distribution representation technique based on the distribution hypothesis is continuously proposed. The most important of them are matrix-based distribution representation, cluster-based distribution representation and word vector-based distribution representation. However, the distribution representation method based on either matrix representation or cluster representation can express simple context information when the feature dimension is small. But when the feature dimension is high, the model is unable to express the context, especially the complex context. And the expression technology based on the word vector avoids the problem of dimension disaster regardless of the expression of each word or the expression of the context of the word by a linear combination method. And because the distance between words can be measured by the cosine distance or Euclidean distance between the word vectors corresponding to the words, the problem of semantic gap in the traditional bag-of-words model is eliminated to a great extent.

However, most of the existing word vector research works focus on reducing model complexity by simplifying the structure of a neural network in a model, some work integrates emotion, theme and other information, the research work for integrating part-of-speech information is very little, the part-of-speech granularity for which the work is few is large, the part-of-speech information is not fully utilized, and the part-of-speech information is not reasonably updated.

Disclosure of Invention

In view of the above defects or improvement requirements of the prior art, the present invention aims to provide a word vector training method and system fusing part of speech and position information, so as to solve the technical problems in the prior art that the part of speech granularity targeted in the research work of fusing part of speech information is large, the utilization of part of speech information is not sufficient, and the updating of part of speech information is not reasonable.

To achieve the above object, according to an aspect of the present invention, there is provided a word vector training method fusing part of speech and position information, including the steps of:

s1, preprocessing the original text to obtain a target text;

s2, according to the context information of the words, the parts of speech in the part of speech tagging set is adopted to tag the parts of speech of the words in the target text;

s3, modeling according to the tagged part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M corresponding to the position_i' where, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the element in the matrix M is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix M_i' the dimension of the rows and columns is the same as the matrix M, which is_iThe element in' is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i;

s4, matrix M and matrix M after modeling_iThe method comprises the steps of' fusing the target model into a skip-gram word vector model to construct a target model, and performing word vector learning by the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.

Preferably, step S2 specifically includes the following sub-steps:

s2.1, segmenting the target text to distinguish all words in the target text;

and S2.2, for each sentence in the target text, according to the context information of the word in the sentence, performing part-of-speech tagging on the word by using the part-of-speech in the part-of-speech tagging set.

Preferably, step S3 specifically includes the following sub-steps:

s3.1, generating a word-part-of-speech pair formed by the words and the parts of speech corresponding to the words for each word in the target text, and constructing a part-of-speech associated weight matrix M according to the word-part-of-speech pair, wherein the row-column dimension of the matrix M is the category size of parts of speech in a part-of-speech tagging set, and the element in the matrix M is the co-occurrence probability of the parts of speech of the word corresponding to the row of the element and the parts of speech of the word corresponding to the column of the element;

s3.2, modeling the relative position i of the corresponding word pair according to the part of speech, and constructing a position part of speech associated weight matrix M 'corresponding to the position'_iWherein, matrix M'_iIs the same as matrix M in row-column dimension of matrix M'_iThe element in (b) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i.

Preferably, step S4 specifically includes the following sub-steps:

s4.1, constructing an initial objective function:wherein, C represents a vocabulary table in the whole training corpus, context (w) represents a context word set consisting of C words before and after a target word w, and C represents the window size;

s4.2, matrix M and matrix M after modeling_i' fusing to a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word w^w(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θ^uFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is T_uAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;

s4.3, optimizing the new objective function, maximizing the value of the new objective function, and performing optimization on the parameter theta^u、Andand performing gradient calculation and updating, and obtaining a target word vector when the whole training corpus is traversed.

According to another aspect of the present invention, there is provided a word vector training system fusing part of speech and position information, including:

the preprocessing module is used for preprocessing the original text to obtain a target text;

the part-of-speech tagging module is used for tagging the part of speech of the word in the target text by adopting the part of speech in the part-of-speech tagging set according to the context information of the word;

the position part-of-speech fusion module is used for modeling according to the labeled part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M 'corresponding to the position'_iWherein, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the elements in the matrix M are the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix M'_iIs the same as matrix M in row-column dimension of matrix M'_iThe element in (1) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at a relative position i;

a word vector learning module for summing the modeled matrix MMatrix M'_iAnd fusing the target word vectors into a skip-gram word vector model to construct a target model, and performing word vector learning by using the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.

Preferably, the part of speech tagging module comprises:

the word segmentation module is used for segmenting the target text to distinguish all words in the target text;

and the part-of-speech tagging submodule is used for performing part-of-speech tagging on each sentence in the target text by adopting the part-of-speech in the part-of-speech tagging set according to the context information of the word in the sentence.

Preferably, the location part-of-speech fusion module includes:

the part-of-speech information modeling module is used for generating a word-part-of-speech pair formed by aiming at each word and the part of speech corresponding to the word for each word in the target text, and constructing a part-of-speech associated weight matrix M according to the word-part-of-speech pair, wherein the row-column dimension of the matrix M is the category size of part-of-speech tagging concentrated parts of speech, and the element in the matrix M is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element;

the position information modeling module is used for modeling the relative position i of the corresponding word pair according to the part of speech and constructing a position part of speech associated weight matrix M 'corresponding to the position'_iWherein, matrix M'_iIs the same as matrix M in row-column dimension of matrix M'_iThe element in (b) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i.

Preferably, the word vector learning module comprises:

an initial objective function construction module, configured to construct an initial objective function:wherein, C represents a vocabulary table in the whole training corpus, context (w) represents a context word set consisting of C words before and after a target word w, and C represents the window size;

a new objective function constructing module for constructing the matrix M and the matrix M 'after modeling'_iFusing the target model into a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word w^w(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θ^uFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is T_uAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;

the word vector learning submodule is used for optimizing the new objective function, maximizing the value of the new objective function and carrying out optimization on the parameter theta^u、Andand performing gradient calculation and updating, and obtaining a target word vector when the whole training corpus is traversed.

In general, compared with the prior art, the method of the invention can obtain the following beneficial effects:

(1) by constructing the incidence matrix based on the incidence relation of the part of speech and the incidence relation of the position, the information of the part of speech and the position among the words can be well modeled.

(2) By fusing the well-modeled incidence matrix based on the part-of-speech information and the position information into the negative sampling-based skip-gram word vector learning model, on one hand, a better word vector result can be obtained, and on the other hand, incidence relation weight among the parts-of-speech in a corpus used for model training can also be obtained.

(3) Because the model adopts the optimization strategy of negative sampling, the training speed of the model is higher.

Drawings

Fig. 1 is a schematic flowchart of a word vector training method fusing part of speech and position information according to an embodiment of the present invention;

FIG. 2 is a modeling model diagram of part of speech and location information according to an embodiment of the present invention;

FIG. 3 is a simplified overall flow chart according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating another word vector training method for fusing parts of speech and location information according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a word vector learning method integrating part of speech and position information, which is characterized in that the part of speech and the importance of the part of speech in a natural language are ignored by the existing word vector learning method. The method aims to consider the part-of-speech incidence relation and the position relation among words on the basis of an original skip-gram model so as to enable the model to train a word vector result fused with more information and better complete a word analogy task and a word similarity task by utilizing the learned word vector.

Fig. 1 is a schematic flow chart of a word vector learning method with fusion of parts of speech and position information according to an embodiment of the present invention, where the method shown in fig. 1 includes the following steps:

s1, preprocessing the original text to obtain a target text;

since a large amount of useless information such as XML tags, web page links, picture links, and the like exists in the obtained original text, the useless information is not beneficial to training of word vectors, and even can become noise data to influence the learning of the word vectors, so that the useless information needs to be filtered, and the useless information can be filtered by using a perl script.

because part-of-speech information of words is used in the method provided by the invention, part-of-speech tagging needs to be performed on the text by using some part-of-speech tagging tools. In order to solve the problem that a word may have multiple parts of speech due to different contexts, the text can be part of speech tagged in advance, and the part of speech tagging can be performed by means of the context information of the text. Step S2 specifically includes the following substeps:

s2.1, segmenting the target text to distinguish all words in the target text;

the tokenize segmentation tool in openNLP can be used to segment the text, for example, if the common word "applet" in "I buy an applet" is not segmented, the common word "applet" becomes the word "applet", which does not exist, and the learning of the word vector is influenced.

The part-of-speech tagging is performed on the whole sentence at one time, so that a plurality of parts-of-speech of the same word can be distinguished according to the context of the word. The part of speech to which the word is assigned here belongs to the Penn TreebankPOS part of speech tag set.

Two sentences after word labeling such as "i love you." and "she love her son to mu love":

i _ PRP (pronoun) love _ VBP (verb) you _ PRP. _;

she _ PRP (pronoun) give _ VB (verb) her _ PRP $ (pronoun) son _ NN _ to _ RB (adverb) much _ JJ (adjective) love _ NN.

S3, modeling according to the tagged part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M 'corresponding to the position'_iWherein, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the elements in the matrix M are the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix M'_iIs the same as matrix M in row-column dimension of matrix M'_iThe element in (1) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at a relative position i; FIG. 2 is a modeling model diagram of part of speech and location information according to an embodiment of the present invention, wherein T is in a row and column₀～T_NDenotes the part of speech, M'_i(T_t,T_t-2) Representing part of speech T_tWith part of speech T_t-2Probability of co-occurrence at relative position i.

After the part of speech of the word is obtained, how to participate the part of speech information in the word vector learning model and solve the new model needs to model the part of speech information first. The modeling aims to establish a part-of-speech incidence relation matrix with row and column dimensions of the part-of-speech tagging concentrated part-of-speech category, and elements in the matrix are the probability of occurrence of two parts-of-speech. In addition, the position relationship is modeled, because the position relationship between two parts of speech is important when they co-occur. Step S3 specifically includes the following substeps:

for example, for a word son in "she give her son to multi-love." whose part of speech is NN and part of speech of word her is PRP, the elements specified by the row corresponding to the PRP and the column corresponding to NN in the matrix are the co-occurrence probabilities (i.e., weights) of the two parts of speech.

S3.2, modeling the relative position i of the corresponding word pair according to the part of speech, and constructing a position part of speech associated weight matrix M 'corresponding to the position'_iWherein, matrix M'_iIs the same as matrix M in row-column dimension of matrix M'_iThe element in (1) is the co-occurrence probability (i.e. weight) of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i.

For example, if the window size is 2c, i ∈ [ -c, c]. When the window size is 6, then M 'is established'_-3、M′_-2、M′_-1、M′₁、M′₂、M′₃There are 6 matrices.

For example, for son and her in "she give her son to mu multiple love.", when son is the target word, the associated weight of the part of speech and the position corresponding to the parts of speech of the two words is M'_-1(PRP,NN)。

S4, matrix M and matrix M 'after modeling'_iAnd fusing the target word vectors into a skip-gram word vector model to construct a target model, and performing word vector learning by using the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.

The step S4 specifically includes the following sub-steps:

s4.1, constructing an initial objective function：Wherein, C represents a vocabulary table in the whole training corpus, context (w) represents a context word set consisting of C words before and after a target word w, and C represents the window size;

the Skip-gram model is the same in concept, namely, the target word w is passed_tWord v (w) in the prediction context_t+i) Wherein i represents w_t+iAnd w_tThe positional relationship therebetween. With sample (Context (w)_t)，w_t) For example, where | Context (w)_t) 2c, where Context (w)_t) Is composed of a word w_tThe front and back words are composed of c words. The final optimization goal of the target model is still for the entire training corpus such that all the passing target words w_tTo predict the probability maximization of the context word, i.e. to optimize the initial objective function.

For example, the sample "she give her son to mu ch love." word son is the target word w_tAnd c is 3, then Context (w)_t)＝{she,give,her,too,much,love}。

S4.2, modeling matrix M and matrix M'_iFusing the target model into a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word w^w(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θ^uFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is T_uAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;

for example, the sample "she give her son to mu ch love." the word son is a positive sample, where the label of the word son is 1, and for other words such as dog, flower, etc., the label is a negative sample, and the label is 0.

Fig. 3 is a simplified overall flow diagram disclosed in the embodiment of the present invention, and the constructed target model has three layers, i.e., an input layer, a projection layer, and an output layer. Wherein:

the input of the input layer is a central word w (t), and the output is a word vector corresponding to the central word w (t);

the projection layer mainly projects the output result of the input layer, and the input and output of the projection layer in the model are word vectors of a central word w (t);

the output layer mainly uses the central word w (t) to predict the word vectors of the context words such as w (t-2), w (t-1), w (t +1), w (t +2), etc.

The main purpose of the invention is to consider the part of speech and the position relation of the central word and the context word when predicting the context word by using the central word w (t).

S4.3, optimizing the new objective function, maximizing the value of the new objective function, and performing optimization on the parameter theta^u、Andperforming gradient calculation and updating, and obtaining target word vectors when the whole training corpus is traversed。

For example, a random Gradient Ascent (SGA) method may be used to optimize the new objective function, i.e., maximize the value of the new objective function. And for the parameter theta^u、Andand gradient calculation and updating, and obtaining the target word vector when the whole training corpus is traversed.

Optionally, the target word vector may be obtained by performing updating and gradient calculation in the following manner:

fig. 4 is a schematic flow chart of another word vector training method for fusing parts of speech and location information according to an embodiment of the present invention, where the method shown in fig. 4 includes five steps of data preprocessing, word segmentation and part of speech tagging, modeling of part of speech and location information, word vector training, and task evaluation. The method steps described in embodiment 1 are data preprocessing, word segmentation and part-of-speech tagging, part-of-speech and position information modeling, and word vector training, and after the task evaluation can utilize the learned target word vector with part-of-speech and position information, the target word vector can be used in tasks such as word analogy task and word similarity. The method mainly comprises the following two steps:

and performing word analogy task by using the learned target word vector. For example, for two word pairs < king, queen > and < man, wman >, the word vectors corresponding to these word pairs are calculated to find that there is a relationship of v (king) -v (queen) -v (man) -v (wman).

And performing word similarity tasks by using the learned target word vectors. For example, given a word such as "dog", the top N words such as "puppy", "cat", etc. that are closely related to "dog" may be obtained by calculating the cosine or Euclidean distance of the other words from "dog".

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A word vector training method fusing part of speech and position information is characterized by comprising the following steps:

s1, preprocessing the original text to obtain a target text;

s3, modeling according to the tagged part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M 'corresponding to the position'_iWherein, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the elements in the matrix M are the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix M'_iIs the same as matrix M in row-column dimension of matrix M'_iThe element in (1) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at a relative position i;

2. The method according to claim 1, wherein step S2 comprises the following sub-steps:

s2.1, segmenting the target text to distinguish all words in the target text;

3. The method according to claim 1 or 2, characterized in that step S3 comprises the following sub-steps:

4. The method according to claim 3, wherein step S4 comprises the following sub-steps:

5. A word vector training system fusing part of speech and position information is characterized by comprising:

the word vector learning module is used for calculating the matrix M and the matrix M 'after modeling'_iAnd fusing the target word vectors into a skip-gram word vector model to construct a target model, and performing word vector learning by using the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.

6. The system of claim 5, wherein the part-of-speech tagging module comprises:

7. The system according to claim 5 or 6, wherein the location part-of-speech fusion module comprises:

8. The system of claim 7, wherein the word vector learning module comprises:

a new objective function construction module for constructing the matrix M and the matrix M after modeling_i' fusing to a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word w^w(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θ^uFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is T_uAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;