CN107526834A

CN107526834A - Joint part of speech and the word2vec improved methods of the correlation factor of word order training

Info

Publication number: CN107526834A
Application number: CN201710791297.1A
Authority: CN
Inventors: 于重重; 曹帅; 潘博; 张青川
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2017-09-05
Filing date: 2017-09-05
Publication date: 2017-12-29
Anticipated expiration: 2037-09-05
Also published as: CN107526834B

Abstract

The invention discloses a kind of word2vec improved methods for the correlation factor training for combining part of speech and word order, it is proposed Structured word2vec on POS models, including CWindow POS (CWP) models and Structured Skip gram POS (SSGP) model, two models using part-of-speech tagging information and word order as influence factor combined optimization, are modeled using part of speech related information to the intrinsic syntactic relation between word in contextual window；Context words sequence is weighted by part of speech associated weights, then inner product of vectors calculating is carried out by word position order, uses stochastic gradient descent (SGD) algorithm combination learning associated weight and word embedding.Word by the orientation insertion of its sequence of positions, is realized the combined optimization to term vector and the progress of part of speech related weighing matrix by the present invention；All there is high efficiency in word analogy task, word similitude task and qualitative analysis.

Description

Joint part of speech and the word2vec improved methods of the correlation factor of word order training

Technical field

The invention belongs to machine learning techniques field, is related to word2vec methods, more particularly to a kind of joint part of speech and word The word2vec improved methods of the correlation factor training of sequence, this method propose Structured word2vec on POS models, Word position order can be not only perceived, by word by the orientation insertion of its sequence of positions, and is built using part of speech related information Intrinsic syntactic relation in vertical contextual window between word；Realize term vector and part of speech related weighing matrix combine it is excellent Change.

Background technology

Part of speech is the fundamental of natural language processing, and word order contains passed on semanteme and syntactic information, it Be all key message in natural language.How effectively to be combined both in word embedding models, be The emphasis studied at present.The semantic vector space model of language represents each word with real-valued vectors, and term vector can conduct Feature in many applications, such as document classification, automatic question answering, name Entity recognition and the parsing of morphologic correlation word.Term vector Represent that the Mikolov that effect is generally recorded with document [1] et al. word analogy task is assessed：By checking word vectors Between scalar distance, detect structural relation finer in word vectors space.For example, analogy " king is to Queen as man is towoman " should be encoded in vector space by vector equation formula king-queen=man-woman In.Document [2] points out that this evaluation of programme is advantageous to produce the model of significant dimension, so as to capture the poly of distributed expression Genus.Therefore, researcher uses main appraisal procedure of the word analogy task as word vectors.

The neutral net language model proposed with the development that deep neural network learns, the Bengio that document [3] is recorded (Neural Network Language Model, NNLM) is gradually by the concern and attention of researchers.Document [4] and [5] Describe and be applied to natural language processing field：Such as Recognition with Recurrent Neural Network language model (Recurrent Neural Networks language model, RNNLM).The defects of NNLM and RNNLM models, is that structure is excessively complicated, wherein non-thread The hidden layer of property brings substantial amounts of calculating.For this problem, in document [6], Mikolov proposes word2vec two kinds of simplification Linear model：Continuous Bag-of-Words Model (CBOW) and Continuous Skip-gram (CSG). On the basis of CBOW and CSG linear structure, in document [7], Kavukcuoglu et al. proposes scale model vLBL and ivLBL. In document [8], Levy et al. proposes the explicit word embeddings models based on PPMI measurements.In document [9], Jeffrey et al. proposes a kind of word lists representation model GloVe based on global information, and it is by local context window and matrix The method of decomposition effectively combines, and establishes word-word schoolmate's co-occurrence count matrixs, global excellent so as to be carried out using matrix Change.

Word2vec models and GloVe model are disadvantageous in that：It is first, insensitive for word order information；Second, Part of speech related information can not be utilized.Word order and part of speech are key messages in natural language, and ignoring both information can cause model to be lost Lose part of semantic and syntactic information.For in the improved model of the two problems, document [10] describes Wang et al. propositions Word order orients incorporation model Structured word2vec, can effectively utilize word order information, greatly improve model in language Expression effect in method inter-related task；Document [11] describes Liu et al. by introducing part of speech associated weight (POS Relevance Weights), the part of speech related information of word is made full use of to be modeled.Document [10] changes respectively to CBOW and CSG structures Enter, propose corresponding two new construction CWindow and Structured Skip-gram (SSG).CWindow and SSG keeps letter On the basis of single linear structure, according further to sequence of positions to carrying out word orientation insertion, it is used for for each relative position definition The parameter of word insertion, therefore, CWindow and SSG remain the progress of word order information, but part-of-speech information is not protected Stay.Part of speech is generally used for such as Language Modeling, dependency parsing and the name various natures of Entity recognition by existing method Language processing tasks, but the training that part-of-speech information carries out distributed word lists representation model is rarely employed.In order to using part-of-speech information, Document [11] has used for reference the thought of Language Modeling and has proposed to be used for word using part of speech correlation weighting matrix in CBOW It is every in embedding models (POS Relevance Weights for Learning Word Embeddings, PWE) Individual word-context is to being modeled.This modeling is effectively maintained for part-of-speech information, but for word order information Do not play stick effect.

Citation：

[1]Mikolov T,Yih W T,Zweig G.Linguistic regularities in continuous space wordrepresentations[J].In HLT-NAACL,2013.

[2]Bengio Y.Learning deep architectures for AI[J].Foundations andin Machine Learning,2009,2(1):1-127.

[3]Bengio Y,Schwenk H,Senécal J S,et al.Neural probabilistic language models[M]//Innovations in Machine Learning.Springer Berlin Heidelberg,2006: 137-186.

[4]Zhang X,Gu N,Ye H.Multi-GPU Based Recurrent Neural Network Language Model Training[M]//Social Computing.Springer Singapore,2016.

[5]Mikolov T,Karafiát M,Burget L,et al.Recurrent neural network based language model[C]//INTERSPEECH 2010,Conference of the International Speech Communication Association,Makuhari,Chiba,Japan,September.2010:1045-1048.

[6]Mikolov T,Chen K,Corrado G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv preprint arXiv:1301.3781,2013.

[7]Mnih A,Kavukcuoglu K.Learning word embeddings efficiently with noise-contrastive estimation[C]//Advances in Neural Information Processing Systems.2013:2265-2273.

[8]Levy O,Goldberg Y,Ramat-Gan I.Linguistic Regularities in Sparse and Explicit Word Representations[C]//CoNLL.2014:171-180.

[9]Pennington J,Socher R,Manning C D.GloVe:Global Vectors for Word Representation[C]//EMNLP.2014,14:1532-43.

[10]Ling W,Dyer C,Black A,et al.Two/too simple adaptations of word2vec for syntax problems[C]//Proceedings of the 2015Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2015:1299-1304.

[11]Liu Q,Ling Z H,Jiang H,et al.Part-of-Speech Relevance Weights for Learning Word Embeddings[J].arXiv preprint arXiv:1603.07695,2016.

The content of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of correlation factor training for combining part of speech and word order Word2vec improved methods, this method combined word order (word order) and part-of-speech information, proposed that one kind is based on word The Structured word2vec on POS models of embedding models, model is not only set to perceive word position order, And establish the intrinsic syntactic relation in contextual window between word using part of speech related information；Structured Word2vec on POS models by the orientation insertion of its sequence of positions, realize word to term vector and part of speech related weighing matrix Carry out combined optimization.In multiple word similitude tasks and word analogy task, the model is current more advanced than other The expression effect of word embedding models is more preferable.

The present invention principle be：Carried for word order information and the problem that is ignored of part of speech related information, the present invention Go out a kind of new model：Structured word2vec on POS, the model is using part-of-speech tagging information and word order as shadow Ring factors in combination optimization.Structured word2vec on POS models introduce part of speech on the basis of word orients embedded structure Relevance weight matrix is as the training factor.Structured word2vec on POS models, using improved NS (Negative sampling, NS) algorithm is trained, the NS algorithms and document [13] that its computational efficiency is recorded than document [12] The computational efficiency that the multilayer perceptron (Hieraichical Softmax, HS) of record is trained is higher.Structured Word2vec on POS models have higher sensitivity not only for word order, and using part of speech related information to upper and lower Intrinsic syntactic relation in text window between word is modeled.Context words sequence is entered by part of speech associated weights first Row weighted calculation；Then inner product of vectors calculating is carried out by word position order；Finally joined using stochastic gradient descent (SGD) algorithm Close study associated weight and word embedding.

For convenience of narration, symbolization of the present invention and corresponding meaning are shown in Table 1：

The symbol relevant information of table 1 (runic represents vector or matrix)

Technical scheme provided by the invention is：

A kind of word2vec improved methods for the correlation factor training for combining part of speech and word order, propose Structured Word2vec on POS models, including CWindow-POS (CWP) models and Structured Skip gram-POS (SSGP) Model, two models using part-of-speech tagging information and word order as influence factor combined optimization, utilize part of speech related information Intrinsic syntactic relation between word in contextual window is modeled；By part of speech associated weights to context words sequence It is weighted, then inner product of vectors calculating is carried out by word position order, is combined using stochastic gradient descent (SGD) algorithm and learned Practise associated weight and word embedding；Comprise the following steps：

1) CWindow-POS (CWP) model is established；

With reference to CWindow models and PWE models, in the present invention, definition output prediction matrix O ∈ R | V | × 2cd, and draw Enter part of speech correlation weighting matrix, establish the term vector model of Feature Words, obtained CWP model structures such as Fig. 1.For training language Expect Corp, the object function of CWP models is a log-likelihood function for maximizing each sample labeling word, and the present invention, which uses, to be changed Enter NS algorithms, it trains function such as formula 1：

Wherein word (t) is local center word, and t represents the word tokens sequence numbers in training corpus Corp；Context (word (t)) is word (t) context words sequences；Neg (word) represents word centered on word, the counter-example sample extracted to word The negative data set of this set；L^word(t)(u) represent word (t) sampling word u label (if word (t)=u, L^word ^(t)(u)=1；Otherwise L^word(t)(u)=0)；

By being trained to object function, maximum likelihood probability is calculated.In formula 1, Q_CBOWFor maximum likelihood probability； P (u | context (word (t))) it is that sample word u follows context context (word (t)) co-occurrence posterior probability；p(u| Context (word (t))) calculating process be different from CWindow and CBOW.CWP first enters the term vector of input layer respectively Row part of speech weighted calculation；Then in conjunction with the method for CWindow models, CWP models are by the vector after part of speech weighted calculation by up and down The appearance of cliction language, which sequentially orients, to be embedded into projection layer, the cascade vector form such as formula 2：

x_word(t)=[Φ_-c(z_t-c,z_t) v (word (t-c)) ... Φ_-1(z_t-1,z_t) v (word (t-1)), Φ₁ (z_t+1,z_t) v (word (t+1)) ... Φ_i(z_t+c,z_t) v (word (t+c))] (formula 2)

Wherein, Φ_-c(z_t+c,z_t) represent and be meant that distance center word as on c position, part of speech label z_t+cWith z_tPhase Closing property weights；V (word (t-c)), which is represented, is meant that vector of the distance center word as c word.

Formula 2 is substituted into NS algorithms, draws formula 3：

u∈{word(t)}∪Neg(word(t))

θ^u(i)=O (u) [(index [i] -1) × d+1~index [i] × d]

Wherein word (t) is local center word, and t represents the word tokens sequence numbers in training corpus Corp；Context (word (t)) is word (t) context words sequences；Neg (word) represents word centered on word, the counter-example sample extracted to word The negative data set of this set.

By Q in formula 1_CBOWBrace under formula be designated as L, as CWP object function carry out gradient derivation, by the generation of formula 3 Enter L, be expressed as formula 4：

Wherein, in order to which gradient below asks local derviation convenient, be assigned to a variable dot_product_weight, it be one to The entirety of amount weighting dot product；0^u(i) one section of dimension for representing the interception from O (u) is d vector, and it is with word in input layer Orient embedded position i and change；Φ_-i(z_t+i,z_t) represent and be meant that distance center word as on i position, part of speech label z_t+iWith z_tCorrelation weights.

For the variable Φ i (zt+i, zt) in object function (formula 4), 0^u(i) and v (word (t+i)), in gradient algorithm In, the key of solution is the gradient of three variables corresponding to object function.The present invention is with stochastic gradient rise method to above-mentioned the three of L Individual variable carries out gradient derivation solution, then constantly optimization renewal.

0^u(i) gradient updating

0u (i) gradient is expressed as formula 5：

0u (i) more new formula is formula 6：

Wherein word (t) is local center word, and t represents the word tokens sequence numbers in training corpus Corp；L^word(t)(u) Represent word (t) sampling word u label (if word (t)=u, L^word(t)(u)=1；Otherwise L^word(t)(u)=0)；For Gradient asks local derviation convenient below, is assigned to a variable dot_product_weight, and it is the whole of vector weighting dot product Body.

Φ i (zt+i, zt) gradient updating

Φ i (zt+i, zt) gradient is expressed as formula 7：

Φ i (zt+i, zt) more new formula is formula 8：

Wherein word (t) is local center word, and t represents the word tokens sequence numbers in training corpus Corp；L^word(t)(u) Represent word (t) sampling word u label (if word (t)=u, L^word(t)(u)=1；Otherwise L^word(t)(u)=0)；For Gradient asks local derviation convenient below, is assigned to a variable dot_product_weight, and it is the whole of vector weighting dot product Body；θ^u(i) one section of dimension for representing the interception from O (u) is d vector, and it is with the embedded position i of word orientation in input layer And change；Φ_i(z_t+i,z_t) represent and be meant that distance center word as on i position, part of speech label z_t+iWith z_tCorrelation power Value.

V (word (t+i)) gradient updating

V (word (t+i)) gradient is expressed as formula 9：

V (word (t+i)) more new formula is formula 10：

2) Structured Skip gram-POS (SSGP) model is established；

SSGP models are improved on the basis of SSG models and PWE models, and word (t) uses for giving centre word are single Output matrix O ∈ R | V | × d predicts each cliction up and down, and introduces part of speech correlation weighting matrix and is modeled, obtains SSGP model structures such as Fig. 2.

For training corpus Corp, the object function of SSGP models is a pair for maximizing each sample labeling word Number likelihood function, for the present invention using NS algorithms are improved, it trains function such as formula 11：

P (word (t+i) | u) represent word (t+i) posterior probability based on sample word u, Q_CSGFor maximum likelihood probability；

Wherein, the p in SSGP models (word (t+i) | u) is different from CSG and SSG.SSGP adds the PWE part of speech degree of correlation Weight matrix Φ i add output layer, and the position for orienting insertion in matrix to word based on the weighted factor of part of speech is related.Therefore, exist After adding Φ i in SSG NS algorithms, calculation formula is expressed as formula 12：

Q in formula 11_CSGBrace under formula be designated as L1 and enter as Structured Skip gram-POS object function Row gradient is derived, and formula 12 is substituted into the L1 of formula 13, is expressed as formula 13：

L1={ (L^word(t)(u)×log[σ(Φ_i(z_t+i,z_t)v(word(t+i))·O_i(u))]+(1-L^word(t)(u))× log[1-σ(Φ_i(z_t+i,z_t)v(word(t+i))·O_i(u))] } (formula 13)

For variable parameter Φ i (zt+i, zt), the Oi (u) in object function and v (word (t)), in gradient algorithm, The key of solution is the gradient of three parameters corresponding to object function.Three variables of the present invention with stochastic gradient rise method to L1 Gradient solution is carried out, then constantly optimization renewal.Drawn by step 1) inference：

Oi (u) more new formula is：

Φ i (zt+i, zt) more new formula is：

V (word (t+i)) more new formula is：

Compared with prior art, the beneficial effects of the invention are as follows：

The invention provides joint part of speech and word order correlation factor training word2vec improved methods, foundation Structured word2vec on POS models have combined word order and two kinds of information of part of speech, not only perceive model Word position order, and establish the intrinsic syntactic relation in contextual window between word using part of speech related information.Tool Body, the inventive method has following technical advantage：

Improved method of the present invention, by being determined based on word order for PWE part of speech associated weight and Structured Word2Vec Merged to embedded structure, it is proposed that Structured word2vec on POS models, including CWP and two moulds of SSGP Type, can in sensing network context words relative position information, and can is using part of speech relevance weight to context sequence It is weighted.CWP and SSGP is as follows in training process：In word orients embedded structure, stochastic gradient descent algorithm is used Combination learning word embedding and part of speech correlation weighting matrix.Compared with the conventional method, word is pressed its position by the present invention Order orientation insertion is put, realizes the combined optimization to term vector and the progress of part of speech related weighing matrix；Word analogy task, Word similitude task and qualitative analysis all have high efficiency.

Brief description of the drawings

Fig. 1 is the FB(flow block) of Structured word2vec on POS methods provided by the invention.

Fig. 2 is CWindow-POS model structures schematic diagram proposed by the present invention.

Fig. 3 is SSGP model structures schematic diagram proposed by the present invention.

Fig. 4 is each model word analogy accuracy rate in the embodiment of the present invention under different vector dimensions and iterations Average；

Wherein, (1) is Dim=100, during Epoch=1, each model word analogy accuracy rate average；(2) it is Dim=100, During Epoch=3, each model word analogy accuracy rate average；(3) it is Dim=300, during Epoch=3, each model word analogy is accurate True rate average；(4) it is Dim=300, during Epoch=5, each model word analogy accuracy rate average.

Fig. 5 is that each model word similitude is accurate in the embodiment of the present invention under different vector dimensions and iterations Rate；

Wherein, (1) is Dim=100, during Epoch=1, each model word similitude accuracy rate；(2) it is Dim=100, During Epoch=3, each model word similitude accuracy rate；(3) it is Dim=300, during Epoch=3, each model word similitude is accurate True rate.

In Fig. 4~5, abscissa illustrates each model classification；Ordinate illustrates the value of accuracy rate.

Embodiment

Below in conjunction with the accompanying drawings, the present invention, the model of but do not limit the invention in any way are further described by embodiment Enclose.

The present invention provides a kind of word2vec improved methods for the correlation factor training for combining part of speech and word order, proposes Structured word2vec on POS models；This method is combined using part-of-speech tagging information with word order as influence factor Optimized model, the intrinsic syntactic relation between word in contextual window is modeled using part of speech related information；Pass through word Property associated weights context words sequence is weighted, then by word position order carry out inner product of vectors calculating, use Stochastic gradient descent (SGD) algorithm combination learning associated weight and word embedding.Fig. 1 show the stream of the inventive method Journey.

The training corpus of following examples, evaluation and test task, part-of-speech information mark and parameter setting are as follows：

Using the English wikipedia corpus in April, 2016~June, about 6,000,000 articles and 3,000,000,000 are shared tokens.Wikipedia corpus has used wscript.exe to be pre-processed.After normalization, we are less than 50 by removing Secondary word constructs a vocabulary for including 636,900 different terms；Using OpenNLP toolkit3 to training language Material carries out part of speech mark.Tally set is Penn Treebank part of speech tally sets, and it is accorded with by 36 common part of speech labels and 6 Number label composition.

The implementation environment of embodiment is Win7 64 systems, inside saves as 16GB, 12 core CPU, CPU frequency 2.0GHZ, Development platform is Eclipse, and programming language uses Java.For other hyper parameters of word embedding model trainings, I Set it is as follows：Negative sample number is 5；Contextual window is dimensioned to 5；The initial value of learning rate is set to 0.025, and with instruction Practice process linear decline；All weight initial values in part of speech correlation weighting matrix are set to identical value.The reality of final entry Test result and derive from the average value that result is run multiple times.

Find in an experiment, only vector dimension has a significant impact with iterations to experimental result.Wherein when vector is tieed up Degree 100~300 and during iterations 1~5, word analogy results change is more obvious, therefore charting above-mentioned two Experimental result in parameter area, while the result of multiple models is compared for convenience, experiment will be recorded in different vectors Dimension (being expressed as Dim) and the result under iterations (being expressed as Epoch).

In specific implementation, part of speech includes such as " a is to b as c is to than task" it is such the problem of.Semanteme is asked Topic is generally directed to the analogy in people or place, such as " Athens is to Greece as Berlin is to”.Syntax problem It is generally directed to the analogy of verb time sequence or adjective form, such as " dance is to dancing as fly is to”. In order to correctly answer a question, model should uniquely identify the term of missing, and only accurate correspond to could be by as correct Matching.Task needs to find word d vector vs (d) closest to v (b)-v (a)+v (c) according to cosine similarity, is answered with this Problem " a is to b as c is to”.

Experimental data concentrates syntax test and appraisal data set to include：First data set for being named as MSR includes 8000 forms Syntax analog problem (document [15] record).It is the data set that Mikolov et al. is proposed that another, which is named as SYN, wherein Include 10675 syntax problems；Semanteme test and appraisal data set is the data set SEM that Mikolov et al. is proposed, wherein including 8869 Matter of semantics.In order to answer these analog problems, we delete the outer word of all vocabulary first, then using typical similitude Multiplication method finds correct option from whole vocabulary.Experimental result is as shown in table 2.Wherein, CBOW, CSG, PWE, CWindow It is existing method with SSG, GloVe；CWP and SSGP is the model that the inventive method is established.

Word evaluation and test task in experiment has three：1st, word analogy task.Analogy task is our principal focal point, because Semantic and phraseological test has been carried out to vector space minor structure for it.2nd, in addition to analogy task, experiment additionally uses Word similitude task assesses our model.3rd, word is carried out using Liu et al. [11] way of qualitative analysis for part of speech Language is evaluated and tested.

The result (%) of the word analogy task of table 2

For word similitude task, each feature in vocabulary is normalized first, then calculate mutually it Between cosine similarity.Experiment obtains similarity score from word vector, and calculates between the fraction and artificial judgment fraction Spearman's correlation coefficient is as a result.Experimental data, which is concentrated, includes WordSim-353 (WS353) [16], SCWS [17] and RW [18]。

Experimental result is as shown in table 3.

The result (%) of the word similitude task of table 3

Above-mentioned experimental result is analyzed, we are by the data drafting pattern 4 of the word analogy result of table 2, wherein vertical Coordinate is the average (avg) of test accuracy rate of the model on three data sets.Drawn from histogram analysis：

1.CBOW and most initial models of the CSG as word2vec, positive effect are less than other models.This has confirmed phase Close the advantages caused by improved model such as PWE, CWindow, SSG described in work.

2. obviously, word analogy accuracy rate is proportionate with vector dimension, iterations in an experiment.When vector dimension is 300, iterations be 5 when (i.e. the optimal parameter setting of word analogy accuracy rate), set forth herein CWP and SSGP models It is better than other wordembedding models in the general effect of word analogy task.

3. when vector dimension is identical, PWE, CWP and SSGP effect can have more obviously with iterations increase Advantage.When iterations increase increases to 3 from 1 (i.e. (1)~(2) figure), PWE, CWP and SSGP surmount other models, effect Become best；When iterations increase increases to 5 from 3 (i.e. (3)~(4) figure), PWE effect has surmounted SSG and Glove. Its reason is：Word additivity associated weight is added in PWE, CWP and SSGP model, one layer of net more than other models Network node needs to adjust ginseng (other models only need to adjust the parameter of term vector and two variables of predicted vector).Therefore PWE, CWP Need more operands preferably could to be fitted three variable parameters with SSGP, so as to preferably refine part of speech correlative weight Weight.Conversely, iterations is too low to reduce degree of refinement and operand so that PWE, CWP and SSGP represent that effect is not so good as on the contrary Other models.

4.CWP and SSGP effect is always above PWE.Its reason is：CWP and SSGP passes through the term vector to context Head and the tail splicing is carried out successively, is taken full advantage of word order information and is trained.

The data drafting pattern 5 of the word correlation result of table 3, wherein ordinate are model in three data sets by we On test accuracy rate average (avg).Drawn from histogram analysis：

It is 1. essentially the same with the analysis result of above-mentioned word analogy task.

2. same author propose method in (i.e. column color is identical) progress across comparison, it has been found that CSG and its Improved model (SSG, SSGP) is more preferable compared with CBOW and its improved model (CWindow, CWP) effect.Reason is：The former with Center term vector is input, and the predicted vector for combining each word of context carries out posterior probability (formula 12) and maximized；The latter (summation/splicing) is as input after all term vectors of context are merged, and the predicted vector progress posteriority for combining centre word is general Rate (formula 6) maximizes.Obviously, CSG and its improved model can be obtained preferably between centre word and each word of context Relation information, and all term vectors of context simply simply are incorporated as inputting by CBOW and its improved model by contrast, So that each word weakens to the relation between (centre word-upper and lower cliction).

Perception for model to part of speech, we can also do following qualitative analysis：Gathered based on unsupervised k-means Class algorithm carries out part of speech induction [19] to 500 words before word frequency, [20], it is most common that each cluster (cluster) is mapped into its Golden standard part of speech label；Finally calculate cluster purity (cluster purity) and be used as evaluation index.Find to work as in experiment When vector dimension is 300 and iterations is 3, the cluster purity of each model basically reaches saturation value (i.e. with dimension and iteration Number increase, cluster purity amplitude of variation are very faint), its experimental result such as table 4.

Test result indicates that CBOW, PWE based on p (word (t) | context (word (t))) posterior probability model with CWP has more preferable Clustering Effect.Wherein CWP expression effects are best, it was demonstrated that when model had both been kept to word order information Sensitivity and can joint part of speech associated weight is trained, and this syntax that may consequently contribute to model and have the preferably simulation meaning of a word closes The ability of system.

The result (%) of the qualitative analysis of table 4

It is training effectiveness contrast situation below：

Experiment represents to train using word number (Training words) divided by training time (Training time) Efficiency (Training efficiency).Experiment is carried out on wikipedia training corpus, and wherein NNLM models use 8- Gram (i.e. input layer word number is expressed as p, p=8), the node unit number (being expressed as h) of hidden layer are set to 640, computation complexity For o (p × d+p × d × h+h × | V |) [10].The hyper parameter of other models is according to default setting in 4.1.Experimental result such as table 5.

The training effectiveness of table 5 (words/s)

Test result indicates that model training efficiency and computation complexity are negatively correlated.Wherein NNLM efficiency is far below it Remaining linear model, because its nonlinear hidden layer brings substantial amounts of calculating.Set forth herein CWP and SSGP be less than in efficiency Other linear models, this is due to that they not only add part of speech associated weight, and in order to be entered using word order information to vector Go splicing, add vector dimension.CWP and SSGP, which introduces part of speech and word order information, can bring term vector to represent in effect Lifting, therefore their reductions by a small margin in efficiency are acceptables (at least compared to NNLM, amplitude very little).

It should be noted that the purpose for publicizing and implementing example is that help further understands the present invention, but the skill of this area Art personnel are appreciated that：Do not departing from the present invention and spirit and scope of the appended claims, various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim The scope that book defines is defined.

Citation：

[12]Mikolov T,Sutskever I,Chen K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems.2013:3111-3119

[13]Mnih A,Hinton G.Ascalable hierarchical distributed language model [C]//Conference on Neural Information Processing Systems,Vancouver,British Columbia,Canada,December.2008:1081-1088.

[14]Collobert R,Weston J,Bottou L,et al.Natural language processing (almost)from scratch[J].Journal of Machine Learning Research,2011,12(Aug): 2493-2537.

[15]Marcus M P,Marcinkiewicz M A,Santorini B.Building a large annotated corpus of English:The Penn Treebank[J].Computational linguistics, 1993,19(2):313-330.

[16]Finkelstein L,Gabrilovich E,Matias Y,et al.Placing search in context:The concept revisited[C]//Proceedings of the 10th international conference on World Wide Web.ACM,2001:406-414.

[17]Huang E H,Socher R,Manning C D,et al.Improving word representations via global context and multiple word prototypes[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:Long Papers-Volume 1.Association for Computational Linguistics, 2012:873-882.

[18]Luong T,Socher R,Manning C D.Better Word Representations with Recursive Neural Networks for Morphology[C]//CoNLL.2013:104-113.

[19]Christodoulopoulos C,Goldwater S,Steedman M.Two decades of unsupervised POS induction:How far have we come[C]//Proceedings of the 2010Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2010:575-584.

[20]Yatbaz M A,Sert E,Yuret D.Learning syntactic categories using paradigmatic representations of word context[C]//Proceedings of the 2012Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Association for Computational Linguistics,2012:940-951.

Claims

1. a kind of word2vec improved methods, it is characterized in that, establish the joint part of speech factor and the training of the word order factor Structured word2vec on POS models；Structured word2vec on POS models include CWindow-POS (CWP) model and Structured Skip gram-POS (SSGP) model, CWP models and SSGP models believe part-of-speech tagging Breath and word order are as influence factor combined optimization, using part of speech related information to intrinsic between word in contextual window Syntactic relation is modeled；Context words sequence is weighted by part of speech associated weights, then it is suitable by word position Sequence carries out inner product of vectors calculating, is weighted using stochastic gradient descent algorithm combination learning word embedding and part of speech correlation Matrix, realize and combined optimization is carried out to term vector and part of speech related weighing matrix；Comprise the following steps：

1) CWP models are established：Definition output prediction matrix is O ∈ R | V | × 2cd, and part of speech correlation weighting matrix is introduced, build The term vector MODEL C WP models of vertical Feature Words；Including：

For training corpus Corp, the object function of CWP models is a log-likelihood letter for maximizing each sample labeling word Number；Using the training function for improving NS algorithms, by being trained to the object function of formula 1, maximum likelihood probability is calculated：

In formula 1, Q_CBOWFor maximum likelihood probability；Word (t) is local center word, and t represents the word in training corpus Corp Tokens sequence numbers；Context (word (t)) is word (t) context words sequences；Neg (word) represents word centered on word, Negative data set to the word negative data set extracted；L^word(t)(u) word (t) sampling word u label is represented If (word (t)=u, L^word(t)(u)=1；Otherwise L^word(t)(u)=0)；

P (u | context (word (t))) it is that sample word u follows context context (word (t)) co-occurrence posterior probability；p The calculating process of (u | context (word (t))) is：The term vector of input layer is subjected to part of speech weighted calculation respectively first；So The vector after part of speech weighted calculation is embedded into projection layer by context words appearance order orientation afterwards, the cascade vectorial shape Formula such as formula 2：

x_word(t)=[Φ_-c(z_t-c,z_t) v (word (t-c)) ... Φ_-1(z_t-1,z_t) v (word (t-1)), Φ₁(z_t+1,z_t) V (word (t+1)) ... Φ_i(z_t+c,z_t) v (word (t+c))] (formula 2)

Wherein, Φ_-c(z_t+c,z_t) represent and be meant that distance center word as on c position, part of speech label z_t+cWith z_tCorrelation Weights；

Formula 2 is substituted into NS algorithms, draws formula 3：

<mrow> <mi>i</mi> <mi>n</mi> <mi>d</mi> <mi>e</mi> <mi>x</mi> <mrow> <mo>&lsqb;</mo> <mi>i</mi> <mo>&rsqb;</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>c</mi> <mo>+</mo> <mi>i</mi> <mo>+</mo> <mn>1</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>i</mi> <mo>&Element;</mo> <mrow> <mo>&lsqb;</mo> <mrow> <mo>-</mo> <mi>c</mi> <mo>,</mo> <mo>-</mo> <mn>1</mn> </mrow> <mo>&rsqb;</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>c</mi> <mo>+</mo> <mi>i</mi> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>i</mi> <mo>&Element;</mo> <mrow> <mo>&lsqb;</mo> <mrow> <mn>1</mn> <mo>,</mo> <mi>c</mi> </mrow> <mo>&rsqb;</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>

u∈{word(t)}∪Neg(word(t))

θ^u(i)=O (u) [(index [i] -1) × d+1~index [i] × d]

By formula 1Q_CBOWIn brace under formula be designated as L, as CWP object function carry out gradient derivation, formula 3 is substituted into L, It is expressed as formula 4：

<mrow> <mi>L</mi> <mo>=</mo> <mo>{</mo> <mo>(</mo> <msup> <mi>L</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </msup> <mrow> <mo>(</mo> <mi>u</mi> <mo>)</mo> </mrow> <mo>&times;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>&lsqb;</mo> <mi>&sigma;</mi> <mo>(</mo> <munder> <mo>&Sigma;</mo> <mrow> <mo>-</mo> <mi>c</mi> <mo>&le;</mo> <mi>i</mi> <mo>&le;</mo> <mi>c</mi> <mo>,</mo> <mi>i</mi> <mo>&NotEqual;</mo> <mn>0</mn> </mrow> </munder> <mrow> <mo>(</mo> <mrow> <msub> <mi>&Phi;</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>z</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>i</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>z</mi> <mi>t</mi> </msub> </mrow> <mo>)</mo> </mrow> <mi>v</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mrow> <mi>t</mi> <mo>+</mo> <mi>i</mi> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msup> <mi>&theta;</mi> <mi>u</mi> </msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>&rsqb;</mo> <mo>+</mo> <mrow> <mo>(</mo> <mrow> <mn>1</mn> <mo>-</mo> <msup> <mi>L</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </msup> <mrow> <mo>(</mo> <mi>u</mi> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>&times;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>&lsqb;</mo> <mn>1</mn> <mo>-</mo> <mi>&sigma;</mi> <mo>(</mo> <munder> <mi>&Sigma;</mi> <mrow> <mo>-</mo> <mi>c</mi> <mo>&le;</mo> <mi>i</mi> <mo>&le;</mo> <mi>c</mi> <mo>,</mo> <mi>i</mi> <mo>&NotEqual;</mo> <mn>0</mn> </mrow> </munder> <mrow> <mo>(</mo> <mrow> <msub> <mi>&Phi;</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>z</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>i</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>z</mi> <mi>t</mi> </msub> </mrow> <mo>)</mo> </mrow> <mi>v</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mrow> <mi>t</mi> <mo>+</mo> <mi>i</mi> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msup> <mi>&theta;</mi> <mi>u</mi> </msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>&rsqb;</mo> <mo>}</mo> </mrow>

For variable Φ i (zt+i, zt), the θ u (i) and v (word (t+i)) of object function in formula 4, risen using stochastic gradient Method carries out gradient derivation solution to L above three variable, renewal is then continued to optimize, so as to realize to term vector and part of speech phase Close weighting matrix and carry out combined optimization；

2) SSGP models are established：On the basis of SSG models and PWE models, the word (t) for giving centre word uses single output square Battle array O ∈ R | V | × d predicts each cliction, and introduce part of speech correlation weighting matrix and be modeled up and down；Including：

It is as follows using the training function for improving NS algorithms for training corpus Corp：

<mrow> <msub> <mi>Q</mi> <mrow> <mi>C</mi> <mi>S</mi> <mi>G</mi> </mrow> </msub> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>&Element;</mo> <mi>C</mi> <mi>o</mi> <mi>r</mi> <mi>p</mi> </mrow> </munder> <munder> <mo>&Sigma;</mo> <mrow> <mo>-</mo> <mi>c</mi> <mo>&le;</mo> <mi>i</mi> <mo>&le;</mo> <mi>c</mi> <mo>,</mo> <mi>i</mi> <mo>&NotEqual;</mo> <mn>0</mn> </mrow> </munder> <munder> <mo>&Sigma;</mo> <mrow> <mi>u</mi> <mo>&Element;</mo> <mo>{</mo> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mo>(</mo> <mi>t</mi> <mo>)</mo> <mo>}</mo> <mo>&cup;</mo> <msup> <mi>Neg</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mrow> <mi>t</mi> <mo>+</mo> <mi>i</mi> </mrow> <mo>)</mo> </mrow> </mrow> </msup> <mrow> <mo>(</mo> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mo>(</mo> <mi>t</mi> <mo>)</mo> <mo>)</mo> </mrow> </mrow> </munder> <mo>{</mo> <mo>(</mo> <msup> <mi>L</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </msup> <mo>(</mo> <mi>u</mi> <mo>)</mo> <mo>&times;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>&lsqb;</mo> <mi>p</mi> <mo>(</mo> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mrow> <mi>t</mi> <mo>+</mo> <mi>i</mi> </mrow> <mo>)</mo> </mrow> <mo>|</mo> <mi>u</mi> </mrow> <mo>)</mo> <mo>&rsqb;</mo> <mo>+</mo> <mo>(</mo> <mrow> <mn>1</mn> <mo>-</mo> <msup> <mi>L</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </msup> <mrow> <mo>(</mo> <mi>u</mi> <mo>)</mo> </mrow> </mrow> <mo>)</mo> <mo>&times;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>&lsqb;</mo> <mn>1</mn> <mo>-</mo> <mi>p</mi> <mo>(</mo> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mrow> <mi>t</mi> <mo>+</mo> <mi>i</mi> </mrow> <mo>)</mo> </mrow> <mo>|</mo> <mi>u</mi> </mrow> <mo>)</mo> <mo>&rsqb;</mo> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> </mrow>

Wherein, p (word (t+i) | u) represents word (t+i) posterior probability based on sample word u, Q_CSGIt is general for maximum likelihood Rate；After PWE part of speech degree of correlation weighting matrix Φ i are added into output layer, determined in matrix based on the weighted factor of part of speech with word It is related to embedded position, p (word (t+i) | u) is calculated by formula 12：

Wherein, σ is Sigmod functions:V (word) represents word word vector；

By formula 11Q_CSGBrace under formula be designated as L1, as Structured Skip gram-POS object function carry out Gradient is derived, and formula 12 is substituted into the L1 of formula 13, is expressed as formula 13：

L1={ (L^word(t)(u)×log[σ(Φ_i(z_t+i,z_t)v(word(t+i))·O_i(u))]+(1-L^word(t)(u))×log [1-σ(Φ_i(z_t+i,z_t)v(word(t+i))·O_i(u))] } (formula 13)

For variable parameter Φ i (zt+i, zt), the Oi (u) in the object function of formula 13 and v (word (t)), using on stochastic gradient The method of liter carries out gradient solution to L1 three variables, and then constantly optimization renewal, related to part of speech to term vector so as to realize Weighting matrix carries out combined optimization；

Thus, by CWP models and SSGP models, stochastic gradient descent algorithm combination learning word embedding and word are used Property correlation weighting matrix, realize and combined optimization carried out to term vector and part of speech related weighing matrix.

2. word2vec improved methods as claimed in claim 1, its speciality are, for the variable θ u (i) of object function in formula 4, Gradient derivation solution is carried out using stochastic gradient rise method, then continues to optimize renewal；Specifically：

θ u (i) gradient is expressed as formula 5：

θ u (i) are updated by formula 6：

3. word2vec improved methods as claimed in claim 1, its speciality are, for the variable Φ i (zt+ of object function in formula 4 I, zt), gradient derivation solution is carried out using stochastic gradient rise method, then continues to optimize renewal；Specially：

Φ i (zt+i, zt) gradient is represented by formula 7：

Φ i (zt+i, zt) are updated by formula 8：

4. word2vec improved methods as claimed in claim 1, its speciality are, for the variable v (word of object function in formula 4 (t+i) gradient derivation solution), is carried out to the variable using stochastic gradient rise method, then continues to optimize renewal；Specially：

V (word (t+i)) gradient is expressed as formula 9：

V (word (t+i)) is updated by formula 10：

5. word2vec improved methods as claimed in claim 1, its speciality are, for the variable parameter Φ in the object function of formula 13 I (zt+i, zt), Oi (u) and v (word (t)), gradient solution is carried out to L1 three variables using stochastic gradient rise method, so Constantly optimization renewal afterwards；Optimizing renewal is specially：

Oi (u) is updated by formula 14：

Φ i (zt+i, zt) are updated by formula 15：

V (word (t+i)) is updated by formula 16：