CN107608953A

CN107608953A - A kind of term vector generation method based on random length context

Info

Publication number: CN107608953A
Application number: CN201710609471.6A
Authority: CN
Inventors: 王俊丽; 王小敏; 杨亚星
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2017-07-25
Filing date: 2017-07-25
Publication date: 2018-01-19
Anticipated expiration: 2037-07-25
Also published as: CN107608953B

Abstract

A kind of term vector generation method based on random length context.The present invention relates to natural language processing field, is related specifically to the term vector generation method based on random length context.Technical scheme proposes a kind of context partition strategy of indefinite length and the term vector generation method based on random length context.Corpus has been divided into indefinite length, but semantic complete context by this strategy using punctuation mark.The unfixed of length result in traditional language model and can not utilize this context generation term vector.In order to tackle this problem, the language model that can handle a random length context F Model is devised herein in conjunction with convolutional neural networks and Recognition with Recurrent Neural Network.Analyzed by result of implementation, corpus, which is divided into semantic complete context, using punctuate can improve the quality of term vector.F Model have good learning ability, and the term vector for implementing to obtain contains abundant semantic and preferable linear relationship.

Description

A kind of term vector generation method based on random length context

Technical field

The present invention relates to natural language processing field, is related specifically to the term vector generation side based on random length context Method.

Background technology

It is most of to be all based on term vector to realize in common natural language processing task, and final place Reason result is often largely dependent upon the quality of term vector.In general, the quality of term vector is higher, its semanteme included It is abundanter and accurate, it is also easier to allow semanteme in computer understanding natural language, this also fundamentally improves other natures The result of language processing tasks.So the term vector for how generating high quality is a basis in natural language processing field And important task, this is direct to generations such as follow-up other natural language processing tasks, such as machine translation, part-of-speech taggings again Great influence.

In conventional term vector generation method, in order to simplify problem and computation complexity, corpus can be all divided into solid The context unit of measured length, but the context of this regular length is not complete semantic primitive, which results in upper and lower The semantic missing of text is semantic chaotic.The semantic missing of context and semantic confusion can be delivered in term vector, directly result in word The semantic missing and semanteme of vector are chaotic.

In order to solve the semantic missing of term vector and the semantic chaotic problem that this fixed context is brought, make full use of herein Original language material information, corpus is divided into semantic relatively complete context unit, such context using punctuation mark The length of unit is uncertain, therefore traditional term vector generation method based on fixed context will be no longer applicable.

Therefore, the present invention has gone out a kind of term vector generation method based on random length context.This method is based on convolution Neutral net and Recognition with Recurrent Neural Network, strengthen the long Dependency Specification between word.Last result of implementation shows that this method is given birth to Into term vector contain more abundant semanteme, there is more preferable linear relationship between term vector.

The content of the invention

The technical problem to be solved in the present invention is to provide a kind of context partition strategy of indefinite length and based on random length The term vector generation method of context.Corpus has been divided into indefinite length by this strategy using punctuation mark, but semantic complete Whole context, solve the semantic missing brought in traditional language model using the context of regular length and chaotic problem. The term vector generation method of random length context based on this strategy division, utilizes convolutional neural networks and Recognition with Recurrent Neural Network The characteristics of and advantage, strengthen the long Dependency Specification between word, the quality of the final term vector for improving generation.

To achieve the above object of the invention, the present invention proposes the term vector generation method based on random length context, its feature It is, is drawn using punctuation mark, probability statistics, convolutional neural networks and the characteristics of Recognition with Recurrent Neural Network and advantage, above and below completion Literary semantic integrity, strengthens the long dependence between word and word, and the semanteme for improving term vector contains ability.

The present invention divides context first after being pre-processed to corpus, using punctuation mark, and corpus is divided For length, semantic complete context unit.Then using the convolutional neural networks hereinafter weight of each word in study, this Weight is then and the global of corpus is distributed the final weight for combining each word in generation context.Followed by this final weight and The vector table that term vector calculates context reaches.Followed by the vector table of context up between each word in structure and context One-to-many mapping relations.Then by stochastic gradient algorithm's training pattern, and finally obtain term vector.

The present invention is achieved through the following technical solutions：

(1) document pre-processes, and obtains training corpus.Given one group of collection of document on certain professional domain, pass through word Remove the preconditioning techniques such as stop words and low-frequency word, obtain the useful information in corpus, and then composing training corpus.

(2) word frequency statisticses, statistics language material distribution.Based on the statistics of the word frequency of occurrences in document, the word of corpus is generated Allusion quotation, word, the index of word and the frequency of word in corpus are included in dictionary.

(3) build training set, the punctuation mark in training corpus, corpus be divided into length not wait up and down Text, form training set.

(4) weight of term vector in context is calculated.The term vector of each word forms context-aware matrix in context.Utilize volume Product neutral net is by obtaining the weight of each word to the convolution algorithm of context-aware matrix, this weight frequency with word in corpus again Combine to form final weight.

(5) the distributed expression of context is calculated.With reference to the weight of the term vector obtained in step (4), obtain on current Distributed expression hereafter.The history context information in Recognition with Recurrent Neural Network is recycled, generates the distribution of newest context Formula is expressed, while updates the historical information in Recognition with Recurrent Neural Network.

(6) mode inference.Using the context distributed intelligence obtained in step (5), build in context and context The one-to-many mapping relations of word.Build the loss function of model.

(7) training pattern, term vector is obtained.Mapping relations according to being built in step (6) carry out optimal on training set Change training, training method is using negative sampling and stochastic gradient descent algorithm.

In the above-mentioned methods, punctuation mark has been used in the step (3), punctuation mark used herein above refers to include Compare the semantic punctuate of full segmentation, such as " ", ".”、“”、“！" etc..

In the above-mentioned methods, the step (4) has used convolutional neural networks, and the size of convolution kernel is [1,3, m, 1], its Middle m represents the dimension size of term vector.Using convolutional neural networks, the context-aware matrix for being shaped as [k, m] can be passed through volume Product generation is shaped as [k, 1] weight, and wherein k represents the number of word in context.This weight calculates in conjunction with the distribution of corpus Go out final weight.

In the above-mentioned methods, the step (4) specifically includes the following steps:

A) context-aware matrix is built.Context-aware matrix is built according to F-Model input.The input In of model is with upper and lower It is literary for unit, including two parts：The index I of word in context is vectorial, the global distribution vector Wt1 of word in context.To Each measured in I represents index of the word in dictionary.Vectorial Wt1 length is identical with the length of context, per the value on one-dimensional Represent frequency disribution of the corresponding word in I in whole corpus.Index vector I by gather () operations according to input The term vector in dictionary table is searched, and these term vector arranged in sequence are combined into context-aware matrix C.

Wt1=(d₀,d₁,…,d_k) (1)

I=(i₀,i₁,…,i_k) (2)

In=(I, Wt1) (3)

C=gather (D, I) (4)

Wherein k be context length, i_kRepresent word W_kIndex in dictionary, d_kRepresent word W_kGlobal distribution.

B) context-aware matrix C is by being shaped as the weight Wt2 of [k, 1] after convolutional layer convolution.

Wt2=f (C)+B (5)

Wherein：D is dictionary, and f () is convolution algorithm, and convolution kernel is shaped as [1,3, m, 1], and B is the bias term of convolution.

C) final weight is calculated.Final weight Wt is calculated with reference to the distribution of corpus.

Wt=Wt2Wt1 (6)

In the above-mentioned methods, the step (5) passes through the weight Wt obtained in the term vector C of context, step (4), meter Count semantic vector Ct hereafter in.Based on context semantic vector Ct and Recognition with Recurrent Neural Network learning arrive Recognition with Recurrent Neural Network Historical context state Ct calculates newest context semantic vector Ct, and circulation nerve net is contained in context semantic vector C The historical context that network learning arrives is semantic.

In the above-mentioned methods, the step (5) specifically includes the following steps:

A) the distributed expression of current context is calculated.The weight information of word is with the addition of on the basis of bag of words.

The calculation of the distributed expression of current context is：

Ct=CX (Wt) (7)

Wherein, C represents context-aware matrix, X representing matrix multiplication.

B) Recognition with Recurrent Neural Network is utilized, adds historical information.

Ct=rnn (Ct) (8)

Wherein, rnn () represents to pass through Recognition with Recurrent Neural Network.

In the above-mentioned methods, the step (6) based on context semantic vector Ct predict context in each word.Meter Calculate the conditional probability of the target word under context condition.Calculation is：

P (W | C)=sigmoid (C θ_i) (7)

Wherein, θ_iFor word W_iParameter.

In the above-mentioned methods, during step (7) training pattern, the loss function is using model complexity (perplexity) length of the average context of prediction target word, is represented.

Wherein, k represents the length of context, P (W_i| C) represent word W_iConditional probability under context C.

The present invention effect and have an advantage that：Corpus is divided using punctuation mark, forms semantic complete context. F-Model utilizes the characteristics of convolutional neural networks and Recognition with Recurrent Neural Network and advantage, and long rely on strengthened between word and word is closed System, improve the quality of term vector.

Brief description of the drawings

Fig. 1 is F-Model structure chart.

Fig. 2 is the implementation module map of the present invention.

F-Model loss changes under Fig. 3 difference learning rates.

Fig. 4：The similarity analysis of the word of table 1.

Fig. 5：Table 2F-Model linear relationships analyze data (part).

The different dimensions of Fig. 6 tables 3 and iterations questions test set accuracy.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing, to according to this hair The Ontological concept and level generation method of bright embodiment are further described.It should be appreciated that specific implementation described herein Example is only used for explaining the present invention, is not intended to limit the present invention, i.e., protection scope of the present invention is not limited to following embodiments, phase Instead, can suitably be changed according to the inventive concept of the present invention, those of ordinary skill in the art, these changes can fall into power Within the invention scope that sharp claim is limited.

As shown in Fig. 1 structured flowchart, the term vector based on random length context being embodied according to the present invention generates Method comprises the following steps：

1) pretreatment module：

Corpus is pre-processed, stop words and low-frequency word is mainly removed, numeral is substituted for fixation mark, is owned Word using lowercase versions etc., eventually form effective training corpus.

2) model construction module：

Count the word frequency distribution of training corpus and generate dictionary, dictionary content includes word, index and frequency.Utilize punctuate Corpus is divided into indefinite length by symbol, semantic complete context unit, forms training set.The input of F-Model models It is that the above is hereafter unit, including two parts：The index I vectors of word in context, in context the overall situation of word be distributed to Measure Wt1.Then I structure context-aware matrix C are indexed, convolutional neural networks learn word in context-aware matrix by context-aware matrix Weight Wt2.Weight Wt2 and language material distribution Wt1 form last weight Wt.F-Model calculates further according to Wt and context-aware matrix C The distribution for going out context represents Ct.F-Model builds the one-to-many mapping of context and word again, and model is built using negative sampling Loss function, optimization aim is provided for follow-up training.

3) model training module：

Stochastic gradient algorithm is used, is passed through come training pattern, training by constantly tuning loss function on training set Model hyper parameter is adjusted to improve the term vector quality of acquisition.Hyper parameter in model includes：Vector dimension, iterations and Habit rate.The dimension of term vector has made 3 kinds of different dimensions respectively in specific implementation：50th, 100,200 dimension.Iterations refers to The iterations that each training set enters after model, 10,20 two kind of embodiment has been respectively adopted in we.Learning rate we adopt With the learning rate of several fixations, it is respectively：0.1、0.01、0.001；

Implement explanation and result：

4 kinds of data sets have been used altogether in implementation, have been respectively：English language material in Billion-Words corpus News.2011.en.shuffled, ptb.train.txt corpus, Wordsim353 data sets and questions- Words.txt data sets.Wherein news.2011.en.shuffled language materials are used as training set, ptb.train.txt corpus For test model function.Wordsim353 data sets and questions-words.txt data sets are made as test set With Wordsim353 data sets are used for the similarity for contrasting term vector, and questions-words.txt data sets are used to assess word The linear relationship of vector.

News.2011.en.shuffled language materials are a plain text datas captured from the news of 2011, data Amount is big, information is complete and remains the raw informations such as punctuate.When this corpus is handled, word frequency is less than 5 times and is disabling Word in word list is all ignored, and all words are all converted into small letter, and all numerals are converted into capital N.Compare For Billion-Words, ptb.train.txt corpus is a more small-sized corpus, and test model is used in implementation. Wordsim353 data sets are a small-sized corpus artificially built, and it contains similar between multiple words pair and word pair Degree, this similarity show that minimum is divided into 0 point, and maximum is divided into 10 points by manually giving a mark.Contained in Wordsim353 3 Sub Data Sets：Set1, set2 and combined.Wherein set1 is contained from 13 different sides manually to the phase of word pair Given a mark data like degree, set2 then contains to give a mark data from 16 sides to the similarity of word pair, combined be then set1 and The result after marking average value in set2 to similarity.Wordsim353 data sets are as test set in implementation, as similar Spend the benchmark assessed.Questions-words.txt data sets are the data set used in tensorflow in word2vec, this Individual data set is an artificial constructed test data set, for testing the linear relationship of term vector.It is every in this data set Individual test case includes 4 words, and this 4 words have similar King-Man+Woman=Queen linear relationship.

Found after being analyzed and tested by the similarity to term vector and linear relationship (refering to table 1, the table 3 of table 2 and figure 3)：The quality of term vector can be influenceed by model, context and dimension size simultaneously.In random length using punctuation mark division More complete semanteme is hereafter can guarantee that, improves the quality of term vector.F-Model has good learning ability, and training obtains Term vector contain abundant semanteme, and there is good linear relationship.The dimension of term vector contains more greatly the ability of semanteme Also it is stronger, but this also brings along dimension disaster and over-fitting, causes term vector to show poor effect when migrating.

It should be noted that and understand, the situation of the spirit and scope of the present invention required by appended claims are not departed from Under, various modifications and improvements can be made to the present invention of foregoing detailed description.It is therefore desirable to the scope of the technical scheme of protection Do not limited by given any specific exemplary teachings.

Claims

1. a kind of term vector generation method based on random length context, it is characterised in that located in advance to corpus first After reason, context is divided using punctuation mark, corpus is divided into length, semantic complete context unit.Then Using convolutional neural networks, hereinafter the weight of each word, this weight are then distributed in combination generation with the global of corpus in study The hereinafter final weight of each word.The vector table that context is calculated followed by this final weight and term vector reaches.Followed by The vector table of context reaches the one-to-many mapping relations between each word in structure and context.Then pass through stochastic gradient algorithm Training pattern, and finally obtain term vector.

A kind of 2. term vector generation method based on random length context as claimed in claim 1, it is characterised in that specific bag Include following steps：

(1) document pre-processes, and obtains training corpus.Given one group of collection of document on certain professional domain, removed by word The preconditioning technique such as stop words and low-frequency word, obtain the useful information in corpus, and then composing training corpus.

(2) word frequency statisticses, statistics language material distribution.Based on the statistics of the word frequency of occurrences in document, the dictionary of corpus, word are generated Word, the index of word and the frequency of word in corpus are included in allusion quotation.

(3) build training set, the punctuation mark in training corpus, corpus be divided into length not wait context, Form training set.

(4) weight of term vector in context is calculated.The term vector of each word forms context-aware matrix in context.Utilize convolution god Weight through network by each word of convolution algorithm acquisition to context-aware matrix, this weight combine with the frequency of word in corpus again Form final weight.

(5) the distributed expression of context is calculated.With reference to the weight of the term vector obtained in step (4), current context is obtained Distributed expression.The history context information in Recognition with Recurrent Neural Network is recycled, generates the distributed table of newest context Reach, while update the historical information in Recognition with Recurrent Neural Network.

(6) mode inference.Using the context distributed intelligence obtained in step (5), context and the word in context are built One-to-many mapping relations.Build the loss function of model.

(7) training pattern, term vector is obtained.Mapping relations according to being built in step (6) carry out optimization instruction on training set Practice, training method is using negative sampling and stochastic gradient descent algorithm.

In the above-mentioned methods, punctuation mark has been used in the step (3), punctuation mark used herein above refers to include and compared The semantic punctuate of full segmentation, such as " ", ".”、“”、“！" etc..

A kind of 3. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step Suddenly (4) have used convolutional neural networks, and the size of convolution kernel is [1,3, m, 1], and wherein m represents the dimension size of term vector.Profit With convolutional neural networks, the context-aware matrix for being shaped as [k, m] can be shaped as [k, 1] weight, wherein k by convolution generation Represent the number of word in context.This weight calculates final weight in conjunction with the distribution of corpus.

A kind of 4. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step Suddenly (4) specifically include the following steps:

A) context-aware matrix is built.Context-aware matrix is built according to F-Model input.The input In of model be the above hereinafter Unit, including two parts：The index I of word in context is vectorial, the global distribution vector Wt1 of word in context.In vectorial I Each represent index in dictionary of word.Vectorial Wt1 length is identical with the length of context, and I is represented per the value on one-dimensional In corresponding word whole corpus frequency disribution.Word is searched according to the index vector I of input by gather () operations Term vector in allusion quotation table, and these term vector arranged in sequence are combined into context-aware matrix C.

Wt1=(d₀,d₁,…,d_k) (1)

I=(i₀,i₁,…,i_k) (2)

In=(I, Wt1) (3)

C=gather (D, I) (4)

Wherein k be context length, i_kRepresent word W_kIndex in dictionary, d_kRepresent word W_kGlobal distribution.B) context Matrix C is by being shaped as the weight Wt2 of [k, 1] after convolutional layer convolution.

Wt2=f (C)+B (5)

Wt=Wt2Wt1 (6)

A kind of 5. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step Suddenly (5) calculate the semantic vector Ct of context by the weight Wt that is obtained in the term vector C of context, step (4).Circulation god Calculated through the network historical context state Ct that based on context semantic vector Ct and Recognition with Recurrent Neural Network learning arrive newest It is semantic that the historical context that Recognition with Recurrent Neural Network learning arrives is contained in context semantic vector Ct, context semantic vector C.

A kind of 6. term vector generation method based on random length context as claimed in claim 5, it is characterised in that the step Suddenly (5) specifically include the following steps:

The calculation of the distributed expression of current context is：

Ct=CX (Wt) (7)

Ct=rnn (Ct) (8)

A kind of 7. term vector generation method based on random length context as claimed in claim 2, it is characterised in that the step Suddenly (6) based on context semantic vector Ct predict context in each word.Calculate the bar of the target word under context condition Part probability.Calculation is：

P (W | C)=sigmoid (C θ_i) (7)

Wherein, θ_iFor word W_iParameter.

8. a kind of term vector generation method based on random length context as claimed in claim 1, it is characterised in that described During step (7) training pattern, the loss function represents prediction target word using model complexity (perplexity) The length of average context.