CN109271632A

CN109271632A - A kind of term vector learning method of supervision

Info

Publication number: CN109271632A
Application number: CN201811075603.2A
Authority: CN
Inventors: 覃勋辉
Original assignee: Chongqing Xiezhi Technology Co ltd
Current assignee: Beijing Star Cube Digital Technology Co ltd
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2019-01-25
Anticipated expiration: 2038-09-14
Also published as: CN109271632B

Abstract

The present patent application discloses a kind of term vector learning method of supervision, it is related to natural language processing method field, the following steps are included: step 1 builds deep learning network model by increasing word relationship disaggregated model on the basis of word2vec neural network model；Multiple a certain specified term vectors of adjacent input word vector sum are input in deep learning network model and carry out multi-task learning by step 2；Step 3 repeats step 4, is iterated calculating, word2vec neural network model and word relationship disaggregated model after being optimized.The present patent application can obtain the relationship between the term vector and specified term vector while term vector is calculated.

Description

A kind of term vector learning method of supervision

Technical field

The present invention relates to natural language processing method fields, and in particular to a kind of term vector learning method of supervision.

Background technique

Term vector (word embedding), the vector characterization of word are an operations common in natural language processing, are The common basic technology in the Internet services such as search engine, ad system, recommender system behind.

Term vector can simply be interpreted as word carrying out vectorization expression, and entity has been abstracted into mathematical description, such as one A word: " apple ", be expressed as [0.4,0.5,0.9 ...], " banana ": [0.3,0.8,0.1 ...], the different dimensional of vector Degree is used to characterize different characteristic, just represents different semantemes on different dimensions.

Natural language processing (NLP is made in natural language processing, abbreviation) is artificial intelligence and linguistics The subdiscipline in field.Inquire into how to handle and with natural language, allow computer " to understand " language of the mankind, computer in this field Data are converted into natural language, and natural language is converted into the form that computer program is more easily handled.

Present natural language processing, including various ways, wherein word2vec be it is now more common be used to carry out from The series model of right Language Processing.Word2vec relies on skip-grams or continuous bag of words (CBOW) and is embedded in establish neural word, Term vector is obtained using neural network model.It is more in line in daily communication compared to skip-grams, CBOW by nature language The requirement that speech is interchangeable with machine language.

Although word2vec is able to carry out natural language processing, but word ambiguity and the obstructed feelings of sentence often occurs Condition, to find out its cause, being because of the unsupervised mechanism of Word2vec, what word2vec considered is only the pass between word and surrounding word System, when surrounding's word difference of two synonyms, the term vector that the two synonyms train also differs widely certainly.By big The word2vec of corpus learns term vector out, has in term vector space from what given word was closer: synonym, same to position Word, upper hyponym, related term etc., but word2vec can not distinguish these relationships.And many NLP tasks need this kind of word and word Relationship, but the term vector obtained in existing learning method does not have such function.

Summary of the invention

The invention is intended to provide a kind of term vector learning method of supervision, can not only obtain the corresponding word of natural language to Amount can also predict relationship between two term vectors.

The term vector learning method supervised in this programme, comprising the following steps:

Step 1 builds depth by increasing word relationship disaggregated model on the basis of word2vec neural network model Learning network model；

Multiple a certain specified term vectors of adjacent input word vector sum are input in deep learning network model by step 2 Carry out multi-task learning；

Step 3 repeats step 4, is iterated calculating, and the word2vec neural network model and word after being optimized close It is disaggregated model.

The present invention has the advantages that

The present invention proposes a kind of word-based and word relationship term vector generation method for having supervised.This method is existing On the basis of word2vec, the word relationship disaggregated model for calculating word and word relationship is increased, using neural network multitask The mechanism of habit learns term vector and word word relationship simultaneously.After the completion of training, term vector corresponding to word can not only obtain, and It can predict the word relationship of two words.This word relationship is in multiple skills such as Text similarity computing, the information retrievals of natural language There is very important effect in art field.

In addition, telling the priori knowledge of neural network word in the training process, the study for helping to eliminate low-frequency word is not filled The case where dividing.

Further, before step 1, corpus text is segmented, establishes vocabulary and initial term vector corresponding with vocabulary.

By collecting corpus, vocabulary and initial term vector are established initially to be instructed to newly-built deep learning network model Practice.

Further, it before step 1, according to vocabulary, marks in corpus text between each term vector and term vector Relationship.

By be labelled with relationship term vector can output vector to deep learning network model and word relationship carry out it is anti- To learning and give comments and criticisms, make the ginseng in the word2vec neural network model and word relationship disaggregated model in deep learning network model Number can access optimization.

Wherein, corpus text is collected into internet and corpus ancient books and records using crawler.

Corpus text in corpus ancient books and records than more complete, but be not it is newest, crawled on internet by crawler Supplement of the cyberspeak as corpus text in existing corpus ancient books and records, when the vocabulary established and initial term vector can be made all to meet For language feature.

Further, in step 1, word relationship disaggregated model includes sequentially connected input layer, splicing layer, Quan Lian stratum And probability layer；Wherein splicing layer by the output vector Wi being calculated by word2vec neural network model and is input to word The specified vector Wk of relationship disaggregated model is spliced according to following formula: [Wi, Wk, Wi-Wk, Wi ° of Wk, Cos (Wi, Wk)].

By word relationship disaggregated model, the relationship between initial term vector is subjected to corresponding mark, is facilitated below in training It is calculated together in calculating process with relationship.

Further, in step 2, input word vector sum is defined by initial term vector and specifies term vector.

All term vectors are both initialized to the vector of specified equal length.

Further, it in step 2, inputs and exports to the neural network model of word2vec using continuous pond bag model The adjacent multiple term vectors of term vector are as input term vector.

Continuous pond bag model is the main models for being currently used to carry out natural language processing in word2vec, but every Term vector in a pond bag does not carry out relationship each other and corresponds to, this term vector for allowing for finally being calculated also is difficult Accurately correct relationship is established with other term vectors.The present invention efficiently solves this by increase word relationship disaggregated model and asks Topic.And neural network model is superimposed continuous pond bag model, can greatly reduce the number of plies of calculating and the number of iteration, reduces and calculates Amount enables natural language to be quickly processed into the term vector of standard, and then carries out subsequent applications.

Further, in step 2, when carrying out multi-task learning, word2vec neural network disk model is calculating output While vector Wi, word relationship disaggregated model calculates the relationship label (Wk, Wi) of Wi and Wk.

While with initial vector training word2vec neural network, training word relationship disaggregated model, trained depth Degree learning network model can obtain the relationship label (Wk, Wi) of Wi and Wk while obtaining output term vector Wi.

Further, in step 2, word2vec neural network is by error back propagation mechanism to neural network parameter It optimizes, error includes the error in classification and word relationship error in classification of Hofman tree.

Optimize output vector Wi and the word2vec neural network model being calculated.

Further, in step 2, word relationship disaggregated model is by neural network error back propagation mechanism to full connection Layer parameter optimizes.

Using the term vector of mark relationship, the relationship after calculating word relationship disaggregated model compares optimization, Jin Erxiu It is positive to update full connection layer parameter, optimize the label (Wk, Wi) being calculated and word relationship disaggregated model.

Further, in step 3, the multiple input vectors selected at random and specified vector are separately input to In word2vec neural network model and word relationship disaggregated model, an output term vector and the output term vector is calculated With the relationship between specified term vector.

It is used after training word2vec neural network model and word relationship disaggregated model by successive ignition When, it can also synchronize to obtain the relationship of the output term vector and specified term vector while obtaining output term vector.

Detailed description of the invention

Fig. 1 is the flow chart of the embodiment of the present invention.

Fig. 2 is the operation frame diagram of the embodiment of the present invention.

Specific embodiment

It is further described below by specific embodiment:

Embodiment is substantially as shown in Fig. 1: the term vector learning method supervised in the present embodiment, comprising the following steps:

The first step establishes corpus text library, to corpus text segment, segmenting method can using existing ltp, stammerer, Even segment by hand；Participle establishes vocabulary later, and vocabulary is the set of word composition one by one；And randomly select initial word to Amount.

When establishing corpus text library, pass through existing Chinese corpus ancient books and records such as " ccs " dictionary, " hownet ", " major term Woods " and corpus text is collected by crawler on the internet, multiple big corpus texts are formed, and these big corpus texts are built It stands into for receiving the corpus text library of rope.

After the foundation for completing corpus text library, the corpus text in library is segmented.Meanwhile by the initial of each word Term vector is defined as W0={ w1 ..., wn }, wherein W0 is term vector, and w1 to wn is respectively term vector in n different dimensions Characteristic value, wherein n is the term vector intrinsic dimensionality of word2vec setting.

Second step marks the relative of a word according to vocabulary, mark can according to existing dictionary, as major term woods, Ccs etc. can also be marked by hand.

Relative includes synonym, apposition, hypernym, hyponym, unrelated word etc..In the connection established between word and word When being, first, in accordance with the prior art, the relationship between word and word, corpus allusion quotation are established using the word relationship in existing corpus ancient books and records Nationality such as " ccs " dictionary, " hownet ", " major term woods ".If only existing corpus ancient books and records, the word word relationship provided is not Completely.In the present embodiment, we using following manner construct word word relationship, for Wi all word relationships with label (Wi, Wk it) indicates, i and k belong to { 1 ..., n }.Word relationship has { synonym, apposition, hypernym, hyponym, unrelated word, unknown }. Unknown word relational tags are set as " unknown ", these word relationships are simultaneously not involved in training.

Third step builds deep learning network structure.

Output term vector is calculated using the neural network model of word2vec, passes through CBOW model insertion in calculating process Initial term vector.Meanwhile by word relationship disaggregated model calculate export term vector while, synchronize calculate the output word to Relationship between amount and specified term vector.

Specifically, as shown in Fig. 2, being input to using the CBOW model of word2vec by term vector Wi-m to Wi+m is inputted Output term vector Wi is calculated in word2vec neural network type.Then Wi is input to huffman model The probability of output vector, the probability exported according to the prior art by huffman model are calculated in hierarchicalsoftmax It corrects the neural network parameter and output vector of neural network model, makes the output term vector obtained after neural network model It is more accurate.

While calculating output term vector, pass through word relationship disaggregated model, i.e. word relationship disaggregated model calculates output word Relationship between vector Wi and specified term vector Wk.

Specifically, word relationship disaggregated model, including sequentially connected input layer, splicing layer, Quan Lian stratum fully

Connectedlayer and probability layer softmax.

4th step, multi-task learning.

While inputting multiple term vectors to word2vec, classified mould by the input layer of word relational model to word relationship Type inputs specified term vector Wk.Then, while neural network model exports Wi, by Wi and Wk by the splicing layer of input, two A vector forms row vector according to basic mathematical formulae recombination feature, and the row vector after recombination is Wi Wk Wi-Wk Wi ° WkCos (Wi, Wk), then remapped by the network of full articulamentum, finally by softmax classifier realize the classification of word relationship and Error calculation obtains the relationship between two term vectors being arranged according to predetermined dimension.

Assuming that word2vec selects Cbow, the window selection of Cbow is 2m+1.[Wi-m ..., Wi+m] it is one in addition to Wi The corpus data of the vectorization of window.Wk is the relative of Wi, i.e., specified term vector, the relationship of the two term vectors is expressed as label(Wi,Wk).This variable represents the relationship of Wi and Wk.In the present embodiment, label (Wi, Wk) is equal to { synonym, same to position It is word, hypernym, hyponym, unrelated word, unknown } in the similarity probabilities that calculate of each characteristic dimension, label (Wi, Wk) passes through Word relationship disaggregated model is calculated.

This method uses the output and mark joint training of neural network model and sorter model, the damage of two models The Probability Forms expression of appraxia logarithmetics is added to obtain the loss function of whole network again, as follows:

Loss=logP (Wi | Wi-m ..., Wi+m)+s*logP (label (Wi, Wk) | Wi, Wk)

Wherein, s is preset coefficient, such as takes s=0.5.

After obtaining loss function, joined using loss function using the mechanism learning network of neural network error back propagation Number, wherein network parameter is that neural network is included, by the continuous corrective networks parameter of loss function, obtains neural network model The output term vector arrived is more accurate.Meanwhile utilizing the Quan Lian stratum in the mechanism Study strategies and methods model of error back propagation Parameter makes to connect full level parameter in the gradually calculating process of term vector relationship by continuous training optimization, makes the word finally obtained Relationship disaggregated model can accurately calculate the relationship between two term vectors.

5th step updates and network parameter and connects level parameter entirely, and the deep neural network model after being optimized is to get arriving Updated word2vec neural network model and word relationship disaggregated model.

When concrete application, the adjacent term vector of some term vector of stochastic inputs is obtained by neural network model To output term vector Wi, while specified term vector Wk is inputted, obtains term vector Wi and word by the iterative calculation of sorter model The relationship label (Wi, Wk) of the Wk of vector correlation between the two.

By above step, after the completion of training, the corresponding term vector of word can not only be obtained, while can be according to classification Device model calculates the relationship between the term vector and specified term vector.

This method increases the classifier of word and word relationship on the basis of existing word2vec, more using neural network The mechanism of tasking learning learns term vector and word word relationship simultaneously, during CBOW term vector model is learnt, passes through The relationship of each term vector and other term vectors is predicted and is defined by word relationship disaggregated model.As shown in Fig. 2, this method has The output and mark joint training of two networks are used to body, left network is the vec2vec CBOW net based on Hofman tree Network, the right are word relationship sorter networks.The Probability Forms of the loss logarithmetics of two networks of left and right are indicated to be added again, as net The loss function of network.After the completion of training, term vector corresponding to word can not only obtain, and can predict that the word of two words closes System.This word relationship has very important in multiple technical fields such as Text similarity computing, the information retrievals of natural language Effect.

In addition, telling the priori knowledge of neural network word in the training process, the study for helping to eliminate low-frequency word is not filled The case where dividing.Such as: " Zhang San " and " Li Si " is synonym, and there are many frequency that " Li Si " occurs in training text, it is believed that can be with It is trained up, and the frequency that " Zhang San " occurs is seldom, can not all train up according to traditional word2vec.In the present invention Network in, when training Zhang San, the term vector of word-based word sorter network and " Li Si " can be with by mistake backpropagation mechanism " Zhang San " term vector is updated, so inventive network helps to eliminate the insufficient situation of study of low-frequency word.

Similarly, the priori knowledge of neural network word is told in the training process, and word-based word sorter network enhances input Two term vectors for having a priori interest distinguish and connection, overcome in original word2vec network model term vector only and dependence Deficiency in the relevant mechanism of text.

What has been described above is only an embodiment of the present invention, and the common sense such as well known specific structure and characteristic are not made herein in scheme Excessive description, technical field that the present invention belongs to is all before one skilled in the art know the applying date or priority date Ordinary technical knowledge can know the prior art all in the field, and have using routine experiment hand before the date The ability of section, one skilled in the art can improve and be implemented in conjunction with self-ability under the enlightenment that the application provides This programme, some typical known features or known method should not become one skilled in the art and implement the application Obstacle.It should be pointed out that for those skilled in the art, without departing from the structure of the invention, can also make Several modifications and improvements out, these also should be considered as protection scope of the present invention, these all will not influence the effect that the present invention is implemented Fruit and patent practicability.The scope of protection required by this application should be based on the content of the claims, the tool in specification The records such as body embodiment can be used for explaining the content of claim.

Claims

1. a kind of term vector learning method of supervision, it is characterised in that: the following steps are included:

Step 1 builds deep learning by increasing word relationship disaggregated model on the basis of word2vec neural network model Network model；

Multiple a certain specified term vectors of adjacent input word vector sum are input in deep learning network model and carry out by step 2 Multi-task learning；

Step 3 repeats step 4, is iterated calculating, word2vec neural network model and word relation after being optimized Class model.

2. the term vector learning method of supervision according to claim 1, it is characterised in that: before step 1, by corpus Text participle, establishes vocabulary and initial term vector corresponding with vocabulary.

3. the term vector learning method of supervision according to claim 2, it is characterised in that: according to vocabulary, mark corpus text Relationship in this between each term vector and term vector.

4. the term vector learning method of supervision according to claim 1, it is characterised in that: in step 1, word relation Class model includes sequentially connected input layer, splicing layer, Quan Lian stratum and probability layer；Wherein splicing layer will pass through word2vec The output vector Wi and be input to the specified vector Wk of word relationship disaggregated model according to following public affairs that neural network model is calculated Formula is spliced: [Wi, Wk, Wi-Wk, Wi ° of Wk, Cos (Wi, Wk)].

5. the term vector learning method of supervision according to claim 2, it is characterised in that: in step 2, by initial Term vector specifies term vector to define input word vector sum.

6. the term vector learning method of supervision according to claim 5, it is characterised in that: in step 2, using continuous Pond bag model to the input of the neural network model of word2vec with export the adjacent multiple term vectors of term vector as input word to Amount.

7. the term vector learning method of supervision according to claim 4, it is characterised in that: more in progress in step 2 When tasking learning, while calculating output vector Wi, word relationship disaggregated model calculates word2vec neural network disk model The relationship label (Wk, Wi) of Wi and Wk.

8. the term vector learning method of supervision according to claim 1, it is characterised in that: in step 2, word2vec Neural network optimizes neural network parameter by error back propagation mechanism, and error includes the error in classification of Hofman tree With word relationship error in classification.

9. the term vector learning method of supervision according to claim 1, it is characterised in that: in step 2, word relation Class model optimizes full connection layer parameter by neural network error back propagation mechanism.

10. the term vector learning method of supervision according to claim 1, it is characterised in that: in step 3, will select at random Multiple input vectors and specified vector out are separately input in word2vec neural network model and word relationship disaggregated model, meter It calculates and obtains the relationship between an output term vector and the output term vector and specified term vector.