CN109271632B

CN109271632B - Supervised word vector learning method

Info

Publication number: CN109271632B
Application number: CN201811075603.2A
Authority: CN
Inventors: 覃勋辉; 杜若; 向海; 侯聪; 刘科
Original assignee: Chongqing Xiezhi Technology Co ltd
Current assignee: Beijing Star Cube Digital Technology Co ltd
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2023-05-26
Anticipated expiration: 2038-09-14
Also published as: CN109271632A

Abstract

The invention discloses a supervised word vector learning method, which relates to the field of natural language processing methods and comprises the following steps: step one, a word relation classification model is added on the basis of a word2vec neural network model, and a deep learning network model is built; inputting a plurality of adjacent input word vectors and a certain appointed word vector into a deep learning network model for multitask learning; and step three, repeating the step four, and performing iterative computation to obtain an optimized word2vec neural network model and a word relation classification model. The method and the device can calculate the word vector and obtain the relationship between the word vector and the appointed word vector.

Description

Supervised word vector learning method

Technical Field

The invention relates to the field of natural language processing methods, in particular to a supervised word vector learning method.

Background

Word vector (word casting), the vector representation of words is a common operation in natural language processing, and is a common basic technology behind internet services such as search engines, advertisement systems, recommendation systems and the like.

Word vectors, which can be understood simply as vectorizing words, are abstracted into mathematical descriptions of entities, such as a word: "apple", expressed as [0.4,0.5,0.9,. ], banana ": [0.3,0.8,0.1.], different dimensions of the vector are used to characterize different features, and different dimensions represent different semantics.

Natural language processing (natural language processing, abbreviated NLP) is a branch discipline in the fields of artificial intelligence and linguistics. The field discusses how to process and use natural language, let a computer "understand" human language, convert computer data into natural language, and convert natural language into a form that is easier for a computer program to process.

Natural language processing now includes a number of approaches, where word2vec is a series of models that are now relatively common for natural language processing. Word2vec relies on skip-grams or continuous Word bags (CBOW) to build neural Word embedding, and Word vectors are obtained using neural network models. Compared with skip-grams, CBOW meets the requirement of exchanging natural language with machine language in daily communication.

Although Word2vec can perform natural language processing, word ambiguity and Word failure often occur, because Word2vec is not a supervision mechanism, word2vec only considers the relationship between words and surrounding words, and when the surrounding words of two synonyms are different, word vectors trained by the two synonyms are different. Word vectors learned by word2vec of large corpus have the following distances from given words in the word vector space: synonyms, co-located words, contextuals, related words, etc., but word2vec does not distinguish between these relationships. Many NLP tasks require such word-to-word relationships, but word vectors obtained in existing learning methods do not have such functionality.

Disclosure of Invention

The invention aims to provide a supervised word vector learning method which not only can obtain word vectors corresponding to natural language, but also can predict the relationship between two word vectors.

The supervised word vector learning method in the scheme comprises the following steps:

step one, a word relation classification model is added on the basis of a word2vec neural network model, and a deep learning network model is built;

inputting a plurality of adjacent input word vectors and a certain appointed word vector into a deep learning network model for multitask learning;

and step three, repeating the step four, and performing iterative computation to obtain an optimized word2vec neural network model and a word relation classification model.

The invention has the advantages that:

the invention provides a supervised word vector generation method based on word and word relation. According to the method, a word relation classification model for calculating word and word relation is added on the basis of the existing word2vec, and a neural network multitask learning mechanism is adopted to learn word vectors and word relation simultaneously. After training is completed, not only the word vector corresponding to the word can be obtained, but also the word relationship of the two words can be predicted. The word relation has very important roles in a plurality of technical fields such as text similarity calculation, information retrieval and the like of natural language.

In addition, the prior knowledge of the neural network words is told in the training process, and the learning of the low-frequency words is eliminated.

Further, before the first step, the corpus text is segmented, and a word list and an initial word vector corresponding to the word list are established.

And establishing a word list and an initial word vector by collecting corpus to perform initial training on the newly built deep learning network model.

Further, before the first step, according to the vocabulary, each word vector and the relationship between the word vectors in the corpus text are marked.

The output vector of the deep learning network model and the word relation can be reversely learned and taught through the word vector marked with the relation, so that parameters in the word2vec neural network model and the word relation classification model in the deep learning network model can be optimized.

And collecting corpus texts from the internet and corpus books by adopting crawlers.

The corpus texts in the corpus books are complete, but are not up to date, and the established word list and the initial word vector can accord with the language characteristics of the times by crawling web terms on the internet as the supplement of the corpus texts in the existing corpus books by the crawler.

Further, in the first step, the word relation classification model includes an input layer, a splicing layer, a full-connection layer and a probability layer which are sequentially connected; the splicing layer splices an output vector Wi obtained through word2vec neural network model calculation and a designated vector Wk input into a word relation classification model according to the following formula: [ Wi, wk, wi-Wk, wi DEG Wk, cos (Wi, wk) ].

And the relation between the initial word vectors is correspondingly marked through the word relation classification model, so that the relation is conveniently calculated together in the training calculation process.

Further, in step two, an input word vector and a specified word vector are defined by the initial word vector.

All word vectors are initialized to a specified vector of the same length.

Further, in the second step, a plurality of word vectors adjacent to the output word vector are input as input word vectors to the neural network model of word2vec by using the continuous pool bag model.

The continuous pool bag model is a main model which is used for natural language processing in word2vec, but word vectors in each pool bag do not have relation correspondence with each other, so that the word vectors obtained by final calculation are difficult to accurately establish correct relation with other word vectors. The invention effectively solves the problem by adding a word relation classification model. The neural network model is overlapped with the continuous pool bag model, so that the number of calculated layers and the number of iterations can be greatly reduced, the calculated amount is reduced, and the natural language can be processed into standard word vectors more quickly, so that the subsequent application can be further carried out.

Further, in step two, when the multitask learning is performed, the word2vec neural network disk model calculates the output vector Wi and the word relation classification model calculates the relation label (Wk, wi) of Wi and Wk.

The word2vec neural network is trained by using the initial vector, and meanwhile, the word relation classification model is trained, and the trained deep learning network model can obtain the relation label (Wk, wi) of Wi and Wk at the same time of obtaining the output word vector Wi.

Further, in the second step, the word2vec neural network optimizes the neural network parameters through an error back propagation mechanism, and the errors include classification errors of the huffman tree and word relation classification errors.

And optimizing the calculated output vector Wi and word2vec neural network model.

Further, in the second step, the word relation classification model optimizes the full connection layer parameters through a neural network error back propagation mechanism.

And (3) comparing and optimizing the relation calculated by the word relation classification model by using the word vector of the marked relation, and further correcting and updating the parameters of the full-connection layer to optimize the calculated label (Wk, wi) and the word relation classification model.

In step three, a plurality of randomly selected input vectors and specified vectors are respectively input into a word2vec neural network model and a word relation classification model, and an output word vector and a relation between the output word vector and the specified word vector are obtained through calculation.

After the word2vec neural network model and the word relation classification model are trained through multiple iterations, when the word2vec neural network model is used, the relation between the output word vector and the appointed word vector can be synchronously obtained while the output word vector is obtained.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a diagram of an operation framework according to an embodiment of the present invention.

Detailed Description

The following is a further detailed description of the embodiments:

an example is substantially as shown in figure 1: the supervised word vector learning method in the embodiment comprises the following steps:

firstly, establishing a corpus text library, and word segmentation is carried out on the corpus text, wherein the word segmentation method can adopt the existing ltp, barking and even manual word segmentation; after word segmentation, establishing a word list, wherein the word list is a set formed by words; and randomly selects an initial word vector.

When the corpus text library is built, corpus texts are collected on the Internet through the existing Chinese corpus books such as ccs dictionary, hownet, large word forest and crawlers to form a plurality of large corpus texts, and the large corpus texts are built into a corpus text library for collection.

After the establishment of the corpus text library is completed, word segmentation is carried out on the corpus text in the corpus. Meanwhile, the initial word vector for each word is defined as w0= { W1,..once, wn }, where W0 is the word vector, W1 to wn are feature values of the word vector in n different dimensions, respectively, where n is the word vector feature dimension set by word2 vec.

In the second step, a word is marked Guan Jici according to a word list, and marking can be performed according to an existing dictionary, such as a big word forest, ccs and the like, or manual marking can be performed.

The relation words comprise synonyms, parity words, hypernyms, hyponyms, irrelevant words and the like. When the relation between words is established, firstly, the relation between the words is established by adopting the word relation in the existing corpus books according to the prior art, and the corpus books are such as a ccs dictionary, a hownet, a large word forest and the like. If only the existing corpus books are used, the word-word relation provided by the corpus books is incomplete. In this embodiment, we construct word relationships in the following way, and all word relationships for Wi are denoted by label (Wi, wk), i and k belonging to { 1..sub.n }. The word relation is { synonym, co-located word, hypernym, hyponym, irrelevant word, unknown }. The unknown word relationship labels are set to "unknown" and these word relationships do not participate in the training.

Thirdly, building a deep learning network structure.

And calculating an output word vector by using a word2vec neural network model, and embedding an initial word vector into the word vector by using a CBOW model in the calculation process. Meanwhile, the word relation classification model calculates the output word vector and synchronously calculates the relation between the output word vector and the appointed word vector.

Specifically, as shown in fig. 2, the input word vectors Wi-m to wi+m are input into the word2vec neural network type by using the CBOW model of the word2vec, and the output word vector Wi is calculated. And then, wi is input into a Huffman model hierarachicalcarftmax to calculate the probability of an output vector, and the neural network parameters and the output vector of the neural network model are corrected according to the probability output by the Huffman model in the prior art, so that the output word vector obtained after the neural network model is more accurate.

And calculating the relation between the output word vector Wi and the appointed word vector Wk through a word relation classification model, namely, a word relation classification model, while calculating the output word vector.

Specifically, the word relation classification model comprises an input layer, a splicing layer and a full-connection level full which are sequentially connected

connectiedlayer and probability layer softmax.

Fourth, multitask learning.

The specified word vector Wk is input to the word-relationship classification model through the input layer of the word-relationship model while a plurality of word vectors are input to the word2 vec. And when the neural network model outputs Wi, the Wi and Wk are subjected to input splicing layers, the two vectors are recombined according to a basic mathematical formula to form row vectors, the recombined row vectors are Wi Wk Wi-Wk Wi degrees WkCos (Wi, wk), the Wi-Wk Wi degrees WkCos (Wi, wk) are remapped through a network of the fully connected layers, and finally word relation classification and error calculation are realized through a softmax classifier, so that the relation between the two word vectors arranged according to a preset dimension is obtained.

Assume that word2vec selects Cbow, and the window of Cbow is selected to be 2m+1. Wi-m, wi+m is vectorized corpus data for a window other than Wi. Wk is a relational term for Wi, i.e., a specified word vector, and the relationship between these two word vectors is denoted as label (Wi, wk). This variable represents the relationship of Wi and Wk. In this embodiment, label (Wi, wk) is equal to the similarity probability calculated by each feature dimension in { synonyms, co-located words, hypernyms, hyponyms, irrelevant words, unknown }, and is calculated by a word relationship classification model.

The method adopts the output and labeling combined training of a neural network model and a classifier model, and the loss of the two models is represented by a logarithmic probability form and added to obtain a loss function of the whole network, wherein the loss function is as follows:

Loss＝logP(Wi|Wi-m,...,Wi+m)+s*logP(label(Wi,Wk)|Wi,Wk)

where s is a predetermined coefficient, for example, s=0.5.

After the loss function is obtained, a mechanism of neural network error back propagation is adopted to learn network parameters by using the loss function, wherein the network parameters are self-contained in the neural network, and the network parameters are continuously corrected by the loss function, so that the output word vector obtained by the neural network model is more accurate. Meanwhile, the full-connection level parameters in the classifier model are learned by utilizing an error back propagation mechanism, so that the full-connection level parameters are continuously trained and optimized in the gradual calculation process of the word vector relations, and the finally obtained word relation classification model can accurately calculate the relation between two word vectors.

And fifthly, updating network parameters and full-connection level parameters to obtain an optimized deep neural network model, namely obtaining an updated word2vec neural network model and a word relation classification model.

When the method is specifically applied, word vectors adjacent to a certain word vector are randomly input, an output word vector Wi is obtained through a neural network model, meanwhile, an appointed word vector Wk is input, and a relation label (Wi, wk) between the word vector Wi and the Wk related to the word vector is obtained through iterative calculation of a classifier model.

Through the steps, after training is completed, not only can the word vector corresponding to the word be obtained, but also the relation between the word vector and the appointed word vector can be calculated according to the classifier model.

According to the method, on the basis of the existing word2vec, a word and word relation classifier is added, a neural network multitask learning mechanism is adopted to learn word vectors and word relation simultaneously, and in the learning process of a CBOW word vector model, the relation between each word vector and other word vectors is predicted and defined through the word relation classification model. As shown in FIG. 2, the method specifically adopts the output and label combined training of two networks, wherein the left network is a vec2vec CBOW network based on a Huffman tree, and the right network is a word relation classification network. The losses of the left and right networks are represented by logarithmic probability forms and added as a function of the losses of the networks. After training is completed, not only the word vector corresponding to the word can be obtained, but also the word relationship of the two words can be predicted. The word relation has very important roles in a plurality of technical fields such as text similarity calculation, information retrieval and the like of natural language.

In addition, the prior knowledge of the neural network words is told in the training process, and the learning of the low-frequency words is eliminated. Such as: "Zhang Sano" and "Li four" are synonyms, and the frequency of occurrence of "Li four" in the training text is much, which is considered to be fully trained, while the frequency of occurrence of "Zhang Sano" is little, which is not fully trained according to the traditional word2 vec. In the network of the invention, when training Zhang Sanhe, word vectors of the 'Zhang Sanhe' can be updated through a false back propagation mechanism based on word and word classification network and word vectors of the 'Liqu', so the network of the invention is beneficial to eliminating the condition of insufficient learning of low-frequency words.

Similarly, the prior knowledge of the word of the neural network is told in the training process, the word-word classification network is based on the word vector distinction and connection of two inputted word vectors with prior relations are enhanced, and the defect that the word vectors are only related to the dependent text in the original word2vec network model is overcome.

The foregoing is merely an embodiment of the present invention, and a specific structure and characteristics of common knowledge in the art, which are well known in the scheme, are not described herein, so that a person of ordinary skill in the art knows all the prior art in the application day or before the priority date of the present invention, and can know all the prior art in the field, and have the capability of applying the conventional experimental means before the date, so that a person of ordinary skill in the art can complete and implement the present embodiment in combination with his own capability in the light of the present application, and some typical known structures or known methods should not be an obstacle for a person of ordinary skill in the art to implement the present application. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present invention, and these should also be considered as the scope of the present invention, which does not affect the effect of the implementation of the present invention and the utility of the patent. The protection scope of the present application shall be subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims

1. A supervised word vector learning method is characterized in that: the method comprises the following steps:

repeating the second step, and performing iterative computation to obtain an optimized word2vec neural network model and a word relation classification model;

after training is completed, not only can word vectors corresponding to words be obtained, but also the relation between the word vectors and the appointed word vectors can be calculated according to the classifier model;

in the second step, the word2vec neural network optimizes the neural network parameters through an error back propagation mechanism, wherein the errors comprise classification errors of Huffman trees and word relation classification errors;

in the third step, a plurality of randomly selected input vectors and specified vectors are respectively input into a word2vec neural network model and a word relation classification model, and an output word vector and the relation between the output word vector and the specified word vector are obtained through calculation.

2. The supervised word vector learning method of claim 1, wherein: before the first step, the corpus text is segmented, and a word list and an initial word vector corresponding to the word list are established.

3. The supervised word vector learning method of claim 2, wherein: and labeling the relation between each word vector and each word vector in the corpus text according to the word list.

4. The supervised word vector learning method of claim 1, wherein: in the first step, the word relation classification model comprises an input layer, a splicing layer, a full-connection layer and a probability layer which are connected in sequence; the splicing layer splices an output vector Wi obtained through word2vec neural network model calculation and a designated vector Wk input into a word relation classification model according to the following formula: [ Wi, wk, wi-Wk, wi ∘ Wk, cos (Wi, wk) ].

5. The supervised word vector learning method of claim 2, wherein: in step two, an input word vector and a specified word vector are defined by the initial word vector.

6. The supervised word vector learning method of claim 5, wherein: in the second step, a plurality of word vectors adjacent to the output word vector are input to the neural network model of word2vec by adopting the continuous pool bag model as input word vectors.

7. The supervised word vector learning method of claim 4, wherein: in the second step, when the multitask learning is performed, the word2vec neural network disk model calculates an output vector Wi, and simultaneously, the word relation classification model calculates a relation label (Wk, wi) of Wi and Wk.

8. The supervised word vector learning method of claim 1, wherein: in the second step, the word relation classification model optimizes the parameters of the full-connection layer through a neural network error back propagation mechanism.