CN110348001A

CN110348001A - A kind of term vector training method and server

Info

Publication number: CN110348001A
Application number: CN201810299633.5A
Authority: CN
Inventors: 宋彦; 史树明; 张海松; 李菁; 俞栋; 张潼
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-04-04
Filing date: 2018-04-04
Publication date: 2019-10-18
Anticipated expiration: 2038-04-04
Also published as: CN110348001B

Abstract

The embodiment of the invention discloses a kind of term vector training method and servers can satisfy the semanteme of natural language processing and the demand of syntax task for directional information to be integrated into term vector.The embodiment of the present invention provides a kind of term vector training method, comprising: obtains corresponding input term vector according to the word in training sample text；Corresponding original output term vector is obtained according to context words corresponding with the word in the training sample text；Target is generated according to the original output term vector and exports term vector, and the target output term vector carries the directional information for being used to indicate locality of the context words relative to the word；Term vector learning model is trained using the output term vector and target output term vector.

Description

Word vector training method and server

Technical Field

The invention relates to the technical field of computers, in particular to a word vector training method and a server.

Background

The SG (Skip-Gram) model is a currently general word vector learning model and is widely used in actual industrial environments. On the basis of large-scale corpora, the SG model can obtain a word vector model with high quality, and when a negative sampling (negative sampling) computing technology is used in a matched mode, word vectors can be computed efficiently and quickly, so that computing efficiency and result quality can be guaranteed simultaneously.

In the prior art, the SG model can be established by establishing the relationship between one word and other words around the word. Specifically, in a given corpus, for a word sequence segment, the SG model learns the relationship between them for each pair of words, i.e., predicts the probability of outputting other words given a word as input. The vector for each word is finally updated by optimizing these probability values.

Although the current SG models can effectively train the word vectors, the prior art still has some corresponding disadvantages. For example, the SG model treats any word in the context window of each target word equally, so the context structure information in the target word cannot be reflected in the vector of the target word, and all words around a word are equal in importance to the word, so the word vector obtained through the SG model learning cannot embody the context structure information, and the word vector obtained through the prior art is not sensitive to the position information of the target word, and cannot be effectively applied to semantic and syntactic tasks of natural language processing.

Disclosure of Invention

The embodiment of the invention provides a word vector training method and a server, which are used for integrating direction information into word vectors and can meet the requirements of semantic and syntactic tasks of natural language processing.

In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides a word vector training method, including:

acquiring corresponding input word vectors according to words in the training sample text;

obtaining corresponding original output word vectors according to the context words corresponding to the words in the training sample text;

generating a target output word vector according to the original output word vector, wherein the target output word vector carries direction information for indicating the position direction of the context word relative to the word;

training a word vector learning model using the output word vectors and the target output word vectors.

In a second aspect, an embodiment of the present invention further provides a server, including:

the input word vector acquisition module is used for acquiring corresponding input word vectors according to words in the training sample text;

an output word vector obtaining module, configured to obtain a corresponding original output word vector according to a context word corresponding to the word in the training sample text;

an output word vector reconfiguration module, configured to generate a target output word vector according to the original output word vector, where the target output word vector carries direction information used to indicate a position direction of the context word relative to the word;

and the model training module is used for training a word vector learning model by using the output word vector and the target output word vector.

In the second aspect, the constituent modules of the server may further perform the steps described in the foregoing first aspect and various possible implementations, for details, see the foregoing description of the first aspect and various possible implementations.

In a third aspect, an embodiment of the present invention provides a server, where the server includes: a processor, a memory; the memory is used for storing instructions; the processor is configured to execute the instructions in the memory to cause the server to perform the method of any of the preceding first aspects.

In a fourth aspect, the present invention provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.

In a fifth aspect, embodiments of the present invention provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the above aspects.

According to the technical scheme, the embodiment of the invention has the following advantages:

in the embodiment of the invention, firstly, a corresponding input word vector is obtained according to a word in a training sample text, a corresponding original output word vector is obtained according to a context word corresponding to the word in the training sample text, then, a target output word vector is generated according to the original output word vector, the target output word vector carries direction information used for indicating the position direction of the context word relative to the word, and a word vector learning model is trained by using the output word vector and the target output word vector. In the embodiment of the invention, the context of the input word in different position directions is modeled respectively, and the structural information of the context word is integrated into the word vector learning, so the word vector obtained by the word vector model learning can embody the structural information of the context, and the word vector obtained by the word vector learning model provided by the embodiment of the invention can be suitable for various tasks of natural language processing, especially semantic and syntax related tasks.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings.

Fig. 1 is a schematic flow chart diagram of a word vector training method according to an embodiment of the present invention;

fig. 2 is a schematic view of an application scenario of the word vector training method according to the embodiment of the present invention;

fig. 3 is a schematic diagram of an SG model as a word vector learning model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of joint optimization provided by an embodiment of the present invention;

FIG. 5 is a diagram illustrating a word vector learning model according to an embodiment of the present invention as an SSG model;

fig. 6-a is a schematic structural diagram of a server according to an embodiment of the present invention;

FIG. 6-b is a schematic diagram illustrating a structure of an output word vector reconfiguration module according to an embodiment of the present invention;

FIG. 6-c is a schematic diagram of a structure of a model training module according to an embodiment of the present invention;

FIG. 6-d is a schematic diagram illustrating a structure of another output word vector reconfiguration module according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a composition structure of a server to which the word vector training method according to the embodiment of the present invention is applied.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one skilled in the art from the embodiments given herein are intended to be within the scope of the invention.

The terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The following are detailed below.

The word vector training method provided by the embodiment of the invention realizes the training of the word vector learning model by using the direction information of the context, the word vector learning can be an SG (Skip-Gram) model with direction pointing of the context, for convenience of description, the SG model with direction pointing of the context adopted by the embodiment of the invention is called a DSG (direct Skip-Gram) model, and the DSG model provided by the embodiment of the invention can help to learn the word vector. The DSG model provided by the embodiment of the invention considers that sequence information of words is a very important indicating signal in any language, and for all input and output word pairs (pair), direction information is introduced into an output word vector to indicate the left and right directions (and the up and down directions) of a target word on the input word, so that the guiding function of the target word on the input word is enhanced, and a better word vector is obtained. In the embodiment of the invention, the structural information of the text is integrated into word vector learning by respectively modeling the upper text and the lower text of the target word. Therefore, the word vector obtained through the learning of the DSG model can embody the structural information of the context, the semantic expression capability of the current word vector can be enhanced through the direction information of the context, and the syntactic capability is increased at the same time, so that the word vector obtained through the embodiment of the invention can be suitable for semantic and syntactic tasks of natural language processing.

The word vector training method provided in the embodiment of the present invention may be applied to a word vector learning scenario, and the method may be applied to a server, where the server may include a processor and a memory, where an input word vector and a target output word vector are stored by a storage device in the server, and the target output word vector carries direction information indicating a position direction of a context word with respect to a word. For example, the input word vector and the target output word vector are stored in the memory of the server, and the processor may read a program from the memory to execute the word vector training method provided by the embodiment of the present invention.

Referring to fig. 1, a word vector training method according to an embodiment of the present invention includes the following steps:

101. and acquiring corresponding input word vectors according to words in the training sample text.

In the embodiment of the present invention, a corpus stores training sample texts, where the training sample texts may include a segment of vocabulary, where each vocabulary may be a word, and the word corresponds to a context word, for example, the training sample texts include: a continuous segment of vocabulary: ABC, then for word B, word a and word C constitute the context word for that B. Firstly, words and context words of the words are obtained from a training sample text, corresponding input word vectors are obtained according to the words in the training sample text, wherein the input word vectors comprise the words, the input vectors can be input into a word vector learning model, the input vectors can be continuously updated in the model training process, and new words can be continuously read from a corpus and written into the input vectors.

102. And acquiring a corresponding original output word vector according to the context word corresponding to the word in the training sample text.

In the embodiment of the invention, after words and context words of the words are obtained from a training sample text, an original output word vector corresponding to the context words can be obtained, the original output word vector comprises the context words corresponding to the words, the output word vector is a standard value of prediction output of a word vector learning model, and when an input vector is continuously updated in a model training process, new context words corresponding to the words can be continuously read from a corpus and written into the original output word vector.

It should be noted that, in the embodiment of the present invention, the output word vector corresponding to the context word may be described as an "original output word vector", and after the input word vector and the output word vector are obtained, training of a word vector learning model cannot be directly performed, but the original output word vector needs to be reconfigured to carry the position direction of the context word relative to the word in the output word vector.

103. And generating a target output word vector according to the original output word vector, wherein the target output word vector carries direction information for indicating the position direction of the context word relative to the word.

In the embodiment of the present invention, after the original output word vector is obtained, a target output word vector needs to be generated according to the original output word vector, where the target output word vector carries direction information for indicating a position direction of a context word relative to a word, that is, the original output word vector is reconfigured to carry the position direction of the context word relative to the word in the original output word vector, and for convenience of description of distinction, an output word vector obtained after the original output word vector is reconfigured is referred to as a "target output word vector".

In the embodiment of the invention, the target output word vector carries direction information indicating the position direction of the context word relative to the word. Where the position direction indicates which direction of the word the context word appears in, the direction information may be a one-dimensional array indicating the position direction. For example, the positional orientation of the context word relative to the word may include: a context word appears above (i.e., left direction) or below (i.e., right direction) a word, and the direction information may take a value of 1 if the context word of the word appears above (i.e., left direction) the word, and 0 if the context word of the word appears below (i.e., right direction) the word.

In some embodiments of the present invention, step 103 generating a target output word vector from the original output word vector comprises:

generating a direction vector according to the context word appearing above or below the word, wherein the direction vector is used for indicating that the context word appears above or below the word;

obtaining a target output word vector through the original output word vector and the direction vector, wherein the target output word vector comprises: the original output word vector and direction vector.

The sequence information of a word is an important indication signal in any language, the context word of a word in the corpus can indicate the sequence information corresponding to the word, the direction vector is used to indicate that the context word appears above or below the word, and the target output word vector is obtained through the original output word vector and the direction vector. The direction vector is introduced to indicate the left direction or the right direction of the target word in the input word, so that the guiding effect of the target word on the input word is enhanced, and a better word vector is obtained.

acquiring an upper output word vector from an original output word vector according to the context word appearing above the word;

acquiring a context output word vector from an original output word vector according to the context word appearing in the context of the word;

obtaining a target output word vector through the upper output word vector and the lower output word vector, wherein the target output word vector comprises: the above output word vector and the below output word vector.

In the embodiment of the present invention, a manner of implicitly carrying directional information may also be adopted, that is, two groups of output word vectors may be designed, which are respectively used to express the upper word and the lower word of any input word, unlike the implementation manner of carrying directional vectors in the target output word vector in the foregoing embodiment. Thus, each word has three vectors, one for expressing the word as an input word vector, one for expressing as an above output word vector, and the last for expressing as a below output word vector. Thus, when calculating a word vector, for any input word, its previous word may use their previous output word vector, and the following word may use the following output word vector with the input word vector of the input word to calculate a log-probability likelihood estimate. This implementation in embodiments of the present invention may also actively distinguish the context of a word during the learning process of a word vector, because each time it is only possible to update as context or context alone, the output vector of each word will only have half the probability to be updated.

104. The word vector learning model is trained using the output word vectors and the target output word vectors.

In the embodiment of the present invention, after an output word vector and a target output word vector are obtained, the word vector learning model may be trained using the output word vector and the target output word vector, the word vector learning model provided in the embodiment of the present invention may be an SG model with a directional direction in a context, which is referred to as a DSG model for short, and since the target output word vector carries directional information indicating a position direction of a context word relative to a word, structural information of the context word may be integrated into word vector learning through training of the word vector learning model, so that the word vector obtained through learning by the word vector model may embody structural information of the context, and the word vector obtained through the word vector learning model provided in the embodiment of the present invention may be applicable to semantic and syntactic tasks of natural language processing. The direction information is used for expanding the existing word vector learning model in the embodiment of the invention, so that various model variants can be obtained according to different use scenes, different tasks are further suitable, and word vectors with higher quality can be obtained.

In some embodiments of the invention, the target output word vector comprises: in the case of the original output word vector and the direction vector, step 104 trains a word vector learning model using the output word vector and the target output word vector, including:

obtaining an interactive function calculation result according to the input word vector and the direction vector, and performing iterative updating on the input word vector and the direction vector according to the interactive function calculation result;

obtaining a conditional probability calculation result according to the input word vector and the original output word vector, and performing iterative updating on the input word vector and the original output word vector according to the conditional probability calculation result;

and estimating the optimal target of the word vector learning model according to the interactive function calculation result and the conditional probability calculation result.

In the embodiment of the present invention, the result of the interaction function may be calculated by using the input word vector and the direction vector, that is, the result of the interaction function calculation is obtained, for example, the interaction relationship between the input word vector and the direction vector may be calculated by using a softmax function, so as to achieve the purpose of integrating the direction information into the final word vector. And synchronously updating the values of the input word vector and the direction vector through the interactive function calculation result, so that the interactive function calculation result accords with an expected result, for example, the interactive relation between the input word vector and the direction vector is calculated by using a softmax function, and the value of the interactive function calculation result is ensured to tend to 1 when the context word is on the left side of the word, and tends to 0 when the context word is on the right side of the word. In the embodiment of the invention, besides the need of calculating the interaction function between the input word vector and the direction vector, the conditional probability between the words and the context words needs to be synchronously calculated, namely, the conditional probability calculation result can be obtained according to the input word vector and the original output word vector, for example, the conditional probability between the words is calculated through an SG model, so that the semantic relation between the words is modeled. After the interactive function calculation result and the conditional probability calculation result are respectively obtained through the steps, joint optimization can be carried out according to the interactive function calculation result and the conditional probability calculation result, namely, the optimal target of the word vector learning model can be estimated, so that the optimal target of each word can be respectively updated in an iterative mode through the interactive function calculation result and the conditional probability calculation result, and after the training of the word vector learning model is completed, the word vector learning model can obtain high-quality word vectors of input words.

Optionally, in some embodiments of the present invention, taking an interaction function as a softmax function as an example, obtaining a calculation result of the interaction function according to an input word vector and a direction vector, where the calculation result includes:

an interaction function between an input word vector and a direction vector is calculated by, among other things,

wherein g (ω)_t+i,ω_t) Representing the result of the calculation of the interaction function, δ_ωt+iMeaning that the context word is ω_t+iDirection vector of time, v_ωtRepresenting the word as omega_tThe input vector of time, V, represents the set of all words in the corpus. In the above formula, exp represents an e-function and T represents a transposition.

Optionally, in some embodiments of the present invention, iteratively updating the input word vector and the direction vector according to the interactive function calculation result includes:

the input word vector and the direction vector are iteratively updated in such a way that, among other things,

wherein,represents the updated word as ω_tThe input vector of the time of day,represents the input vector before update, gamma represents the learning rate, delta_ωt+iMeaning that the context word is ω_t+iDirection vector of time, v_ωtRepresenting the word as omega_tInput vector of time, σ (v)_ωt ^Tδ_ωt+i) Indicating the position direction predicted value of the context word relative to the word, D indicating the position direction mark value of the context word relative to the word,represents an updated context word of ω_t+iThe direction vector of the time of flight,representing the context word before update as ω_t+iThe direction vector of time.

In the above formula, a superscript (new) is used to indicate a vector after update, a superscript (old) is used to indicate a vector before update, γ is a learning rate, and the learning rate is a numerical variable that decreases as the training process progresses in the training of word vectors, for example, the learning rate may be defined as a ratio of an untrained text size to a total text size.

Optionally, the position and direction flag value D satisfies the following condition:

wherein, when i <0, the position direction of the context word relative to the word is represented as the above, and when i >0, the position direction of the context word relative to the word is represented as the below.

For example, D is the label information of the context word in the left and right directions of the input word, and as mentioned above, there are two values: i <0 corresponds to the vocabulary in the text, i >0 corresponds to the vocabulary in the text, and in each training sample, the value of D is a mark automatically obtained according to the position of the word during training.

Optionally, in some embodiments of the present invention, the optimal target of the word vector learning model is estimated according to the interactive function calculation result and the conditional probability calculation result:

the global log maximum likelihood estimate f (ω) is calculated as follows_t+i,ω_t) Wherein

f(ω_t+i,ω_t)＝p(ω_t+i|ω_t)+g(ω_t+i,ω_t) (formula five)

Wherein g (ω)_t+i,ω_t) Representing the result of the calculation of the interaction function, p (ω)_t+i|ω_t) Indicating conditionsAnd (5) calculating a result of the probability.

Joint log-likelihood estimation L for calculating the probability of a word to a context word by_SGWherein

where V represents the set of all words in the corpus, and the context word is ω_t+iThe word is omega_tAnd c represents a context window size.

For example, the global log-maximum likelihood estimation can be optimized through the formula five, so that the optimal target estimation on the word vector learning model in the embodiment of the present invention can be converted into a joint optimization problem on two correlation functions, thereby implementing the optimal target estimation on the word vector learning model.

As can be seen from the description of the embodiments of the present invention in the above embodiments, first, a corresponding input word vector is obtained according to a word in a training sample text, a corresponding original output word vector is obtained according to a context word corresponding to the word in the training sample text, then, a target output word vector is generated according to the original output word vector, the target output word vector carries direction information indicating a position direction of the context word relative to the word, and a word vector learning model is trained using the output word vector and the target output word vector. Because the context of the input words in different position directions is modeled respectively in the embodiment of the invention, the structural information of the context words is merged into the word vector learning, so the word vector obtained by the word vector model learning can embody the structural information of the context, and the word vector obtained by the word vector learning model provided by the embodiment of the invention can be suitable for various tasks of natural language processing, especially tasks related to semantics and syntax.

In order to better understand and implement the above-mentioned schemes of the embodiments of the present invention, the following description specifically illustrates corresponding application scenarios.

The word vector learning model used in the embodiment of the present invention may be an improved SG model (hereinafter referred to as a DSG model) which is a word vector learning model by establishing a relationship between one word and other words around it. Specifically, in a given corpus, for a word sequence segment, the SG model learns the relationship between them for each pair of words, i.e., predicts the probability of outputting other words given a word as input. The vector for each word is finally updated by optimizing these probability values. The method provided by the invention can enhance the semantic ability of the SG model and increase the syntactic ability at the same time.

The word vector training method provided by the embodiment of the invention is used as a basic algorithm and can be used in all natural language related application scenes and processing technologies and products required by the application scenes. The usage mode is generally to generate or update word vectors by using the word vector learning model provided by the invention, and deliver the generated vectors to be applied to subsequent natural language processing tasks. For example, the generated word vector can be used in a word segmentation and part-of-speech tagging system to improve the accuracy of word segmentation and part-of-speech tagging, thereby improving the subsequent processing capability. As another example, in a search and related scenarios, the obtained search results often need to be sorted, and the sorted results often need to calculate semantic similarity of each result to a search query statement (query). The similarity measurement can be achieved through similarity calculation of word vectors, and therefore the quality of the vectors greatly determines the effect of the semantic similarity calculation method. In addition to the above tasks, since the word vectors trained by the embodiments of the present invention effectively combine and distinguish context information of different words, it is possible to have better performance especially for tasks of semantic and syntactic types (e.g., part-of-speech tagging, chunking analysis, structural syntactic analysis, dependency syntactic analysis, etc.).

Fig. 2 is a schematic view of an application scenario of the word vector training method according to the embodiment of the present invention. In the word vector training method provided by the embodiment of the invention, human languages have a linear characteristic, that is, words expressing any language often follow a certain sequence relationship, so that the collocation of the words can form a certain relatively fixed front-back sequence relationship, for example, in a sentence, a word may often appear on the left side of another word, especially for a language with higher requirement on the word sequence than the syntax, such as Chinese. Based on the above analysis, the embodiment of the present invention uses another approach to model the above (left text) and below (right text) relations of the input word respectively to reflect the word order relation formed by the context of one word. On the basis of an SG model, the embodiment of the invention introduces an additional direction vector delta for each word, and the vector delta is used for expressing and calculating the situation that the word appears on the left side or the right side of an input word as a certain context word.

For this purpose, a softmax function g is defined, and the interaction between the direction vector of the following word and the word vector of the currently input word is calculated as in the formula one, so as to integrate the direction information into the final word vector.

In particular, the interaction function is used to compute the input word w for_tContext word w_t+iAnd synchronously updating the values of delta and v through the calculation result of the formula I to ensure that when w is equal to_t+iAt w_tThe value of g tends to 1 when w is on the left side_t+iAt w_tThe value of g tends to 0 when the right side is right. To achieve this effect, the updating manner of δ and v can be as in formula two and formula three in the foregoing embodiments, where the superscript (new) is used to indicate the vector after updating, and (old) is used to indicate the vector before updating, and the learning rate is a numerical variable that is continuously reduced as the training process progresses in the training of the word vector, and is generally defined as the ratio of the size of the untrained text to the size of the total text. Where D is the tagged information of the context word around the input word, there are two values, as described above, such as the formula four, i in the previous embodiment<0 corresponds to the above vocabulary, i>0 corresponds to the vocabulary in the following text, and in each training sample, the value of D is distinguished according to the text and the following text, and is a mark automatically obtained according to the position of the word during training.

The g function defined by the above formula can be regarded as an effective means for modeling the structural information of the context, and besides the g function, in the embodiment of the present invention, an SG model is used to calculate the conditional probability between words for modeling the semantic relationship between words.

As shown in fig. 3, the word vector learning model provided in the embodiment of the present invention is a schematic diagram of an SG model. Wherein, w₀Is a current word, w_-2，w_-1，w₁And w₂Is w₀The SG model utilizes w₀As an input, maximize w₀Probability to other words, so the optimization goal of the SG model over the entire corpus is to maximize each word w_tA joint Log-likelihood estimate (Log-likelihood) of the probability to its context may be estimated, for example, by the aforementioned equation six.

For convenience of explanation of subsequent methods, formula six may use the f function to express w_tIn the SG model, f (w)_t+i，w_t) Defined as a softmax function expressed by a word vector, for example, as shown in equation seven below:

wherein v is_wtFinger w_tIs expressed as an input vector of'_wt+iFinger w_t+iThe output vector of (a) is expressed, and so on. Each word in the SG model has two vectors, one for the input word (labeled v) and the other for the predicted output context word (labeled v'). Therefore, the SG model increases the value of the joint likelihood estimate in formula six by calculating formula seven and updating the vectors for each word continuously iteratively over the entire corpus, and outputs the vectors for all words after a specified number of iterations.

Fig. 4 is a schematic diagram of joint optimization provided in the embodiment of the present invention. In the embodiment of the present invention, the optimization target of the DSG model is consistent with the function defined by formula six, and the global log maximum likelihood estimation is optimized, for example, by using formula five in the foregoing embodiment. Therefore, in the present invention, it can be considered as a joint optimization problem for two correlation functions, and the optimization target of each word can be expressed in the form shown in fig. 4, where the solid line arrow represents the prediction relationship and the dotted line arrow represents the vector update process of the input word.

In the foregoing implementation process, the method provided by the embodiment of the present invention has no special requirement on hardware, is consistent with a word vector learning model (e.g., SG model), can complete calculation by using a common processor, and can be implemented by using a single thread or multiple threads. The word vectors and direction vectors related to the present invention are stored in a Memory (RAM) during the calculation process, and are output to a disk or other carriers for storage after the calculation is completed. In the embodiment of the invention, the whole algorithm only needs to give one training corpus, and the vectors of the words contained in the corpus can be calculated according to parameters such as the size of a predefined window, the iteration times and the like.

The embodiment of the present invention further provides a Structured SG (Structured Skip-Gram, SSG), where the SSG model considers context words of the input word and also considers the influence of the positions of the context words on the input word, the positions of the context words refer to a relative position relationship between the context words and the input word in the corpus, and the probabilities of the context words are predicted separately for each different position. The structure of the SSG model is similar to the SG model, as shown in FIG. 5, except that the SSG model estimates the probability of each context word at a corresponding location using different parameters, O in FIG. 2_-2、O_-1、O₁And O₂Indicating that different words are predicted using a uniform O as distinguished from that shown in fig. 3. Wherein, O is the expression of the prediction relationship, the same O represents the same prediction relationship, and O with different corner marks represents different prediction relationships.

In the embodiment of the invention, the optimization target of the SSG model is consistent with that of the SG model, and the joint log-likelihood estimation of the whole corpus is maximized. The only difference is that there are multiple output vectors corresponding to different locations in the SSG model, so f is defined as equation eight below.

Wherein r is the relative position, c is the contextThe window size and the remaining physical quantities have the same meaning as the preceding formula, where a context word w_t+iThe probability of the input word needs to be taken into account at w_tAnd thus the SSG model formally defines a series of different "roles" (prototypes) for each context word to distinguish the effect of the words on the input word when they occur at different locations. Compared with the SG model, distinguishing context words at different positions enables the SSG model to model the structural information of the context (here, information such as the arrangement and sequence relationship of the words) to a certain extent, so that the mutual relationship between the words richer than the SG can be learned.

As can be seen from the foregoing examples of the SG model, the DSG model, and the SSG model, although various methods can effectively train the word vector, there are some differences in many aspects. For example, the SG model does not distinguish between different types of contexts, and treats any word within the context window for each target word input word equally. Therefore, the context structure information in the target word cannot be reflected in the vector of the target word, all words around a word are equal in importance degree to the word, and much collocation information (especially fixed forward collocation or backward collocation) cannot be reflected in the learning process of the vector. In contrast, the SSG model solves the problem of distinguishing the context from the SG model, and ensures that the context words at each location have a specific and unique role, however, this significantly increases the computational complexity, and for training with the same size corpus, the SSG may take several times the SG.

Table 1 lists the temporal and spatial complexity of SG and SSG models, where d represents the dimension of a word vector (e.g., 50,100,200, etc.), and S is the corpus size (total Token number) used to train the word vector, where Token can be translated into "word" representing the number of words in the corpus, but will be different from the concept of words in the vocabulary, e.g., a corpus contains 1 ten thousand words, and possibly only 100 different words (i.e., vocabulary), where Token refers to 1 ten thousand words. V is the set of all words in the corpus, o is the time required for performing one vector update, n is the number of negative samples (negative sampling), and negative sampling is an algorithm for effectively reducing the computational complexity in word vector computation, as can be seen from table 1 below, when the context window increases, the spatial and temporal complexity of the SSG model will be higher than the SG model by the multiple of the window size c, and taking the context window size of 5 words typically used in word vector computation as an example, training an SSG model will require about 5 times the spatial and temporal complexity of the SG model.

TABLE 1 spatiotemporal complexity analysis of the various models

Method of producing a composite material	Spatial complexity	Time complexity
			SG	2\|V\|d	2cS(n+1)o
SSG	(2c+1)\|V\|d	4c2S(n+1)o
			DSG	3\|V\|d	2cS(n+2)o

Wherein, the lower the space-time complexity, the easier the implementation, and the less the hardware requirement for the processor. The space-time complexity of the DSG model compared to the SG model and the SSG model is also shown in the third row of Table 1. Compared with an SG model and an SSG model, the DSG model can give consideration to learning of structural information of a certain degree of context, and meanwhile, compared with the SG model, the calculation complexity is not increased remarkably. The foregoing SSG model has higher computational complexity than the DSG model, and due to sparsity of occurrence of words at different positions, the SSG model is difficult to extend to a larger context window environment in actual computation, whereas the DSG model is less affected by the data sparsity problem than the SSG model. Considering the expression characteristics of Chinese, the DSG model is more sensitive to word order than syntax, so that the DSG model is more suitable for learning Chinese word vectors, and is more favorable for semantic understanding of a Chinese environment and further processing of Chinese vocabularies.

In the word vector training method provided by the invention, a group of additional direction vectors are introduced by using the DSG model and used for expressing the position information of the context words to the input words, so that the method can learn the structure information of the context. For example, compared with the SG and SSG models, the DSG model requires 1.5 times more spatial complexity than the conventional SG model, the temporal complexity is close to the conventional SG model, and is much lower than the SSG model, especially the spatial complexity is not affected by the size of the context window, the temporal complexity is linearly proportional to the window size, and the SSG model is squared.

In the embodiment of the present application, since a set of additional direction vectors is introduced, the set of additional direction vectors can be output separately, and can be directly used for calculating the position relationship between a certain word and another word, for example, directly calculating the direction vector and the cosine of the word vector, where sim-cosine (v1, d2) refers to the word vector of word 1 as v1, and d2 refers to the direction vector of word 2. The similarity is calculated through the position vector of a certain word and the word vector of another word, so that the expression of position information is simplified, and only the upper text and the lower text are distinguished. Meanwhile, since the sequence information of the context is integrated into the word vector expression, the word vector obtained through the learning of the DSG model has certain syntactic adaptability, that is, the final word vector implicitly contains text structure information, so that the method can help the semantic and syntactic related tasks (such as part-of-speech tagging, chunk recognition, dependency syntactic analysis and the like) of natural language processing to a certain extent.

Due to the context distinguishing performance in the invention, the word vector obtained by the learning of the method can obtain more accurate word class distinguishing capability. This is because, due to the characteristics of the language structure, words belonging to certain categories tend to be subject to a certain degree of organization, for example, adjectives tend to be in front of nouns, adverbs before and after verbs are functionally different, and so on (as with the aforementioned syntactic adaptability). Therefore, when similar words are calculated using word vectors learned by DSG, it is easier to obtain the same type of words (compared to SG models), and word vectors having such a capability can be calculated more efficiently than complex models such as SSG.

Without limitation, in the word vector training method provided by the embodiment of the present invention, the negative sampling algorithm may be replaced with a hierarchical softmax (hierarchical softmax) algorithm to calculate the probability of the target word and predict the target word. Compared with negative sampling, the layered softmax can obtain better results when the training data is small, but the required computing space is remarkably increased when the training data is increased.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

To facilitate a better implementation of the above-described aspects of embodiments of the present invention, the following also provides relevant means for implementing the above-described aspects.

Referring to fig. 6-a, a server 600 according to an embodiment of the present invention may include: an input word vector obtaining module 601, an output word vector obtaining module 602, an output word vector reconfiguring module 603, and a model training module 604, wherein,

an input word vector obtaining module 601, configured to obtain a corresponding input word vector according to a word in a training sample text;

an output word vector obtaining module 602, configured to obtain a corresponding original output word vector according to a context word corresponding to the word in the training sample text;

an output word vector reconfiguration module 603, configured to generate a target output word vector according to the original output word vector, where the target output word vector carries direction information used to indicate a position direction of the context word relative to the word;

a model training module 604, configured to train a word vector learning model using the output word vector and the target output word vector.

In some embodiments of the present application, referring to fig. 6-b, the output word vector reconfiguration module 603 includes:

a direction vector generation module 6031 configured to generate a direction vector according to the context word appearing above or below the word, where the direction vector is used to indicate that the context word appears above or below the word;

a first target output word vector generating module 6032, configured to obtain the target output word vector by using the original output word vector and the direction vector, where the target output word vector includes: the original output word vector and the direction vector.

In some embodiments of the present application, referring to fig. 6-c, the model training module 604 comprises:

an interactive function calculation module 6041, configured to obtain an interactive function calculation result according to the input word vector and the direction vector, and perform iterative update on the input word vector and the direction vector according to the interactive function calculation result;

a conditional probability calculation module 6042, configured to obtain a conditional probability calculation result according to the input word vector and the original output word vector, and iteratively update the input word vector and the original output word vector according to the conditional probability calculation result;

and an object estimation module 6043, configured to estimate an optimal object of the word vector learning model according to the interaction function calculation result and the conditional probability calculation result.

Further, in some embodiments of the present application, the interaction function calculation module 6041 is specifically configured to calculate an interaction function between the input word vector and the direction vector, wherein,

wherein, the g (ω)_t+i,ω_t) Representing the result of said interactive function calculation, said δ_ωt+iMeans that the context word is ω_t+iDirection vector of time, said v_ωtRepresents the word as omega_tThe V represents the set of all words in the corpus.

Further, in some embodiments of the present application, the interactive function calculating module 6041 is specifically configured to iteratively update the input word vector and the direction vector, wherein,

wherein, theRepresents that the updated word is ω_tAn input vector of time, saidRepresenting the input vector before update, gamma representing the learning rate, delta_ωt+iMeans that the context word is ω_t+iDirection vector of time, instituteV is_ωtRepresents the word as omega_tAn input vector of time, said σ (v)_ωt ^Tδ_ωt+i) Representing a position direction predicted value of the context word relative to the word, the D representing a position direction marker value of the context word relative to the word, theRepresents the updated context word as ω_t+iA direction vector of time, saidRepresents the context word before update as ω_t+iThe direction vector of time.

In some embodiments of the present application, the position direction flag value D satisfies the following condition:

wherein when i <0, it means that the position direction of the context word with respect to the word is above, and when i >0, it means that the position direction of the context word with respect to the word is below.

Further, in some embodiments of the present application, the target estimation module 6043 is configured to calculate a global log maximum likelihood estimate f (ω) by_t+i,ω_t) Wherein, f (ω)_t+i,ω_t)＝p(ω_t+i|ω_t)+g(ω_t+i,ω_t) Said g (ω)_t+i,ω_t) Representing the result of said interaction function computation, said p (ω)_t+i|ω_t) Representing the conditional probability computation result; calculating a joint log-likelihood estimate L of the probability of the word to the context word by_SGWhereinthe V represents all the words in the corpus, and the context word isω_t+iThe word is omega_tAnd c represents a context window size.

In some embodiments of the present application, referring to fig. 6-d, the output word vector reconfiguration module 603 includes:

an output word vector generating module 6033, configured to obtain an output word vector from the original output word vector according to the context word appearing above the word;

a context output word vector generation module 6034, configured to obtain a context output word vector from the original output word vector according to that the context word appears in the context of the word;

a second target output word vector generating module 6035, configured to obtain the target output word vector by using the above output word vector and the below output word vector, where the target output word vector includes: the above output word vector and the below output word vector.

As can be seen from the above description of the embodiments of the present invention, first, a corresponding input word vector is obtained according to a word in a training sample text, a corresponding original output word vector is obtained according to a context word corresponding to the word in the training sample text, then, a target output word vector is generated according to the original output word vector, the target output word vector carries direction information indicating a position direction of the context word relative to the word, and a word vector learning model is trained by using the output word vector and the target output word vector. In the embodiment of the invention, the context of the input word in different position directions is modeled respectively, and the structural information of the context word is integrated into the word vector learning, so the word vector obtained by the word vector model learning can embody the structural information of the context, and the word vector obtained by the word vector learning model provided by the embodiment of the invention can be suitable for semantic and syntactic tasks of natural language processing.

Fig. 7 is a schematic diagram of a server 1100 according to an embodiment of the present invention, where the server 1100 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and a memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) for storing applications 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100.

The server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps of the word vector training method performed by the server in the above embodiment may be based on the server structure shown in fig. 7.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and may also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, the implementation of a software program is a more preferable embodiment for the present invention. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the above embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for word vector training, comprising:

2. The method of claim 1, wherein generating a target output word vector from the original output word vector comprises:

generating a direction vector according to the context word appearing above or below the word, the direction vector indicating that the context word appears above or below the word;

obtaining the target output word vector through the original output word vector and the direction vector, wherein the target output word vector comprises: the original output word vector and the direction vector.

3. The method of claim 2, wherein training a word vector learning model using the output word vectors and the target output word vectors comprises:

4. The method of claim 3, wherein obtaining an interactive function computation result according to the input word vector and the direction vector comprises:

calculating an interaction function between the input word vector and the direction vector by, wherein,

5. The method of claim 3, wherein iteratively updating the input word vector and the direction vector according to the interactive function computation result comprises:

iteratively updating the input word vector and the direction vector by, wherein,

wherein, theRepresents that the updated word is ω_tAn input vector of time, saidRepresenting the input vector before update, gamma representing the learning rate, delta_ωt+iMeans that the context word is ω_t+iDirection vector of time, said v_ωtRepresents the word as omega_tAn input vector of time, said σ (v)_ωt ^Tδ_ωt+i) A position direction prediction value representing the context word relative to the word, andthe D represents a position direction mark value of the context word relative to the word, theRepresents the updated context word as ω_t+iA direction vector of time, saidRepresents the context word before update as ω_t+iThe direction vector of time.

6. The method according to claim 5, characterized in that the position direction flag value D satisfies the following condition:

7. The method according to any one of claims 3 to 6, wherein the estimation of the optimal target of the word vector learning model from the interaction function calculation result and the conditional probability calculation result is performed by:

f(ω_t+i,ω_t)＝p(ω_t+iω_t)+g(ω_t+i,ω_t)，

wherein, the g (ω)_t+i,ω_t) Representing the result of said interaction function computation, said p (ω)_t+iω_t) Representing the conditional probability computation result;

calculating a joint log-likelihood estimate L of the probability of the word to the context word by_SGWhich isIn (1),

wherein the V represents all word sets in the corpus, and the context word is omega_t+iThe word is omega_tAnd c represents a context window size.

8. The method of claim 1, wherein generating a target output word vector from the original output word vector comprises:

acquiring an upper output word vector from the original output word vector according to the situation that the context word appears above the word;

acquiring a context output word vector from the original output word vector according to the situation that the context word appears in the context of the word;

obtaining the target output word vector through the above output word vector and the below output word vector, where the target output word vector includes: the above output word vector and the below output word vector.

9. A server, comprising:

10. The server according to claim 9, wherein the output word vector reconfiguration module comprises:

a direction vector generation module for generating a direction vector according to the context word appearing above or below the word, wherein the direction vector is used for indicating that the context word appears above or below the word;

a first target output word vector generation module, configured to obtain the target output word vector through the original output word vector and the direction vector, where the target output word vector includes: the original output word vector and the direction vector.

11. The server of claim 10, wherein the model training module comprises:

the interactive function calculation module is used for obtaining an interactive function calculation result according to the input word vector and the direction vector and carrying out iterative update on the input word vector and the direction vector according to the interactive function calculation result;

the conditional probability calculation module is used for obtaining a conditional probability calculation result according to the input word vector and the original output word vector and performing iterative updating on the input word vector and the original output word vector according to the conditional probability calculation result;

and the target estimation module is used for estimating the optimal target of the word vector learning model according to the interactive function calculation result and the conditional probability calculation result.

12. The server according to claim 11, wherein the interaction function computation module is configured to compute the interaction function between the input word vector and the direction vector by,

13. The server according to claim 11, wherein the interaction function computation module is configured to iteratively update the input word vector and the direction vector by, wherein,

wherein, theRepresents that the updated word is ω_tAn input vector of time, saidRepresenting the input vector before update, gamma representing the learning rate, delta_ωt+iMeans that the context word is ω_t+iDirection vector of time, said v_ωtRepresents the word as omega_tAn input vector of time, said σ (v)_ωt ^Tδ_ωt+i) Representing a position direction predicted value of the context word relative to the word, the D representing a position direction marker value of the context word relative to the word, theRepresents the updated context word as ω_t+iA direction vector of time, saidRepresents the context word before update as ω_t+iThe direction vector of time.

14. The server according to claim 13, wherein the position and direction flag value D satisfies the following condition:

15. The server according to any of claims 11 to 14, wherein the target estimation module is configured to calculate a global log maximum likelihood estimate f (ω) by_t+i,ω_t) Wherein, f (ω)_t+i,ω_t)＝p(ω_t+i|ω_t)+g(ω_t+i,ω_t) Said g (ω)_t+i,ω_t) Representing the result of said interaction function computation, said p (ω)_t+i|ω_t) Representing the conditional probability computation result; calculating a joint log-likelihood estimate L of the probability of the word to the context word by_SGWhereinthe V represents a set of all words in the corpus, and the context word is omega_t+iThe word is omega_tAnd c represents a context window size.