WO2021227935A1

WO2021227935A1 - Training of word vector embedding model

Info

Publication number: WO2021227935A1
Application number: PCT/CN2021/092009
Authority: WO
Inventors: 曹绍升; 陈超超; 吴郑伟
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2020-05-09
Filing date: 2021-05-07
Publication date: 2021-11-18
Also published as: CN111291165A; CN111291165B

Abstract

A method for training a word vector embedding model. Said method comprises multiple iterative updating, wherein any one of the iterative updating comprises: firstly determining a central word and a plurality of corresponding context words from a word sequence corresponding to a training sentence; then determining, according to a first word vector matrix, a central word vector corresponding to the central word, and determining, according to a second word vector matrix, a plurality of context word vectors corresponding to the plurality of context words; then, determining, on the basis of the similarity between the plurality of context word vectors, a plurality of attention weights corresponding thereto; using the plurality of attention weights to perform weighted summation on the plurality of context word vectors to obtain a context representation vector of the central word; then, calculating a first similarity between the central word vector and the context representation vector; and finally, with at least increasing the first similarity as a target, updating the first word vector matrix and the second word vector matrix.

Description

Training word vector embedding model

Technical field

The embodiments of this specification relate to the application of machine learning technology to the field of text processing, and in particular to a method and device for training a word vector embedding model.

Background technique

Word vector technology solves the problem that computers cannot understand the semantics of human language by mapping words to real number vectors. For example, humans can easily determine that "猫" and "猫咪" are two words with very close semantics, but it is difficult for a computer to describe the semantic similarity of these two words. In this regard, the word vector algorithm can be used to generate a word vector for each of "cat" and "cat", and then by calculating the similarity between the word vectors, the semantic similarity between "cat" and "cat" can be determined. Therefore, the accuracy of the word vector algorithm determines the semantic comprehension ability of the computer.

However, the current word vector algorithm is relatively single, and it is difficult to meet multiple requirements. For example, while quickly generating word vectors for a large number of words, it is guaranteed that the determined word vectors have high accuracy. Therefore, a solution is needed that can quickly and accurately determine the word vectors of a large number of words.

Summary of the invention

In the method of training the word vector embedding model described in this manual, the word vector training framework CBOW is used for reference, and the self-attention mechanism is introduced, which can realize the rapid training of massive word vectors while effectively improving the accuracy of the trained word vectors. .

According to a first aspect, a method for training a word vector embedding model is provided, the word vector embedding model includes a first word vector matrix and a second word vector matrix; the method includes multiple iterations of updating, wherein any one of the iterations is updated Including: determining the central word and k context words of the central word from the word sequence corresponding to the training sentence, where k is an integer greater than 1, and determining the central word vector corresponding to the central word according to the first word vector matrix According to the second word vector matrix, determine the k context word vectors corresponding to the k context words; determine the corresponding k attention weights based on the similarity between the k context word vectors; Describe the k attention weights, perform a weighted summation on the k context word vectors to obtain the context representation vector of the central word; calculate the first similarity between the central word vector and the context representation vector; At least with the goal of increasing the first similarity, the first word vector matrix and the second word vector matrix are updated; the first word vector matrix after multiple iterations is used to query the target of the target word Word vector.

In one embodiment, before the multiple iterative updates, the method further includes: obtaining training corpus, including multiple training sentences; performing word segmentation on each training sentence, and obtaining the corresponding training corpus according to the word segmentation result According to the vocabulary list, initialize the first word vector matrix and the second word vector matrix, wherein one row or one column of each matrix corresponds to the vocabulary list A word in.

In a specific embodiment, obtaining the vocabulary list corresponding to the training corpus according to the word segmentation result includes: performing word frequency statistics according to the word segmentation result to obtain the word frequencies of a plurality of different word segmentation; from the plurality of different word segmentation The low-frequency word segmentation whose word frequency is lower than the predetermined threshold is removed to obtain the vocabulary list.

In one embodiment, determining the central word and the k context words of the central word from the word sequence corresponding to the training sentence includes: using a sliding window with a preset width, sliding along the word sequence, and changing any time Next, the word corresponding to the center position of the sliding window is determined as the central word, and words in the sliding window other than the central word are used as the k contextual words.

In one embodiment, determining the corresponding k attention weights based on the similarity between the k context word vectors includes: determining a k-order similarity square matrix based on the k context word vectors, The element in the i-th row and the j-th column represents the similarity between the i-th context word vector and the j-th context word vector, where i and j are positive integers not greater than k; for the k-th order similarity square matrix Each row of is normalized to obtain a k-order self-attention score square matrix; the average value of each column in the k-order self-attention score square matrix is respectively calculated to obtain the k self-attention weights.

In a specific embodiment, determining the k-th order similarity square matrix based on the k context word vectors includes: calculating the dot product between the i-th context word vector and the j-th context word vector as An element in the i-th row and j-th column in the k-th order similarity square matrix.

In another specific embodiment, the normalization process is performed on each row in the k-order similarity square matrix to obtain the k-order self-attention score square matrix, which includes: using a softmax function to perform the normalization on each row separately Normalization processing is performed to obtain the k-order self-attention score square matrix.

In one embodiment, calculating the first similarity between the central word vector and the context representation vector includes: calculating the dot product between the central word vector and the context representation vector as the The first degree of similarity.

In an embodiment, updating the first word vector matrix and the second word vector matrix with at least increasing the first degree of similarity includes: randomly extracting from the first word vector matrix A certain first word vector; calculate the second similarity between the certain first word vector and the context representation vector; aim to increase the first similarity and reduce the second similarity , Updating the first word vector matrix and the second word vector matrix.

According to a second aspect, an apparatus for training a word vector embedding model is provided, the word vector embedding model includes a first word vector matrix and a second word vector matrix; the device includes an iterative update unit for performing multiple iterations Update, the iterative update unit performs any one of the iterative updates through the following modules:

The word determination module is configured to determine the central word and k context words of the central word from the word sequence corresponding to the training sentence, where k is an integer greater than 1. The word vector determining module is configured to determine the central word vector corresponding to the central word according to the first word vector matrix; and to determine the k context word vectors corresponding to the k context words according to the second word vector matrix. The weight determination module is configured to determine k corresponding attention weights based on the similarity between the k context word vectors. The weighted summation module is configured to use the k attention weights to perform a weighted summation on the k context word vectors to obtain the context representation vector of the central word. The similarity calculation module is configured to calculate the first similarity between the central word vector and the context representation vector. The matrix update module is configured to update the first word vector matrix and the second word vector matrix with at least increasing the first degree of similarity; the first word vector matrix after multiple iterations update is used The target word vector for querying the target word.

According to a third aspect, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in the first aspect.

According to a fourth aspect, there is provided a computing device, including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the implementation described in the first aspect is implemented. Methods.

In the above-mentioned method and device disclosed in the embodiments of this specification, the self-attention weight of multiple context word vectors is determined by introducing a self-attention mechanism, so as to realize the characterization of the mutual influence and internal association between multiple context word vectors. Furthermore, the weighted vector of multiple context word vectors is obtained as the context representation vector of the central word. In this way, compared with directly obtaining the average vector of multiple context word vectors as the context representation vector, the accuracy of the context representation vector can be improved, thereby Improve the accuracy of the updated word vector matrix, thereby improving the accuracy of the finally determined word embedding vector.

Description of the drawings

In order to more clearly describe the technical solutions of the multiple embodiments disclosed in this specification, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only disclosed in this specification. For the multiple embodiments, those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

Fig. 1 shows a schematic diagram of two word vector matrices in a word embedding model according to an embodiment;

FIG. 2 shows a schematic diagram of selecting a center word and a context word based on a word sequence according to an embodiment;

FIG. 3 shows a schematic diagram of the architecture of training a word embedding model based on the selected words in FIG. 2 and the word vector matrix in FIG. 1 according to an embodiment;

FIG. 4 shows a schematic diagram of a training corpus processing flow in a method for training a word vector embedding model according to an embodiment;

Fig. 5 shows a schematic diagram of an iterative update process in a method for training a word vector embedding model according to an embodiment;

Fig. 6 shows a schematic structural diagram of an apparatus for training a word vector embedding model according to an embodiment.

Detailed ways

In the following, a number of embodiments disclosed in this specification will be described with reference to the accompanying drawings.

The embodiment of this specification discloses a method for training a word vector embedding model. Below, first introduce the inventive concept of the method proposed by the inventor, which is specifically as follows:

The word vector algorithm is used to map a word to a fixed-dimensional vector, so that the value of the vector can represent the semantic information of the word. At present, there are two common frameworks for training word vectors, namely Skigram and CBOW (Continuous Bag-of-Words Model). The accuracy of word vectors determined based on the Skigram framework is higher, but the training speed will be many times slower. In some scenarios with very large data volumes, the CBOW framework is more needed, but the accuracy of the word vectors determined based on it is limited.

Based on this, the inventor proposes a method for training word vector embedding models. By drawing on the CBOW framework and introducing a self-attention mechanism, it can quickly complete the word vector training of super-large-scale texts while effectively improving the trained The accuracy of the word vector.

Specifically, the above-mentioned word vector embedding model includes two word vector matrices established for the same multiple words, which are called the first word vector matrix and the second word vector matrix to facilitate distinguishing and description. In an embodiment, FIG. 1 shows a schematic diagram of two word vector matrices in a word embedding model according to an embodiment, and FIG. 1 shows a first word vector matrix corresponding to N (an integer greater than 1) words W _N*M and the second word vector matrix C _N*M . In the method of training the word vector embedding model, it includes multiple iterative updates. In each iterative update, the first word vector matrix and the second word vector matrix are updated, and further, after a predetermined number of iterations or until convergence Then, the N word vectors contained in the first word vector matrix updated in the last iteration are determined as the final word vectors of the N words.

In one embodiment, updating any one of the foregoing multiple iterations may include: first, from the word sequence corresponding to the training sentence, select the central word and the corresponding multiple context words, for example, the words shown in FIG. 2 Select the center word _t and 2b context words in the sequence, and the 2b context words correspond to the set {context word _i |i∈[tb,t+b],i≠t}; then, get the center from the first word vector matrix The first word vector of the word is used as the central word vector, and multiple second word vectors of multiple context words are obtained from the second word vector matrix as multiple context word vectors; then, the self-attention mechanism is introduced, based on multiple A context word vector is used to score multiple context words for self-attention, and then the weight of each context word among them is obtained, and the weight is used to perform a weighted summation on multiple context word vectors to obtain the context representation vector for the central word, for example As shown in Figure 3, based on 2b context word vectors, 2b self-attention weights are determined, and then 2b self-attention weights are used to perform a weighted summation on the 2b context word vectors to obtain the context representation vector c _{t of the central word} ′; Finally, based on the above-mentioned central word vector and context representation vector, the loss is determined and used to update the first word vector matrix and the second word vector matrix. For example, as shown in Figure 3, the central word vector w _t and the context representation vector are used c _t ′ calculates the training loss, which is used to adjust the two word vector matrices. In this way, iterative update of the two word vector matrices can be achieved.

Using the above method, by introducing a self-attention mechanism, the self-attention weights of multiple context word vectors are determined, and the mutual influence and internal association between multiple context word vectors are described, and then the value of multiple context word vectors is obtained. The weighted vector is used as the context representation vector of the central word. In this way, compared to directly obtaining the average vector of multiple context word vectors as the context representation vector, the accuracy of the context representation vector can be improved, thereby improving the accuracy of the updated word vector matrix Therefore, the accuracy of the finally determined word embedding vector is improved.

The following describes the implementation steps of the above method in conjunction with specific embodiments. Specifically, the execution subject of the foregoing method may be any device, equipment, system, server cluster, etc. that has computing and processing capabilities. In the above method, firstly, it includes processing the training corpus to establish a vocabulary list, and then initialize the corresponding first word vector matrix and the second word vector matrix, and determine multiple word sequences corresponding to multiple training sentences for subsequent use The first word vector matrix and the second word vector matrix are updated iteratively. In order to facilitate understanding, first introduce the processing of training corpus, and then introduce the process of multiple iterations and updates.

Fig. 4 shows a schematic diagram of a training corpus processing flow in a method for training a word vector embedding model according to an embodiment. As shown in Figure 4, the processing flow of the training corpus includes the following steps:

Step S410: Obtain training corpus, which includes multiple training sentences; Step S420, perform word segmentation on each training sentence, and obtain a vocabulary table corresponding to the training corpus and the word sequence corresponding to each training sentence according to the word segmentation result; Step S430: Initialize the first word vector matrix and the second word vector matrix according to the vocabulary list, wherein one row or one column of each matrix corresponds to a word in the vocabulary list.

The above steps are as follows:

First, in step S410, a training corpus is obtained, which includes a plurality of training sentences.

In one embodiment, a large amount of text can be crawled from the website as training corpus. In another embodiment, electronic texts of reference books, such as dictionaries, can be obtained as training corpus. Further, in one embodiment, sentence segmentation processing may be performed on the training corpus to obtain the above-mentioned multiple training sentences. The sentence segmentation processing can be based on common punctuation marks, such as comma, period, semicolon, exclamation point, etc., to segment the text. In another embodiment, the symbols in a text can be removed, and the retained text can be used as a training sentence. In a specific embodiment, for the user dynamics crawled from the social network platform, tags, tags, spaces and other symbols can be removed, and the remaining text can be used as the corresponding training sentence.

In this way, multiple training sentences included in the training corpus can be obtained. Next, in step S420, word segmentation is performed on multiple training sentences, and a vocabulary list and multiple word sequences are determined according to the word segmentation result.

Specifically, word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. The word segmentation in this step can be achieved by using existing word segmentation methods or word segmentation tools. For example, the word segmentation methods may include forward maximum matching, minimum segmentation, and word segmentation based on N-gram model (N-gram), and the word segmentation tools may include THULAC lexical tools, NLPIR word segmentation systems, and the like.

The above-mentioned word segmentation results include multiple word segmentation obtained after word segmentation of each training sentence. Based on this, a number of different word segmentation can be determined, and the above vocabulary list can be constructed. At the same time, by performing word segmentation on each training sentence, the corresponding word segmentation sequence can be obtained as a word sequence, thereby obtaining multiple word sequences corresponding to multiple training sentences.

In an embodiment, according to the multiple word segmentation contained in the word segmentation result, word frequency statistics can be performed to obtain the word frequencies of multiple different word segments, and the low-frequency segmentation words whose word frequency is lower than a predetermined threshold are removed, and the remaining word segmentation is used to construct the above Glossary. It should be noted that the above word frequency refers to the number of occurrences of words, and the predetermined threshold can be set according to actual needs, such as 10 or 20. In another embodiment, different word segments may be ranked according to word frequency, and then the last few words (such as 5 or 10 from the bottom) are discarded, and the above-mentioned vocabulary list is constructed using the remaining word segments. Correspondingly, for each of the above-mentioned word segmentation sequences, it is also necessary to remove the above-mentioned low-frequency words to obtain the corresponding word sequences. Therefore, multiple word sequences after removing low-frequency words can be obtained.

Above, the vocabulary and multiple word sequences can be determined. It needs to be understood that the words in the word sequence all exist in the vocabulary, and the words in the vocabulary may appear in several word sequences, or they may appear multiple times in a certain word sequence.

Then in step S430, the first word vector matrix and the second word vector matrix are initialized according to the vocabulary list, wherein one row or one column of each matrix corresponds to a word in the vocabulary list.

Specifically, each word in the vocabulary can be mapped to a fixed-dimensional vector as the word vector of each word. Therefore, the word vector corresponding to each word in the vocabulary can jointly form a word vector matrix. Assuming that the vocabulary includes N words, and mapping each word to an M-dimensional vector, an N*M-dimensional matrix can be obtained. Refer to Figure 1, where two word vector matrices are established, namely the first word vector matrix W _N*M and the second word vector matrix C _N*M . It needs to be understood that a word vector can be used as a row vector or a column vector of a matrix. Accordingly, a row or a column in the matrix corresponds to a word in the vocabulary.

While constructing the first word vector matrix and the second word vector matrix, the initialization can be completed. In an embodiment, a random algorithm may be used to initially assign values to the matrix elements. In another embodiment, the specific values of matrix elements can be arbitrarily designated by the staff. For example, for W _N*M and C _N*M shown in Fig. 1, the i-th (i∈[1,N ]) The i-th element of the row is set to 1, and the remaining elements are set to 0. It should be understood that the initialized first word vector matrix and the second word vector matrix may be the same or different, and there will usually be differences between the two in the subsequent iterative update process. In this way, the initialization of the two word vector matrices can be realized.

From the above, based on the steps shown in Figure 4, the word sequence corresponding to each training sentence can be determined based on the training corpus, and at the same time, the vocabulary can be determined, and then the construction of the first word vector matrix and the second word vector matrix can be completed And initialization.

Further, in the above-mentioned training method of the word vector embedding model, multiple iterative updates may be performed based on the word sequence corresponding to each training sentence and the two initialized word vector matrices. Specifically, FIG. 5 shows a schematic diagram of an iterative update process in a method for training a word vector embedding model according to an embodiment. As shown in Figure 5, any one iteration update can include the following steps:

Step S510: Determine the central word and k context words of the central word from the word sequence corresponding to the training sentence, where k is an integer greater than 1; Step S520, determine the central word corresponding to the first word vector matrix According to the second word vector matrix, determine the k context word vectors corresponding to the k context words; step S530, determine the corresponding k context word vectors based on the similarity between the k context word vectors Attention weights; step S540, use the k attention weights to perform a weighted summation on the k context word vectors to obtain the context representation vector of the central word; step S550, calculate the central word vector and The context represents the first degree of similarity between vectors; step S560, at least aiming at increasing the first degree of similarity, updating the first word vector matrix and the second word vector matrix; the multiple times The first word vector matrix after iterative update is used to query the target word vector of the target word.

The above steps are as follows:

First, in step S510, a central word and k context words of the central word are determined from the word sequence corresponding to the training sentence.

In one embodiment, this step may include: using a sliding window with a preset width, sliding along the word sequence, determining the word corresponding to the center position of the sliding window at any moment as the central word, and Words in the sliding window other than the central word are regarded as the k context words. In a specific embodiment, as shown in Figure 2, the width of the sliding window can be set to 2b+1. Accordingly, the center position of the sliding window at any time can be set to the b-th word in 2b+1. +1 words are regarded as words in the loyal shrine, and the other 2b words are regarded as k (=2b) context words. It should be noted that for the sliding window word acquisition method, each word sequence in the word sequence can be used as the central word in turn. For the default words in the sliding window, for example, the sliding window uses the first word of the word sequence as the center of the sliding window , At this time, after the first word is used as the central word, there is no previous word in the word sequence. At this time, the preset vector can be used to supplement, for example, b preset vectors are used as the b of the first word The word vector of the above words can be deduced by analogy for the rest. In this way, the central word and the corresponding k context words can be determined by the way of taking words through the sliding window.

In another embodiment, a word may be randomly selected from the word sequence as the head word, and then k words adjacent to it may be selected as k context words. For example, assuming that k is 3, at this time, after selecting the central word, one word that is adjacent to it before and two words that are adjacent to it can be selected as its three contextual words.

Above, the head word and its k context words can be determined based on the word sequence. Next, in step S520, according to the first word vector matrix, determine the central word vector corresponding to the central word; and, according to the second word vector matrix, determine k context word vectors corresponding to the k context words.

Specifically, as described above, the first word vector matrix and the second word vector matrix are initialized based on the vocabulary list, and the first mapping relationship between the words in the word list and the first word vector in the first word vector matrix, And the second mapping relationship with the second word vector in the second word vector matrix is established accordingly. The center word and context word are both words in the vocabulary. Therefore, the first word vector corresponding to the center word can be determined according to the first mapping relationship as the above center word vector, and k is determined according to the second mapping relationship. The k second word vectors corresponding to each context word are used as the above k context word vectors. According to one embodiment, the central word can be searched in the vocabulary shown in FIG. 1, and its subscript (for example, subscript 2) in the vocabulary can be determined, and then search in the first word vector matrix according to the subscript To the corresponding first word vector (eg, vector w ₂ ). At the same time, you can also search for k context words in the vocabulary shown in Figure 1, and determine the corresponding k subscripts (such as including subscript N), and then search in the second word vector matrix according to the k subscripts To the corresponding k second word vectors (for example, including the vector c _N ).

In this way, the central word vector corresponding to the central word and the k context word vectors corresponding to the k context words can be determined. Then, in step S530, based on the similarity between the k context word vectors among them, the corresponding k attention weights are determined.

In an embodiment, this step may include: firstly, based on the k context word vectors, determine a k-order similarity square matrix; then, normalize each row in the k-order similarity square matrix respectively After processing, the k-order self-attention score square matrix is obtained; then, the average value of each column in the k-order self-attention score square matrix is respectively calculated to obtain the k self-attention weights.

Further, for the above k-th order similarity square matrix, the element in the i-th row and j-th column represents the similarity between the i-th context word vector and the j-th context word vector. It needs to be understood that i and j are not greater than k Is a positive integer. In a specific embodiment, the dot product, Euclidean distance, or cosine distance between the two word vectors can be calculated as the similarity between the two, and then determined as the i-th row in the k-order similarity square matrix Element in column j.

Regarding the aforementioned normalization processing, in a specific embodiment, the softmax function may be used to perform normalization processing on each row in the k-order similarity square matrix. In this way, a square matrix of k-order self-attention scores can be obtained, and then the average value of the scores in each column can be obtained as the corresponding autonomy weight. It needs to be understood that each column in the k column corresponds to each context word vector, so that k self-attention weights corresponding to the k context word vectors can be obtained.

In another embodiment, this step may include: firstly, calculating the similarity between any two word vectors in the k context word vectors, so that

Similarity, where C is the symbol in permutation and combination,

Indicates the number of groups that can be obtained by taking any two of k different elements as a group; then, for any one of the context word vectors, calculate the sum of the similarity between other word vectors and the k sums Value; Then, normalize the k sum values to obtain the k self-attention weights. It should be noted that, for the description of the similarity calculation and normalization processing, refer to the relevant description in the foregoing, and will not be repeated.

From the above, k self-attention weights corresponding to k context word vectors can be obtained. Next, in step S540, the k attention weights are used to perform a weighted summation on the k context word vectors to obtain the context representation vector of the central word. In an embodiment, refer to Figure 3, which shows 2b context word vectors and 2b self-attention weights, from which the context representation vector of the central word can be calculated:

c _t ′=∑ _{i∈[tb,t+b]-{t}} a _i ·c _i (1)

It should be noted that the subscript i in formula (1) is based on the order of words in the sliding window in Figure 2, or in other words, based on k context words, which is different from the vocabulary in Figure 1 Word subscripts. In addition, t is the subscript of the head word, i∈[tb, t+b]-{t} means that except for the head word, i points to each context word in turn; c _{i means the i-th in} k context word vectors Context word vectors, a _i represents the corresponding k-th self-attention weight.

Thus, through weighted summation, the context representation vector of the central word can be obtained. Then in step S550, the first similarity between the central word vector and the context representation vector is calculated. In a specific embodiment, the dot product, or cosine distance or Euclidean distance, etc. between the central word vector and the context representation vector may be calculated as the first similarity. Based on this, in step S560, at least with the goal of increasing the first degree of similarity, the first word vector matrix and the second word vector matrix are updated.

In an embodiment, this step may include: randomly extracting a certain first word vector from the first word vector matrix, and calculating the second difference between the certain first word vector and the context representation vector Similarity; then, with the goal of increasing the first similarity and reducing the second similarity, the first word vector matrix and the second word vector matrix are updated. In this way, the accuracy of the updated word vector matrix can be further improved. It should be noted that for the calculation of the second similarity, the description of calculating the similarity between the vectors can be used, which will not be repeated here.

In a specific embodiment, the training loss can be calculated using the following formula:

L=-logσ(w _t ·c _t ′)-λlogσ(-w _r ·c _t ′) (2)

In the above formula (2), L represents the above training loss, w _t represents the above central word vector, c _t ′ represents the context representation vector of the above central word, and w _t ·c _t ′ represents the point between _{w t} and c _{t ′} Product, w _r represents a certain first word vector drawn randomly above, w _r · c _t ′ represents _{the dot product between w t} and c _t ′, λ is a hyperparameter, for example, it can be set to 0.01, etc., σ Represents activation functions commonly used in neural networks, such as tanh function or sigmoid function. In this way, the training loss can be determined based on the first similarity and the second similarity, and then the two word vector matrices can be updated according to the training loss.

It should be noted that updating the two word vector matrices according to the training loss may include: determining the loss gradients of related elements in the two word vector matrices according to the training loss, and then using the current values of each element in the related elements to subtract the corresponding loss gradients Multiply the product with the learning step size (super parameter, for example, set to 0.05) to obtain the updated element value, thereby realizing the update of the two word vector matrices.

Above, the update of the first word vector matrix and the second word vector matrix can be realized. It should be noted that the above steps S510 to S560 describe the process of any one iteration update. By repeating the above steps S510 to S560, multiple iterative updates can be achieved, and the first word vector matrix after multiple iterative updates is used as the word vector query matrix corresponding to the vocabulary to query the target word vector of the target word . In a specific embodiment, after a certain search engine receives a user's query instruction for a target word, it can query the target word vector corresponding to the target word from the word vector query matrix, and then according to the target word vector, from The content database determines the relevant content used for feedback to the user.

In summary, using the method of training word vector embedding models disclosed in the embodiments of this specification, by introducing a self-attention mechanism, the self-attention weights of multiple context word vectors are determined, and the mutual influence and interaction between multiple context word vectors are realized. To characterize the internal association, and then obtain the weighted vector of multiple context word vectors as the context representation vector of the central word. In this way, compared to directly obtaining the average vector of multiple context word vectors as the context representation vector, the accuracy of the context representation vector can be improved, thereby improving the accuracy of the updated word vector matrix, and thereby improving the final word embedding. The accuracy of the vector.

Corresponding to the above-mentioned training method, the embodiment of this specification also discloses a training device. Specifically, FIG. 6 shows a schematic diagram of the device structure for training a word vector embedding model according to an embodiment. The word vector embedding model includes a first word vector matrix and a second word vector matrix. Capable of computing node or server cluster implementation.

As shown in FIG. 6, the device 600 includes an iterative update unit 610, configured to perform multiple iterative updates, and the iterative update unit performs any one of the iterative updates through the following modules:

The word determination module 611 is configured to determine the central word and k context words of the central word from the word sequence corresponding to the training sentence, where k is an integer greater than 1.

The word vector determining module 612 is configured to determine the central word vector corresponding to the central word according to the first word vector matrix; and to determine the k context word vectors corresponding to the k context words according to the second word vector matrix.

The weight determination module 613 is configured to determine k corresponding attention weights based on the similarity between the k context word vectors.

The weighted summation module 614 is configured to use the k attention weights to perform a weighted summation on the k context word vectors to obtain the context representation vector of the central word.

The similarity calculation module 615 is configured to calculate the first similarity between the central word vector and the context representation vector.

The matrix update module 616 is configured to update the first word vector matrix and the second word vector matrix with at least increasing the first degree of similarity; the first word vector matrix after multiple iterations of the update The target word vector used to query the target word.

In one embodiment, the above device 600 further includes: a corpus acquisition unit 620, configured to acquire training corpus, including multiple training sentences; and a word segmentation unit 630, configured to segment each training sentence, and obtain all the training sentences according to the word segmentation result. The vocabulary table corresponding to the training corpus, and the word sequence corresponding to each training sentence; the initialization unit 640 is configured to initialize the first word vector matrix and the second word vector matrix according to the vocabulary table, wherein each matrix A row or column of corresponds to a word in the vocabulary.

In a specific embodiment, the word segmentation unit 630 is specifically configured to: perform word frequency statistics according to the word segmentation result to obtain the word frequencies of a plurality of different word segmentation; remove low-frequency word segmentation whose word frequency is lower than a predetermined threshold from the plurality of different word segmentation , Get the vocabulary list.

In one embodiment, the word determining module 611 is specifically configured to: adopt a sliding window with a preset width, slide along the word sequence, and determine the word corresponding to the center position of the sliding window at any time as the center Words, words other than the central word in the sliding window are regarded as the k contextual words.

In an embodiment, the weight determining module 613 is specifically configured to determine a k-th order similarity square matrix based on the k context word vectors, where the element in the i-th row and the j-th column represents the i-th context word vector and the j-th context word vector. The similarity between two context word vectors, where i and j are positive integers not greater than k; each row in the k-order similarity square matrix is respectively normalized to obtain the k-order self-attention score square matrix ; Calculate the average value of each column in the k-order self-attention score square matrix to obtain the k self-attention weights.

In an embodiment, the similarity calculation module 615 is specifically configured to calculate the dot product between the central word vector and the context representation vector as the first similarity.

In an embodiment, the matrix update module 616 is specifically configured to: randomly extract a certain first word vector from the first word vector matrix; calculate the relationship between the certain first word vector and the context representation vector The second degree of similarity; with the goal of increasing the first degree of similarity and reducing the second degree of similarity, the first word vector matrix and the second word vector matrix are updated.

In summary, using the device for training word vector embedding models disclosed in the embodiments of this specification, by introducing a self-attention mechanism, the self-attention weights of multiple context word vectors are determined, and the mutual influence and interaction between multiple context word vectors are realized. To characterize the internal association, and then obtain the weighted vector of multiple context word vectors as the context representation vector of the central word. In this way, compared to directly obtaining the average vector of multiple context word vectors as the context representation vector, the accuracy of the context representation vector can be improved, thereby improving the accuracy of the updated word vector matrix, and thereby improving the final word embedding. The accuracy of the vector.

As above, according to another aspect of the embodiment, there is also provided a computer-readable storage medium on which a computer program is stored. When the computer program is executed in a computer, the computer is caused to execute the description in conjunction with FIG. 4 or FIG. 5 method.

According to an embodiment of yet another aspect, there is also provided a computing device, including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, a combination of FIG. 4 or FIG. 5 is implemented. The described method.

Those skilled in the art should be aware that, in one or more of the foregoing examples, the functions described in the multiple embodiments disclosed in this specification can be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.

The specific implementations described above further describe the objectives, technical solutions, and beneficial effects of the multiple embodiments disclosed in this specification. It should be understood that the above descriptions are only examples of the multiple embodiments disclosed in this specification. The specific implementation mode is only, and is not used to limit the protection scope of the multiple embodiments disclosed in this specification. Any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the multiple embodiments disclosed in this specification , Should be included in the protection scope of the multiple embodiments disclosed in this specification.

Claims

A method for training a word vector embedding model. The word vector embedding model includes a first word vector matrix and a second word vector matrix; the method includes multiple iterative updates, wherein any one iterative update includes:

From the word sequence corresponding to the training sentence, determine the central word and k context words of the central word, where k is an integer greater than 1;

Determine the central word vector corresponding to the central word according to the first word vector matrix; determine the k context word vectors corresponding to the k context words according to the second word vector matrix;

Determine the corresponding k attention weights based on the similarity between the k context word vectors;

Using the k attention weights to perform a weighted summation on the k context word vectors to obtain the context representation vector of the central word;

Calculating the first similarity between the central word vector and the context representation vector;

At least with the goal of increasing the first similarity, the first word vector matrix and the second word vector matrix are updated; the first word vector matrix after multiple iterations is used to query the target of the target word Word vector.
The method according to claim 1, before the multiple iterations of updating, the method further comprises:

Obtain training corpus, including multiple training sentences;

Perform word segmentation on each training sentence, and obtain the vocabulary table corresponding to the training corpus and the word sequence corresponding to each training sentence according to the word segmentation result;

According to the vocabulary list, the first word vector matrix and the second word vector matrix are initialized, wherein one row or one column of each matrix corresponds to a word in the vocabulary list.
The method according to claim 2, wherein obtaining a vocabulary table corresponding to the training corpus according to the word segmentation result comprises:

Perform word frequency statistics according to the word segmentation results to obtain the word frequencies of multiple different word segmentation;

The low-frequency word segmentation whose word frequency is lower than a predetermined threshold is removed from the multiple different word segmentation to obtain the vocabulary list.
The method according to claim 1, wherein determining the central word and the k context words of the central word from the word sequence corresponding to the training sentence comprises:

A sliding window with a preset width is used to slide along the word sequence, the word corresponding to the center position of the sliding window at any moment is determined as the central word, and the central word is excluded from the sliding window As the k contextual words.
The method according to claim 1, wherein, based on the similarity of the k context word vectors to each other, determining the corresponding k attention weights comprises:

Based on the k context word vectors, determine the k-th order similarity square matrix, where the element in the i-th row and j-th column represents the similarity between the i-th context word vector and the j-th context word vector, where i and j are A positive integer not greater than k;

Normalize each row in the k-order similarity square matrix to obtain the k-order self-attention score square matrix;

The average value of each column in the k-order self-attention score square matrix is respectively calculated to obtain the k self-attention weights.
The method according to claim 5, wherein, based on the k context word vectors, determining a k-th order similarity square matrix comprises:

Calculate the dot product between the i-th context word vector and the j-th context word vector as an element in the i-th row and j-th column in the k-th order similarity square matrix.
The method according to claim 5, wherein normalizing each row in the k-order similarity square matrix to obtain a k-order self-attention score square matrix includes:

The softmax function is used to normalize the rows respectively to obtain the k-order self-attention score square matrix.
The method according to claim 1, wherein calculating the first similarity between the central word vector and the context representation vector comprises:

The dot product between the central word vector and the context representation vector is calculated as the first degree of similarity.
The method according to claim 1, wherein updating the first word vector matrix and the second word vector matrix at least with the goal of increasing the first degree of similarity comprises:

Randomly extract a certain first word vector from the first word vector matrix;

Calculating a second degree of similarity between the certain first word vector and the context representation vector;

With the goal of increasing the first similarity degree and reducing the second similarity degree, the first word vector matrix and the second word vector matrix are updated.
A device for training a word vector embedding model. The word vector embedding model includes a first word vector matrix and a second word vector matrix; the device includes an iterative update unit for performing multiple iterative updates, the iterative update The unit performs any one of the iterative updates through the following modules:

The word determination module is configured to determine the central word and k context words of the central word from the word sequence corresponding to the training sentence, where k is an integer greater than 1;

The word vector determining module is configured to determine the central word vector corresponding to the central word according to the first word vector matrix; determine the k context word vectors corresponding to the k context words according to the second word vector matrix;

A weight determination module configured to determine k corresponding attention weights based on the similarity between the k context word vectors;

A weighted summation module configured to use the k attention weights to perform a weighted summation on the k context word vectors to obtain the context representation vector of the central word;

A similarity calculation module, configured to calculate a first similarity between the central word vector and the context representation vector;

The matrix update module is configured to update the first word vector matrix and the second word vector matrix with at least increasing the first degree of similarity; the first word vector matrix after multiple iterations update is used The target word vector for querying the target word.
The device according to claim 10, further comprising:

The corpus acquisition unit is configured to acquire training corpus, which includes multiple training sentences;

The word segmentation unit is configured to segment each training sentence, and obtain the vocabulary table corresponding to the training corpus and the word sequence corresponding to each training sentence according to the word segmentation result;

The initialization unit is configured to initialize the first word vector matrix and the second word vector matrix according to the vocabulary list, wherein one row or one column of each matrix corresponds to a word in the vocabulary list.
The device according to claim 10, wherein the weight determination module is specifically configured to:

Based on the k context word vectors, determine the k-th order similarity square matrix, where the element in the i-th row and j-th column represents the similarity between the i-th context word vector and the j-th context word vector, where i and j are A positive integer not greater than k;

Normalize each row in the k-order similarity square matrix to obtain the k-order self-attention score square matrix;

The average value of each column in the k-order self-attention score square matrix is respectively calculated to obtain the k self-attention weights.
The device according to claim 10, wherein the matrix update module is specifically configured as:

Randomly extract a certain first word vector from the first word vector matrix;

Calculating a second degree of similarity between the certain first word vector and the context representation vector;

With the goal of increasing the first similarity degree and reducing the second similarity degree, the first word vector matrix and the second word vector matrix are updated.
A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of any one of claims 1-9.
A computing device, comprising a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the device described in any one of claims 1-9 is implemented method.