CN106897265B - Word vector training method and device - Google Patents

Word vector training method and device Download PDF

Info

Publication number
CN106897265B
CN106897265B CN201710022458.0A CN201710022458A CN106897265B CN 106897265 B CN106897265 B CN 106897265B CN 201710022458 A CN201710022458 A CN 201710022458A CN 106897265 B CN106897265 B CN 106897265B
Authority
CN
China
Prior art keywords
vocabulary
library
old
huffman tree
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710022458.0A
Other languages
Chinese (zh)
Other versions
CN106897265A (en
Inventor
李建欣
刘垚鹏
彭浩
张日崇
陈汉腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201710022458.0A priority Critical patent/CN106897265B/en
Publication of CN106897265A publication Critical patent/CN106897265A/en
Application granted granted Critical
Publication of CN106897265B publication Critical patent/CN106897265B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Abstract

The invention provides a word vector training method and device, and belongs to the technical field of machine learning. The word vector training method comprises the following steps: acquiring a newly added vocabulary library, wherein the vocabulary in the newly added vocabulary library and the vocabulary in an old vocabulary library form a new vocabulary library, and the vocabulary in the old vocabulary library corresponds to an old word vector; initializing the vocabulary in the new vocabulary library so that word vectors of the vocabulary in the old vocabulary library in the new vocabulary library are old word vectors, and vocabulary word vectors in the newly added vocabulary library in the new vocabulary library are random word vectors; and respectively updating word vectors of the words in the new word bank according to the first Huffman tree corresponding to the new word bank and the second Huffman tree corresponding to the old word bank. The word vector training method and device provided by the invention improve the training efficiency of the word vector.

Description

Word vector training method and device
Technical Field
The invention relates to the technical field of machine learning, in particular to a word vector training method and device.
Background
In the machine learning technique, in order for a machine to understand the meaning of a human language, a word representation tool of a neural network language model converts each word in the human language into a form of a word vector, so that a computer can learn the meaning of each word in the human language through the word vector.
With the prior art, when a new vocabulary is added to the vocabulary library, all vocabularies in the new vocabulary library are generally required to be learned again to obtain new word vectors of each vocabulary. However, this approach makes training of the word vectors inefficient.
Disclosure of Invention
The invention provides a word vector training method and device, which improve the training efficiency of word vectors.
The embodiment of the invention provides a word vector training method, which comprises the following steps:
acquiring a newly added vocabulary library, wherein the vocabulary in the newly added vocabulary library and the vocabulary in an old vocabulary library form a new vocabulary library, and the vocabulary in the old vocabulary library corresponds to an old word vector;
initializing the vocabulary in the new vocabulary library to ensure that word vectors of the vocabulary in the old vocabulary library in the new vocabulary library are old word vectors and the vocabulary word vectors in the new vocabulary library in the newly added vocabulary library are random word vectors;
and respectively updating word vectors of the words in the new vocabulary library according to the first Huffman tree corresponding to the new vocabulary library and the second Huffman tree corresponding to the old vocabulary library.
In an embodiment of the present invention, the updating the word vectors of the words in the new vocabulary library according to the first huffman tree corresponding to the new vocabulary library and the second huffman tree corresponding to the old vocabulary library respectively includes:
acquiring a preset target function corresponding to the first vocabulary, wherein the first vocabulary is a vocabulary in the new vocabulary library;
and performing gradient processing on the preset target function according to the attribute of the first vocabulary in the first Huffman tree and the attribute of the first vocabulary in the second Huffman tree to obtain a word vector corresponding to the first vocabulary.
In an embodiment of the present invention, the obtaining of the preset objective function corresponding to the first vocabulary includes:
if the first vocabulary belongs to the old vocabulary library, factorizing the first vocabulary according to an original objective function of a Skip-gram model to obtain a preset objective function corresponding to the first vocabulary;
and if the first vocabulary belongs to the newly added vocabulary library, the preset objective function corresponding to the first vocabulary is the original objective function of the Skip-gram model.
In an embodiment of the present invention, the obtaining of the preset objective function corresponding to the first vocabulary includes:
if the first vocabulary belongs to the old vocabulary library, performing factorization on the first vocabulary according to an original target function of a CBOW model to obtain a preset target function corresponding to the first vocabulary;
and if the first vocabulary belongs to the newly added vocabulary library, the preset target function corresponding to the first vocabulary is the original target function of the CBOW model.
In an embodiment of the present invention, factorizing the first vocabulary according to an original objective function of a Skip-gram model to obtain a preset objective function corresponding to the first vocabulary includes:
if the first vocabulary belongs to the old vocabulary library, then according to
Figure BDA0001208199980000021
Factorizing the first vocabulary to obtain a preset target function corresponding to the first vocabulary;
if the first vocabulary belongs to the newly added vocabulary library, the preset objective function corresponding to the first vocabulary is the original objective function of the Skip-gram model
Figure BDA0001208199980000022
Wherein W represents the first vocabulary, W represents the old vocabulary library, Δ W represents the newly added vocabulary library, c (W) represents a vocabulary library composed of vocabularies corresponding to W contexts, u represents a vocabulary corresponding to W contexts,
Figure BDA0001208199980000023
representing the lengths of matched Huffman codes of non-leaf nodes w on a second Huffman tree and a first Huffman tree, i represents that the first vocabulary is the ith node on the second Huffman tree, j represents that the first vocabulary is the jth node on the second Huffman tree,
Figure BDA0001208199980000024
a word vector representing the j-1 st node on the first huffman path corresponding to u,
Figure BDA0001208199980000025
representing the huffman coding of the j-th node on the second huffman path represented by u,
Figure BDA0001208199980000026
denotes the activation function, and v (w) denotes the word vector to which w corresponds.
In an embodiment of the present invention, the factorizing the first vocabulary according to the original objective function of the CBOW model to obtain the preset objective function corresponding to the first vocabulary includes:
if the first vocabulary belongs to the old vocabulary library, then according to
Figure BDA0001208199980000031
Factorizing the first vocabulary to obtain a preset target function corresponding to the first vocabulary;
if the first vocabulary belongs to the newly added vocabulary library, the preset target function corresponding to the first vocabulary is the original target function of the CBOW model
Figure BDA0001208199980000032
Wherein the content of the first and second substances,
Figure BDA0001208199980000033
representing the huffman code of the jth node on the second huffman path represented by w,
Figure BDA0001208199980000034
represents the sum of the word vectors corresponding to all the words in C (w).
In an embodiment of the present invention, the obtaining a word vector corresponding to the first vocabulary by performing gradient processing on the preset target function according to the attribute of the first vocabulary in the first huffman tree and the attribute of the second huffman tree includes:
if the first vocabulary belongs to the old vocabulary library and the first vocabulary is encoded in the first Huffman tree and encoded in the second Huffman treeHas the same prefix part, the vector of the corresponding node of the different part of the Huffman coding of the first vocabulary on the second Huffman tree is determined according to the
Figure BDA0001208199980000035
Executing random gradient ascending processing; the vector corresponding to the node on the second Huffman tree for different parts of the Huffman coding of the first vocabulary on the first Huffman tree is determined according to
Figure BDA0001208199980000036
Executing random gradient descent processing;
if the first vocabulary belongs to the newly added vocabulary library, the first vocabulary is processed according to
Figure BDA0001208199980000037
Executing random gradient ascending processing to obtain a word vector corresponding to the first vocabulary;
here, η' represents the learning rate.
In an embodiment of the present invention, the obtaining a word vector corresponding to the first vocabulary by performing gradient processing on the preset target function according to the attribute of the first vocabulary in the first huffman tree and the attribute of the second huffman tree includes:
if the first vocabulary belongs to the old vocabulary library and the coding of the first vocabulary in the first Huffman tree and the coding of the first vocabulary in the second Huffman tree have the same prefix part, then the vector of the corresponding node of the different part of the Huffman coding of the first vocabulary in the second Huffman tree is determined according to the prefix part
Figure BDA0001208199980000038
Executing random gradient ascending processing; the vector corresponding to the node on the second Huffman tree for different parts of the Huffman coding of the first vocabulary on the first Huffman tree is determined according to
Figure BDA0001208199980000041
Execution follow-upGradient descending treatment;
if the first vocabulary belongs to the newly added vocabulary library, the first vocabulary is processed according to
Figure BDA0001208199980000042
Executing random gradient ascending processing to obtain a word vector corresponding to the first vocabulary;
wherein the content of the first and second substances,
Figure BDA0001208199980000043
and the word vector of the (i-1) th node on the first Huffman path corresponding to the w is represented.
The embodiment of the present invention further provides a word vector training device, including:
the acquisition module is used for acquiring a newly added vocabulary library, wherein the vocabulary in the newly added vocabulary library and the vocabulary in an old vocabulary library form a new vocabulary library, and the vocabulary in the old vocabulary library corresponds to an old word vector;
the initialization module is used for initializing the vocabulary in the new vocabulary library, so that word vectors of the vocabulary in the old vocabulary library in the new vocabulary library are old word vectors, and the vocabulary word vectors in the new vocabulary library in the newly added vocabulary library are random word vectors;
and the updating module is used for respectively updating the word vectors of the words in the new word bank according to the first Huffman tree corresponding to the new word bank and the second Huffman tree corresponding to the old word bank.
In an embodiment of the present invention, the updating module is specifically configured to obtain a preset objective function corresponding to the first vocabulary, where the first vocabulary is a vocabulary in the new vocabulary library; and performing gradient processing on the preset target function according to the attribute of the first vocabulary in the first Huffman tree and the attribute of the first vocabulary in the second Huffman tree to obtain a word vector corresponding to the first vocabulary.
According to the word vector training method and device provided by the embodiment of the invention, a newly added word library is obtained, and words in the new word library are initialized, so that word vectors of words in the new word library, which belong to words in an old word library, are old word vectors, and word vectors of words in the new word library, which belong to words in the newly added word library, are random word vectors; and respectively updating word vectors of words in the new vocabulary library according to the first Huffman tree corresponding to the new vocabulary library and the second Huffman tree corresponding to the old vocabulary library, so that the training efficiency of the word vectors is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a word vector training method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a process for updating word vectors of words in a new vocabulary library according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a word vector training apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 1 is a schematic flow chart of a word vector training method according to an embodiment of the present invention, where the word vector training method may be executed by a word vector training apparatus, and the word vector training apparatus may be integrated in a processor or may be separately configured, and the present invention is not limited in particular. Specifically, referring to fig. 1, the word vector training method may include:
and S101, acquiring a newly added vocabulary library.
And the vocabulary in the newly added vocabulary library and the vocabulary in the old vocabulary library form a new vocabulary library, and the vocabulary in the old vocabulary library corresponds to the old word vector.
In the embodiment of the invention, the vocabulary in the old vocabulary library is trained into the corresponding old word vector, and the vocabulary in the newly-added vocabulary library is not trained into the corresponding word vector. For example: the old vocabulary library is the vocabulary library of the existing trained word vector, the newly added vocabulary library comprises newly added vocabularies, and at the moment, the vocabularies in the old vocabulary library of the trained word vector and the newly added vocabularies are combined into a new vocabulary library.
S102, initializing the vocabulary in the new vocabulary library, so that word vectors of the vocabulary in the old vocabulary library in the new vocabulary library are old word vectors, and word vectors of the vocabulary in the newly added vocabulary library in the new vocabulary library are random word vectors.
For example, in the embodiment of the present invention, the old vocabulary library is W, where the vocabulary in the old vocabulary library has been trained to obtain the corresponding word vector v (W), the newly added vocabulary library is Δ W, the new vocabulary library is W ' ═ W + Δ W, the second huffman tree corresponding to the old vocabulary library W is T, and the first huffman corresponding to the new vocabulary library W ' is T '. Judging a first vocabulary W in the new vocabulary library, if W is in the old vocabulary library W, proving that W trains a corresponding word vector in the old vocabulary library, not training the word, but inheriting the original v (W); and if the first vocabulary w in the new vocabulary library is in the new vocabulary library, namely belongs to the new vocabulary, randomly initializing the word vector corresponding to w.
For example, taking the first vocabulary as an example, the first vocabulary is any vocabulary in the new vocabulary library, and the distribution of the first vocabulary on the first huffman tree can include two cases. In the first case: the first vocabulary is a leaf node on the first Huffman tree; in the second case: the first vocabulary is the non-leaf nodes on the first Huffman tree.
In the first case: if the first vocabulary is a leaf node on the first huffman tree, the first vocabulary can be initialized according to the following formula 1:
Figure BDA0001208199980000061
wherein w represents a first vocabulary, v (w) represents a word vector of w on a second Huffman tree T; v '(w) denotes the word vector of w on the first Huffman tree T'.
It can be seen from the formula 1 that, if the first vocabulary belongs to the old vocabulary library W, the word vector of the first vocabulary is the old word vector corresponding to the first vocabulary in the old vocabulary library, and if the first vocabulary does not belong to the old vocabulary library W, that is, the first vocabulary is the newly added vocabulary, at this time, the word vector of the first vocabulary may be initialized randomly, that is, the word vector of the first vocabulary is the random word vector at this time.
In the second case: if the first vocabulary is a non-leaf node on the first Huffman tree, the non-leaf node has a parameter vector. To distinguish the parameter vectors, we set W1The parameter vector at the ith node on the corresponding first Huffman path is
Figure BDA0001208199980000062
W2The parameter vector at the ith node on the corresponding first Huffman path is
Figure BDA0001208199980000063
When W is1And W2When corresponding to the same node on the tree, there are
Figure BDA0001208199980000064
Assuming that the code of a word w on the second Huffman tree is "0010" and the code on the first Huffman tree is "00011", since the Huffman codes of the two have the same prefix "00", and the vector on the node corresponding to the same prefix "00" remains unchanged, here, the identifier L needs to be set at the same timewAnd L'wRespectively representing the coding length of the first vocabulary w on the second Huffman tree and the coding length of the first vocabulary w on the first Huffman tree. Then the first vocabulary may be initialized according to equation 2 as follows:
Figure BDA0001208199980000071
wherein the content of the first and second substances,
Figure BDA0001208199980000072
representing the huffman coding of the ith node on the first huffman path represented by the non-leaf node w,
Figure BDA0001208199980000073
representing the huffman coding of the ith node on the second huffman path represented by the non-leaf node w. At this time, the huffman codes corresponding to the non-leaf nodes w on the first huffman tree may be divided into prefix matching parts
Figure BDA0001208199980000074
And other nodes
Figure BDA0001208199980000075
Figure BDA0001208199980000076
Representing the length of the matched huffman code of the non-leaf node w on the second huffman tree and on the first huffman tree.
It can be seen from the combination of equation 2 that if the first vocabulary is a non-leaf node on the first huffman tree, the vector corresponding to the prefix portion of the first vocabulary matched with the prefix portion on the second huffman tree on the first huffman tree is the existing parameter vector
Figure BDA0001208199980000079
And initializing the vector corresponding to the unmatched coding part as a zero vector.
It should be noted that, in the embodiment of the present invention, for the first vocabulary, if the first vocabulary is a leaf node on the first huffman tree, random initialization is adopted; if the node is a non-leaf node, initializing to a zero vector, specifically:
Figure BDA0001208199980000077
this allows the initial word vector to fall within the interval
Figure BDA0001208199980000078
Where m refers to the length of the word vector.
After the vocabulary in the new vocabulary library is initialized, the word vectors corresponding to the vocabulary in the new vocabulary library can be updated.
S103, respectively updating word vectors of the words in the new word bank according to the first Huffman tree corresponding to the new word bank and the second Huffman tree corresponding to the old word bank.
The word vector training method provided by the embodiment of the invention comprises the steps of obtaining a newly-added word library and carrying out initialization processing on words in the new word library, so that word vectors of words in the new word library, which belong to words in an old word library, are old word vectors, and word vectors of words in the new word library, which belong to words in the newly-added word library, are random word vectors; and respectively updating word vectors of words in the new vocabulary library according to the first Huffman tree corresponding to the new vocabulary library and the second Huffman tree corresponding to the old vocabulary library, so that the training efficiency of the word vectors is improved.
Optionally, in the embodiment of the present invention, in step S103, updating the word vectors of the words in the new vocabulary library according to the first huffman tree corresponding to the new vocabulary library and the second huffman tree corresponding to the old vocabulary library respectively may be implemented as follows, specifically please refer to fig. 2, where fig. 2 is a schematic flow diagram of updating the word vectors of the words in the new vocabulary library according to the embodiment of the present invention.
S201, acquiring a preset objective function corresponding to the first vocabulary.
Wherein, the first vocabulary is the vocabulary in the new vocabulary library.
Optionally, in S201, the preset objective function corresponding to the first vocabulary may be obtained through the following two models:
for a first Skip-gram model, if a first vocabulary belongs to an old vocabulary library, factorizing the first vocabulary according to an original objective function of the Skip-gram model to obtain a preset objective function corresponding to the first vocabulary; and if the first vocabulary belongs to the newly added vocabulary library, the preset objective function corresponding to the first vocabulary is the original objective function of the Skip-gram model.
For example, in the embodiment of the present invention, if the first vocabulary belongs to the old vocabulary library, factorizing each word in W according to the same part and different parts of the assembly code to obtain the preset objective function corresponding to the first vocabulary, that is: according to
Figure BDA0001208199980000081
Factorizing the first vocabulary to obtain a preset target function corresponding to the first vocabulary;
if the first vocabulary belongs to the newly added vocabulary library, the preset objective function corresponding to the first vocabulary is the original objective function of the Skip-gram model:
Figure BDA0001208199980000082
wherein, w tableFirst vocabulary, W old vocabulary base, Δ W new vocabulary base, C (W) vocabulary base composed of vocabulary corresponding to W context, u vocabulary corresponding to W context,
Figure BDA0001208199980000083
representing the lengths of matched Huffman codes of non-leaf nodes w on the second Huffman tree and the first Huffman tree, i represents that the first vocabulary is the ith node on the second Huffman tree, j represents that the first vocabulary is the jth node on the second Huffman tree,
Figure BDA0001208199980000084
a word vector representing the j-1 st node on the first huffman path corresponding to u,
Figure BDA0001208199980000085
representing the huffman coding of the j-th node on the second huffman path represented by u,
Figure BDA0001208199980000086
representing the activation function, v (w) representing the word vector to which w corresponds,
Figure BDA0001208199980000087
represents the sum of the codes of the same prefix,
Figure BDA0001208199980000088
representing the sum of other vocabulary inheritance in the new vocabulary library and non-leaf nodes initialized to zero.
For the second CBOW model, if the first vocabulary belongs to the old vocabulary library, factorizing the first vocabulary according to the original target function of the CBOW model to obtain a preset target function corresponding to the first vocabulary; and if the first vocabulary belongs to the newly added vocabulary library, the preset target function corresponding to the first vocabulary is the original target function of the CBOW model.
For example, in the embodiment of the present invention, if the first vocabulary belongs to the old vocabulary library, according to the first vocabulary, factoring each word in W according to the same part and different parts of the code to obtain the preset objective function corresponding to the first vocabulary, that is:
according to
Figure BDA0001208199980000091
And carrying out factorization on the first vocabulary to obtain a preset objective function corresponding to the first vocabulary.
If the first vocabulary belongs to the newly added vocabulary library, the preset target function corresponding to the first vocabulary is the original target function l (w, i) of the CBOW model:
Figure BDA0001208199980000092
wherein the content of the first and second substances,
Figure BDA0001208199980000093
representing the huffman code of the jth node on the second huffman path represented by w,
Figure BDA0001208199980000094
represents the sum of the word vectors corresponding to all the words in C (w).
It should be noted that, in the embodiment of the present invention, by factoring each word in W according to the same part and different parts of the coding, the amount of computation in the word vector process can be saved, thereby improving the computation efficiency.
After the preset target function corresponding to the first vocabulary is obtained, gradient processing can be performed on the preset target function according to the attribute of the first vocabulary in the first Huffman tree and the attribute of the first vocabulary in the second Huffman tree, so that a word vector corresponding to the first vocabulary is obtained.
S202, gradient processing is carried out on a preset target function according to the attribute of the first vocabulary in the first Huffman tree and the attribute of the first vocabulary in the second Huffman tree, and a word vector corresponding to the first vocabulary is obtained.
Please refer to step S201, which can be implemented by the following two models:
for the first Skip-gram model, if the first word belongs to the old vocabulary library, and the first word belongs to the old vocabulary library, the first word is not a word in the old vocabulary libraryThe codes gathered in the first Huffman tree and the codes in the second Huffman tree have the same prefix part, and then the vectors of the corresponding nodes of the different parts of the Huffman codes of the first vocabulary on the second Huffman tree are according to the
Figure BDA0001208199980000095
Executing random gradient ascending processing; the vector corresponding to the node on the second Huffman tree for different parts of the Huffman coding of the first vocabulary on the first Huffman tree is based on
Figure BDA0001208199980000096
A random gradient descent process is performed.
If the first vocabulary belongs to the newly added vocabulary library, the first vocabulary is based on
Figure BDA0001208199980000097
And performing random gradient ascending processing to obtain a word vector corresponding to the first vocabulary, wherein η' represents the learning rate.
For example: can be expressed as:
Figure BDA0001208199980000101
Figure BDA0001208199980000102
Figure BDA0001208199980000103
Figure BDA0001208199980000104
Figure BDA0001208199980000105
Figure BDA0001208199980000106
for the second CBOW model, if the first vocabulary belongs to the old vocabulary library and the encoding of the first vocabulary in the first Huffman tree has the same prefix part as the encoding in the second Huffman tree, the vectors of the corresponding nodes of the different parts of the Huffman encoding of the first vocabulary in the second Huffman tree are based on
Figure BDA0001208199980000107
Executing random gradient ascending processing; the vector corresponding to the node on the second Huffman tree for different parts of the Huffman coding of the first vocabulary on the first Huffman tree is based on
Figure BDA0001208199980000108
A random gradient descent process is performed.
If the first vocabulary belongs to the newly added vocabulary library, the first vocabulary is based on
Figure BDA0001208199980000109
And executing random gradient ascending processing to obtain a word vector corresponding to the first vocabulary.
Wherein the content of the first and second substances,
Figure BDA00012081999800001010
and the word vector of the (i-1) th node on the first Huffman path corresponding to the w is represented.
For example, it can be expressed as:
Figure BDA00012081999800001011
Figure BDA00012081999800001012
Figure BDA00012081999800001013
Figure BDA00012081999800001014
Figure BDA00012081999800001015
Figure BDA00012081999800001016
where η' denotes the learning rate, exemplary, an initial learning rate η is set0Every 1000 words processed, the learning rate is adjusted according to the following formula:
Figure BDA00012081999800001017
wherein word count actual represents the number of words currently processed, and train words +1 is to prevent denominator from being zero, and a threshold η is introducedmin=10-40η minimum ηminIn the incremental learning process, the word number counter needs to add the word number of the original corpus and combine ηminη is calculated.
Fig. 3 is a schematic structural diagram of a word vector training apparatus 30 according to an embodiment of the present invention, and it should be understood that the embodiment of the present invention is only illustrated in fig. 3, but the present invention is not limited thereto. Referring to fig. 3, the word vector training apparatus 30 may include:
the obtaining module 301 is configured to obtain a new vocabulary library, where a vocabulary in the new vocabulary library and a vocabulary in an old vocabulary library form a new vocabulary library, and the vocabulary in the old vocabulary library corresponds to an old word vector.
The initialization module 302 is configured to initialize the vocabulary in the new vocabulary library, so that the word vector in the new vocabulary library that belongs to the vocabulary in the old vocabulary library is an old word vector, and the word vector in the new vocabulary library that belongs to the vocabulary in the newly added vocabulary library is a random word vector.
And the updating module 303 is configured to update word vectors of words in the new vocabulary library according to the first huffman tree corresponding to the new vocabulary library and the second huffman tree corresponding to the old vocabulary library.
Optionally, the updating module 303 is specifically configured to obtain a preset objective function corresponding to a first vocabulary, where the first vocabulary is a vocabulary in the new vocabulary library; and carrying out gradient processing on a preset target function according to the attribute of the first vocabulary in the first Huffman tree and the attribute of the first vocabulary in the second Huffman tree to obtain a word vector corresponding to the first vocabulary.
The word vector training apparatus 30 shown in the embodiment of the present invention may implement the technical solution corresponding to the word vector training method shown in the above method embodiment, and the implementation principle and the beneficial effect are similar, which are not described herein again.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A method for word vector training, comprising:
acquiring a newly added vocabulary library, wherein the vocabulary in the newly added vocabulary library and the vocabulary in an old vocabulary library form a new vocabulary library, and the vocabulary in the old vocabulary library corresponds to an old word vector;
initializing the vocabulary in the new vocabulary library to ensure that word vectors of the vocabulary in the old vocabulary library in the new vocabulary library are old word vectors and the vocabulary word vectors in the new vocabulary library in the newly added vocabulary library are random word vectors;
updating word vectors of words in the new vocabulary library respectively according to a first Huffman tree corresponding to the new vocabulary library and a second Huffman tree corresponding to the old vocabulary library;
wherein, the updating the word vectors of the words in the new vocabulary library according to the first Huffman tree corresponding to the new vocabulary library and the second Huffman tree corresponding to the old vocabulary library respectively comprises:
acquiring a preset target function corresponding to a first vocabulary, wherein the first vocabulary is a vocabulary in the new vocabulary library;
performing gradient processing on the preset target function according to the attribute of the first vocabulary in the first Huffman tree and the attribute of the first vocabulary in the second Huffman tree to obtain a word vector corresponding to the first vocabulary;
the obtaining of the preset objective function corresponding to the first vocabulary includes:
if the first vocabulary belongs to the old vocabulary library, factorizing the first vocabulary according to an original objective function of a Skip-gram model to obtain a preset objective function corresponding to the first vocabulary;
if the first vocabulary belongs to the newly added vocabulary library, the preset objective function corresponding to the first vocabulary is the original objective function of the Skip-gram model;
or, the obtaining of the preset objective function corresponding to the first vocabulary includes:
if the first vocabulary belongs to the old vocabulary library, performing factorization on the first vocabulary according to an original target function of a CBOW model to obtain a preset target function corresponding to the first vocabulary;
and if the first vocabulary belongs to the newly added vocabulary library, the preset target function corresponding to the first vocabulary is the original target function of the CBOW model.
2. The method of claim 1, wherein the obtaining the predetermined objective function corresponding to the first vocabulary comprises:
if the first vocabulary belongs to the old vocabulary library, then according to
Figure FDA0002411214070000011
Factorizing the first vocabulary to obtain a preset target function corresponding to the first vocabulary;
if the first vocabulary belongs to the newly added vocabulary library, the preset objective function corresponding to the first vocabulary is the original objective function of the Skip-gram model
Figure FDA0002411214070000021
Wherein W represents the first vocabulary, W represents the old vocabulary library, Δ W represents the newly added vocabulary library, c (W) represents a vocabulary library composed of vocabularies corresponding to W contexts, u represents a vocabulary corresponding to W contexts,
Figure FDA0002411214070000022
representing the length of the matched Huffman codes on the second Huffman tree and the first Huffman tree when w is a non-leaf node, j represents that the first vocabulary is the jth node on the second Huffman tree,
Figure FDA0002411214070000023
a word vector representing the j-1 st node on the first huffman path corresponding to u,
Figure FDA0002411214070000024
representing the huffman coding of the j-th node on the second huffman path represented by u,
Figure FDA0002411214070000025
representing activation function, v (w) representing word vector corresponding to w L'uIndicating the coding length of the vocabulary u on the first huffman tree.
3. The method of claim 1, wherein the obtaining the predetermined objective function corresponding to the first vocabulary comprises:
if the first vocabulary belongs to the old vocabulary library, then according to
Figure FDA0002411214070000026
Factorizing the first vocabulary to obtain a preset target function corresponding to the first vocabulary;
if the first vocabulary belongs to the newly added vocabulary library, the preset target function corresponding to the first vocabulary is the original target function of the CBOW model;
Figure FDA0002411214070000027
wherein the content of the first and second substances,
Figure FDA0002411214070000028
representing the huffman code of the ith node on the second huffman path represented by w,
Figure FDA0002411214070000029
representing the sum of the word vectors corresponding to all the words in C (w);
w represents the first vocabulary, W represents the old vocabulary library, Δ W represents the newly added vocabulary library, c (W) represents a vocabulary library composed of vocabularies corresponding to the context of W,
Figure FDA00024112140700000210
representing the length of the matched Huffman codes on the second Huffman tree and the first Huffman tree when w is a non-leaf node; i denotes that the first vocabulary is the ith node on the second Huffman tree,
Figure FDA00024112140700000211
a word vector representing the i-1 st node on the first Huffman path corresponding to w, L'wTo representThe coding length of the first word w on the first huffman tree,
Figure FDA00024112140700000212
representing an activation function.
4. The method according to claim 2, wherein the obtaining the word vector corresponding to the first vocabulary by performing gradient processing on the preset objective function according to the attribute of the first vocabulary in the first huffman tree and the attribute of the second huffman tree comprises:
if the first vocabulary belongs to the old vocabulary library and the coding of the first vocabulary in the first Huffman tree and the coding of the first vocabulary in the second Huffman tree have the same prefix part, then the vector of the corresponding node of the different part of the Huffman coding of the first vocabulary in the second Huffman tree is determined according to the prefix part
Figure FDA0002411214070000031
Executing random gradient ascending processing; the vector corresponding to the node on the second Huffman tree for different parts of the Huffman coding of the first vocabulary on the first Huffman tree is determined according to
Figure FDA0002411214070000032
Executing random gradient descent processing;
if the first vocabulary belongs to the newly added vocabulary library, the first vocabulary is processed according to
Figure FDA0002411214070000033
Executing random gradient ascending processing to obtain a word vector corresponding to the first vocabulary;
here, η' represents the learning rate.
5. The method according to claim 3, wherein the obtaining the word vector corresponding to the first vocabulary by performing gradient processing on the preset objective function according to the attribute of the first vocabulary in the first Huffman tree and the attribute of the second vocabulary in the second Huffman tree comprises:
if the first vocabulary belongs to the old vocabulary library and the coding of the first vocabulary in the first Huffman tree and the coding of the first vocabulary in the second Huffman tree have the same prefix part, then the vector of the corresponding node of the different part of the Huffman coding of the first vocabulary in the second Huffman tree is determined according to the prefix part
Figure FDA0002411214070000034
Executing random gradient ascending processing; the vector corresponding to the node on the second Huffman tree for different parts of the Huffman coding of the first vocabulary on the first Huffman tree is determined according to
Figure FDA0002411214070000035
Executing random gradient descent processing;
if the first vocabulary belongs to the newly added vocabulary library, the first vocabulary is processed according to
Figure FDA0002411214070000036
Executing random gradient ascending processing to obtain a word vector corresponding to the first vocabulary;
wherein η' represents the learning rate, XwRepresenting the word vector corresponding to the first word w,
Figure FDA0002411214070000037
represents XwThe transposing of (1).
6. A word vector training apparatus, comprising:
the acquisition module is used for acquiring a newly added vocabulary library, wherein the vocabulary in the newly added vocabulary library and the vocabulary in an old vocabulary library form a new vocabulary library, and the vocabulary in the old vocabulary library corresponds to an old word vector;
the initialization module is used for initializing the vocabulary in the new vocabulary library, so that word vectors of the vocabulary in the old vocabulary library in the new vocabulary library are old word vectors, and the vocabulary word vectors in the new vocabulary library in the newly added vocabulary library are random word vectors;
the updating module is used for respectively updating word vectors of words in the new word bank according to a first Huffman tree corresponding to the new word bank and a second Huffman tree corresponding to the old word bank;
the updating module is specifically configured to obtain a preset objective function corresponding to a first vocabulary, where the first vocabulary is a vocabulary in the new vocabulary library; performing gradient processing on the preset target function according to the attribute of the first vocabulary in the first Huffman tree and the attribute of the first vocabulary in the second Huffman tree to obtain a word vector corresponding to the first vocabulary;
wherein the update module is specifically configured to:
if the first vocabulary belongs to the old vocabulary library, factorizing the first vocabulary according to an original objective function of a Skip-gram model to obtain a preset objective function corresponding to the first vocabulary;
if the first vocabulary belongs to the newly added vocabulary library, the preset objective function corresponding to the first vocabulary is the original objective function of the Skip-gram model;
or, the update module is specifically configured to:
if the first vocabulary belongs to the old vocabulary library, performing factorization on the first vocabulary according to an original target function of a CBOW model to obtain a preset target function corresponding to the first vocabulary;
and if the first vocabulary belongs to the newly added vocabulary library, the preset target function corresponding to the first vocabulary is the original target function of the CBOW model.
CN201710022458.0A 2017-01-12 2017-01-12 Word vector training method and device Active CN106897265B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710022458.0A CN106897265B (en) 2017-01-12 2017-01-12 Word vector training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710022458.0A CN106897265B (en) 2017-01-12 2017-01-12 Word vector training method and device

Publications (2)

Publication Number Publication Date
CN106897265A CN106897265A (en) 2017-06-27
CN106897265B true CN106897265B (en) 2020-07-10

Family

ID=59198669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710022458.0A Active CN106897265B (en) 2017-01-12 2017-01-12 Word vector training method and device

Country Status (1)

Country Link
CN (1) CN106897265B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107957989B9 (en) 2017-10-23 2021-01-12 创新先进技术有限公司 Cluster-based word vector processing method, device and equipment
CN108170663A (en) 2017-11-14 2018-06-15 阿里巴巴集团控股有限公司 Term vector processing method, device and equipment based on cluster
CN110020303A (en) * 2017-11-24 2019-07-16 腾讯科技(深圳)有限公司 Determine the alternative method, apparatus and storage medium for showing content
CN108509422B (en) * 2018-04-04 2020-01-24 广州荔支网络技术有限公司 Incremental learning method and device for word vectors and electronic equipment
CN110210557B (en) * 2019-05-31 2024-01-12 南京工程学院 Online incremental clustering method for unknown text in real-time stream processing mode
CN111325026B (en) * 2020-02-18 2023-10-10 北京声智科技有限公司 Training method and system for word vector model
US11822447B2 (en) 2020-10-06 2023-11-21 Direct Cursus Technology L.L.C Methods and servers for storing data associated with users and digital items of a recommendation system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
CN106055623A (en) * 2016-05-26 2016-10-26 《中国学术期刊(光盘版)》电子杂志社有限公司 Cross-language recommendation method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
CN106055623A (en) * 2016-05-26 2016-10-26 《中国学术期刊(光盘版)》电子杂志社有限公司 Cross-language recommendation method and system

Also Published As

Publication number Publication date
CN106897265A (en) 2017-06-27

Similar Documents

Publication Publication Date Title
CN106897265B (en) Word vector training method and device
CN110263323B (en) Keyword extraction method and system based on barrier type long-time memory neural network
US10032463B1 (en) Speech processing with learned representation of user interaction history
Lhoussain et al. Adaptating the levenshtein distance to contextual spelling correction
JP2019511033A5 (en)
CN109785824A (en) A kind of training method and device of voiced translation model
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
CN107437417B (en) Voice data enhancement method and device based on recurrent neural network voice recognition
CN106802888B (en) Word vector training method and device
JP2017059205A (en) Subject estimation system, subject estimation method, and program
CN109344242B (en) Dialogue question-answering method, device, equipment and storage medium
US10878201B1 (en) Apparatus and method for an adaptive neural machine translation system
CN110506282A (en) The more new management of RPU array
CN108509422B (en) Incremental learning method and device for word vectors and electronic equipment
CN111950275B (en) Emotion recognition method and device based on recurrent neural network and storage medium
CN110825857A (en) Multi-turn question and answer identification method and device, computer equipment and storage medium
CN110275928B (en) Iterative entity relation extraction method
CN111400494A (en) Sentiment analysis method based on GCN-Attention
CN109979461A (en) A kind of voice translation method and device
CN110069781B (en) Entity label identification method and related equipment
CN110968702A (en) Method and device for extracting matter relationship
CN115221977A (en) Text similarity calculation model training method, calculation method and related device
CN111126047B (en) Method and device for generating synonymous text
CN111241843B (en) Semantic relation inference system and method based on composite neural network
JP7359028B2 (en) Learning devices, learning methods, and learning programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant