CN107844472B - Word vector processing method and device and electronic equipment - Google Patents

Word vector processing method and device and electronic equipment Download PDF

Info

Publication number
CN107844472B
CN107844472B CN201710583716.2A CN201710583716A CN107844472B CN 107844472 B CN107844472 B CN 107844472B CN 201710583716 A CN201710583716 A CN 201710583716A CN 107844472 B CN107844472 B CN 107844472B
Authority
CN
China
Prior art keywords
word
words
vector
letter
gram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710583716.2A
Other languages
Chinese (zh)
Other versions
CN107844472A (en
Inventor
曹绍升
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201710583716.2A priority Critical patent/CN107844472B/en
Publication of CN107844472A publication Critical patent/CN107844472A/en
Application granted granted Critical
Publication of CN107844472B publication Critical patent/CN107844472B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the specification discloses a word vector processing method and device and electronic equipment. The method comprises the following steps: dividing a plurality of n-gram letters from a word, and training a word vector of the word based on the letter vector of the n-gram letters, wherein the n-gram letters represent continuous n letters of the corresponding word; wherein the words are words of the Hierann language family.

Description

Word vector processing method and device and electronic equipment
Technical Field
The present disclosure relates to the field of computer software technologies, and in particular, to a word vector processing method and apparatus, and an electronic device.
Background
Most of the current natural language processing solutions adopt neural network-based architectures, and the next important basic technology in such architectures is word vectors. A word vector is a vector that maps a word to a fixed dimension, the vector characterizing the semantic information of the word.
In the prior art, common algorithms for generating word vectors are specifically designed for english. Such as google's word vector algorithm, microsoft's deep neural network algorithm, etc.
Based on the prior art, a word vector generation scheme for the Hieman language family is needed.
Disclosure of Invention
The embodiment of the specification provides a word vector processing method, a word vector processing device and electronic equipment, which are used for solving the following technical problems: there is a need for a word vector generation scheme for japanese language family.
In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:
the word vector processing method provided by the embodiment of the present specification includes:
segmenting the corpus into words to obtain each word;
determining each n-gram letter corresponding to each word, wherein the n-gram letters represent continuous n letters of the corresponding word;
establishing and initializing word vectors of the words and letter vectors of the n-element letters corresponding to the words;
training the word vectors and the letter vectors according to the word vectors, the letter vectors and the corpus after word segmentation;
wherein the words are words of the Hierann language family.
An embodiment of the present specification provides a word vector processing apparatus, including:
the word segmentation module is used for segmenting words of the speech to obtain each word;
the determining module is used for determining each n-element letter corresponding to each word, and the n-element letters represent continuous n letters of the corresponding word;
the initialization module is used for establishing and initializing word vectors of all the words and letter vectors of all the n-element letters corresponding to all the words;
the training module is used for training the word vectors and the letter vectors according to the word vectors, the letter vectors and the linguistic data after word segmentation;
wherein the words are words of the Hierann language family.
Another word vector processing method provided in an embodiment of this specification includes:
step 1, segmenting words of a language material to obtain each word; skipping to the step 2;
step 2, establishing an n-element letter mapping table, wherein the mapping table comprises the mapping relation between each word and the n-element letters, and the n-element letters represent continuous n letters of the words mapped by the n-element letters; establishing and initializing word vectors of the words and letter vectors of the n-gram letters mapped by the words; skipping to step 3;
step 3, traversing the corpus after word segmentation, respectively taking the traversed word as a current word w, and executing step 4 on the current word w, if the traversal is finished, ending, otherwise, continuing the traversal;
step 4, traversing each context word appointed for the current word w in the corpus, respectively executing step 5 on the current context word c, if the traversal is completed, continuing the execution of step 3, otherwise, continuing the traversal;
step 5, calculating the similarity between the current word w and the current contextual word c according to the following formula:
Figure BDA0001352943900000021
wherein, s (w) represents a set of at least some n-grams mapped by the current word w in the n-gram mapping table, q represents each n-gram in s (w), and sim (w, c) represents the similarity between the current word w and the current context word c;
Figure BDA0001352943900000022
a letter vector representing q is generated by a letter vector,
Figure BDA0001352943900000023
a word vector representing w is shown as a vector of words,
Figure BDA0001352943900000024
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter; skipping to step 6;
step 6, randomly extracting lambda words as negative sample words, and calculating corresponding loss characteristic values l (w, c) according to the following loss functions:
Figure BDA0001352943900000031
where c' is a randomly drawn negative sample word, and Ec'∈p(V)[x]The expected value of the expression x is the neural network excitation function defined as the expected value of the expression x when the randomly extracted negative sample word c' satisfies the probability distribution p (V)
Figure BDA0001352943900000032
Calculating the gradient corresponding to the loss function according to the calculated loss characterization value l (w, c), and according to the gradient, performing q-letter vector
Figure BDA0001352943900000033
And the word vector of the current context word c
Figure BDA0001352943900000034
And (6) updating.
An electronic device provided in an embodiment of the present specification includes:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
segmenting the corpus into words to obtain each word;
determining each n-gram letter corresponding to each word, wherein the n-gram letters represent continuous n letters of the corresponding word;
establishing and initializing word vectors of the words and letter vectors of the n-element letters corresponding to the words;
training the word vectors and the letter vectors according to the word vectors, the letter vectors and the corpus after word segmentation;
wherein the words are words of the Hierann language family.
The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects: the characteristics of the word can be expressed more finely through the n-gram corresponding to the word, so that the accuracy of the word vector of the generated words of the Hierann language family is improved, the practical effect is better, and the technical problems can be solved partially or completely.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic diagram of an overall architecture involved in a practical application scenario of the solution of the present specification;
fig. 2 is a schematic flowchart of a word vector processing method provided in an embodiment of the present specification;
fig. 3 is a schematic flow chart of a specific implementation of the word vector processing method in a practical application scenario provided in an embodiment of the present specification;
FIG. 4 is a diagram illustrating operations of a process related to a part of corpora used in the flow of FIG. 3 according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of another word vector processing method provided in an embodiment of the present specification;
fig. 6 is a schematic structural diagram of a word vector processing apparatus corresponding to fig. 2 according to an embodiment of the present disclosure.
Detailed Description
The embodiment of the specification provides a word vector processing method and device and electronic equipment.
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any inventive step based on the embodiments of the present disclosure, shall fall within the scope of protection of the present application.
Fig. 1 is a schematic diagram of an overall architecture related to the solution of the present specification in a practical application scenario. In the overall architecture, four parts are mainly involved: and the words in the corpus, n-gram letters corresponding to the words, word vectors of the words, and letter vectors and vector training servers of the n-gram letters. The n-gram letters are used for expressing the characteristics of the corresponding words more finely, and the word vectors of the words and the letter vectors of the n-gram letters are trained through the vector training server, so that more accurate word vectors can be obtained. In practical applications, the related actions of the first three parts can be performed by corresponding software and/or hardware functional modules.
The solution of the present description is applicable to word vectors of words of the japanese language family, as well as word vectors of words of other languages, such as french, which are mainly composed of letters. In the case of words of a non-japanese language family, the letters are accordingly the letters that make up the non-japanese language family, such as french letters and the like.
The Japanese language family may include: german, swedish, denmark, norway, iceland, etc.
For convenience of description, the following embodiments mainly describe the scheme of the present specification with respect to the context of words of the language of the japanese language family.
Fig. 2 is a flowchart illustrating a word vector processing method according to an embodiment of the present disclosure. From a program perspective, the execution subject of the flow may be a program or the like having a word vector generation function and/or a training function; from the perspective of the device, the executing body of the process may include, but is not limited to, at least one of the following devices on which the program may be loaded: personal computers, large and medium-sized computers, computer clusters, mobile phones, tablet computers, intelligent wearable equipment, vehicle machines and the like.
The flow in fig. 2 may include the following steps:
s202: and segmenting the corpus to obtain each word.
In the embodiments of the present specification, the words may specifically be: at least some of the words in the corpus that have occurred at least once. For convenience of subsequent processing, each word can be stored in the vocabulary, and the word can be read from the vocabulary when the word needs to be used.
S204: determining each n-gram letter corresponding to each word, wherein the n-gram letters represent continuous n letters of the corresponding word; wherein the words are words of the Hierann language family.
For ease of understanding, the word german is used as an example for illustration:
for example, the word "sonne" for german. Its corresponding 3-letter includes: "son" (1 to 3 letters), "onn" (2 to 4 letters), and "nne" (3 to 5 letters); the corresponding 4-element characters comprise: "sonn" (1 to 4 letters) and "one" (2 to 5 letters).
In the embodiments of the present description, the value of n may be dynamically adjustable. For the same word, when determining each n-gram corresponding to the word, the value of n may be only 1 (for example, only each 3-gram corresponding to the word is determined), or may be multiple (for example, each 3-gram corresponding to the word and each 4-gram corresponding to the word are determined). When the value of n is just the total number of letters contained in a word, the n-element letter is just the word.
In the embodiments of the present specification, n-gram letters may also be represented by numbers for convenience of computer processing. For example, different letters may be represented by a number, and the n-gram may be represented as a string of numbers with a mapping relationship between the number and the string of numbers or letters.
In step S204, at least a portion of the n-grams corresponding to the word may be determined for subsequent processing.
S206: and establishing and initializing word vectors of the words and letter vectors of the n-gram letters corresponding to the words.
In the embodiments of the present specification, the alphabet vector refers to a vector for representing n-gram. Each n-gram may be represented by a letter vector, as if each word were represented by a word vector.
In the embodiments of the present specification, in order to ensure the effect of the scheme, there may be some restrictions when initializing the word vector and the letter vector. For example, it is not possible to initialize both word vectors and letter vectors to the same vector; for another example, the values of vector elements in some word vectors or letter vectors cannot be all 0; and so on.
In this embodiment of the present specification, the word vector of each word and the alphabet vector of each n-gram corresponding to each word may be initialized in a random initialization manner or in an initialization manner according to a specified probability distribution, where the alphabet vectors of the same n-gram are also the same. For example, the specified probability distribution may be a 0-1 distribution, and the like.
In addition, if word vectors and alphabet vectors corresponding to some words have been trained based on other corpora before, when the word vectors and alphabet vectors corresponding to these words are trained based on the corpora in fig. 2, the word vectors and alphabet vectors corresponding to these words may not be re-established and initialized, but training may be performed based on the corpora in fig. 2 and the previous training result.
S208: and training the word vector and the letter vector according to the word vector, the letter vector and the corpus after word segmentation.
In the embodiment of the present specification, the training may be implemented by a neural network, and the neural network may be a shallow neural network or a deep neural network, and the like. The specific structure of the neural network used in the present specification is not limited.
By the method of fig. 2, the characteristics of the word can be expressed more finely by the n-gram corresponding to the word, and particularly, the letter composition characteristics of the word can be expressed, so that the accuracy of the word vector of the words of the japanese language family can be improved, and the practical effect is good.
Based on the method of fig. 2, the present specification also provides some specific embodiments of the method, and further provides the following descriptions.
In this embodiment of this specification, for step S204, the determining each n-gram corresponding to each word specifically includes: determining words appearing in the corpus according to the result of word segmentation of the corpus;
respectively aiming at the determined different words, executing:
determining each n-gram corresponding to the word, wherein the n-gram corresponding to the word represents continuous n-grams of the word, and n is a positive integer or a plurality of different positive integers.
In the embodiment of the present specification, for the same word, each n-gram corresponding to the same word is also the same, and therefore, for the steps in the previous paragraph, the steps are executed respectively for the determined words different from each other, and for the repeated word, the existing result can be directly used without repeated execution, so that resources can be saved.
Further, considering that if the number of times a word appears in a corpus is too small, the training samples and training times corresponding to the training based on the corpus are also small, which may adversely affect the reliability of the training result, so that such words may be filtered out and not trained for the time being. Training may be subsequently performed in other corpora.
Based on such a concept, the determining, according to the result of segmenting the corpus, the word appearing in the corpus may specifically include: and determining words which appear in the corpus and the occurrence frequency of which is not less than the set frequency according to the result of word segmentation of the corpus. The specific number of times of setting can be determined according to actual conditions.
In this embodiment of the present specification, for step S208, there may be a plurality of specific training manners, such as a training manner based on a context word, a training manner based on a designated synonym or synonym, and for convenience of understanding, the foregoing manner is taken as an example and is described in detail.
The training of the word vector and the alphabet vector according to the word vector, the alphabet vector, and the corpus after word segmentation may specifically include: determining a designated word in the corpus after word segmentation and one or more context words of the designated word in the corpus after word segmentation; determining the similarity between the designated word and the upper and lower words according to the letter vectors of the n-gram letters corresponding to the designated word and the word vectors of the upper and lower words; and updating the word vector of the context word and the letter vector of each n-gram corresponding to the designated word according to the similarity between the designated word and the context word.
The specific manner of determining the similarity is not limited in this specification. For example, the similarity may be calculated based on an angle cosine operation of the vectors, the similarity may be calculated based on a sum of squares operation of the vectors, and so on.
The designated word may be plural, the designated word may be repeated and may have a different position in the corpus, and the processing action in the previous paragraph may be performed for each designated word. Preferably, a word included in the corpus after word segmentation may be respectively used as a designated word.
In the present specification embodiment, the training in step S208 may be such that: the similarity between a given word and a context word is relatively high (here, the similarity may reflect the degree of association, the degree of association between the word and the context word is relatively high, and the context words corresponding to the words having the same or similar meanings are often the same or similar), while the similarity between the given word and a non-context word is relatively low, and the non-context word may be used as a negative example word described below, and the context word may be used as a positive example word relatively.
It can be seen that some negative examples need to be determined as a contrast during the training process. One or more words can be randomly selected from the corpus after word segmentation as the negative sample words, and non-context words can also be strictly selected as the negative sample words. Taking the former way as an example, the updating the word vector of the contextual word and the letter vector of each n-gram corresponding to the specified word according to the similarity between the specified word and the contextual word may specifically include: selecting one or more words from the words as negative sample words; determining the similarity between the specified word and each negative sample word; determining a loss characteristic value corresponding to the designated word according to the designated loss function, the similarity between the designated word and the upper and lower words and the similarity between the designated word and each negative sample word; and updating the word vectors of the context words and the letter vectors of the n-gram letters corresponding to the designated words according to the loss characterization values.
Wherein, the loss characterization value is used for measuring the error degree between the current vector value and the training target. The parameters of the loss function may use the above-mentioned similarities as parameters, and the specific loss function expression is not limited in this specification, and will be exemplified in detail later.
In the embodiments of the present specification, updating the word vector and the alphabet vector actually corrects the degree of error. When the solution of the present specification is implemented using a neural network, such a correction can be implemented based on back propagation and gradient descent methods. In this case, the gradient is the gradient corresponding to the loss function.
Updating the word vector of the specified word and the letter vector of each n-gram corresponding to the specified word according to the loss characterization value may specifically include: determining a gradient corresponding to the loss function according to the loss characterization value; and updating the word vectors of the upper and lower words and the letter vectors of the n-gram letters corresponding to the designated words according to the gradient.
In this embodiment, the training process for the word vector and the alphabet vector may be performed iteratively based on at least a part of words in the corpus after word segmentation, so that the word vector and the alphabet vector may gradually converge until the training is completed.
Take training based on all words in the corpus after word segmentation as an example. For step S208, the training the word vector and the alphabet vector according to the word vector, the alphabet vector, and the corpus after word segmentation may specifically include:
traversing the corpus after word segmentation, and respectively executing the following steps of:
determining one or more context words of the word in the corpus after word segmentation;
respectively executing the following steps according to the context words:
determining the similarity between the word and the upper and lower words according to the letter vectors of the n-letter letters corresponding to the word and the word vectors of the upper and lower words;
and updating the word vector of the context word and the letter vector of each n-gram corresponding to the word according to the similarity between the word and the context word.
The above description of how to update the data is described in detail, and is not repeated.
In the embodiment of the present specification, the similarity between the word and the above and below words is determined, and the similarity may be determined by combining the word vector comprehensive measure of the word, in addition to the letter vector of each n-gram corresponding to the word and the word vector of the above and below words. Based on this idea, determining the similarity between the word and the previous and following words according to the letter vectors of the n-gram letters corresponding to the word and the word vectors of the previous and following words may specifically include: and determining the similarity between the word and the upper and lower words according to the letter vector of each n-gram corresponding to the word, the word vector of the word and the word vector of the upper and lower words.
For example, the similarity of the word to the contextual word may be calculated according to the following formula:
Figure BDA0001352943900000101
wherein, s (w) represents a set of n-gram letters corresponding to the word w, q represents each n-gram letter in s (w), sim (w, c) represents the similarity between the word w and the context word c;
Figure BDA0001352943900000102
a letter vector representing q is generated by a letter vector,
Figure BDA0001352943900000103
a word vector representing w is shown as a vector of words,
Figure BDA0001352943900000104
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter.
It should be noted that the formula in the above example is exemplary, and the similarity between the word and the contextual word may also be calculated according to other formulas at least including parameters such as the letter vector, the word vector of the word, and the word vector of the contextual word.
Further, to facilitate computer processing, the above traversal process may be implemented on a window basis.
For example, the determining one or more context words in the corpus after the word segmentation may specifically include: in the corpus after word segmentation, establishing a window by sliding the distance of a specified number of words leftwards and/or rightwards by taking the word as a center; and determining the words except the word in the window as the contextual words of the word.
Of course, a window with a set length may also be established with the first word of the corpus after word segmentation as the starting position, and the window includes the first word and a subsequent continuous set number of words; and after processing each word in the window, sliding the window backwards to process the next batch of words in the corpus until traversing the corpus.
The word vector processing method provided in the embodiments of the present specification is described above. For convenience of understanding, based on the above description, the embodiment of this specification further provides a flowchart of a specific implementation of the word vector processing method in an actual application scenario, as shown in fig. 3.
The process in fig. 3 mainly includes the following steps:
step 1, using a word segmentation tool to segment Chinese linguistic data, scanning the segmented Chinese linguistic data, counting all the appeared words to establish a vocabulary table, and deleting the words with the appearance frequency less than b times (namely, the set frequency); skipping to the step 2;
step 2, scanning the vocabulary table one by one, extracting n-gram letters corresponding to each word, establishing an n-gram letter table and a mapping table of the words and the corresponding n-gram letters; skipping to step 3;
step 3, establishing a word vector with the dimension of d for each word in the vocabulary table, establishing an alphabet vector with the dimension of d for each n-gram in the n-gram, and randomly initializing all established vectors; skipping to the step 4;
step 4, sliding one by one from the first word in the Chinese corpus in which word segmentation is completed, selecting one word each time as a current word w (namely, the designated word), and ending if w traverses all words in the whole corpus; otherwise, skipping to step 5;
step 5, taking the current word w as a center, sliding k words to two sides to establish a window, selecting one word as a context word c from the first word to the last word (except the current word w) in the window every time, and jumping to the step 4 if c traverses all words in the window; otherwise, jumping to step 6;
step 6, for the current word w, finding at least part of n-gram letters corresponding to the current word w according to the word in the step 2 and the corresponding n-gram letter mapping table, and calculating the similarity between the current word w and the context word c according to the formula (1):
Figure BDA0001352943900000111
wherein, in the formula, S represents the n-gram table established in step 2, S (w) represents the set of n-grams corresponding to the current word w in the mapping table in step 2, q represents each n-gram in S (w), sim (w, c) represents the similarity between the current word w and the current contextual word c;
Figure BDA0001352943900000121
a letter vector representing q is generated by a letter vector,
Figure BDA0001352943900000122
a word vector representing w is shown as a vector of words,
Figure BDA0001352943900000123
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2For the weighting parameter, it can generally take a value between 0 and 1, e.g., β1、β2Are all non-negative numbers, and beta1And beta2The sum is equal to 1; skipping to step 7;
step 7, randomly extracting λ words as negative sample words, and calculating a loss score l (w, c) according to formula (2) (i.e., the loss function), where the loss score can be used as the loss characterization value:
Figure BDA0001352943900000124
where log is a logarithmic function, c' is a randomly drawn negative sample term, and Ec'∈p(V)[x]The expectation value of the expression x is shown in the case that the randomly extracted negative sample word c' satisfies the probability distribution p (v), and σ (·) is a neural network excitation function, which is shown in detail in formula (3):
Figure BDA0001352943900000125
wherein if x is a real number, then σ (x) is also a real number; calculating gradient according to the value of l (w, c), and updating the letter vector of q
Figure BDA0001352943900000126
And vectors of contextual words
Figure BDA0001352943900000127
And skipping to step 5.
In the steps 1 to 7, the step 6 and the step 7 are more critical steps. For ease of understanding, this is illustrated in connection with fig. 4.
Fig. 4 is a schematic diagram of related processing actions of a part of corpora used in the flow of fig. 3 according to an embodiment of the present disclosure.
As shown in FIG. 4, assume that there is a sentence "die war Sonne" in the corpus, and the word segmentation results in three words "die", "war", "Sonne" in the sentence.
Assuming that "war" is selected as the current word w and "sonne" is selected as the context word c at this time, a set s (w) of at least part of n-gram letters mapped by the current word w is extracted, for example, 3-gram letters of "war" mapping include "war", "arm", "rme", and 4-gram letters include "arm", "arm". The loss score l (w, c) is calculated according to formula (1), formula (2) and formula (3), and then the gradient is calculated to update the word vector of c and the letter vector corresponding to w, which are represented by the gray boxes in fig. 4.
Based on the same idea as fig. 2 and the implementation in fig. 3, another word vector processing method is provided in the embodiments of this specification.
Fig. 5 is a flowchart illustrating another word vector processing method according to an embodiment of the present disclosure.
The flow in fig. 5 may include the following steps:
step 1, segmenting words of a language material to obtain each word; skipping to the step 2;
step 2, establishing an n-element letter mapping table, wherein the mapping table comprises the mapping relation between each word and the n-element letters, and the n-element letters represent continuous n letters of the words mapped by the n-element letters; establishing and initializing word vectors of the words and letter vectors of the n-gram letters mapped by the words; skipping to step 3;
step 3, traversing the corpus after word segmentation, respectively taking the traversed word as a current word w, and executing step 4 on the current word w, if the traversal is finished, ending, otherwise, continuing the traversal;
step 4, traversing each context word specified for the current word w (for example, the context word may be specified by the window or other rules) in the corpus, and respectively executing step 5 on the current context word c, if the traversal is completed, continuing the execution of step 3, otherwise, continuing the traversal;
step 5, calculating the similarity between the current word w and the current contextual word c according to the following formula:
Figure BDA0001352943900000131
wherein, s (w) represents a set of at least some n-grams mapped by the current word w in the n-gram mapping table, q represents each n-gram in s (w), and sim (w, c) represents the similarity between the current word w and the current context word c;
Figure BDA0001352943900000132
a letter vector representing q is generated by a letter vector,
Figure BDA0001352943900000133
a word vector representing w is shown as a vector of words,
Figure BDA0001352943900000134
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter; skipping to step 6;
step 6, randomly extracting lambda words as negative sample words, and calculating corresponding loss characteristic values l (w, c) according to the following loss functions:
Figure BDA0001352943900000141
where c' is a randomly drawn negative sample word, and Ec'∈p(V)[x]The expected value of the expression x is the neural network excitation function defined as the expected value of the expression x when the randomly extracted negative sample word c' satisfies the probability distribution p (V)
Figure BDA0001352943900000142
Calculating the gradient corresponding to the loss function according to the calculated loss characterization value l (w, c), and according to the gradient, performing q-letter vector
Figure BDA0001352943900000143
And the word vector of the current context word c
Figure BDA0001352943900000144
And (6) updating.
The steps in the another word vector processing method may be executed by the same module or different modules, and this specification does not specifically limit this.
Based on the same idea, the word vector processing method provided above for the embodiments of the present specification further provides a corresponding apparatus, as shown in fig. 6.
Fig. 6 is a schematic structural diagram of a word vector processing apparatus corresponding to fig. 2 provided in an embodiment of this specification, where the apparatus may be located in an execution body of the flow in fig. 2, and includes:
a word segmentation module 601, which is used for segmenting words of the speech to obtain each word;
a determining module 602, configured to determine each n-gram corresponding to each word, where the n-gram represents n consecutive words of the word corresponding to the n-gram;
the initialization module 603 is configured to establish and initialize a word vector of each word and a letter vector of each n-gram corresponding to each word;
a training module 604, configured to train the word vector and the alphabet vector according to the word vector, the alphabet vector, and the corpus after word segmentation;
wherein the words are words of the Hierann language family.
Optionally, the determining module 602 determines each n-gram corresponding to each word, which specifically includes:
the determining module 602 determines the words appearing in the corpus according to the result of segmenting the corpus;
performing, for the determined words, respectively:
determining each n-gram corresponding to the word, wherein the n-gram corresponding to the word represents continuous n-grams of the word, and n is a positive integer or a plurality of different positive integers.
Optionally, the determining module 602 determines, according to a result of segmenting the corpus, a word appearing in the corpus, specifically including:
the determining module 602 determines words that appear in the corpus and whose occurrence frequency is not less than a set frequency according to the result of segmenting the corpus.
Optionally, the initializing module 603 initializes the word vector of each word and the letter vector of each n-gram corresponding to each word, which specifically includes:
the initialization module 603 initializes the word vectors of the words and the letter vectors of the n-gram corresponding to the words in a random initialization manner or an initialization manner according to a designated probability distribution, wherein the letter vectors of the same n-gram are also the same.
Optionally, the training module 604 trains the word vector and the alphabet vector according to the word vector, the alphabet vector, and the corpus after word segmentation, and specifically includes:
the training module 604 determines a specified word in the corpus after word segmentation and one or more context words in the corpus after word segmentation of the specified word;
determining the similarity between the designated word and the upper and lower words according to the letter vectors of the n-gram letters corresponding to the designated word and the word vectors of the upper and lower words;
and updating the word vector of the context word and the letter vector of each n-gram corresponding to the designated word according to the similarity between the designated word and the context word.
Optionally, the training module 604 updates the word vector of the contextual word and the alphabet vector of each n-gram corresponding to the specified word according to the similarity between the specified word and the contextual word, which specifically includes:
the training module 604 selects one or more words from the words as negative sample words;
determining the similarity between the specified word and each negative sample word;
determining a loss characteristic value corresponding to the designated word according to the designated loss function, the similarity between the designated word and the upper and lower words and the similarity between the designated word and each negative sample word;
and updating the word vectors of the context words and the letter vectors of the n-gram letters corresponding to the designated words according to the loss characterization values.
Optionally, the training module 604 updates the word vector of the context word and the alphabet vector of each n-gram corresponding to the specified word according to the loss characterization value, which specifically includes:
the training module 604 determines a gradient corresponding to the loss function according to the loss characterization value;
and updating the word vectors of the upper and lower words and the letter vectors of the n-gram letters corresponding to the designated words according to the gradient.
Optionally, the training module 604 selects one or more words from the words as negative sample words, which specifically includes:
the training module 604 randomly selects one or more words from the words as negative sample words.
Optionally, the training module 604 trains the word vector and the alphabet vector according to the word vector, the alphabet vector, and the corpus after word segmentation, and specifically includes:
the training module 604 traverses the corpus after word segmentation, and respectively executes the following operations on the words in the corpus after word segmentation:
determining one or more context words of the word in the corpus after word segmentation;
respectively executing the following steps according to the context words:
determining the similarity between the word and the upper and lower words according to the letter vectors of the n-letter letters corresponding to the word and the word vectors of the upper and lower words;
and updating the word vector of the context word and the letter vector of each n-gram corresponding to the word according to the similarity between the word and the context word.
Optionally, the training module 604 determines the similarity between the word and the previous and next words according to the letter vector of each n-gram letter corresponding to the word and the word vector of the previous and next words, and specifically includes:
the training module 604 determines the similarity between the word and the context word according to the letter vector of each n-gram corresponding to the word, the word vector of the word, and the word vector of the context word.
Optionally, the training module 604 determines the similarity between the word and the context word according to the letter vector of each n-gram corresponding to the word, the word vector of the word, and the word vector of the context word, and specifically includes:
the training module 604 calculates the similarity between the word and the contextual word according to the following formula:
Figure BDA0001352943900000171
wherein, s (w) represents a set of n-gram letters corresponding to the word w, q represents each n-gram letter in s (w), sim (w, c) represents the similarity between the word w and the context word c;
Figure BDA0001352943900000172
a letter vector representing q is generated by a letter vector,
Figure BDA0001352943900000173
a word vector representing w is shown as a vector of words,
Figure BDA0001352943900000174
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter.
Optionally, the training module 604 determines one or more context words in the corpus after the word is segmented, specifically including:
the training module 604 establishes a window by sliding the distance of a specified number of words left and/or right with the word as a center in the corpus after word segmentation;
and determining the words except the word in the window as the contextual words of the word.
Based on the same idea, embodiments of this specification further provide a corresponding electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
segmenting the corpus into words to obtain each word;
determining each n-gram letter corresponding to each word, wherein the n-gram letters represent continuous n letters of the corresponding word;
establishing and initializing word vectors of the words and letter vectors of the n-element letters corresponding to the words;
training the word vectors and the letter vectors according to the word vectors, the letter vectors and the corpus after word segmentation;
wherein the words are words of the Hierann language family.
Based on the same idea, embodiments of the present specification further provide a corresponding non-volatile computer storage medium, in which computer-executable instructions are stored, where the computer-executable instructions are configured to:
segmenting the corpus into words to obtain each word;
determining each n-gram letter corresponding to each word, wherein the n-gram letters represent continuous n letters of the corresponding word;
establishing and initializing word vectors of the words and letter vectors of the n-element letters corresponding to the words;
training the word vectors and the letter vectors according to the word vectors, the letter vectors and the corpus after word segmentation;
wherein the words are words of the Hierann language family.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, and the nonvolatile computer storage medium, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to the partial description of the embodiments of the method.
The apparatus, the electronic device, the nonvolatile computer storage medium and the method provided in the embodiments of the present description correspond to each other, and therefore, the apparatus, the electronic device, and the nonvolatile computer storage medium also have similar advantageous technical effects to the corresponding method.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.
As will be appreciated by one skilled in the art, the present specification embodiments may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, the embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only an example of the present specification, and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (20)

1. A method of word vector processing, comprising:
segmenting the corpus into words to obtain each word;
determining each n-gram letter corresponding to each word, wherein the n-gram letters represent continuous n letters of the corresponding word;
establishing and initializing word vectors of the words and letter vectors of the n-element letters corresponding to the words;
training the word vectors and the letter vectors according to the word vectors, the letter vectors and the corpus after word segmentation;
wherein the words are words of the Hieran language family;
the training of the word vector and the letter vector according to the word vector, the letter vector and the corpus after word segmentation specifically comprises:
traversing the corpus after word segmentation, and respectively executing the following steps of:
determining one or more context words of the word in the corpus after word segmentation;
respectively executing the following steps according to the context words:
determining the similarity between the word and the upper and lower words according to the letter vectors of the n-letter letters corresponding to the word and the word vectors of the upper and lower words;
updating the word vector of the context word and the letter vector of each n-gram corresponding to the word according to the similarity between the word and the context word;
determining the similarity between the word and the preceding and following words according to the letter vectors of the n-gram letters corresponding to the word and the word vectors of the preceding and following words, specifically comprising:
determining the similarity between the word and the upper and lower words according to the letter vector of each n-gram corresponding to the word, the word vector of the word and the word vector of the upper and lower words;
determining the similarity between the word and the context word according to the letter vector of each n-gram corresponding to the word, the word vector of the word, and the word vector of the context word, specifically comprising:
the similarity of the word to the contextual word is calculated according to the following formula:
Figure FDA0003096397420000021
wherein, s (w) represents a set of n-gram letters corresponding to the word w, q represents each n-gram letter in s (w), sim (w, c) represents the similarity between the word w and the context word c;
Figure FDA0003096397420000022
a letter vector representing q is generated by a letter vector,
Figure FDA0003096397420000023
a word vector representing w is shown as a vector of words,
Figure FDA0003096397420000024
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter.
2. The method according to claim 1, wherein the determining each n-gram corresponding to each word specifically comprises:
determining words appearing in the corpus according to the result of word segmentation of the corpus;
respectively aiming at the determined different words, executing:
determining each n-gram corresponding to the word, wherein the n-gram corresponding to the word represents continuous n-grams of the word, and n is a positive integer or a plurality of different positive integers.
3. The method according to claim 2, wherein said determining the words appearing in said corpus according to the result of said corpus participling specifically comprises:
and determining words which appear in the corpus and the occurrence frequency of which is not less than the set frequency according to the result of word segmentation of the corpus.
4. The method according to claim 1, wherein the initializing the word vector of each word and the letter vector of each n-gram corresponding to each word specifically comprises:
and initializing the word vector of each word and the letter vector of each n-gram corresponding to each word by adopting a random initialization mode or an initialization mode according to appointed probability distribution, wherein the letter vectors of the same n-gram are also the same.
5. The method according to claim 1, wherein said training said word vector and said alphabet vector according to said word vector, said alphabet vector, and said corpus after word segmentation comprises:
determining a designated word in the corpus after word segmentation and one or more context words of the designated word in the corpus after word segmentation;
determining the similarity between the designated word and the upper and lower words according to the letter vectors of the n-gram letters corresponding to the designated word and the word vectors of the upper and lower words;
and updating the word vector of the context word and the letter vector of each n-gram corresponding to the designated word according to the similarity between the designated word and the context word.
6. The method according to claim 5, wherein the updating the word vector of the contextual word and the letter vector of each n-gram corresponding to the specified word according to the similarity between the specified word and the contextual word specifically comprises:
selecting one or more words from the words as negative sample words;
determining the similarity between the specified word and each negative sample word;
determining a loss characteristic value corresponding to the designated word according to the designated loss function, the similarity between the designated word and the upper and lower words and the similarity between the designated word and each negative sample word;
and updating the word vectors of the context words and the letter vectors of the n-gram letters corresponding to the designated words according to the loss characterization values.
7. The method according to claim 6, wherein the updating the word vector of the context word and the alphabet vector of each n-gram corresponding to the specified word according to the loss characterization value specifically includes:
determining a gradient corresponding to the loss function according to the loss characterization value;
and updating the word vectors of the upper and lower words and the letter vectors of the n-gram letters corresponding to the designated words according to the gradient.
8. The method according to claim 6, wherein the selecting one or more words from the words as negative sample words specifically comprises:
and randomly selecting one or more words from the words to serve as negative sample words.
9. The method according to claim 1, wherein said determining one or more contextual words in said corpus after word segmentation of said word specifically comprises:
in the corpus after word segmentation, establishing a window by sliding the distance of a specified number of words leftwards and/or rightwards by taking the word as a center;
and determining the words except the word in the window as the contextual words of the word.
10. A word vector processing apparatus comprising:
the word segmentation module is used for segmenting words of the speech to obtain each word;
the determining module is used for determining each n-element letter corresponding to each word, and the n-element letters represent continuous n letters of the corresponding word;
the initialization module is used for establishing and initializing word vectors of all the words and letter vectors of all the n-element letters corresponding to all the words;
the training module is used for training the word vectors and the letter vectors according to the word vectors, the letter vectors and the linguistic data after word segmentation;
wherein the words are words of the Hieran language family;
the training module trains the word vector and the letter vector according to the word vector, the letter vector and the corpus after word segmentation, and specifically comprises:
the training module traverses the corpus after word segmentation, and executes the following steps of:
determining one or more context words of the word in the corpus after word segmentation;
respectively executing the following steps according to the context words:
determining the similarity between the word and the upper and lower words according to the letter vectors of the n-letter letters corresponding to the word and the word vectors of the upper and lower words;
updating the word vector of the context word and the letter vector of each n-gram corresponding to the word according to the similarity between the word and the context word;
the training module determines the similarity between the word and the preceding and following words according to the letter vectors of the n-gram letters corresponding to the word and the word vectors of the preceding and following words, and specifically includes:
the training module determines the similarity between the word and the upper and lower words according to the letter vector of each n-gram corresponding to the word, the word vector of the word and the word vector of the upper and lower words;
the training module determines similarity between the word and the context word according to the letter vector of each n-gram corresponding to the word, the word vector of the word, and the word vector of the context word, and specifically includes:
the training module calculates the similarity between the word and the contextual word according to the following formula:
Figure FDA0003096397420000051
wherein, s (w) represents a set of n-gram letters corresponding to the word w, q represents each n-gram letter in s (w), sim (w, c) represents the similarity between the word w and the context word c;
Figure FDA0003096397420000052
a letter vector representing q is generated by a letter vector,
Figure FDA0003096397420000053
a word vector representing w is shown as a vector of words,
Figure FDA0003096397420000054
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter.
11. The apparatus according to claim 10, wherein the determining module determines each n-gram corresponding to each word, and specifically includes:
the determining module determines words appearing in the corpus according to the result of word segmentation of the corpus;
respectively aiming at the determined different words, executing:
determining each n-gram corresponding to the word, wherein the n-gram corresponding to the word represents continuous n-grams of the word, and n is a positive integer or a plurality of different positive integers.
12. The apparatus according to claim 11, wherein the determining module determines the word appearing in the corpus according to the result of segmenting the corpus, specifically comprising:
and the determining module determines words which appear in the corpus and the occurrence frequency of which is not less than the set frequency according to the result of segmenting the corpus.
13. The apparatus according to claim 10, wherein the initializing module initializes the word vector of each word and the alphabet vector of each n-gram corresponding to each word, and specifically includes:
the initialization module initializes the word vector of each word and the letter vector of each n-gram corresponding to each word in a random initialization mode or an initialization mode according to the appointed probability distribution, wherein the letter vectors of the same n-gram are the same.
14. The apparatus according to claim 10, wherein the training module trains the word vector and the alphabet vector according to the word vector, the alphabet vector, and the corpus after word segmentation, and specifically includes:
the training module determines a specified word in the corpus after word segmentation and one or more context words of the specified word in the corpus after word segmentation;
determining the similarity between the designated word and the upper and lower words according to the letter vectors of the n-gram letters corresponding to the designated word and the word vectors of the upper and lower words;
and updating the word vector of the context word and the letter vector of each n-gram corresponding to the designated word according to the similarity between the designated word and the context word.
15. The apparatus according to claim 14, wherein the training module updates the word vector of the contextual word and the alphabet vector of each n-gram corresponding to the specified word according to the similarity between the specified word and the contextual word, specifically comprising:
the training module selects one or more words from the words as negative sample words;
determining the similarity between the specified word and each negative sample word;
determining a loss characteristic value corresponding to the designated word according to the designated loss function, the similarity between the designated word and the upper and lower words and the similarity between the designated word and each negative sample word;
and updating the word vectors of the context words and the letter vectors of the n-gram letters corresponding to the designated words according to the loss characterization values.
16. The apparatus according to claim 15, wherein the training module updates, according to the loss characterization value, the word vector of the contextual word and the alphabet vector of each n-gram corresponding to the specified word, specifically including:
the training module determines a gradient corresponding to the loss function according to the loss characterization value;
and updating the word vectors of the upper and lower words and the letter vectors of the n-gram letters corresponding to the designated words according to the gradient.
17. The apparatus according to claim 15, wherein the training module selects one or more words from the words as negative sample words, and specifically includes:
and the training module randomly selects one or more words from the words to serve as negative sample words.
18. The apparatus according to claim 10, wherein the training module determines one or more context words in the corpus after the word segmentation, and specifically includes:
the training module establishes a window by sliding the distance of a specified number of words leftwards and/or rightwards by taking the word as a center in the corpus after word segmentation;
and determining the words except the word in the window as the contextual words of the word.
19. A method of word vector processing, comprising:
step 1, segmenting words of a language material to obtain each word; skipping to the step 2;
step 2, establishing an n-element letter mapping table, wherein the mapping table comprises the mapping relation between each word and the n-element letters, and the n-element letters represent continuous n letters of the words mapped by the n-element letters; establishing and initializing word vectors of the words and letter vectors of the n-gram letters mapped by the words; skipping to step 3;
step 3, traversing the corpus after word segmentation, respectively taking the traversed word as a current word w, and executing step 4 on the current word w, if the traversal is finished, ending, otherwise, continuing the traversal;
step 4, traversing each context word appointed for the current word w in the corpus, respectively executing step 5 on the current context word c, if the traversal is completed, continuing the execution of step 3, otherwise, continuing the traversal;
step 5, calculating the similarity between the current word w and the current contextual word c according to the following formula:
Figure FDA0003096397420000071
wherein, s (w) represents a set of at least some n-grams mapped by the current word w in the n-gram mapping table, q represents each n-gram in s (w), and sim (w, c) represents the similarity between the current word w and the current context word c;
Figure FDA0003096397420000072
a letter vector representing q is generated by a letter vector,
Figure FDA0003096397420000073
a word vector representing w is shown as a vector of words,
Figure FDA0003096397420000074
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter; skipping to step 6;
step 6, randomly extracting lambda words as negative sample words, and calculating corresponding loss characteristic values l (w, c) according to the following loss functions:
Figure FDA0003096397420000075
where c' is a randomly drawn negative sample word, and Ec'∈p(V)[x]Means that randomly extracted negative sample words c' satisfy probability distributionp(V) the expected value, σ (-) of the expression x is the neural network excitation function, defined as
Figure FDA0003096397420000081
Calculating the gradient corresponding to the loss function according to the calculated loss characterization value l (w, c), and according to the gradient, performing q-letter vector
Figure FDA0003096397420000082
And the word vector of the current context word c
Figure FDA0003096397420000083
Updating is carried out;
wherein the method further comprises:
training the word vector and the letter vector according to the word vector, the letter vector and the corpus after word segmentation, specifically comprising:
traversing the corpus after word segmentation, and respectively executing the following steps of:
determining one or more context words of the word in the corpus after word segmentation;
respectively executing the following steps according to the context words:
determining the similarity between the word and the upper and lower words according to the letter vectors of the n-letter letters corresponding to the word and the word vectors of the upper and lower words;
updating the word vector of the context word and the letter vector of each n-gram corresponding to the word according to the similarity between the word and the context word;
determining the similarity between the word and the upper and lower words according to the letter vector of each n-gram corresponding to the word and the word vector of the upper and lower words, specifically comprising:
determining the similarity between the word and the context word according to the letter vector of each n-gram corresponding to the word, the word vector of the word and the word vector of the context word;
determining the similarity between the word and the context word according to the letter vector of each n-gram corresponding to the word, the word vector of the word, and the word vector of the context word, specifically comprising:
the similarity of the word to the contextual word is calculated according to the following formula:
Figure FDA0003096397420000084
wherein, s (w) represents a set of n-gram letters corresponding to the word w, q represents each n-gram letter in s (w), sim (w, c) represents the similarity between the word w and the context word c;
Figure FDA0003096397420000091
a letter vector representing q is generated by a letter vector,
Figure FDA0003096397420000092
a word vector representing w is shown as a vector of words,
Figure FDA0003096397420000093
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter.
20. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
segmenting the corpus into words to obtain each word;
determining each n-gram letter corresponding to each word, wherein the n-gram letters represent continuous n letters of the corresponding word;
establishing and initializing word vectors of the words and letter vectors of the n-element letters corresponding to the words;
training the word vectors and the letter vectors according to the word vectors, the letter vectors and the corpus after word segmentation;
wherein the words are words of the Hieran language family;
training the word vector and the letter vector according to the word vector, the letter vector and the corpus after word segmentation, specifically comprising:
traversing the corpus after word segmentation, and respectively executing the following steps of:
determining one or more context words of the word in the corpus after word segmentation;
respectively executing the following steps according to the context words:
determining the similarity between the word and the upper and lower words according to the letter vectors of the n-letter letters corresponding to the word and the word vectors of the upper and lower words;
updating the word vector of the context word and the letter vector of each n-gram corresponding to the word according to the similarity between the word and the context word;
determining the similarity between the word and the upper and lower words according to the letter vector of each n-gram corresponding to the word and the word vector of the upper and lower words, specifically comprising:
determining the similarity between the word and the context word according to the letter vector of each n-gram corresponding to the word, the word vector of the word and the word vector of the context word;
determining the similarity between the word and the context word according to the letter vector of each n-gram corresponding to the word, the word vector of the word, and the word vector of the context word, specifically comprising:
the similarity of the word to the contextual word is calculated according to the following formula:
Figure FDA0003096397420000101
wherein, s (w) represents a set of n-gram letters corresponding to the word w, q represents each n-gram letter in s (w), sim (w, c) represents the similarity between the word w and the context word c;
Figure FDA0003096397420000102
a letter vector representing q is generated by a letter vector,
Figure FDA0003096397420000103
a word vector representing w is shown as a vector of words,
Figure FDA0003096397420000104
a word vector indicating c, the inner indicates a specific operation for two vectors, the specific operation is a dot product operation, or an included angle cosine operation, or an euclidean distance operation; beta is a1、β2Is a weight parameter.
CN201710583716.2A 2017-07-18 2017-07-18 Word vector processing method and device and electronic equipment Active CN107844472B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710583716.2A CN107844472B (en) 2017-07-18 2017-07-18 Word vector processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710583716.2A CN107844472B (en) 2017-07-18 2017-07-18 Word vector processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN107844472A CN107844472A (en) 2018-03-27
CN107844472B true CN107844472B (en) 2021-08-24

Family

ID=61682814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710583716.2A Active CN107844472B (en) 2017-07-18 2017-07-18 Word vector processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN107844472B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245353B (en) * 2019-06-20 2022-10-28 腾讯科技(深圳)有限公司 Natural language expression method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN105528411A (en) * 2015-12-03 2016-04-27 中国人民解放军海军工程大学 Full-text retrieval device and method for interactive electronic technical manual of shipping equipment
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPR824301A0 (en) * 2001-10-15 2001-11-08 Silverbrook Research Pty. Ltd. Methods and systems (npw001)

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN105528411A (en) * 2015-12-03 2016-04-27 中国人民解放军海军工程大学 Full-text retrieval device and method for interactive electronic technical manual of shipping equipment
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Enriching Word Vectors with Subword Information;Piotr Bojanowaski et al.;《arXiv:1607.04606v1》;20160615;第1-7页 *
FastText分析与实践;machinelearning;《简书https://www.jianshu.com/p/9ea0d69dd55e》;20170711;第1-10页 *
Word2Vec作者Thomas Mikolov的三篇代表作解析;悟乙己;《CSDN博客 https://blog.csdn.net/sinat_26917383/article/details/52577551》;20160918;第1-9页 *

Also Published As

Publication number Publication date
CN107844472A (en) 2018-03-27

Similar Documents

Publication Publication Date Title
CN108345580B (en) Word vector processing method and device
CN108874765B (en) Word vector processing method and device
CN107957989B9 (en) Cluster-based word vector processing method, device and equipment
CN108228704B (en) Method, device and equipment for identifying risk content
WO2019084867A1 (en) Automatic answering method and apparatus, storage medium, and electronic device
TWI689831B (en) Word vector generating method, device and equipment
US11010554B2 (en) Method and device for identifying specific text information
TWI686713B (en) Word vector generating method, device and equipment
CN109117474B (en) Statement similarity calculation method and device and storage medium
CN107423269B (en) Word vector processing method and device
CN112308113A (en) Target identification method, device and medium based on semi-supervision
CN117235226A (en) Question response method and device based on large language model
CN113221555A (en) Keyword identification method, device and equipment based on multitask model
CN107247704B (en) Word vector processing method and device and electronic equipment
CN109597982B (en) Abstract text recognition method and device
CN107577658B (en) Word vector processing method and device and electronic equipment
CN107562715B (en) Word vector processing method and device and electronic equipment
CN107844472B (en) Word vector processing method and device and electronic equipment
CN111091001B (en) Method, device and equipment for generating word vector of word
JP2017032996A (en) Provision of adaptive electronic reading support
CN108681490B (en) Vector processing method, device and equipment for RPC information
CN115130621A (en) Model training method and device, storage medium and electronic equipment
CN110321433B (en) Method and device for determining text category
CN111539520A (en) Method and device for enhancing robustness of deep learning model
CN108563696B (en) Method, device and equipment for discovering potential risk words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1249780

Country of ref document: HK

TA01 Transfer of patent application right

Effective date of registration: 20191212

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, ky1-1205, Cayman Islands

Applicant after: Innovative advanced technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant