CN111325026A - Training method and system for word vector model - Google Patents

Training method and system for word vector model Download PDF

Info

Publication number
CN111325026A
CN111325026A CN202010100333.7A CN202010100333A CN111325026A CN 111325026 A CN111325026 A CN 111325026A CN 202010100333 A CN202010100333 A CN 202010100333A CN 111325026 A CN111325026 A CN 111325026A
Authority
CN
China
Prior art keywords
word
training
huffman tree
vector model
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010100333.7A
Other languages
Chinese (zh)
Other versions
CN111325026B (en
Inventor
周思丞
陈孝良
苏少炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202010100333.7A priority Critical patent/CN111325026B/en
Publication of CN111325026A publication Critical patent/CN111325026A/en
Application granted granted Critical
Publication of CN111325026B publication Critical patent/CN111325026B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a training method and a system of a word vector model, which are used for acquiring training corpora of different application scenes; determining a first Huffman tree corresponding to a word vector model by using a training corpus, determining a second Huffman tree corresponding to each application scene by using the training corpus corresponding to each application scene in advance, and taking a root node of each second Huffman tree as an inner node of the first Huffman tree; and training a word vector model according to the first Huffman tree until convergence by taking the maximum co-occurrence probability of each word and the context corresponding to the word as a target. In the scheme, the second Huffman tree of each application scene is determined by using the training corpus corresponding to each application scene, and the root node of the second Huffman tree is used as the internal node of the first Huffman tree. And training the word vector model until convergence according to the first Huffman tree by taking the maximum co-occurrence probability of each word and the corresponding context as a target, and improving the performance of the word vector model by utilizing the influence of different application scenes on the training word vector model.

Description

Training method and system for word vector model
Technical Field
The invention relates to the technical field of word vector training, in particular to a training method and a training system of a word vector model.
Background
With the development of scientific technology, speech recognition is more and more widely applied. In the process of applying speech recognition, an application scene corresponding to the speaking content of the user needs to be recognized, and then reply or subsequent processing is carried out according to the application scene.
When the application scene of the speaking content is identified, the speaking content is converted into text data, and the text data is processed by utilizing a pre-trained classification model to obtain the application scene of the speaking content. When text data is processed, word vectors are generally used as input of a classification model, that is, the text data is converted into word vectors by using the word vector model, and then the converted word vectors are subjected to subsequent processing. To improve the performance of the classification model, the word vector model needs to be pre-trained.
Therefore, a training method of the word vector model is needed.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and a system for training a word vector model to improve performance of the word vector model.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
the first aspect of the embodiments of the present invention discloses a method for training a word vector model, where the method includes:
acquiring training corpora corresponding to different application scenes, wherein each training corpus comprises a plurality of words of different word types;
determining a first Huffman tree corresponding to a word vector model by using the training corpuses, wherein the second Huffman tree corresponding to each application scene is determined by using the training corpuses corresponding to each application scene in advance, and a root node of each second Huffman tree is used as an internal node of the first Huffman tree;
and training the word vector model according to the first Huffman tree until convergence by taking the maximum co-occurrence probability of each word and the context corresponding to the word vector model per se as a target.
Preferably, the determining, by using a corpus corresponding to each application scenario in advance, the second huffman tree corresponding to each application scenario includes:
determining the word type of a training corpus corresponding to each application scene;
and determining a second Huffman tree corresponding to the application scene by taking the application scene as a root node, taking the word type as an inner node and taking the words as leaf nodes.
Preferably, the training the word vector model according to the first huffman tree with the maximum co-occurrence probability of each word and its corresponding context as a target until convergence includes:
for each of the words in the first Huffman tree, determining a context to which the word corresponds;
calculating the co-occurrence probability of each word and the context corresponding to the word and the word;
and training the word vector model until convergence with the maximum co-occurrence probability as a target.
Preferably, after the obtaining of the corpus corresponding to different application scenarios, the method further includes:
and performing word segmentation processing, special symbol removing processing and stop word removing processing on the training corpus corresponding to each application scene.
The second aspect of the embodiments of the present invention discloses a training system for a word vector model, the system comprising:
the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring training corpora corresponding to different application scenes, and each training corpus comprises a plurality of words of different word types;
a determining unit, configured to determine a first huffman tree corresponding to a word vector model by using the corpus, where the second huffman tree corresponding to each application scenario is determined by using the corpus corresponding to each application scenario in advance, and a root node of each second huffman tree is used as an inner node of the first huffman tree;
and the training unit is used for training the word vector model according to the first Huffman tree until convergence by taking the maximum co-occurrence probability of each word and the context corresponding to the word as a target.
Preferably, the determination unit includes:
the first determining module is used for determining the word type of the training corpus corresponding to each application scene;
and the second determining module is used for determining a second Huffman tree corresponding to the application scene by taking the application scene as a root node, taking the word type as an inner node and taking the word as a leaf node.
Preferably, the training unit comprises:
a determining module, configured to determine, for each of the words in the first huffman tree, a context to which the word corresponds;
the calculation module is used for calculating the co-occurrence probability of each word and the context corresponding to the word;
and the training module is used for training the word vector model until convergence according to the maximum co-occurrence probability target.
Preferably, the system further comprises:
and the processing unit is used for performing word segmentation processing, special symbol removing processing and stop word removing processing on the training corpus corresponding to each application scene.
The third aspect of the embodiments of the present invention discloses an electronic device, where the electronic device is configured to run a program, and when the program runs, the method for training a word vector model disclosed in the first aspect of the embodiments of the present invention is performed.
A fourth aspect of the embodiments of the present invention discloses a computer storage medium, where the computer storage medium includes a storage program, and when the program runs, a device in which the storage medium is located is controlled to execute the method for training a word vector model disclosed in the first aspect of the embodiments of the present invention.
Based on the above method and system for training a word vector model provided by the embodiments of the present invention, the method includes: acquiring training corpora of different application scenes; determining a first Huffman tree corresponding to a word vector model by using a training corpus, determining a second Huffman tree corresponding to each application scene by using the training corpus corresponding to each application scene in advance, and taking a root node of each second Huffman tree as an inner node of the first Huffman tree; and training a word vector model according to the first Huffman tree until convergence by taking the maximum co-occurrence probability of each word and the context corresponding to the word as a target. In the scheme, the second Huffman tree of each application scene is determined by using the training corpus corresponding to each application scene. And taking the root node of each second Huffman tree as the inner node of the first Huffman tree. The co-occurrence probability of each word and the context corresponding to the word is the maximum target, the word vector model is trained until convergence according to the first Huffman tree, the influence of different application scenes on the training word vector model is fully utilized, and the performance of the word vector model is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for training a word vector model according to an embodiment of the present invention;
FIG. 2 is a block diagram of a first Huffman tree according to an embodiment of the present invention;
FIG. 3 is a flowchart of a training word vector model according to an embodiment of the present invention;
FIG. 4 is another flowchart of a method for training a word vector model according to an embodiment of the present invention;
FIG. 5 is a block diagram of a training system for a word vector model according to an embodiment of the present invention;
FIG. 6 is another block diagram of a word vector model training system according to an embodiment of the present invention;
FIG. 7 is a block diagram of another structure of a training system for a word vector model according to an embodiment of the present invention;
fig. 8 is a further structural block diagram of a training system for a word vector model according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The word vector model referred to in the embodiments of the present invention refers to a general term of a language modeling and feature learning technique in natural language processing, and maps words and phrases to vectors of real numbers. It can also be said that the word vector model involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with lower dimensions.
The specific concept of the huffman tree (also called huffman tree) involved in the embodiment of the present invention is: given n weights as n leaf nodes, constructing a binary tree, and if the weighted path length of the binary tree reaches the minimum, then the binary tree may be called an optimal binary tree, or a huffman tree. Namely, the Huffman tree is a binary tree with the shortest path length with weight, and the node with the larger weight is closer to the root node.
As known from the background art, when text data is processed, a word vector is usually used as an input of a classification model, i.e., the text data is converted into the word vector by using the word vector model, and then the converted word vector is subjected to subsequent processing. The performance of the word vector model affects the processing effect of the classification model on the text data, that is, the better the performance of the word vector model is, the higher the accuracy of recognizing the speaking content of the user is.
Therefore, the embodiment of the present invention provides a method and a system for training a word vector model, which determine a second huffman tree of each application scenario by using a training corpus corresponding to each application scenario. And taking the root node of each second Huffman tree as the inner node of the first Huffman tree. And training the word vector model until convergence according to the first Huffman tree by taking the maximum co-occurrence probability of each word and the context corresponding to the word and the context, and fully utilizing the influence of different application scenes on the training word vector model to improve the performance of the word vector model.
Referring to fig. 1, a flowchart of a training method for a word vector model according to an embodiment of the present invention is shown, where the training method includes the following steps:
step S101: and acquiring training corpora corresponding to different application scenes.
In the process of specifically implementing step S101, a plurality of different application scenarios are preset, and a corpus corresponding to each application scenario is obtained. And the training corpus corresponding to each application scene comprises a plurality of words of different word types.
In a further implementation, a plurality of different application scenarios may be set according to the actual requirements of speech recognition. For example, for a currently common intelligent sound, a corresponding word vector model needs to be trained for the intelligent sound to assist the intelligent sound in performing subsequent speech recognition. According to the functional requirements of the intelligent sound box, a plurality of different application scenes are set for the intelligent sound box, such as application scenes of weather, alarm clock, music and the like, and then corresponding training corpora are collected according to each set application scene.
It should be noted that, for each word in the corpus, the word may appear in the corpus in multiple application scenarios.
Preferably, after the corpus corresponding to different application scenarios is obtained, preprocessing such as word segmentation processing, special symbol removal processing, stop word removal processing and the like is performed on the corpus corresponding to each application scenario.
Step S102: and determining a first Huffman tree corresponding to the word vector model by using the training corpus.
In the process of specifically implementing step S102, for each application scenario, according to the corpus of the application scenario, a second huffman tree corresponding to the application scenario is determined. Specifically, the process of determining the second huffman tree corresponding to each application scenario is as follows:
and determining the word type of the training corpus corresponding to each application scene, and determining a second Huffman tree corresponding to the application scene by taking the application scene as a root node, the word type as an inner node and the words as leaf nodes. I.e. each of said application scenarios corresponds to one of said second huffman trees.
It should be noted that, in the process of constructing the second huffman tree corresponding to the application scenario, the words corresponding to the second huffman tree are the words after de-duplication. For example: suppose there are 8 words in the corpus, wherein the word "weather" appears 5, and the word "Beijing" appears 3. After the words of the training corpus are de-repeated, two words corresponding to the second Huffman tree are respectively 'weather' and 'Beijing'.
After the second Huffman trees corresponding to each application scene are determined, the root node of each second Huffman tree is used as the inner node of a first Huffman tree, and the first Huffman tree is constructed by combining all the second Huffman trees.
In the first Huffman tree, leaf nodes are words in the training corpus related to the previous step, and other nodes are internal nodes.
As can be seen from the foregoing, each word may appear in a plurality of the application scenes, and the probability Pr (D) of each word appearing in each of the application scenes is calculated by formula (1)i). In the formula (1), D represents an application scenario, w is a word, Counti(wj) Meaning word wjIn application scenario DiNumber of occurrences, Counti(w) denotes D in the application scenarioiThe total number of all words.
Pr(Di)=Count(wj)/Count(w) (1)
Step S103: and training the word vector model according to the first Huffman tree until convergence by taking the maximum co-occurrence probability of each word and the context corresponding to the word vector model per se as a target.
In the process of implementing step S103 specifically, a context corresponding to each word is given in advance, and the word vector model is trained by using the first huffman tree with the maximum co-occurrence probability of the word and the context being as a target until the word vector model converges.
To better explain the structure of the first huffman tree referred to above, it is illustrated by fig. 2, and it should be noted that fig. 2 is only shown for illustration. Referring to fig. 2, a schematic diagram of an architecture of a first huffman tree according to an embodiment of the present invention is shown.
In said FIG. 2, n0Presetting 4 application scenes, namely d, for the root node of the first Huffman tree1、d2、d3 and d4And respectively collecting the training corpora of each application scene. Aiming at each application scene, constructing a second Huffman tree corresponding to the application scene, wherein leaf nodes wiRepresenting words in the corpus. Taking root node of each of the second Huffman trees as inner node of the first Huffman tree, i.e. inner node n in the FIG. 24、n5、n6 and n7
As can be seen from the structure of the first huffman tree shown in fig. 2, if a word occurs in a different application scenario, the word occurs multiple times in the first huffman tree.
It should be noted that, in the first huffman tree, for the application scenarios with more word types, the inner node corresponding to the application scenario is closer to the root node. For the application scenes with less word types, the inner nodes corresponding to the application scenes are farther from the root node.
It should be further noted that, in fig. 2, all the corresponding internal nodes of the application scenarios are in the same layer in the first huffman tree, and the foregoing is only used for example. The inner nodes corresponding to different application scenarios can also be in different layers in the first Huffman tree.
In the embodiment of the invention, the second Huffman tree of each application scene is determined by utilizing the training corpus corresponding to each application scene. And taking the root node of each second Huffman tree as the inner node of the first Huffman tree. The co-occurrence probability of each word and the context corresponding to the word is the maximum target, the word vector model is trained until convergence according to the first Huffman tree, the influence of different application scenes on the training word vector model is fully utilized, and the performance of the word vector model is improved.
The process of training the word vector model related to step S103 in fig. 1 in the embodiment of the present invention is shown in fig. 3, which is a flowchart of training the word vector model provided in the embodiment of the present invention, and includes the following steps:
step S301: for each of the words in the first Huffman tree, determining a context to which the word corresponds.
In the process of implementing step S301 specifically, it should be noted that a context is given to each word in advance, for example: in the context of "playing a song of Zhougelon", the context of the word "Zhougelon" is "playing" and "song", respectively.
As can be seen from the aforementioned schematic architecture of the first huffman tree shown in fig. 2, each internal node in the first huffman tree can be regarded as a binary classifier. Meanwhile, each word may appear in the first huffman tree multiple times, that is, multiple paths may exist from the root node to the leaf nodes, and the path is represented as:
Figure BDA0002386660860000071
Figure BDA0002386660860000072
denoted as w from the root node to the leaf nodeiIs received by the node b. The probability Pr (w) of observing a leaf node along the path is calculated by equation (2)i|C(wi))pathWherein, C (w)i) Is the word wiThe context of the corresponding one or ones of the contexts,
Figure BDA0002386660860000081
from root node to leaf node wiThe jth inner node on the path of (1).
Figure BDA0002386660860000082
It should be noted that, in the first huffman tree, each internal node
Figure BDA0002386660860000083
All correspond to a potential vector representation
Figure BDA0002386660860000084
The potential vector representation may be considered a parameter of a binary classifier.
Figure BDA0002386660860000085
Is defined as formula (3). In the above-mentioned formula (3),
Figure BDA0002386660860000086
is a sigmod function.
Figure BDA0002386660860000087
Step S302: and calculating the co-occurrence probability of each word and the context corresponding to the word and the word.
In the specific implementation process of step S302, since each word may appear in a plurality of application scenarios, there may be a plurality of paths from the root node to each leaf node in the first huffman tree, that is, each leafChild node wiCorresponding to K paths (each term appearing in K application scenarios). Aiming at each word, the k path corresponding to the wordkIn (K ∈ K), the probability Pr (path) of the occurrence of the word in the K-th application scenario can be calculatedk),Pr(pathk) Is Pr (D) in the above formula (1)i). And calculating the co-occurrence probability of each word and the context corresponding to the word by using the formula (4) in combination with the contents shown in the formulas (1) to (3).
Figure BDA0002386660860000088
Step S303: and training the word vector model until convergence with the maximum co-occurrence probability as a target.
In the process of implementing step S303 specifically, the co-occurrence probability is calculated by the above formula (4), and with the maximum co-occurrence probability as a target, the word vector model is trained until the word vector model converges, and the converged word vector model is applied to a classification model, so as to improve the accuracy of recognizing the speech content of the user.
In the embodiment of the invention, according to the first Huffman tree, the co-occurrence probability of each word and the context corresponding to the word and the context is calculated, the word vector model is trained until the word vector model is converged according to the maximum co-occurrence probability of the target word vector model, and the converged word vector model is applied to the classification model, so that the accuracy of recognizing the speaking content of the user is improved.
To better explain the contents of the various steps shown in fig. 1 and fig. 3 of the above-described embodiment of the present invention, the contents shown in fig. 4 are exemplified. Referring to fig. 4, another flowchart of a method for training a word vector model according to an embodiment of the present invention is shown, including the following steps:
step S401: determining different application scenes, and acquiring a training corpus corresponding to each application scene.
In the process of implementing step S401 specifically, different application scenarios are selected according to the functions of the speech recognition device such as an intelligent sound, and the corpus corresponding to the application scenario is collected according to each application scenario.
Step S402: and performing word segmentation processing, special symbol removing processing and stop word removing processing on the training corpus.
Step S403: and constructing a second Huffman tree corresponding to each application scene by using the training corpus corresponding to each application scene.
In the process of specifically implementing step S403, for each application scenario, a second huffman tree of the application scenario is constructed by using the corresponding deduplicated corpus.
Step S404: and constructing a first Huffman tree corresponding to the word vector model by utilizing each second Huffman tree.
In the specific implementation process of step S404, a root node of each second huffman tree is used as an inner node of a first huffman tree, and the first huffman tree corresponding to the word vector model is constructed.
Step S405: and calculating the co-occurrence probability of each word and the corresponding context according to the first Huffman tree.
In the process of implementing step S405 specifically, a context corresponding to each word is given in advance, and the co-occurrence probability of each word and its corresponding context is calculated by using the constructed first huffman tree. For a specific calculation process, reference may be made to the content shown in step S302 in fig. 3 in the above embodiment of the present invention, and details are not described herein again.
Step S406: training the word vector model with the co-occurrence probability as a maximum target until the word vector model converges.
In the process of implementing step S406 specifically, with the maximum co-occurrence probability as a target, the word vector model is trained until the word vector model converges, and the converged word vector model is applied to a classification model, so that the classification model performs a subsequent speech recognition process, thereby improving the accuracy of speech recognition.
In the embodiment of the invention, the second Huffman tree of each application scene is determined by utilizing the training corpus corresponding to each application scene. And taking the root node of each second Huffman tree as the inner node of the first Huffman tree. The co-occurrence probability of each word and the context corresponding to the word is the maximum target, the word vector model is trained until convergence according to the first Huffman tree, the influence of different application scenes on the training word vector model is fully utilized, the performance of the word vector model is improved, and the word vector model is applied to the classification model, so that the accuracy of voice recognition is improved.
Corresponding to the above training method for a word vector model provided in the embodiment of the present invention, referring to fig. 5, an embodiment of the present invention further provides a structural block diagram of a training system for a word vector model, where the training system includes: an acquisition unit 501, a determination unit 502 and a training unit 503;
the obtaining unit 501 is configured to obtain corpus corresponding to different application scenarios, where each corpus includes a plurality of words of different word types.
In a specific implementation, a plurality of different application scenarios are preset, a corpus corresponding to the application scenarios is obtained, and a process of specifically obtaining the corpus may refer to the content shown in step S101 in fig. 1 in the embodiment of the present invention, which is not described herein again.
A determining unit 502, configured to determine a first huffman tree corresponding to a word vector model by using the corpus, where the second huffman tree corresponding to each application scenario is determined in advance by using the corpus corresponding to each application scenario, and a root node of each second huffman tree is used as an inner node of the first huffman tree.
In a specific implementation, the process of determining the first huffman tree may refer to the content shown in step S102 in fig. 1 in the embodiment of the present invention, and is not described herein again.
A training unit 503, configured to train the word vector model according to the first huffman tree until convergence, with a maximum co-occurrence probability of each word and its corresponding context as a target.
In the embodiment of the invention, the second Huffman tree of each application scene is determined by utilizing the training corpus corresponding to each application scene. And taking the root node of each second Huffman tree as the inner node of the first Huffman tree. The co-occurrence probability of each word and the context corresponding to the word is the maximum target, the word vector model is trained until convergence according to the first Huffman tree, the influence of different application scenes on the training word vector model is fully utilized, and the performance of the word vector model is improved.
Preferably, referring to fig. 6 in conjunction with fig. 5, a block diagram of a structure of a training system of a word vector model according to an embodiment of the present invention is shown, where the determining unit 502 includes:
a first determining module 5021, configured to determine, for each application scenario, a word class of a corpus corresponding to the application scenario.
A second determining module 5022, configured to determine a second huffman tree corresponding to the application scenario by using the application scenario as a root node, using the word category as an inner node, and using the word as a leaf node.
In a specific implementation, reference may be made to relevant contents of constructing the second huffman tree in step S102 in fig. 1 of the embodiment of the present invention, and details thereof are not repeated herein.
Preferably, referring to fig. 7 in conjunction with fig. 5, a block diagram of a training system of a word vector model according to an embodiment of the present invention is shown, where the training unit 503 includes: a determination module 5031, a calculation module 5032 and a training module 5033;
a determining module 5031 configured to determine, for each of the words in the first huffman tree, a context corresponding to the word.
A calculating module 5032, configured to calculate a co-occurrence probability of each word and its corresponding context.
In a specific implementation, the process of calculating the co-occurrence probability of each word and the context corresponding to the word and the context may refer to the content shown in fig. 3 in the embodiment of the present invention, and is not described herein again.
A training module 5033, configured to train the word vector model until convergence, with the co-occurrence probability as a maximum target.
In the embodiment of the invention, according to the first Huffman tree, the co-occurrence probability of each word and the context corresponding to the word and the context is calculated, the word vector model is trained until the word vector model is converged according to the maximum co-occurrence probability of the target word vector model, and the converged word vector model is applied to the classification model, so that the accuracy of recognizing the speaking content of the user is improved.
Preferably, referring to fig. 8 in conjunction with fig. 5, a block diagram of a training system of a word vector model according to an embodiment of the present invention is shown, where the training system further includes:
the processing unit 504 is configured to perform word segmentation, special symbol removal, and stop word removal on the corpus corresponding to each application scenario.
Preferably, an embodiment of the present invention further provides an electronic device, where the electronic device is configured to run a program, and the program executes the above-disclosed training method for the word vector model when running.
Preferably, an embodiment of the present invention further provides a computer storage medium, where the storage medium includes a storage program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the above-disclosed training method for a word vector model.
In summary, the embodiments of the present invention provide a method and a system for training a word vector model, where the method includes: acquiring training corpora of different application scenes; determining a first Huffman tree corresponding to a word vector model by using a training corpus, determining a second Huffman tree corresponding to each application scene by using the training corpus corresponding to each application scene in advance, and taking a root node of each second Huffman tree as an inner node of the first Huffman tree; and training a word vector model according to the first Huffman tree until convergence by taking the maximum co-occurrence probability of each word and the context corresponding to the word as a target. In the scheme, the second Huffman tree of each application scene is determined by using the training corpus corresponding to each application scene. And taking the root node of each second Huffman tree as the inner node of the first Huffman tree. The co-occurrence probability of each word and the context corresponding to the word is the maximum target, the word vector model is trained until convergence according to the first Huffman tree, the influence of different application scenes on the training word vector model is fully utilized, and the performance of the word vector model is improved.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for training a word vector model, the method comprising:
acquiring training corpora corresponding to different application scenes, wherein each training corpus comprises a plurality of words of different word types;
determining a first Huffman tree corresponding to a word vector model by using the training corpuses, wherein the second Huffman tree corresponding to each application scene is determined by using the training corpuses corresponding to each application scene in advance, and a root node of each second Huffman tree is used as an internal node of the first Huffman tree;
and training the word vector model according to the first Huffman tree until convergence by taking the maximum co-occurrence probability of each word and the context corresponding to the word vector model per se as a target.
2. The method according to claim 1, wherein the determining the second huffman tree corresponding to each of the application scenarios by using the corpus corresponding to each of the application scenarios in advance comprises:
determining the word type of a training corpus corresponding to each application scene;
and determining a second Huffman tree corresponding to the application scene by taking the application scene as a root node, taking the word type as an inner node and taking the words as leaf nodes.
3. The method of claim 1, wherein training the word vector model according to the first huffman tree until convergence with the co-occurrence probability of each word with its corresponding context being at most a target comprises:
for each of the words in the first Huffman tree, determining a context to which the word corresponds;
calculating the co-occurrence probability of each word and the context corresponding to the word and the word;
and training the word vector model until convergence with the maximum co-occurrence probability as a target.
4. The method according to claim 3, wherein after obtaining the corpus corresponding to different application scenarios, the method further comprises:
and performing word segmentation processing, special symbol removing processing and stop word removing processing on the training corpus corresponding to each application scene.
5. A system for training a word vector model, the system comprising:
the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring training corpora corresponding to different application scenes, and each training corpus comprises a plurality of words of different word types;
a determining unit, configured to determine a first huffman tree corresponding to a word vector model by using the corpus, where the second huffman tree corresponding to each application scenario is determined by using the corpus corresponding to each application scenario in advance, and a root node of each second huffman tree is used as an inner node of the first huffman tree;
and the training unit is used for training the word vector model according to the first Huffman tree until convergence by taking the maximum co-occurrence probability of each word and the context corresponding to the word as a target.
6. The system of claim 5, wherein the determining unit comprises:
the first determining module is used for determining the word type of the training corpus corresponding to each application scene;
and the second determining module is used for determining a second Huffman tree corresponding to the application scene by taking the application scene as a root node, taking the word type as an inner node and taking the word as a leaf node.
7. The system of claim 5, wherein the training unit comprises:
a determining module, configured to determine, for each of the words in the first huffman tree, a context to which the word corresponds;
the calculation module is used for calculating the co-occurrence probability of each word and the context corresponding to the word;
and the training module is used for training the word vector model until convergence according to the maximum co-occurrence probability target.
8. The system of claim 5, further comprising:
and the processing unit is used for performing word segmentation processing, special symbol removing processing and stop word removing processing on the training corpus corresponding to each application scene.
9. An electronic device, characterized in that the electronic device is configured to run a program, wherein the program when running performs the method of training a word vector model according to any one of claims 1-4.
10. A computer storage medium, characterized in that the storage medium comprises a stored program, wherein the apparatus on which the storage medium is controlled when the program is executed performs the training method of the word vector model according to any one of claims 1-4.
CN202010100333.7A 2020-02-18 2020-02-18 Training method and system for word vector model Active CN111325026B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010100333.7A CN111325026B (en) 2020-02-18 2020-02-18 Training method and system for word vector model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010100333.7A CN111325026B (en) 2020-02-18 2020-02-18 Training method and system for word vector model

Publications (2)

Publication Number Publication Date
CN111325026A true CN111325026A (en) 2020-06-23
CN111325026B CN111325026B (en) 2023-10-10

Family

ID=71167150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010100333.7A Active CN111325026B (en) 2020-02-18 2020-02-18 Training method and system for word vector model

Country Status (1)

Country Link
CN (1) CN111325026B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101030726B1 (en) * 2009-11-26 2011-04-26 명지대학교 산학협력단 Memory efficient multimedia huffman decoding method and apparatus for adapting huffman table based on symbol from probability table
CN106897265A (en) * 2017-01-12 2017-06-27 北京航空航天大学 Term vector training method and device
CN107102981A (en) * 2016-02-19 2017-08-29 腾讯科技(深圳)有限公司 Term vector generation method and device
US20170286914A1 (en) * 2016-04-05 2017-10-05 Facebook, Inc. Systems and methods to develop training set of data based on resume corpus
US20180357531A1 (en) * 2015-11-27 2018-12-13 Devanathan GIRIDHARI Method for Text Classification and Feature Selection Using Class Vectors and the System Thereof
CN110413779A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 It is a kind of for the term vector training method and its system of power industry, medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101030726B1 (en) * 2009-11-26 2011-04-26 명지대학교 산학협력단 Memory efficient multimedia huffman decoding method and apparatus for adapting huffman table based on symbol from probability table
US20180357531A1 (en) * 2015-11-27 2018-12-13 Devanathan GIRIDHARI Method for Text Classification and Feature Selection Using Class Vectors and the System Thereof
CN107102981A (en) * 2016-02-19 2017-08-29 腾讯科技(深圳)有限公司 Term vector generation method and device
US20170286914A1 (en) * 2016-04-05 2017-10-05 Facebook, Inc. Systems and methods to develop training set of data based on resume corpus
CN106897265A (en) * 2017-01-12 2017-06-27 北京航空航天大学 Term vector training method and device
CN110413779A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 It is a kind of for the term vector training method and its system of power industry, medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周练: "Word2vec的工作原理及应用探究" *

Also Published As

Publication number Publication date
CN111325026B (en) 2023-10-10

Similar Documents

Publication Publication Date Title
CN112528672B (en) Aspect-level emotion analysis method and device based on graph convolution neural network
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
JP6222821B2 (en) Error correction model learning device and program
CN108062954B (en) Speech recognition method and device
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
JP5932869B2 (en) N-gram language model unsupervised learning method, learning apparatus, and learning program
CN111414746B (en) Method, device, equipment and storage medium for determining matching statement
CN109918663A (en) A kind of semantic matching method, device and storage medium
CN111694940A (en) User report generation method and terminal equipment
CN108960574A (en) Quality determination method, device, server and the storage medium of question and answer
CN106875936A (en) Audio recognition method and device
CN113096647B (en) Voice model training method and device and electronic equipment
CN111462751A (en) Method, apparatus, computer device and storage medium for decoding voice data
Deena et al. Semi-supervised adaptation of RNNLMs by fine-tuning with domain-specific auxiliary features
CN112233651A (en) Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN114818729A (en) Method, device and medium for training semantic recognition model and searching sentence
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN110895656A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN112446219A (en) Chinese request text intention analysis method
CN109783648B (en) Method for improving ASR language model by using ASR recognition result
CN111325026B (en) Training method and system for word vector model
CN111477212A (en) Content recognition, model training and data processing method, system and equipment
CN111739518B (en) Audio identification method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant