CN110246486B - Training method, device and equipment of voice recognition model - Google Patents
Training method, device and equipment of voice recognition model Download PDFInfo
- Publication number
- CN110246486B CN110246486B CN201910477604.8A CN201910477604A CN110246486B CN 110246486 B CN110246486 B CN 110246486B CN 201910477604 A CN201910477604 A CN 201910477604A CN 110246486 B CN110246486 B CN 110246486B
- Authority
- CN
- China
- Prior art keywords
- speech recognition
- model
- training
- prediction result
- submodels
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 title claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 77
- 238000013519 translation Methods 0.000 claims abstract description 55
- 238000012545 processing Methods 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 5
- 230000007547 defect Effects 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000012935 Averaging Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 210000005266 circulating tumour cell Anatomy 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a training method, a device and equipment of a voice recognition model, wherein the method comprises the following steps: acquiring a voice signal to be trained; inputting a voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors; generating a target translation which is generated by a plurality of voice recognition submodels through common decision according to a plurality of prediction result vectors; and training the plurality of speech recognition submodels according to the target translation and the plurality of prediction result vectors generated by each speech recognition submodel. Therefore, the speech recognition model is trained through the target translation generated by the common decision of the plurality of speech recognition submodels, and the quality of the speech recognition is improved.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a training method, a device and equipment of a voice recognition model.
Background
With the development of artificial intelligence technology, speech recognition technology has made great progress and is beginning to enter various fields such as household appliances, communication, automobiles, medical treatment and the like.
In the related art, when a speech recognition model is trained, a corresponding model structure is usually selected for training, and because each model has advantages and defects of itself and because of the limitation of the scale of the training corpus, the speech recognition model is easy to fall into a local optimal value, and the quality of a speech recognition result needs to be improved.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present invention is to provide a method for training a speech recognition model, in which a target translation generated by a plurality of speech recognition submodels through decision-making is used to train the speech recognition model, so as to improve the quality of speech recognition.
The second objective of the present invention is to provide a training device for speech recognition model.
A third object of the invention is to propose a computer device.
A fourth object of the invention is to propose a computer-readable storage medium.
The embodiment of the first aspect of the invention provides a method for training a speech recognition model, wherein the speech recognition model comprises a plurality of speech recognition submodels, and the method comprises the following steps:
acquiring a voice signal to be trained;
inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;
generating a target translation generated by the multiple voice recognition submodels through common decision according to the multiple prediction result vectors; and
and training the plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel.
The training method of the voice recognition model of the embodiment of the invention obtains the voice signal to be trained, and then inputs the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.
In addition, the training method of the speech recognition model according to the above embodiment of the present invention may further have the following additional technical features:
optionally, the plurality of speech recognizer models comprises a plurality of a Transformer model, an RNN model, a CNN model, CTCs, and GHMMs.
Optionally, the generating a target translation produced by a common decision of the plurality of speech recognizer models according to the plurality of prediction result vectors includes: generating a set of predictor vectors from the plurality of predictor vectors; and generating the target translation according to the prediction result vector set.
Optionally, the plurality of speech recognition submodels are trained by a loss function,
wherein, yavgFor the target translation, yiIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.
The embodiment of the second aspect of the present invention provides a training device for a speech recognition model, where the speech recognition model includes a plurality of speech recognition submodels, and the device includes:
the acquisition module is used for acquiring a voice signal to be trained;
the processing module is used for inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;
the generating module is used for generating a target translation generated by the multiple speech recognition submodels through common decision according to the multiple prediction result vectors; and
and the training module is used for training the plurality of voice recognition submodels according to the target translation and a plurality of prediction result vectors generated by each voice recognition submodel.
The training device of the speech recognition model of the embodiment of the invention obtains the speech signal to be trained, and then inputs the speech signal to be trained into a plurality of speech recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.
In addition, the training device of the speech recognition model according to the above embodiment of the present invention may further have the following additional technical features:
optionally, the plurality of speech recognizer models comprises a plurality of a Transformer model, an RNN model, a CNN model, CTCs, and GHMMs.
Optionally, the generating module is specifically configured to: generating a set of predictor vectors from the plurality of predictor vectors; and generating the target translation according to the prediction result vector set.
Optionally, the plurality of speech recognition submodels are trained by a loss function,
wherein, yavgFor the target translation, yiIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.
An embodiment of a third aspect of the present invention provides a computer device, including a processor and a memory; wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the training method of the speech recognition model according to the embodiment of the first aspect.
A fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for training a speech recognition model according to the first aspect.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
Fig. 1 is a schematic flowchart illustrating a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a training apparatus for speech recognition models according to an embodiment of the present invention;
FIG. 3 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a method, an apparatus, and a device for training a speech recognition model according to an embodiment of the present invention with reference to the drawings.
Fig. 1 is a schematic flowchart of a method for training a speech recognition model according to an embodiment of the present invention, as shown in fig. 1, the method includes:
In this embodiment, when training the speech recognition model, the speech signal to be trained may be obtained first. For example, a voice signal may be collected by a voice receiving device such as a microphone as a voice signal to be trained. As another example, speech signal data may be obtained from the relevant annotation platform as the speech signal to be trained.
The speech signal to be trained can be a speech signal of any language, such as chinese, english, russian, and the like, and can be specifically selected according to the requirements of the speech recognition model.
In this embodiment, a plurality of speech recognition submodels may be preset, and the speech signal to be trained is respectively input into each speech recognition submodel for processing, and a plurality of prediction result vectors corresponding to each speech recognition submodel are respectively output.
The multiple voice recognition submodels with certain differences can be adopted to ensure the effect of the collaborative learning of the multiple voice recognition submodels. The speech recognition submodel may be based on an end-to-end model, for example, the speech recognition submodel may include a transform model, an RNN (Recurrent Neural Network) model, a CNN (Convolutional Neural Network) model, and the like, and optionally, the speech recognition submodel may be implemented by a CTC (connected Temporal Classification) and a GHMM (hybrid gaussian-hidden markov model) model, and is not limited to the end-to-end model.
As an example, the following description is made with respect to a single speech recognition submodel. And inputting the voice signal to be trained into the voice recognition submodel for processing, and outputting a prediction result vector, wherein the output prediction result vector is yt and represents the prediction result at the time t. Where yt is [ e (t,0) … e (t, j) e (t, V-1) ], V denotes the size of the vocabulary, and e (t, j) denotes the probability of predicting to the jth word in the vocabulary at time t. That is, the prediction result vector is used to predict the probability of each word in the vocabulary, for example, the size of the vocabulary is denoted as V, the size of the english vocabulary V may be 26, which represents 26 letters, and the prediction result vector yt includes the probability of predicting each letter in the vocabulary at time t; for Chinese V to represent the number of Chinese characters, the prediction result vector yt includes the probability of predicting each character at time t. Thus, the recognition result predicted by the speech recognizer model can be determined according to the prediction result vector.
It should be noted that, in this embodiment, for each sub-model of a plurality of speech recognition sub-models, reference may be made to the description of a single speech recognition sub-model in the above example, which is not described herein again.
Optionally, the plurality of speech recognition submodels may be trained in advance according to the labeled speech training data, and then the speech signal to be trained is input into the plurality of speech recognition submodels to generate a plurality of prediction result vectors. For example, the speech training data labeled with the corresponding recognition text may be collected in advance, and the processing parameters of the speech recognition submodel may be trained based on a supervised training mode through the speech training data, so that the speech recognition submodel is input as a speech signal and output as a corresponding recognition text.
In practical applications, a processing parameter of a predetermined model is usually trained according to labeled speech training data to generate a speech recognition model, so that the input of the speech recognition model is a speech signal and the output is a corresponding text. For example, for an end-to-end speech recognition model, a Transformer model can be used to recognize a speech signal to obtain a recognition text. However, since different models have different advantages and disadvantages, processing by a single model tends to fall into local optima, such as being limited by the model capabilities for left-to-right decoding models, yielding good prefixes and poor suffixes, and yielding good suffixes and poor prefixes for right-to-left decoding models. Therefore, the collaborative training can be carried out through the plurality of voice recognition submodels, so that the defect that a single model is easy to fall into a local optimal value is avoided, and the quality of voice recognition is improved.
And 103, generating a target translation which is generated by a plurality of voice models through common decision according to the plurality of prediction result vectors.
In the embodiment of the invention, a target translation generated by a common decision of a plurality of speech models can be generated according to a plurality of prediction result vectors, so that the speech models are trained according to the target translation generated by the common decision.
In an embodiment of the present invention, a set of prediction result vectors may be generated according to a plurality of prediction result vectors, and then a target translation may be generated according to the set of prediction result vectors.
As an example, for a segment of a speech signal to be trained, each speech recognition submodel may output a plurality of prediction vectors y, and then a set of prediction vectors Yi generated from the plurality of prediction vectors is [ y0, y1, …, yt, …, yn ], where i is the number of speech recognition submodels, Yi is an output obtained by the ith speech recognition submodel from the speech signal to be trained, and Yi is used to determine a recognition text corresponding to the speech signal. Further, the target translation is obtained by averaging the outputs Yi obtained by the plurality of speech recognition submodels, for example, by averaging yt output from each speech recognition submodel at time t, and the target translation at that time is determined from a vector of the averaged values. It is understood that the prediction result vectors output by different speech recognition submodels for the same speech signal to be trained may be different, for example, yt ═ output by the speech recognition submodel 1 [ a1, b1, c1], yt ═ output by the speech recognition submodel 2 [ a2, b2, c2], and then the vector of the target translation generated by common decision is determined to be [ (a1+ a2)/2, (b1+ b2)/2, (c1+ c2)/2 by averaging. The implementation manner of averaging may be selected according to needs, and is not limited herein.
As another possible implementation manner, labeling may be performed according to the speech signal to be trained, and a translation may be determined according to the recognition result of the label, for example, for a translation corresponding to y0 being the third letter in the vocabulary, the vector of the translation corresponding to y0 is [0,0,1,0 …, V-1 ].
It should be noted that, the implementation manner of obtaining the target translation may be selected according to actual needs, for example, in order to solve the problem of inconsistency during decoding and training of the decoding model, the target translation may be determined according to a prediction result output by the model, and this is not limited here.
And 104, training the voice recognition submodels according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel.
In this embodiment, after the target translation generated by the common decision of the multiple speech recognition submodels is generated according to the multiple prediction result vectors, the multiple speech recognition submodels may be trained according to the target translation generated by the common decision and the prediction result vectors to adjust the processing parameters of the speech recognition submodels, so that the processing parameters of the speech recognition submodels are trained according to the target translation generated by the common decision, and the quality of speech recognition can be improved compared with the recognition result generated by a single model.
As an example, a plurality of speech recognition submodels are trained by the following loss function,
wherein, yavgFor the target translation, yiIs the prediction vector of the ith model, and n is the number of speech recognition submodels.
According to the training method of the voice recognition model, the voice signal to be trained is obtained, and then the voice signal to be trained is input into the plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.
In order to implement the above embodiments, the present invention further provides a training device for a speech recognition model.
Fig. 2 is a schematic structural diagram of a training apparatus for a speech recognition model according to an embodiment of the present invention, and as shown in fig. 2, the training apparatus for a speech recognition model includes: the system comprises an acquisition module 100, a processing module 200, a generation module 300 and a training module 400.
The obtaining module 100 is configured to obtain a speech signal to be trained.
The processing module 200 is configured to input a speech signal to be trained into a plurality of speech recognition submodels to generate a plurality of prediction result vectors.
And a generating module 300, configured to generate a target translation that is jointly decided by multiple speech recognition submodels according to multiple prediction result vectors.
The training module 400 is configured to train a plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel.
In one embodiment of the invention, the plurality of speech recognizer models includes a plurality of a Transformer model, an RNN model, a CNN model, a CTC, and a GHMM.
In an embodiment of the present invention, the generating module 300 is specifically configured to: generating a set of predictor vectors from the plurality of predictor vectors; and generating a target translation according to the vector set of the prediction results.
In one embodiment of the invention, a plurality of speech recognition submodels are trained by the following loss function,
wherein, yavgFor the target translation, yiIs the prediction vector of the ith model, and n is the number of speech recognition submodels.
It should be noted that, the explanation of the training method of the speech recognition model in the foregoing embodiment is also applicable to the training device of the speech recognition model in this embodiment, and details are not repeated here.
According to the training device of the voice recognition model, the voice signal to be trained is obtained, and then the voice signal to be trained is input into the plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.
In order to implement the above embodiments, the present invention further provides a computer device, including a processor and a memory; wherein, the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the training method of the speech recognition model according to any one of the foregoing embodiments.
In order to implement the above embodiments, the present invention further provides a computer program product, wherein when the instructions in the computer program product are executed by a processor, the method for training a speech recognition model according to any of the foregoing embodiments is implemented.
In order to implement the above embodiments, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the training method of the speech recognition model according to any of the foregoing embodiments.
FIG. 3 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present invention. The computer device 12 shown in FIG. 3 is only an example and should not impose any limitation on the scope of use or functionality of embodiments of the present invention.
As shown in FIG. 3, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.
In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (8)
1. A method of training a speech recognition model, the speech recognition model comprising a plurality of speech recognition submodels, the method comprising:
acquiring a voice signal to be trained;
inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;
generating a target translation generated by the multiple voice recognition submodels through common decision according to the multiple prediction result vectors; and
training the plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel;
wherein the plurality of speech recognition submodels are trained by a loss function,
wherein, yavgFor the target translation, yiIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.
2. The method of training a speech recognition model according to claim 1, wherein the plurality of speech recognizer models comprises a plurality of a fransformer model, an RNN model, a CNN model, CTC, and GHMM.
3. The method for training a speech recognition model according to claim 1, wherein the generating a target translation from the plurality of prediction vectors, which is jointly decided by the plurality of speech recognition submodels, comprises:
generating a set of predictor vectors from the plurality of predictor vectors;
and generating the target translation according to the prediction result vector set.
4. An apparatus for training a speech recognition model, the speech recognition model comprising a plurality of speech recognition submodels, the apparatus comprising:
the acquisition module is used for acquiring a voice signal to be trained;
the processing module is used for inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;
the generating module is used for generating a target translation generated by the multiple speech recognition submodels through common decision according to the multiple prediction result vectors; and
the training module is used for training the voice recognition submodels according to the target translation and a plurality of prediction result vectors generated by each voice recognition submodel;
wherein the plurality of speech recognition submodels are trained by a loss function,
wherein, yavgFor the target translation, yiIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.
5. The apparatus for training a speech recognition model according to claim 4, wherein the plurality of speech recognizer models comprises a plurality of a fransformer model, an RNN model, a CNN model, CTC, and GHMM.
6. The apparatus for training a speech recognition model according to claim 4, wherein the generating module is specifically configured to:
generating a set of predictor vectors from the plurality of predictor vectors;
and generating the target translation according to the prediction result vector set.
7. A computer device comprising a processor and a memory;
wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the training method of the speech recognition model according to any one of claims 1 to 3.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of training a speech recognition model according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477604.8A CN110246486B (en) | 2019-06-03 | 2019-06-03 | Training method, device and equipment of voice recognition model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477604.8A CN110246486B (en) | 2019-06-03 | 2019-06-03 | Training method, device and equipment of voice recognition model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110246486A CN110246486A (en) | 2019-09-17 |
CN110246486B true CN110246486B (en) | 2021-07-13 |
Family
ID=67885808
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910477604.8A Active CN110246486B (en) | 2019-06-03 | 2019-06-03 | Training method, device and equipment of voice recognition model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110246486B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210312294A1 (en) * | 2020-04-03 | 2021-10-07 | International Business Machines Corporation | Training of model for processing sequence data |
CN112885330A (en) * | 2021-01-26 | 2021-06-01 | 北京云上曲率科技有限公司 | Language identification method and system based on low-resource audio |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090138265A1 (en) * | 2007-11-26 | 2009-05-28 | Nuance Communications, Inc. | Joint Discriminative Training of Multiple Speech Recognizers |
CN103794214A (en) * | 2014-03-07 | 2014-05-14 | 联想(北京)有限公司 | Information processing method, device and electronic equipment |
US20150199960A1 (en) * | 2012-08-24 | 2015-07-16 | Microsoft Corporation | I-Vector Based Clustering Training Data in Speech Recognition |
CN108597502A (en) * | 2018-04-27 | 2018-09-28 | 上海适享文化传播有限公司 | Field speech recognition training method based on dual training |
CN109710727A (en) * | 2017-10-26 | 2019-05-03 | 哈曼国际工业有限公司 | System and method for natural language processing |
-
2019
- 2019-06-03 CN CN201910477604.8A patent/CN110246486B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090138265A1 (en) * | 2007-11-26 | 2009-05-28 | Nuance Communications, Inc. | Joint Discriminative Training of Multiple Speech Recognizers |
US20150199960A1 (en) * | 2012-08-24 | 2015-07-16 | Microsoft Corporation | I-Vector Based Clustering Training Data in Speech Recognition |
CN103794214A (en) * | 2014-03-07 | 2014-05-14 | 联想(北京)有限公司 | Information processing method, device and electronic equipment |
CN109710727A (en) * | 2017-10-26 | 2019-05-03 | 哈曼国际工业有限公司 | System and method for natural language processing |
CN108597502A (en) * | 2018-04-27 | 2018-09-28 | 上海适享文化传播有限公司 | Field speech recognition training method based on dual training |
Non-Patent Citations (1)
Title |
---|
《面向汉英专利文献的神经网络翻译模型的集外词翻译研究》;郑晓康;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180115(第01期);第6-46页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110246486A (en) | 2019-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7432556B2 (en) | Methods, devices, equipment and media for man-machine interaction | |
CN109003624B (en) | Emotion recognition method and device, computer equipment and storage medium | |
CN108985358B (en) | Emotion recognition method, device, equipment and storage medium | |
CN110162800B (en) | Translation model training method and device | |
CN110275939B (en) | Method and device for determining conversation generation model, storage medium and electronic equipment | |
CN108922564B (en) | Emotion recognition method and device, computer equipment and storage medium | |
Taniguchi et al. | Spatial concept acquisition for a mobile robot that integrates self-localization and unsupervised word discovery from spoken sentences | |
CN113035311B (en) | Medical image report automatic generation method based on multi-mode attention mechanism | |
CN109961041B (en) | Video identification method and device and storage medium | |
CN108846124B (en) | Training method, training device, computer equipment and readable storage medium | |
CN106340297A (en) | Speech recognition method and system based on cloud computing and confidence calculation | |
CN111653274B (en) | Wake-up word recognition method, device and storage medium | |
CN113779310B (en) | Video understanding text generation method based on hierarchical representation network | |
CN110246486B (en) | Training method, device and equipment of voice recognition model | |
JP7178394B2 (en) | Methods, apparatus, apparatus, and media for processing audio signals | |
CN110991195A (en) | Machine translation model training method, device and storage medium | |
US12027162B2 (en) | Noisy student teacher training for robust keyword spotting | |
Sugiura et al. | Situated spoken dialogue with robots using active learning | |
WO2023155676A1 (en) | Method and apparatus for processing translation model, and computer-readable storage medium | |
CN110287999B (en) | Story generation method and device based on hidden variable model | |
CN108829896B (en) | Reply information feedback method and device | |
CN116401364A (en) | Language model training method, electronic device, storage medium and product | |
CN113781998B (en) | Speech recognition method, device, equipment and medium based on dialect correction model | |
CN112541557B (en) | Training method and device for generating countermeasure network and electronic equipment | |
CN113077535B (en) | Model training and mouth motion parameter acquisition method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |