CN110246486B - Training method, device and equipment of voice recognition model - Google Patents

Training method, device and equipment of voice recognition model Download PDF

Info

Publication number
CN110246486B
CN110246486B CN201910477604.8A CN201910477604A CN110246486B CN 110246486 B CN110246486 B CN 110246486B CN 201910477604 A CN201910477604 A CN 201910477604A CN 110246486 B CN110246486 B CN 110246486B
Authority
CN
China
Prior art keywords
speech recognition
model
training
prediction result
submodels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910477604.8A
Other languages
Chinese (zh)
Other versions
CN110246486A (en
Inventor
熊皓
张睿卿
张传强
何中军
李芝
吴华
王海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910477604.8A priority Critical patent/CN110246486B/en
Publication of CN110246486A publication Critical patent/CN110246486A/en
Application granted granted Critical
Publication of CN110246486B publication Critical patent/CN110246486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a training method, a device and equipment of a voice recognition model, wherein the method comprises the following steps: acquiring a voice signal to be trained; inputting a voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors; generating a target translation which is generated by a plurality of voice recognition submodels through common decision according to a plurality of prediction result vectors; and training the plurality of speech recognition submodels according to the target translation and the plurality of prediction result vectors generated by each speech recognition submodel. Therefore, the speech recognition model is trained through the target translation generated by the common decision of the plurality of speech recognition submodels, and the quality of the speech recognition is improved.

Description

Training method, device and equipment of voice recognition model
Technical Field
The invention relates to the technical field of voice recognition, in particular to a training method, a device and equipment of a voice recognition model.
Background
With the development of artificial intelligence technology, speech recognition technology has made great progress and is beginning to enter various fields such as household appliances, communication, automobiles, medical treatment and the like.
In the related art, when a speech recognition model is trained, a corresponding model structure is usually selected for training, and because each model has advantages and defects of itself and because of the limitation of the scale of the training corpus, the speech recognition model is easy to fall into a local optimal value, and the quality of a speech recognition result needs to be improved.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present invention is to provide a method for training a speech recognition model, in which a target translation generated by a plurality of speech recognition submodels through decision-making is used to train the speech recognition model, so as to improve the quality of speech recognition.
The second objective of the present invention is to provide a training device for speech recognition model.
A third object of the invention is to propose a computer device.
A fourth object of the invention is to propose a computer-readable storage medium.
The embodiment of the first aspect of the invention provides a method for training a speech recognition model, wherein the speech recognition model comprises a plurality of speech recognition submodels, and the method comprises the following steps:
acquiring a voice signal to be trained;
inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;
generating a target translation generated by the multiple voice recognition submodels through common decision according to the multiple prediction result vectors; and
and training the plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel.
The training method of the voice recognition model of the embodiment of the invention obtains the voice signal to be trained, and then inputs the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.
In addition, the training method of the speech recognition model according to the above embodiment of the present invention may further have the following additional technical features:
optionally, the plurality of speech recognizer models comprises a plurality of a Transformer model, an RNN model, a CNN model, CTCs, and GHMMs.
Optionally, the generating a target translation produced by a common decision of the plurality of speech recognizer models according to the plurality of prediction result vectors includes: generating a set of predictor vectors from the plurality of predictor vectors; and generating the target translation according to the prediction result vector set.
Optionally, the plurality of speech recognition submodels are trained by a loss function,
Figure BDA0002082787980000021
wherein, yavgFor the target translation, yiIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.
The embodiment of the second aspect of the present invention provides a training device for a speech recognition model, where the speech recognition model includes a plurality of speech recognition submodels, and the device includes:
the acquisition module is used for acquiring a voice signal to be trained;
the processing module is used for inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;
the generating module is used for generating a target translation generated by the multiple speech recognition submodels through common decision according to the multiple prediction result vectors; and
and the training module is used for training the plurality of voice recognition submodels according to the target translation and a plurality of prediction result vectors generated by each voice recognition submodel.
The training device of the speech recognition model of the embodiment of the invention obtains the speech signal to be trained, and then inputs the speech signal to be trained into a plurality of speech recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.
In addition, the training device of the speech recognition model according to the above embodiment of the present invention may further have the following additional technical features:
optionally, the plurality of speech recognizer models comprises a plurality of a Transformer model, an RNN model, a CNN model, CTCs, and GHMMs.
Optionally, the generating module is specifically configured to: generating a set of predictor vectors from the plurality of predictor vectors; and generating the target translation according to the prediction result vector set.
Optionally, the plurality of speech recognition submodels are trained by a loss function,
Figure BDA0002082787980000031
wherein, yavgFor the target translation, yiIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.
An embodiment of a third aspect of the present invention provides a computer device, including a processor and a memory; wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the training method of the speech recognition model according to the embodiment of the first aspect.
A fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for training a speech recognition model according to the first aspect.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
Fig. 1 is a schematic flowchart illustrating a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a training apparatus for speech recognition models according to an embodiment of the present invention;
FIG. 3 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a method, an apparatus, and a device for training a speech recognition model according to an embodiment of the present invention with reference to the drawings.
Fig. 1 is a schematic flowchart of a method for training a speech recognition model according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101, obtaining a speech signal to be trained.
In this embodiment, when training the speech recognition model, the speech signal to be trained may be obtained first. For example, a voice signal may be collected by a voice receiving device such as a microphone as a voice signal to be trained. As another example, speech signal data may be obtained from the relevant annotation platform as the speech signal to be trained.
The speech signal to be trained can be a speech signal of any language, such as chinese, english, russian, and the like, and can be specifically selected according to the requirements of the speech recognition model.
Step 102, inputting a speech signal to be trained into a plurality of speech recognition submodels to generate a plurality of prediction result vectors.
In this embodiment, a plurality of speech recognition submodels may be preset, and the speech signal to be trained is respectively input into each speech recognition submodel for processing, and a plurality of prediction result vectors corresponding to each speech recognition submodel are respectively output.
The multiple voice recognition submodels with certain differences can be adopted to ensure the effect of the collaborative learning of the multiple voice recognition submodels. The speech recognition submodel may be based on an end-to-end model, for example, the speech recognition submodel may include a transform model, an RNN (Recurrent Neural Network) model, a CNN (Convolutional Neural Network) model, and the like, and optionally, the speech recognition submodel may be implemented by a CTC (connected Temporal Classification) and a GHMM (hybrid gaussian-hidden markov model) model, and is not limited to the end-to-end model.
As an example, the following description is made with respect to a single speech recognition submodel. And inputting the voice signal to be trained into the voice recognition submodel for processing, and outputting a prediction result vector, wherein the output prediction result vector is yt and represents the prediction result at the time t. Where yt is [ e (t,0) … e (t, j) e (t, V-1) ], V denotes the size of the vocabulary, and e (t, j) denotes the probability of predicting to the jth word in the vocabulary at time t. That is, the prediction result vector is used to predict the probability of each word in the vocabulary, for example, the size of the vocabulary is denoted as V, the size of the english vocabulary V may be 26, which represents 26 letters, and the prediction result vector yt includes the probability of predicting each letter in the vocabulary at time t; for Chinese V to represent the number of Chinese characters, the prediction result vector yt includes the probability of predicting each character at time t. Thus, the recognition result predicted by the speech recognizer model can be determined according to the prediction result vector.
It should be noted that, in this embodiment, for each sub-model of a plurality of speech recognition sub-models, reference may be made to the description of a single speech recognition sub-model in the above example, which is not described herein again.
Optionally, the plurality of speech recognition submodels may be trained in advance according to the labeled speech training data, and then the speech signal to be trained is input into the plurality of speech recognition submodels to generate a plurality of prediction result vectors. For example, the speech training data labeled with the corresponding recognition text may be collected in advance, and the processing parameters of the speech recognition submodel may be trained based on a supervised training mode through the speech training data, so that the speech recognition submodel is input as a speech signal and output as a corresponding recognition text.
In practical applications, a processing parameter of a predetermined model is usually trained according to labeled speech training data to generate a speech recognition model, so that the input of the speech recognition model is a speech signal and the output is a corresponding text. For example, for an end-to-end speech recognition model, a Transformer model can be used to recognize a speech signal to obtain a recognition text. However, since different models have different advantages and disadvantages, processing by a single model tends to fall into local optima, such as being limited by the model capabilities for left-to-right decoding models, yielding good prefixes and poor suffixes, and yielding good suffixes and poor prefixes for right-to-left decoding models. Therefore, the collaborative training can be carried out through the plurality of voice recognition submodels, so that the defect that a single model is easy to fall into a local optimal value is avoided, and the quality of voice recognition is improved.
And 103, generating a target translation which is generated by a plurality of voice models through common decision according to the plurality of prediction result vectors.
In the embodiment of the invention, a target translation generated by a common decision of a plurality of speech models can be generated according to a plurality of prediction result vectors, so that the speech models are trained according to the target translation generated by the common decision.
In an embodiment of the present invention, a set of prediction result vectors may be generated according to a plurality of prediction result vectors, and then a target translation may be generated according to the set of prediction result vectors.
As an example, for a segment of a speech signal to be trained, each speech recognition submodel may output a plurality of prediction vectors y, and then a set of prediction vectors Yi generated from the plurality of prediction vectors is [ y0, y1, …, yt, …, yn ], where i is the number of speech recognition submodels, Yi is an output obtained by the ith speech recognition submodel from the speech signal to be trained, and Yi is used to determine a recognition text corresponding to the speech signal. Further, the target translation is obtained by averaging the outputs Yi obtained by the plurality of speech recognition submodels, for example, by averaging yt output from each speech recognition submodel at time t, and the target translation at that time is determined from a vector of the averaged values. It is understood that the prediction result vectors output by different speech recognition submodels for the same speech signal to be trained may be different, for example, yt ═ output by the speech recognition submodel 1 [ a1, b1, c1], yt ═ output by the speech recognition submodel 2 [ a2, b2, c2], and then the vector of the target translation generated by common decision is determined to be [ (a1+ a2)/2, (b1+ b2)/2, (c1+ c2)/2 by averaging. The implementation manner of averaging may be selected according to needs, and is not limited herein.
As another possible implementation manner, labeling may be performed according to the speech signal to be trained, and a translation may be determined according to the recognition result of the label, for example, for a translation corresponding to y0 being the third letter in the vocabulary, the vector of the translation corresponding to y0 is [0,0,1,0 …, V-1 ].
It should be noted that, the implementation manner of obtaining the target translation may be selected according to actual needs, for example, in order to solve the problem of inconsistency during decoding and training of the decoding model, the target translation may be determined according to a prediction result output by the model, and this is not limited here.
And 104, training the voice recognition submodels according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel.
In this embodiment, after the target translation generated by the common decision of the multiple speech recognition submodels is generated according to the multiple prediction result vectors, the multiple speech recognition submodels may be trained according to the target translation generated by the common decision and the prediction result vectors to adjust the processing parameters of the speech recognition submodels, so that the processing parameters of the speech recognition submodels are trained according to the target translation generated by the common decision, and the quality of speech recognition can be improved compared with the recognition result generated by a single model.
As an example, a plurality of speech recognition submodels are trained by the following loss function,
Figure BDA0002082787980000051
wherein, yavgFor the target translation, yiIs the prediction vector of the ith model, and n is the number of speech recognition submodels.
According to the training method of the voice recognition model, the voice signal to be trained is obtained, and then the voice signal to be trained is input into the plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.
In order to implement the above embodiments, the present invention further provides a training device for a speech recognition model.
Fig. 2 is a schematic structural diagram of a training apparatus for a speech recognition model according to an embodiment of the present invention, and as shown in fig. 2, the training apparatus for a speech recognition model includes: the system comprises an acquisition module 100, a processing module 200, a generation module 300 and a training module 400.
The obtaining module 100 is configured to obtain a speech signal to be trained.
The processing module 200 is configured to input a speech signal to be trained into a plurality of speech recognition submodels to generate a plurality of prediction result vectors.
And a generating module 300, configured to generate a target translation that is jointly decided by multiple speech recognition submodels according to multiple prediction result vectors.
The training module 400 is configured to train a plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel.
In one embodiment of the invention, the plurality of speech recognizer models includes a plurality of a Transformer model, an RNN model, a CNN model, a CTC, and a GHMM.
In an embodiment of the present invention, the generating module 300 is specifically configured to: generating a set of predictor vectors from the plurality of predictor vectors; and generating a target translation according to the vector set of the prediction results.
In one embodiment of the invention, a plurality of speech recognition submodels are trained by the following loss function,
Figure BDA0002082787980000061
wherein, yavgFor the target translation, yiIs the prediction vector of the ith model, and n is the number of speech recognition submodels.
It should be noted that, the explanation of the training method of the speech recognition model in the foregoing embodiment is also applicable to the training device of the speech recognition model in this embodiment, and details are not repeated here.
According to the training device of the voice recognition model, the voice signal to be trained is obtained, and then the voice signal to be trained is input into the plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.
In order to implement the above embodiments, the present invention further provides a computer device, including a processor and a memory; wherein, the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the training method of the speech recognition model according to any one of the foregoing embodiments.
In order to implement the above embodiments, the present invention further provides a computer program product, wherein when the instructions in the computer program product are executed by a processor, the method for training a speech recognition model according to any of the foregoing embodiments is implemented.
In order to implement the above embodiments, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the training method of the speech recognition model according to any of the foregoing embodiments.
FIG. 3 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present invention. The computer device 12 shown in FIG. 3 is only an example and should not impose any limitation on the scope of use or functionality of embodiments of the present invention.
As shown in FIG. 3, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, and commonly referred to as a "hard drive"). Although not shown in FIG. 3, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.
In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (8)

1. A method of training a speech recognition model, the speech recognition model comprising a plurality of speech recognition submodels, the method comprising:
acquiring a voice signal to be trained;
inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;
generating a target translation generated by the multiple voice recognition submodels through common decision according to the multiple prediction result vectors; and
training the plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel;
wherein the plurality of speech recognition submodels are trained by a loss function,
Figure FDA0003080426860000011
wherein, yavgFor the target translation, yiIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.
2. The method of training a speech recognition model according to claim 1, wherein the plurality of speech recognizer models comprises a plurality of a fransformer model, an RNN model, a CNN model, CTC, and GHMM.
3. The method for training a speech recognition model according to claim 1, wherein the generating a target translation from the plurality of prediction vectors, which is jointly decided by the plurality of speech recognition submodels, comprises:
generating a set of predictor vectors from the plurality of predictor vectors;
and generating the target translation according to the prediction result vector set.
4. An apparatus for training a speech recognition model, the speech recognition model comprising a plurality of speech recognition submodels, the apparatus comprising:
the acquisition module is used for acquiring a voice signal to be trained;
the processing module is used for inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;
the generating module is used for generating a target translation generated by the multiple speech recognition submodels through common decision according to the multiple prediction result vectors; and
the training module is used for training the voice recognition submodels according to the target translation and a plurality of prediction result vectors generated by each voice recognition submodel;
wherein the plurality of speech recognition submodels are trained by a loss function,
Figure FDA0003080426860000021
wherein, yavgFor the target translation, yiIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.
5. The apparatus for training a speech recognition model according to claim 4, wherein the plurality of speech recognizer models comprises a plurality of a fransformer model, an RNN model, a CNN model, CTC, and GHMM.
6. The apparatus for training a speech recognition model according to claim 4, wherein the generating module is specifically configured to:
generating a set of predictor vectors from the plurality of predictor vectors;
and generating the target translation according to the prediction result vector set.
7. A computer device comprising a processor and a memory;
wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the training method of the speech recognition model according to any one of claims 1 to 3.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of training a speech recognition model according to any one of claims 1 to 3.
CN201910477604.8A 2019-06-03 2019-06-03 Training method, device and equipment of voice recognition model Active CN110246486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910477604.8A CN110246486B (en) 2019-06-03 2019-06-03 Training method, device and equipment of voice recognition model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910477604.8A CN110246486B (en) 2019-06-03 2019-06-03 Training method, device and equipment of voice recognition model

Publications (2)

Publication Number Publication Date
CN110246486A CN110246486A (en) 2019-09-17
CN110246486B true CN110246486B (en) 2021-07-13

Family

ID=67885808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910477604.8A Active CN110246486B (en) 2019-06-03 2019-06-03 Training method, device and equipment of voice recognition model

Country Status (1)

Country Link
CN (1) CN110246486B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210312294A1 (en) * 2020-04-03 2021-10-07 International Business Machines Corporation Training of model for processing sequence data
CN112885330A (en) * 2021-01-26 2021-06-01 北京云上曲率科技有限公司 Language identification method and system based on low-resource audio

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090138265A1 (en) * 2007-11-26 2009-05-28 Nuance Communications, Inc. Joint Discriminative Training of Multiple Speech Recognizers
CN103794214A (en) * 2014-03-07 2014-05-14 联想(北京)有限公司 Information processing method, device and electronic equipment
US20150199960A1 (en) * 2012-08-24 2015-07-16 Microsoft Corporation I-Vector Based Clustering Training Data in Speech Recognition
CN108597502A (en) * 2018-04-27 2018-09-28 上海适享文化传播有限公司 Field speech recognition training method based on dual training
CN109710727A (en) * 2017-10-26 2019-05-03 哈曼国际工业有限公司 System and method for natural language processing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090138265A1 (en) * 2007-11-26 2009-05-28 Nuance Communications, Inc. Joint Discriminative Training of Multiple Speech Recognizers
US20150199960A1 (en) * 2012-08-24 2015-07-16 Microsoft Corporation I-Vector Based Clustering Training Data in Speech Recognition
CN103794214A (en) * 2014-03-07 2014-05-14 联想(北京)有限公司 Information processing method, device and electronic equipment
CN109710727A (en) * 2017-10-26 2019-05-03 哈曼国际工业有限公司 System and method for natural language processing
CN108597502A (en) * 2018-04-27 2018-09-28 上海适享文化传播有限公司 Field speech recognition training method based on dual training

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《面向汉英专利文献的神经网络翻译模型的集外词翻译研究》;郑晓康;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180115(第01期);第6-46页 *

Also Published As

Publication number Publication date
CN110246486A (en) 2019-09-17

Similar Documents

Publication Publication Date Title
JP7432556B2 (en) Methods, devices, equipment and media for man-machine interaction
CN109003624B (en) Emotion recognition method and device, computer equipment and storage medium
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
CN110162800B (en) Translation model training method and device
CN110275939B (en) Method and device for determining conversation generation model, storage medium and electronic equipment
CN108922564B (en) Emotion recognition method and device, computer equipment and storage medium
Taniguchi et al. Spatial concept acquisition for a mobile robot that integrates self-localization and unsupervised word discovery from spoken sentences
CN113035311B (en) Medical image report automatic generation method based on multi-mode attention mechanism
CN109961041B (en) Video identification method and device and storage medium
CN108846124B (en) Training method, training device, computer equipment and readable storage medium
CN106340297A (en) Speech recognition method and system based on cloud computing and confidence calculation
CN111653274B (en) Wake-up word recognition method, device and storage medium
CN113779310B (en) Video understanding text generation method based on hierarchical representation network
CN110246486B (en) Training method, device and equipment of voice recognition model
JP7178394B2 (en) Methods, apparatus, apparatus, and media for processing audio signals
CN110991195A (en) Machine translation model training method, device and storage medium
US12027162B2 (en) Noisy student teacher training for robust keyword spotting
Sugiura et al. Situated spoken dialogue with robots using active learning
WO2023155676A1 (en) Method and apparatus for processing translation model, and computer-readable storage medium
CN110287999B (en) Story generation method and device based on hidden variable model
CN108829896B (en) Reply information feedback method and device
CN116401364A (en) Language model training method, electronic device, storage medium and product
CN113781998B (en) Speech recognition method, device, equipment and medium based on dialect correction model
CN112541557B (en) Training method and device for generating countermeasure network and electronic equipment
CN113077535B (en) Model training and mouth motion parameter acquisition method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant