CN110246486B

CN110246486B - Training method, device and equipment of voice recognition model

Info

Publication number: CN110246486B
Application number: CN201910477604.8A
Authority: CN
Inventors: 熊皓; 张睿卿; 张传强; 何中军; 李芝; 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2021-07-13
Anticipated expiration: 2039-06-03
Also published as: CN110246486A

Abstract

The invention provides a training method, a device and equipment of a voice recognition model, wherein the method comprises the following steps: acquiring a voice signal to be trained; inputting a voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors; generating a target translation which is generated by a plurality of voice recognition submodels through common decision according to a plurality of prediction result vectors; and training the plurality of speech recognition submodels according to the target translation and the plurality of prediction result vectors generated by each speech recognition submodel. Therefore, the speech recognition model is trained through the target translation generated by the common decision of the plurality of speech recognition submodels, and the quality of the speech recognition is improved.

Description

Training method, device and equipment of voice recognition model

Technical Field

The invention relates to the technical field of voice recognition, in particular to a training method, a device and equipment of a voice recognition model.

Background

With the development of artificial intelligence technology, speech recognition technology has made great progress and is beginning to enter various fields such as household appliances, communication, automobiles, medical treatment and the like.

In the related art, when a speech recognition model is trained, a corresponding model structure is usually selected for training, and because each model has advantages and defects of itself and because of the limitation of the scale of the training corpus, the speech recognition model is easy to fall into a local optimal value, and the quality of a speech recognition result needs to be improved.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present invention is to provide a method for training a speech recognition model, in which a target translation generated by a plurality of speech recognition submodels through decision-making is used to train the speech recognition model, so as to improve the quality of speech recognition.

The second objective of the present invention is to provide a training device for speech recognition model.

A third object of the invention is to propose a computer device.

A fourth object of the invention is to propose a computer-readable storage medium.

The embodiment of the first aspect of the invention provides a method for training a speech recognition model, wherein the speech recognition model comprises a plurality of speech recognition submodels, and the method comprises the following steps:

acquiring a voice signal to be trained;

inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;

generating a target translation generated by the multiple voice recognition submodels through common decision according to the multiple prediction result vectors; and

and training the plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel.

The training method of the voice recognition model of the embodiment of the invention obtains the voice signal to be trained, and then inputs the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.

In addition, the training method of the speech recognition model according to the above embodiment of the present invention may further have the following additional technical features:

optionally, the plurality of speech recognizer models comprises a plurality of a Transformer model, an RNN model, a CNN model, CTCs, and GHMMs.

Optionally, the generating a target translation produced by a common decision of the plurality of speech recognizer models according to the plurality of prediction result vectors includes: generating a set of predictor vectors from the plurality of predictor vectors; and generating the target translation according to the prediction result vector set.

Optionally, the plurality of speech recognition submodels are trained by a loss function,

wherein, y_avgFor the target translation, y_iIs the prediction result vector of the ith model, and n is the number of the sub-models of the speech recognition.

The embodiment of the second aspect of the present invention provides a training device for a speech recognition model, where the speech recognition model includes a plurality of speech recognition submodels, and the device includes:

the acquisition module is used for acquiring a voice signal to be trained;

the processing module is used for inputting the voice signal to be trained into a plurality of voice recognition submodels to generate a plurality of prediction result vectors;

the generating module is used for generating a target translation generated by the multiple speech recognition submodels through common decision according to the multiple prediction result vectors; and

and the training module is used for training the plurality of voice recognition submodels according to the target translation and a plurality of prediction result vectors generated by each voice recognition submodel.

The training device of the speech recognition model of the embodiment of the invention obtains the speech signal to be trained, and then inputs the speech signal to be trained into a plurality of speech recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.

In addition, the training device of the speech recognition model according to the above embodiment of the present invention may further have the following additional technical features:

Optionally, the generating module is specifically configured to: generating a set of predictor vectors from the plurality of predictor vectors; and generating the target translation according to the prediction result vector set.

An embodiment of a third aspect of the present invention provides a computer device, including a processor and a memory; wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the training method of the speech recognition model according to the embodiment of the first aspect.

A fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for training a speech recognition model according to the first aspect.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a schematic flowchart illustrating a method for training a speech recognition model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a training apparatus for speech recognition models according to an embodiment of the present invention;

FIG. 3 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a method, an apparatus, and a device for training a speech recognition model according to an embodiment of the present invention with reference to the drawings.

Fig. 1 is a schematic flowchart of a method for training a speech recognition model according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 101, obtaining a speech signal to be trained.

In this embodiment, when training the speech recognition model, the speech signal to be trained may be obtained first. For example, a voice signal may be collected by a voice receiving device such as a microphone as a voice signal to be trained. As another example, speech signal data may be obtained from the relevant annotation platform as the speech signal to be trained.

The speech signal to be trained can be a speech signal of any language, such as chinese, english, russian, and the like, and can be specifically selected according to the requirements of the speech recognition model.

Step 102, inputting a speech signal to be trained into a plurality of speech recognition submodels to generate a plurality of prediction result vectors.

In this embodiment, a plurality of speech recognition submodels may be preset, and the speech signal to be trained is respectively input into each speech recognition submodel for processing, and a plurality of prediction result vectors corresponding to each speech recognition submodel are respectively output.

The multiple voice recognition submodels with certain differences can be adopted to ensure the effect of the collaborative learning of the multiple voice recognition submodels. The speech recognition submodel may be based on an end-to-end model, for example, the speech recognition submodel may include a transform model, an RNN (Recurrent Neural Network) model, a CNN (Convolutional Neural Network) model, and the like, and optionally, the speech recognition submodel may be implemented by a CTC (connected Temporal Classification) and a GHMM (hybrid gaussian-hidden markov model) model, and is not limited to the end-to-end model.

As an example, the following description is made with respect to a single speech recognition submodel. And inputting the voice signal to be trained into the voice recognition submodel for processing, and outputting a prediction result vector, wherein the output prediction result vector is yt and represents the prediction result at the time t. Where yt is [ e (t,0) … e (t, j) e (t, V-1) ], V denotes the size of the vocabulary, and e (t, j) denotes the probability of predicting to the jth word in the vocabulary at time t. That is, the prediction result vector is used to predict the probability of each word in the vocabulary, for example, the size of the vocabulary is denoted as V, the size of the english vocabulary V may be 26, which represents 26 letters, and the prediction result vector yt includes the probability of predicting each letter in the vocabulary at time t; for Chinese V to represent the number of Chinese characters, the prediction result vector yt includes the probability of predicting each character at time t. Thus, the recognition result predicted by the speech recognizer model can be determined according to the prediction result vector.

It should be noted that, in this embodiment, for each sub-model of a plurality of speech recognition sub-models, reference may be made to the description of a single speech recognition sub-model in the above example, which is not described herein again.

Optionally, the plurality of speech recognition submodels may be trained in advance according to the labeled speech training data, and then the speech signal to be trained is input into the plurality of speech recognition submodels to generate a plurality of prediction result vectors. For example, the speech training data labeled with the corresponding recognition text may be collected in advance, and the processing parameters of the speech recognition submodel may be trained based on a supervised training mode through the speech training data, so that the speech recognition submodel is input as a speech signal and output as a corresponding recognition text.

In practical applications, a processing parameter of a predetermined model is usually trained according to labeled speech training data to generate a speech recognition model, so that the input of the speech recognition model is a speech signal and the output is a corresponding text. For example, for an end-to-end speech recognition model, a Transformer model can be used to recognize a speech signal to obtain a recognition text. However, since different models have different advantages and disadvantages, processing by a single model tends to fall into local optima, such as being limited by the model capabilities for left-to-right decoding models, yielding good prefixes and poor suffixes, and yielding good suffixes and poor prefixes for right-to-left decoding models. Therefore, the collaborative training can be carried out through the plurality of voice recognition submodels, so that the defect that a single model is easy to fall into a local optimal value is avoided, and the quality of voice recognition is improved.

And 103, generating a target translation which is generated by a plurality of voice models through common decision according to the plurality of prediction result vectors.

In the embodiment of the invention, a target translation generated by a common decision of a plurality of speech models can be generated according to a plurality of prediction result vectors, so that the speech models are trained according to the target translation generated by the common decision.

In an embodiment of the present invention, a set of prediction result vectors may be generated according to a plurality of prediction result vectors, and then a target translation may be generated according to the set of prediction result vectors.

As an example, for a segment of a speech signal to be trained, each speech recognition submodel may output a plurality of prediction vectors y, and then a set of prediction vectors Yi generated from the plurality of prediction vectors is [ y0, y1, …, yt, …, yn ], where i is the number of speech recognition submodels, Yi is an output obtained by the ith speech recognition submodel from the speech signal to be trained, and Yi is used to determine a recognition text corresponding to the speech signal. Further, the target translation is obtained by averaging the outputs Yi obtained by the plurality of speech recognition submodels, for example, by averaging yt output from each speech recognition submodel at time t, and the target translation at that time is determined from a vector of the averaged values. It is understood that the prediction result vectors output by different speech recognition submodels for the same speech signal to be trained may be different, for example, yt ═ output by the speech recognition submodel 1 [ a1, b1, c1], yt ═ output by the speech recognition submodel 2 [ a2, b2, c2], and then the vector of the target translation generated by common decision is determined to be [ (a1+ a2)/2, (b1+ b2)/2, (c1+ c2)/2 by averaging. The implementation manner of averaging may be selected according to needs, and is not limited herein.

As another possible implementation manner, labeling may be performed according to the speech signal to be trained, and a translation may be determined according to the recognition result of the label, for example, for a translation corresponding to y0 being the third letter in the vocabulary, the vector of the translation corresponding to y0 is [0,0,1,0 …, V-1 ].

It should be noted that, the implementation manner of obtaining the target translation may be selected according to actual needs, for example, in order to solve the problem of inconsistency during decoding and training of the decoding model, the target translation may be determined according to a prediction result output by the model, and this is not limited here.

And 104, training the voice recognition submodels according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel.

In this embodiment, after the target translation generated by the common decision of the multiple speech recognition submodels is generated according to the multiple prediction result vectors, the multiple speech recognition submodels may be trained according to the target translation generated by the common decision and the prediction result vectors to adjust the processing parameters of the speech recognition submodels, so that the processing parameters of the speech recognition submodels are trained according to the target translation generated by the common decision, and the quality of speech recognition can be improved compared with the recognition result generated by a single model.

As an example, a plurality of speech recognition submodels are trained by the following loss function,

wherein, y_avgFor the target translation, y_iIs the prediction vector of the ith model, and n is the number of speech recognition submodels.

According to the training method of the voice recognition model, the voice signal to be trained is obtained, and then the voice signal to be trained is input into the plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.

In order to implement the above embodiments, the present invention further provides a training device for a speech recognition model.

Fig. 2 is a schematic structural diagram of a training apparatus for a speech recognition model according to an embodiment of the present invention, and as shown in fig. 2, the training apparatus for a speech recognition model includes: the system comprises an acquisition module 100, a processing module 200, a generation module 300 and a training module 400.

The obtaining module 100 is configured to obtain a speech signal to be trained.

The processing module 200 is configured to input a speech signal to be trained into a plurality of speech recognition submodels to generate a plurality of prediction result vectors.

And a generating module 300, configured to generate a target translation that is jointly decided by multiple speech recognition submodels according to multiple prediction result vectors.

The training module 400 is configured to train a plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel.

In one embodiment of the invention, the plurality of speech recognizer models includes a plurality of a Transformer model, an RNN model, a CNN model, a CTC, and a GHMM.

In an embodiment of the present invention, the generating module 300 is specifically configured to: generating a set of predictor vectors from the plurality of predictor vectors; and generating a target translation according to the vector set of the prediction results.

In one embodiment of the invention, a plurality of speech recognition submodels are trained by the following loss function,

It should be noted that, the explanation of the training method of the speech recognition model in the foregoing embodiment is also applicable to the training device of the speech recognition model in this embodiment, and details are not repeated here.

According to the training device of the voice recognition model, the voice signal to be trained is obtained, and then the voice signal to be trained is input into the plurality of voice recognition submodels to generate a plurality of prediction result vectors. Further, a target translation generated by a plurality of voice models through common decision making is generated according to the plurality of prediction result vectors, and the plurality of voice recognition submodels are trained according to the target translation and the plurality of prediction result vectors generated by each voice recognition submodel. Therefore, the target translation generated by common decision of the plurality of voice recognition submodels is used for training the voice recognition submodels, the defect that a single model is easy to fall into a local optimal value can be reduced, and the quality of voice recognition is improved based on a learning strategy among the voice recognition submodels.

In order to implement the above embodiments, the present invention further provides a computer device, including a processor and a memory; wherein, the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the training method of the speech recognition model according to any one of the foregoing embodiments.

In order to implement the above embodiments, the present invention further provides a computer program product, wherein when the instructions in the computer program product are executed by a processor, the method for training a speech recognition model according to any of the foregoing embodiments is implemented.

In order to implement the above embodiments, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the training method of the speech recognition model according to any of the foregoing embodiments.

FIG. 3 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present invention. The computer device 12 shown in FIG. 3 is only an example and should not impose any limitation on the scope of use or functionality of embodiments of the present invention.

As shown in FIG. 3, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, and commonly referred to as a "hard drive"). Although not shown in FIG. 3, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.

In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method of training a speech recognition model, the speech recognition model comprising a plurality of speech recognition submodels, the method comprising:

acquiring a voice signal to be trained;

training the plurality of speech recognition submodels according to the target translation and a plurality of prediction result vectors generated by each speech recognition submodel;

wherein the plurality of speech recognition submodels are trained by a loss function,

2. The method of training a speech recognition model according to claim 1, wherein the plurality of speech recognizer models comprises a plurality of a fransformer model, an RNN model, a CNN model, CTC, and GHMM.

3. The method for training a speech recognition model according to claim 1, wherein the generating a target translation from the plurality of prediction vectors, which is jointly decided by the plurality of speech recognition submodels, comprises:

generating a set of predictor vectors from the plurality of predictor vectors;

and generating the target translation according to the prediction result vector set.

4. An apparatus for training a speech recognition model, the speech recognition model comprising a plurality of speech recognition submodels, the apparatus comprising:

the acquisition module is used for acquiring a voice signal to be trained;

the training module is used for training the voice recognition submodels according to the target translation and a plurality of prediction result vectors generated by each voice recognition submodel;

5. The apparatus for training a speech recognition model according to claim 4, wherein the plurality of speech recognizer models comprises a plurality of a fransformer model, an RNN model, a CNN model, CTC, and GHMM.

6. The apparatus for training a speech recognition model according to claim 4, wherein the generating module is specifically configured to:

generating a set of predictor vectors from the plurality of predictor vectors;

7. A computer device comprising a processor and a memory;

wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the training method of the speech recognition model according to any one of claims 1 to 3.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of training a speech recognition model according to any one of claims 1 to 3.