CN110246486A

CN110246486A - Training method, device and the equipment of speech recognition modeling

Info

Publication number: CN110246486A
Application number: CN201910477604.8A
Authority: CN
Inventors: 熊皓; 张睿卿; 张传强; 何中军; 李芝; 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-09-17
Anticipated expiration: 2039-06-03
Also published as: CN110246486B

Abstract

The invention proposes a kind of training method of speech recognition modeling, device and equipment, wherein method includes: to obtain voice signal to be trained；Voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors；The target translation that multiple speech recognition submodel Shared Decision Makings generate is generated according to multiple prediction result vectors；And multiple speech recognition submodels are trained according to multiple prediction result vectors that target translation and each speech recognition submodel generate.The target translation generated as a result, by multiple speech recognition submodel Shared Decision Makings, is trained speech recognition modeling, improves the quality of speech recognition.

Description

Training method, device and the equipment of speech recognition modeling

Technical field

The present invention relates to technical field of voice recognition more particularly to a kind of training method of speech recognition modeling, device and Equipment.

Background technique

With the development of artificial intelligence technology, speech recognition technology achieves huge progress, and initially enters household electrical appliances, leads to The every field such as letter, automobile, medical treatment.

In the related technology, it in training speech recognition modeling, usually chooses corresponding model structure and is trained, and due to Each model has the advantages that itself and defect and due to the limitation of training corpus scale, so that speech recognition modeling is easily trapped into The quality of local optimum, speech recognition result is to be improved.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, the first purpose of this invention is to propose a kind of training method of speech recognition modeling, pass through multiple languages Sound identifies the target translation that submodel Shared Decision Making generates, and is trained to speech recognition modeling, improves the matter of speech recognition Amount.

Second object of the present invention is to propose a kind of training device of speech recognition modeling.

Third object of the present invention is to propose a kind of computer equipment.

Fourth object of the present invention is to propose a kind of computer readable storage medium.

First aspect present invention embodiment proposes a kind of training method of speech recognition modeling, the speech recognition modeling Including multiple speech recognition submodels, which comprises

Obtain voice signal to be trained；

The voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors；

The target that the multiple speech recognition submodel Shared Decision Making generates is generated according to the multiple prediction result vector Translation；And

The multiple prediction result vectors generated according to the target translation and each speech recognition submodel are to described Multiple speech recognition submodels are trained.

The training method of the speech recognition modeling of the embodiment of the present invention obtains voice signal to be trained, and then will be wait train Voice signal inputs multiple speech recognition submodels to generate multiple prediction result vectors.Further, it is tied according to multiple predictions Fruit vector generates the target translation that multiple speech model Shared Decision Makings generate, and according to target translation and each speech recognition Multiple prediction result vectors that model generates are trained multiple speech recognition submodels.Pass through multiple speech recognitions as a result, The target translation that submodel Shared Decision Making generates, is trained speech recognition submodel, can reduce single model and is easy to fall into The shortcomings that entering local optimum improves the quality of speech recognition based on the learning strategy between speech recognition submodel.

In addition, the training method of speech recognition modeling according to the above embodiment of the present invention can also have additional skill as follows Art feature:

Optionally, the multiple speech recognition submodel includes Transformer model, RNN model, CNN model, CTC With it is a variety of in GHMM.

Optionally, described that the multiple speech recognition submodel Shared Decision Making is generated according to the multiple prediction result vector The target translation of generation, comprising: prediction result vector set is generated according to the multiple prediction result vector；According to the prediction Result vector set generates the target translation.

Optionally, the multiple speech recognition submodel is trained by following loss function,

Wherein, y_avgFor the target translation, y_iFor the prediction result vector of i-th of model, n is speech recognition The quantity of model.

Second aspect of the present invention embodiment proposes a kind of training device of speech recognition modeling, the speech recognition modeling Including multiple speech recognition submodels, described device includes:

Module is obtained, for obtaining voice signal to be trained；

Processing module, for the voice signal to be trained to be inputted multiple speech recognition submodels to generate multiple predictions Result vector；

Generation module is determined jointly for generating the multiple speech recognition submodel according to the multiple prediction result vector The target translation that plan generates；And

Training module, multiple prediction knots for being generated according to the target translation and each speech recognition submodel Fruit vector is trained the multiple speech recognition submodel.

The training device of the speech recognition modeling of the embodiment of the present invention obtains voice signal to be trained, and then will be wait train Voice signal inputs multiple speech recognition submodels to generate multiple prediction result vectors.Further, it is tied according to multiple predictions Fruit vector generates the target translation that multiple speech model Shared Decision Makings generate, and according to target translation and each speech recognition Multiple prediction result vectors that model generates are trained multiple speech recognition submodels.Pass through multiple speech recognitions as a result, The target translation that submodel Shared Decision Making generates, is trained speech recognition submodel, can reduce single model and is easy to fall into The shortcomings that entering local optimum improves the quality of speech recognition based on the learning strategy between speech recognition submodel.

In addition, the training device of speech recognition modeling according to the above embodiment of the present invention can also have additional skill as follows Art feature:

Optionally, the generation module is specifically used for: generating prediction result vector according to the multiple prediction result vector Set；According to the prediction result vector set symphysis at the target translation.

Third aspect present invention embodiment proposes a kind of computer equipment, including processor and memory；Wherein, described Processor is corresponding with the executable program code to run by reading the executable program code stored in the memory Program, with the training method for realizing the speech recognition modeling as described in first aspect embodiment.

Fourth aspect present invention embodiment proposes a kind of computer readable storage medium, is stored thereon with computer journey Sequence realizes the training method of the speech recognition modeling as described in first aspect embodiment when the program is executed by processor.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the training method of speech recognition modeling provided by the embodiment of the present invention；

Fig. 2 is a kind of structural schematic diagram of the training device of speech recognition modeling provided by the embodiment of the present invention；

Fig. 3 shows the block diagram for being suitable for the exemplary computer device for being used to realize the embodiment of the present invention.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

Below with reference to the accompanying drawings training method, device and the equipment of the speech recognition modeling of the embodiment of the present invention are described.

Fig. 1 is a kind of flow diagram of the training method of speech recognition modeling provided by the embodiment of the present invention, such as Fig. 1 It is shown, this method comprises:

Step 101, voice signal to be trained is obtained.

In the present embodiment, in training speech recognition modeling, voice signal to be trained can be first obtained.For example, can be with Voice signal is acquired by pronunciation receivers such as microphones, as voice signal to be trained.For another example can be marked from correlation It infuses and obtains voice signal data in platform as voice signal to be trained.

Wherein, voice signal to be trained can be the voice signal, such as Chinese, English, Russian etc. of any languages, tool Body choose according to speech recognition modeling.

Step 102, voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors.

In the present embodiment, multiple speech recognition submodels can be preset, and voice signal difference to be trained is defeated Enter into each speech recognition submodel and handled, export respectively corresponding with each speech recognition submodel multiple prediction results to Amount.

Wherein it is possible to using multiple speech recognition submodels with different, to guarantee multiple speech recognition The effect of model interoperability study.Speech recognition submodel can be based on end to end model, such as speech recognition submodel can wrap Include Transformer model, RNN (Recurrent Neural Network, Recognition with Recurrent Neural Network) model, CNN (Convolutional Neural Networks, convolutional neural networks) model etc., optionally, speech recognition submodel may be used also To pass through CTC (Connectionist Temporal Classification, connectionism chronological classification) and GHMM (mixing Gauss-hidden Markov model) model realization, it is not limited to end to end model.

As an example, it is illustrated below for individual voice identification submodel.It will voice signal input be trained It is handled into speech recognition submodel, prediction result vector is exported, for example output prediction result vector is yt, when representing t The prediction result at quarter.Wherein, yt=[e (t, 0) ... e (t, j) e (t, V-1)], V indicate vocabulary size, and e (t, j) indicates t moment It is predicted as the probability of j-th of word in vocabulary.I.e. prediction result vector is for predicting the probability of each word in vocabulary, for example, vocabulary Size is denoted as V, can be 26 for English words table size V, indicates 26 letters, includes that t moment is pre- in prediction result vector yt It surveys as the probability of each letter in vocabulary；For Chinese V be used to indicate in text number, including t in prediction result vector yt Moment is predicted as the probability of each word.Thus, it is possible to determine the identification of speech recognition submodel prediction according to prediction result vector As a result.

It should be noted that being referred in the present embodiment for each submodel in multiple speech recognition submodels To the explanation of individual voice identification submodel in above-mentioned example, details are not described herein again.

Optionally, multiple speech recognition submodels can be trained respectively previously according to the voice training data of mark, And then voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors.For example, can be pre- The voice training data for being labeled with corresponding identification text are first collected, and by voice training data based on the training method for having supervision The processing parameter of training speech recognition submodel, making the input of speech recognition submodel is voice signal, is exported to identify accordingly Text.

In practical applications, raw generally according to the processing parameter of a certain preset model of voice training data training of mark At speech recognition modeling, make the input voice signal of speech recognition modeling, exports as corresponding text.For example, being arrived for end The speech recognition modeling at end can carry out identification to voice signal using Transformer model and obtain identification text.However, Since different models has different advantage and disadvantage, processing is carried out by single model and is easily trapped into local optimum, such as is right It is limited to model capability in decoded model from left to right, the suffix of the prefix being easy to produce and difference, and for from right to left The suffix that has then been easy to produce of decoded model and difference prefix.Therefore, it can be assisted by multiple speech recognition submodels The shortcomings that with training, being easily trapped into local optimum to avoid single model, improve the quality of speech recognition.

Step 103, the target translation that multiple speech model Shared Decision Makings generate is generated according to multiple prediction result vectors.

In the embodiment of the present invention, it can generate what multiple speech model Shared Decision Makings generated according to multiple prediction result vectors Target translation is trained speech model with the target translation generated according to Shared Decision Making.

In one embodiment of the invention, prediction result vector set can be generated according to multiple prediction result vectors, And then according to the symphysis of prediction result vector set at target translation.

As an example, for one section of voice signal to be trained, each speech recognition submodel can export multiple pre- Survey result vector y, and then according to multiple prediction result vectors generate prediction result vector set Yi=[y0, y1 ..., Yt ..., yn], wherein i is the quantity of speech recognition submodel, and Yi is i-th of speech recognition submodel according to voice to be trained The output that signal obtains, Yi is for determining identification text corresponding with voice signal.In turn, according to multiple speech recognition submodels The output Yi of acquisition averages, and obtains target translation, for example the yt of speech recognition submodel each for t moment output is averaging It is worth, and then determines the target translation at the moment according to the vector of the average value acquired.It is appreciated that different phonetic identifies submodel May be different for the prediction result vector of same voice signal output to be trained, such as the yt that speech recognition submodel 1 exports =[a1, b1, c1], the yt=[a2, b2, c2] that speech recognition submodel 2 exports then are produced by determining Shared Decision Making of averaging The vector of raw target translation is [(a1+a2)/2, (b1+b2)/2, (c1+c2)/2].Wherein, the implementation averaged can To be selected as needed, herein with no restriction.

As alternatively possible implementation, can be labeled according to voice signal to be trained, according to the knowledge of mark Other result determines translation, such as translation corresponding for y0 is third letter in vocabulary, then the vector of the corresponding translation of y0 is [0,0,1,0…,V-1]。

It should be noted that the implementation for obtaining target translation accordingly can be selected according to actual needs, for example, Situation inconsistent when decoded model decoding and training is solved, can determine that target is translated according to the prediction result that model exports Text, herein with no restriction.

Step 104, the multiple prediction result vectors generated according to target translation and each speech recognition submodel are to multiple Speech recognition submodel is trained.

In the present embodiment, generating what multiple speech recognition submodel Shared Decision Makings generated according to multiple prediction result vectors After target translation, can according to Shared Decision Making generate target translation and prediction result vector to multiple speech recognition submodels into Row training passes through the target translation training voice of Shared Decision Making generation to adjust the processing parameter of speech recognition submodel as a result, Identify the processing parameter of submodel, the recognition result that relatively single model generates can be improved the quality of speech recognition.

As an example, multiple speech recognition submodels are trained by following loss function,

Wherein, y_avgFor target translation, y_iFor the prediction result vector of i-th of model, n is the number of speech recognition submodel Amount.

The training method of the speech recognition modeling of the embodiment of the present invention, by obtaining voice signal to be trained, and then will be to Training voice signal inputs multiple speech recognition submodels to generate multiple prediction result vectors.Further, according to multiple pre- It surveys result vector and generates the target translation that multiple speech model Shared Decision Makings generate, and known according to target translation and each voice Multiple prediction result vectors that small pin for the case model generates are trained multiple speech recognition submodels.Pass through multiple voices as a result, It identifies the target translation that submodel Shared Decision Making generates, speech recognition submodel is trained, can reduce single model and hold The shortcomings that easily falling into local optimum, based on the learning strategy between speech recognition submodel, improves the quality of speech recognition.

In order to realize above-described embodiment, the present invention also proposes a kind of training device of speech recognition modeling.

Fig. 2 is a kind of structural schematic diagram of the training device of speech recognition modeling provided by the embodiment of the present invention, such as Fig. 2 Shown, the training device of the speech recognition modeling includes: to obtain module 100, processing module 200, generation module 300, training mould Block 400.

Wherein, module 100 is obtained, for obtaining voice signal to be trained.

Processing module 200, for voice signal to be trained to be inputted multiple speech recognition submodels to generate multiple predictions Result vector.

Generation module 300 is produced for generating multiple speech recognition submodel Shared Decision Makings according to multiple prediction result vectors Raw target translation.

Training module 400, for according to target translation and each speech recognition submodel generation multiple prediction results to Amount is trained multiple speech recognition submodels.

In one embodiment of the invention, multiple speech recognition submodels include Transformer model, RNN model, It is a variety of in CNN model, CTC and GHMM.

In one embodiment of the invention, generation module 300 is specifically used for: being generated according to multiple prediction result vectors pre- Survey result vector set；According to the symphysis of prediction result vector set at target translation.

In one embodiment of the invention, multiple speech recognition submodels are trained by following loss function,

It should be noted that explanation of the previous embodiment to the training method of speech recognition modeling, is equally applicable to The training device of the speech recognition modeling of the present embodiment, details are not described herein again.

The training device of the speech recognition modeling of the embodiment of the present invention, by obtaining voice signal to be trained, and then will be to Training voice signal inputs multiple speech recognition submodels to generate multiple prediction result vectors.Further, according to multiple pre- It surveys result vector and generates the target translation that multiple speech model Shared Decision Makings generate, and known according to target translation and each voice Multiple prediction result vectors that small pin for the case model generates are trained multiple speech recognition submodels.Pass through multiple voices as a result, It identifies the target translation that submodel Shared Decision Making generates, speech recognition submodel is trained, can reduce single model and hold The shortcomings that easily falling into local optimum, based on the learning strategy between speech recognition submodel, improves the quality of speech recognition.

In order to realize above-described embodiment, the present invention also proposes a kind of computer equipment, including processor and memory；Its In, processor runs journey corresponding with executable program code by reading the executable program code stored in memory Sequence, with the training method for realizing the speech recognition modeling as described in aforementioned any embodiment.

In order to realize above-described embodiment, the present invention also proposes a kind of computer program product, when in computer program product Instruction the training method of the speech recognition modeling as described in aforementioned any embodiment is realized when being executed by processor.

In order to realize above-described embodiment, the present invention also proposes a kind of computer readable storage medium, is stored thereon with calculating Machine program realizes the training method of the speech recognition modeling as described in aforementioned any embodiment when the program is executed by processor.

Fig. 3 shows the block diagram for being suitable for the exemplary computer device for being used to realize the embodiment of the present invention.The meter that Fig. 3 is shown Calculating machine equipment 12 is only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in figure 3, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with Including but not limited to: one or more processor or processing unit 16, system storage 28 connect different system components The bus 18 of (including system storage 28 and processing unit 16).

Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (Industry Standard Architecture；Hereinafter referred to as: ISA) bus, microchannel architecture (Micro Channel Architecture；Below Referred to as: MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards Association；Hereinafter referred to as: VESA) local bus and peripheral component interconnection (Peripheral Component Interconnection；Hereinafter referred to as: PCI) bus.

Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.

Memory 28 may include the computer system readable media of form of volatile memory, such as random access memory Device (Random Access Memory；Hereinafter referred to as: RAM) 30 and/or cache memory 32.Computer equipment 12 can be with It further comprise other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only as an example, Storage system 34 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 3 do not show, commonly referred to as " hard drive Device ").Although being not shown in Fig. 3, the disk for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided and driven Dynamic device, and to removable anonvolatile optical disk (such as: compact disc read-only memory (Compact Disc Read Only Memory；Hereinafter referred to as: CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only Memory；Hereinafter referred to as: DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving Device can be connected by one or more data media interfaces with bus 18.Memory 28 may include that at least one program produces Product, the program product have one group of (for example, at least one) program module, and it is each that these program modules are configured to perform the application The function of embodiment.

Program/utility 40 with one group of (at least one) program module 42 can store in such as memory 28 In, such program module 42 include but is not limited to operating system, one or more application program, other program modules and It may include the realization of network environment in program data, each of these examples or certain combination.Program module 42 is usual Execute the function and/or method in embodiments described herein.

Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 Deng) communication, the equipment interacted with the computer system/server 12 can be also enabled a user to one or more to be communicated, and/ Or with enable the computer system/server 12 and one or more of the other any equipment (example for being communicated of calculating equipment Such as network interface card, modem etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, it calculates Machine equipment 12 can also pass through network adapter 20 and one or more network (such as local area network (Local Area Network；Hereinafter referred to as: LAN), wide area network (Wide Area Network；Hereinafter referred to as: WAN) and/or public network, example Such as internet) communication.As shown, network adapter 20 is communicated by bus 18 with other modules of computer equipment 12.It answers When understanding, although not shown in the drawings, other hardware and/or software module can be used in conjunction with computer equipment 12, including but not Be limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and Data backup storage system etc..

Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and Data processing, such as realize the method referred in previous embodiment.

In the description of the present invention, it is to be understood that, term " first ", " second " are used for description purposes only, and cannot It is interpreted as indication or suggestion relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In the description of the present invention, " multiple " It is meant that at least two, such as two, three etc., unless otherwise specifically defined.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims

1. a kind of training method of speech recognition modeling, which is characterized in that the speech recognition modeling includes multiple speech recognitions Submodel, which comprises

Obtain voice signal to be trained；

The target translation that the multiple speech recognition submodel Shared Decision Making generates is generated according to the multiple prediction result vector； And

The multiple prediction result vectors generated according to the target translation and each speech recognition submodel are to the multiple Speech recognition submodel is trained.

2. the training method of speech recognition modeling as described in claim 1, which is characterized in that the multiple speech recognition submodule Type includes a variety of in Transformer model, RNN model, CNN model, CTC and GHMM.

3. the training method of speech recognition modeling as described in claim 1, which is characterized in that described according to the multiple prediction Result vector generates the target translation that the multiple speech recognition submodel Shared Decision Making generates, comprising:

Prediction result vector set is generated according to the multiple prediction result vector；

According to the prediction result vector set symphysis at the target translation.

4. the training method of speech recognition modeling as described in claim 1, which is characterized in that by following loss function to institute Multiple speech recognition submodels are stated to be trained,

Wherein, y_avgFor the target translation, y_iFor the prediction result vector of i-th of model, n is the speech recognition submodel Quantity.

5. a kind of training device of speech recognition modeling, which is characterized in that the speech recognition modeling includes multiple speech recognitions Submodel, described device include:

Module is obtained, for obtaining voice signal to be trained；

Processing module, for the voice signal to be trained to be inputted multiple speech recognition submodels to generate multiple prediction results Vector；

Generation module is produced for generating the multiple speech recognition submodel Shared Decision Making according to the multiple prediction result vector Raw target translation；And

Training module, multiple prediction results for being generated according to the target translation and each speech recognition submodel to Amount is trained the multiple speech recognition submodel.

6. the training device of speech recognition modeling as claimed in claim 5, which is characterized in that the multiple speech recognition submodule Type includes a variety of in Transformer model, RNN model, CNN model, CTC and GHMM.

7. the training device of speech recognition modeling as claimed in claim 5, which is characterized in that the generation module is specifically used In:

8. the training device of speech recognition modeling as claimed in claim 5, which is characterized in that by following loss function to institute Multiple speech recognition submodels are stated to be trained,

9. a kind of computer equipment, which is characterized in that including processor and memory；

Wherein, the processor is run by reading the executable program code stored in the memory can be performed with described The corresponding program of program code, with the training side for realizing speech recognition modeling such as of any of claims 1-4 Method.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The training method such as speech recognition modeling of any of claims 1-4 is realized when execution.