CN110246486A - Training method, device and the equipment of speech recognition modeling - Google Patents
Training method, device and the equipment of speech recognition modeling Download PDFInfo
- Publication number
- CN110246486A CN110246486A CN201910477604.8A CN201910477604A CN110246486A CN 110246486 A CN110246486 A CN 110246486A CN 201910477604 A CN201910477604 A CN 201910477604A CN 110246486 A CN110246486 A CN 110246486A
- Authority
- CN
- China
- Prior art keywords
- speech recognition
- prediction result
- submodel
- trained
- target translation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 58
- 238000000034 method Methods 0.000 title claims abstract description 30
- 239000013598 vector Substances 0.000 claims abstract description 74
- 238000013519 translation Methods 0.000 claims abstract description 54
- 238000012545 processing Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The invention proposes a kind of training method of speech recognition modeling, device and equipment, wherein method includes: to obtain voice signal to be trained;Voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors;The target translation that multiple speech recognition submodel Shared Decision Makings generate is generated according to multiple prediction result vectors;And multiple speech recognition submodels are trained according to multiple prediction result vectors that target translation and each speech recognition submodel generate.The target translation generated as a result, by multiple speech recognition submodel Shared Decision Makings, is trained speech recognition modeling, improves the quality of speech recognition.
Description
Technical field
The present invention relates to technical field of voice recognition more particularly to a kind of training method of speech recognition modeling, device and
Equipment.
Background technique
With the development of artificial intelligence technology, speech recognition technology achieves huge progress, and initially enters household electrical appliances, leads to
The every field such as letter, automobile, medical treatment.
In the related technology, it in training speech recognition modeling, usually chooses corresponding model structure and is trained, and due to
Each model has the advantages that itself and defect and due to the limitation of training corpus scale, so that speech recognition modeling is easily trapped into
The quality of local optimum, speech recognition result is to be improved.
Summary of the invention
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, the first purpose of this invention is to propose a kind of training method of speech recognition modeling, pass through multiple languages
Sound identifies the target translation that submodel Shared Decision Making generates, and is trained to speech recognition modeling, improves the matter of speech recognition
Amount.
Second object of the present invention is to propose a kind of training device of speech recognition modeling.
Third object of the present invention is to propose a kind of computer equipment.
Fourth object of the present invention is to propose a kind of computer readable storage medium.
First aspect present invention embodiment proposes a kind of training method of speech recognition modeling, the speech recognition modeling
Including multiple speech recognition submodels, which comprises
Obtain voice signal to be trained;
The voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors;
The target that the multiple speech recognition submodel Shared Decision Making generates is generated according to the multiple prediction result vector
Translation;And
The multiple prediction result vectors generated according to the target translation and each speech recognition submodel are to described
Multiple speech recognition submodels are trained.
The training method of the speech recognition modeling of the embodiment of the present invention obtains voice signal to be trained, and then will be wait train
Voice signal inputs multiple speech recognition submodels to generate multiple prediction result vectors.Further, it is tied according to multiple predictions
Fruit vector generates the target translation that multiple speech model Shared Decision Makings generate, and according to target translation and each speech recognition
Multiple prediction result vectors that model generates are trained multiple speech recognition submodels.Pass through multiple speech recognitions as a result,
The target translation that submodel Shared Decision Making generates, is trained speech recognition submodel, can reduce single model and is easy to fall into
The shortcomings that entering local optimum improves the quality of speech recognition based on the learning strategy between speech recognition submodel.
In addition, the training method of speech recognition modeling according to the above embodiment of the present invention can also have additional skill as follows
Art feature:
Optionally, the multiple speech recognition submodel includes Transformer model, RNN model, CNN model, CTC
With it is a variety of in GHMM.
Optionally, described that the multiple speech recognition submodel Shared Decision Making is generated according to the multiple prediction result vector
The target translation of generation, comprising: prediction result vector set is generated according to the multiple prediction result vector;According to the prediction
Result vector set generates the target translation.
Optionally, the multiple speech recognition submodel is trained by following loss function,
Wherein, yavgFor the target translation, yiFor the prediction result vector of i-th of model, n is speech recognition
The quantity of model.
Second aspect of the present invention embodiment proposes a kind of training device of speech recognition modeling, the speech recognition modeling
Including multiple speech recognition submodels, described device includes:
Module is obtained, for obtaining voice signal to be trained;
Processing module, for the voice signal to be trained to be inputted multiple speech recognition submodels to generate multiple predictions
Result vector;
Generation module is determined jointly for generating the multiple speech recognition submodel according to the multiple prediction result vector
The target translation that plan generates;And
Training module, multiple prediction knots for being generated according to the target translation and each speech recognition submodel
Fruit vector is trained the multiple speech recognition submodel.
The training device of the speech recognition modeling of the embodiment of the present invention obtains voice signal to be trained, and then will be wait train
Voice signal inputs multiple speech recognition submodels to generate multiple prediction result vectors.Further, it is tied according to multiple predictions
Fruit vector generates the target translation that multiple speech model Shared Decision Makings generate, and according to target translation and each speech recognition
Multiple prediction result vectors that model generates are trained multiple speech recognition submodels.Pass through multiple speech recognitions as a result,
The target translation that submodel Shared Decision Making generates, is trained speech recognition submodel, can reduce single model and is easy to fall into
The shortcomings that entering local optimum improves the quality of speech recognition based on the learning strategy between speech recognition submodel.
In addition, the training device of speech recognition modeling according to the above embodiment of the present invention can also have additional skill as follows
Art feature:
Optionally, the multiple speech recognition submodel includes Transformer model, RNN model, CNN model, CTC
With it is a variety of in GHMM.
Optionally, the generation module is specifically used for: generating prediction result vector according to the multiple prediction result vector
Set;According to the prediction result vector set symphysis at the target translation.
Optionally, the multiple speech recognition submodel is trained by following loss function,
Wherein, yavgFor the target translation, yiFor the prediction result vector of i-th of model, n is speech recognition
The quantity of model.
Third aspect present invention embodiment proposes a kind of computer equipment, including processor and memory;Wherein, described
Processor is corresponding with the executable program code to run by reading the executable program code stored in the memory
Program, with the training method for realizing the speech recognition modeling as described in first aspect embodiment.
Fourth aspect present invention embodiment proposes a kind of computer readable storage medium, is stored thereon with computer journey
Sequence realizes the training method of the speech recognition modeling as described in first aspect embodiment when the program is executed by processor.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the training method of speech recognition modeling provided by the embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of the training device of speech recognition modeling provided by the embodiment of the present invention;
Fig. 3 shows the block diagram for being suitable for the exemplary computer device for being used to realize the embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.
Below with reference to the accompanying drawings training method, device and the equipment of the speech recognition modeling of the embodiment of the present invention are described.
Fig. 1 is a kind of flow diagram of the training method of speech recognition modeling provided by the embodiment of the present invention, such as Fig. 1
It is shown, this method comprises:
Step 101, voice signal to be trained is obtained.
In the present embodiment, in training speech recognition modeling, voice signal to be trained can be first obtained.For example, can be with
Voice signal is acquired by pronunciation receivers such as microphones, as voice signal to be trained.For another example can be marked from correlation
It infuses and obtains voice signal data in platform as voice signal to be trained.
Wherein, voice signal to be trained can be the voice signal, such as Chinese, English, Russian etc. of any languages, tool
Body choose according to speech recognition modeling.
Step 102, voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors.
In the present embodiment, multiple speech recognition submodels can be preset, and voice signal difference to be trained is defeated
Enter into each speech recognition submodel and handled, export respectively corresponding with each speech recognition submodel multiple prediction results to
Amount.
Wherein it is possible to using multiple speech recognition submodels with different, to guarantee multiple speech recognition
The effect of model interoperability study.Speech recognition submodel can be based on end to end model, such as speech recognition submodel can wrap
Include Transformer model, RNN (Recurrent Neural Network, Recognition with Recurrent Neural Network) model, CNN
(Convolutional Neural Networks, convolutional neural networks) model etc., optionally, speech recognition submodel may be used also
To pass through CTC (Connectionist Temporal Classification, connectionism chronological classification) and GHMM (mixing
Gauss-hidden Markov model) model realization, it is not limited to end to end model.
As an example, it is illustrated below for individual voice identification submodel.It will voice signal input be trained
It is handled into speech recognition submodel, prediction result vector is exported, for example output prediction result vector is yt, when representing t
The prediction result at quarter.Wherein, yt=[e (t, 0) ... e (t, j) e (t, V-1)], V indicate vocabulary size, and e (t, j) indicates t moment
It is predicted as the probability of j-th of word in vocabulary.I.e. prediction result vector is for predicting the probability of each word in vocabulary, for example, vocabulary
Size is denoted as V, can be 26 for English words table size V, indicates 26 letters, includes that t moment is pre- in prediction result vector yt
It surveys as the probability of each letter in vocabulary;For Chinese V be used to indicate in text number, including t in prediction result vector yt
Moment is predicted as the probability of each word.Thus, it is possible to determine the identification of speech recognition submodel prediction according to prediction result vector
As a result.
It should be noted that being referred in the present embodiment for each submodel in multiple speech recognition submodels
To the explanation of individual voice identification submodel in above-mentioned example, details are not described herein again.
Optionally, multiple speech recognition submodels can be trained respectively previously according to the voice training data of mark,
And then voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors.For example, can be pre-
The voice training data for being labeled with corresponding identification text are first collected, and by voice training data based on the training method for having supervision
The processing parameter of training speech recognition submodel, making the input of speech recognition submodel is voice signal, is exported to identify accordingly
Text.
In practical applications, raw generally according to the processing parameter of a certain preset model of voice training data training of mark
At speech recognition modeling, make the input voice signal of speech recognition modeling, exports as corresponding text.For example, being arrived for end
The speech recognition modeling at end can carry out identification to voice signal using Transformer model and obtain identification text.However,
Since different models has different advantage and disadvantage, processing is carried out by single model and is easily trapped into local optimum, such as is right
It is limited to model capability in decoded model from left to right, the suffix of the prefix being easy to produce and difference, and for from right to left
The suffix that has then been easy to produce of decoded model and difference prefix.Therefore, it can be assisted by multiple speech recognition submodels
The shortcomings that with training, being easily trapped into local optimum to avoid single model, improve the quality of speech recognition.
Step 103, the target translation that multiple speech model Shared Decision Makings generate is generated according to multiple prediction result vectors.
In the embodiment of the present invention, it can generate what multiple speech model Shared Decision Makings generated according to multiple prediction result vectors
Target translation is trained speech model with the target translation generated according to Shared Decision Making.
In one embodiment of the invention, prediction result vector set can be generated according to multiple prediction result vectors,
And then according to the symphysis of prediction result vector set at target translation.
As an example, for one section of voice signal to be trained, each speech recognition submodel can export multiple pre-
Survey result vector y, and then according to multiple prediction result vectors generate prediction result vector set Yi=[y0, y1 ...,
Yt ..., yn], wherein i is the quantity of speech recognition submodel, and Yi is i-th of speech recognition submodel according to voice to be trained
The output that signal obtains, Yi is for determining identification text corresponding with voice signal.In turn, according to multiple speech recognition submodels
The output Yi of acquisition averages, and obtains target translation, for example the yt of speech recognition submodel each for t moment output is averaging
It is worth, and then determines the target translation at the moment according to the vector of the average value acquired.It is appreciated that different phonetic identifies submodel
May be different for the prediction result vector of same voice signal output to be trained, such as the yt that speech recognition submodel 1 exports
=[a1, b1, c1], the yt=[a2, b2, c2] that speech recognition submodel 2 exports then are produced by determining Shared Decision Making of averaging
The vector of raw target translation is [(a1+a2)/2, (b1+b2)/2, (c1+c2)/2].Wherein, the implementation averaged can
To be selected as needed, herein with no restriction.
As alternatively possible implementation, can be labeled according to voice signal to be trained, according to the knowledge of mark
Other result determines translation, such as translation corresponding for y0 is third letter in vocabulary, then the vector of the corresponding translation of y0 is
[0,0,1,0…,V-1]。
It should be noted that the implementation for obtaining target translation accordingly can be selected according to actual needs, for example,
Situation inconsistent when decoded model decoding and training is solved, can determine that target is translated according to the prediction result that model exports
Text, herein with no restriction.
Step 104, the multiple prediction result vectors generated according to target translation and each speech recognition submodel are to multiple
Speech recognition submodel is trained.
In the present embodiment, generating what multiple speech recognition submodel Shared Decision Makings generated according to multiple prediction result vectors
After target translation, can according to Shared Decision Making generate target translation and prediction result vector to multiple speech recognition submodels into
Row training passes through the target translation training voice of Shared Decision Making generation to adjust the processing parameter of speech recognition submodel as a result,
Identify the processing parameter of submodel, the recognition result that relatively single model generates can be improved the quality of speech recognition.
As an example, multiple speech recognition submodels are trained by following loss function,
Wherein, yavgFor target translation, yiFor the prediction result vector of i-th of model, n is the number of speech recognition submodel
Amount.
The training method of the speech recognition modeling of the embodiment of the present invention, by obtaining voice signal to be trained, and then will be to
Training voice signal inputs multiple speech recognition submodels to generate multiple prediction result vectors.Further, according to multiple pre-
It surveys result vector and generates the target translation that multiple speech model Shared Decision Makings generate, and known according to target translation and each voice
Multiple prediction result vectors that small pin for the case model generates are trained multiple speech recognition submodels.Pass through multiple voices as a result,
It identifies the target translation that submodel Shared Decision Making generates, speech recognition submodel is trained, can reduce single model and hold
The shortcomings that easily falling into local optimum, based on the learning strategy between speech recognition submodel, improves the quality of speech recognition.
In order to realize above-described embodiment, the present invention also proposes a kind of training device of speech recognition modeling.
Fig. 2 is a kind of structural schematic diagram of the training device of speech recognition modeling provided by the embodiment of the present invention, such as Fig. 2
Shown, the training device of the speech recognition modeling includes: to obtain module 100, processing module 200, generation module 300, training mould
Block 400.
Wherein, module 100 is obtained, for obtaining voice signal to be trained.
Processing module 200, for voice signal to be trained to be inputted multiple speech recognition submodels to generate multiple predictions
Result vector.
Generation module 300 is produced for generating multiple speech recognition submodel Shared Decision Makings according to multiple prediction result vectors
Raw target translation.
Training module 400, for according to target translation and each speech recognition submodel generation multiple prediction results to
Amount is trained multiple speech recognition submodels.
In one embodiment of the invention, multiple speech recognition submodels include Transformer model, RNN model,
It is a variety of in CNN model, CTC and GHMM.
In one embodiment of the invention, generation module 300 is specifically used for: being generated according to multiple prediction result vectors pre-
Survey result vector set;According to the symphysis of prediction result vector set at target translation.
In one embodiment of the invention, multiple speech recognition submodels are trained by following loss function,
Wherein, yavgFor target translation, yiFor the prediction result vector of i-th of model, n is the number of speech recognition submodel
Amount.
It should be noted that explanation of the previous embodiment to the training method of speech recognition modeling, is equally applicable to
The training device of the speech recognition modeling of the present embodiment, details are not described herein again.
The training device of the speech recognition modeling of the embodiment of the present invention, by obtaining voice signal to be trained, and then will be to
Training voice signal inputs multiple speech recognition submodels to generate multiple prediction result vectors.Further, according to multiple pre-
It surveys result vector and generates the target translation that multiple speech model Shared Decision Makings generate, and known according to target translation and each voice
Multiple prediction result vectors that small pin for the case model generates are trained multiple speech recognition submodels.Pass through multiple voices as a result,
It identifies the target translation that submodel Shared Decision Making generates, speech recognition submodel is trained, can reduce single model and hold
The shortcomings that easily falling into local optimum, based on the learning strategy between speech recognition submodel, improves the quality of speech recognition.
In order to realize above-described embodiment, the present invention also proposes a kind of computer equipment, including processor and memory;Its
In, processor runs journey corresponding with executable program code by reading the executable program code stored in memory
Sequence, with the training method for realizing the speech recognition modeling as described in aforementioned any embodiment.
In order to realize above-described embodiment, the present invention also proposes a kind of computer program product, when in computer program product
Instruction the training method of the speech recognition modeling as described in aforementioned any embodiment is realized when being executed by processor.
In order to realize above-described embodiment, the present invention also proposes a kind of computer readable storage medium, is stored thereon with calculating
Machine program realizes the training method of the speech recognition modeling as described in aforementioned any embodiment when the program is executed by processor.
Fig. 3 shows the block diagram for being suitable for the exemplary computer device for being used to realize the embodiment of the present invention.The meter that Fig. 3 is shown
Calculating machine equipment 12 is only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.
As shown in figure 3, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with
Including but not limited to: one or more processor or processing unit 16, system storage 28 connect different system components
The bus 18 of (including system storage 28 and processing unit 16).
Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts
For example, these architectures include but is not limited to industry standard architecture (Industry Standard
Architecture;Hereinafter referred to as: ISA) bus, microchannel architecture (Micro Channel Architecture;Below
Referred to as: MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards
Association;Hereinafter referred to as: VESA) local bus and peripheral component interconnection (Peripheral Component
Interconnection;Hereinafter referred to as: PCI) bus.
Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by
The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.
Memory 28 may include the computer system readable media of form of volatile memory, such as random access memory
Device (Random Access Memory;Hereinafter referred to as: RAM) 30 and/or cache memory 32.Computer equipment 12 can be with
It further comprise other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only as an example,
Storage system 34 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 3 do not show, commonly referred to as " hard drive
Device ").Although being not shown in Fig. 3, the disk for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided and driven
Dynamic device, and to removable anonvolatile optical disk (such as: compact disc read-only memory (Compact Disc Read Only
Memory;Hereinafter referred to as: CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only
Memory;Hereinafter referred to as: DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving
Device can be connected by one or more data media interfaces with bus 18.Memory 28 may include that at least one program produces
Product, the program product have one group of (for example, at least one) program module, and it is each that these program modules are configured to perform the application
The function of embodiment.
Program/utility 40 with one group of (at least one) program module 42 can store in such as memory 28
In, such program module 42 include but is not limited to operating system, one or more application program, other program modules and
It may include the realization of network environment in program data, each of these examples or certain combination.Program module 42 is usual
Execute the function and/or method in embodiments described herein.
Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24
Deng) communication, the equipment interacted with the computer system/server 12 can be also enabled a user to one or more to be communicated, and/
Or with enable the computer system/server 12 and one or more of the other any equipment (example for being communicated of calculating equipment
Such as network interface card, modem etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, it calculates
Machine equipment 12 can also pass through network adapter 20 and one or more network (such as local area network (Local Area
Network;Hereinafter referred to as: LAN), wide area network (Wide Area Network;Hereinafter referred to as: WAN) and/or public network, example
Such as internet) communication.As shown, network adapter 20 is communicated by bus 18 with other modules of computer equipment 12.It answers
When understanding, although not shown in the drawings, other hardware and/or software module can be used in conjunction with computer equipment 12, including but not
Be limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and
Data backup storage system etc..
Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and
Data processing, such as realize the method referred in previous embodiment.
In the description of the present invention, it is to be understood that, term " first ", " second " are used for description purposes only, and cannot
It is interpreted as indication or suggestion relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the
One ", the feature of " second " can explicitly or implicitly include at least one of the features.In the description of the present invention, " multiple "
It is meant that at least two, such as two, three etc., unless otherwise specifically defined.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field
Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples
It closes and combines.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example
Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned
Embodiment is changed, modifies, replacement and variant.
Claims (10)
1. a kind of training method of speech recognition modeling, which is characterized in that the speech recognition modeling includes multiple speech recognitions
Submodel, which comprises
Obtain voice signal to be trained;
The voice signal to be trained is inputted into multiple speech recognition submodels to generate multiple prediction result vectors;
The target translation that the multiple speech recognition submodel Shared Decision Making generates is generated according to the multiple prediction result vector;
And
The multiple prediction result vectors generated according to the target translation and each speech recognition submodel are to the multiple
Speech recognition submodel is trained.
2. the training method of speech recognition modeling as described in claim 1, which is characterized in that the multiple speech recognition submodule
Type includes a variety of in Transformer model, RNN model, CNN model, CTC and GHMM.
3. the training method of speech recognition modeling as described in claim 1, which is characterized in that described according to the multiple prediction
Result vector generates the target translation that the multiple speech recognition submodel Shared Decision Making generates, comprising:
Prediction result vector set is generated according to the multiple prediction result vector;
According to the prediction result vector set symphysis at the target translation.
4. the training method of speech recognition modeling as described in claim 1, which is characterized in that by following loss function to institute
Multiple speech recognition submodels are stated to be trained,
Wherein, yavgFor the target translation, yiFor the prediction result vector of i-th of model, n is the speech recognition submodel
Quantity.
5. a kind of training device of speech recognition modeling, which is characterized in that the speech recognition modeling includes multiple speech recognitions
Submodel, described device include:
Module is obtained, for obtaining voice signal to be trained;
Processing module, for the voice signal to be trained to be inputted multiple speech recognition submodels to generate multiple prediction results
Vector;
Generation module is produced for generating the multiple speech recognition submodel Shared Decision Making according to the multiple prediction result vector
Raw target translation;And
Training module, multiple prediction results for being generated according to the target translation and each speech recognition submodel to
Amount is trained the multiple speech recognition submodel.
6. the training device of speech recognition modeling as claimed in claim 5, which is characterized in that the multiple speech recognition submodule
Type includes a variety of in Transformer model, RNN model, CNN model, CTC and GHMM.
7. the training device of speech recognition modeling as claimed in claim 5, which is characterized in that the generation module is specifically used
In:
Prediction result vector set is generated according to the multiple prediction result vector;
According to the prediction result vector set symphysis at the target translation.
8. the training device of speech recognition modeling as claimed in claim 5, which is characterized in that by following loss function to institute
Multiple speech recognition submodels are stated to be trained,
Wherein, yavgFor the target translation, yiFor the prediction result vector of i-th of model, n is the speech recognition submodel
Quantity.
9. a kind of computer equipment, which is characterized in that including processor and memory;
Wherein, the processor is run by reading the executable program code stored in the memory can be performed with described
The corresponding program of program code, with the training side for realizing speech recognition modeling such as of any of claims 1-4
Method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
The training method such as speech recognition modeling of any of claims 1-4 is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477604.8A CN110246486B (en) | 2019-06-03 | 2019-06-03 | Training method, device and equipment of voice recognition model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477604.8A CN110246486B (en) | 2019-06-03 | 2019-06-03 | Training method, device and equipment of voice recognition model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110246486A true CN110246486A (en) | 2019-09-17 |
CN110246486B CN110246486B (en) | 2021-07-13 |
Family
ID=67885808
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910477604.8A Active CN110246486B (en) | 2019-06-03 | 2019-06-03 | Training method, device and equipment of voice recognition model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110246486B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112885330A (en) * | 2021-01-26 | 2021-06-01 | 北京云上曲率科技有限公司 | Language identification method and system based on low-resource audio |
WO2021198838A1 (en) * | 2020-04-03 | 2021-10-07 | International Business Machines Corporation | Training of model for processing sequence data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090138265A1 (en) * | 2007-11-26 | 2009-05-28 | Nuance Communications, Inc. | Joint Discriminative Training of Multiple Speech Recognizers |
CN103794214A (en) * | 2014-03-07 | 2014-05-14 | 联想(北京)有限公司 | Information processing method, device and electronic equipment |
US20150199960A1 (en) * | 2012-08-24 | 2015-07-16 | Microsoft Corporation | I-Vector Based Clustering Training Data in Speech Recognition |
CN108597502A (en) * | 2018-04-27 | 2018-09-28 | 上海适享文化传播有限公司 | Field speech recognition training method based on dual training |
CN109710727A (en) * | 2017-10-26 | 2019-05-03 | 哈曼国际工业有限公司 | System and method for natural language processing |
-
2019
- 2019-06-03 CN CN201910477604.8A patent/CN110246486B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090138265A1 (en) * | 2007-11-26 | 2009-05-28 | Nuance Communications, Inc. | Joint Discriminative Training of Multiple Speech Recognizers |
US20150199960A1 (en) * | 2012-08-24 | 2015-07-16 | Microsoft Corporation | I-Vector Based Clustering Training Data in Speech Recognition |
CN103794214A (en) * | 2014-03-07 | 2014-05-14 | 联想(北京)有限公司 | Information processing method, device and electronic equipment |
CN109710727A (en) * | 2017-10-26 | 2019-05-03 | 哈曼国际工业有限公司 | System and method for natural language processing |
CN108597502A (en) * | 2018-04-27 | 2018-09-28 | 上海适享文化传播有限公司 | Field speech recognition training method based on dual training |
Non-Patent Citations (1)
Title |
---|
郑晓康: "《面向汉英专利文献的神经网络翻译模型的集外词翻译研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021198838A1 (en) * | 2020-04-03 | 2021-10-07 | International Business Machines Corporation | Training of model for processing sequence data |
GB2609157A (en) * | 2020-04-03 | 2023-01-25 | Ibm | Training of model for processing sequence data |
CN112885330A (en) * | 2021-01-26 | 2021-06-01 | 北京云上曲率科技有限公司 | Language identification method and system based on low-resource audio |
Also Published As
Publication number | Publication date |
---|---|
CN110246486B (en) | 2021-07-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Afouras et al. | Deep lip reading: a comparison of models and an online application | |
Afouras et al. | Asr is all you need: Cross-modal distillation for lip reading | |
CN110188202B (en) | Training method and device of semantic relation recognition model and terminal | |
JP7432556B2 (en) | Methods, devices, equipment and media for man-machine interaction | |
US11373049B2 (en) | Cross-lingual classification using multilingual neural machine translation | |
CN109887497A (en) | Modeling method, device and the equipment of speech recognition | |
CN110033760A (en) | Modeling method, device and the equipment of speech recognition | |
CN108985358B (en) | Emotion recognition method, device, equipment and storage medium | |
CN113205817B (en) | Speech semantic recognition method, system, device and medium | |
US10147438B2 (en) | Role modeling in call centers and work centers | |
CN108986793A (en) | translation processing method, device and equipment | |
Zimmermann et al. | Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system | |
US11900518B2 (en) | Interactive systems and methods | |
CN110175335A (en) | The training method and device of translation model | |
CN111816169B (en) | Method and device for training Chinese and English hybrid speech recognition model | |
CN110211570A (en) | Simultaneous interpretation processing method, device and equipment | |
CN110795549B (en) | Short text conversation method, device, equipment and storage medium | |
CN108460098A (en) | Information recommendation method, device and computer equipment | |
CN108984679A (en) | Dialogue generates the training method and device of model | |
US20180365208A1 (en) | Method for modifying segmentation model based on artificial intelligence, device and storage medium | |
Yu et al. | A Multistage Training Framework for Acoustic-to-Word Model. | |
CN113535957A (en) | Conversation emotion recognition network model based on dual knowledge interaction and multitask learning, construction method, electronic device and storage medium | |
CN112016271A (en) | Language style conversion model training method, text processing method and device | |
CN110209786A (en) | It is display methods, device, computer equipment and the storage medium of non-class answer | |
KR20230158613A (en) | Self-adaptive distillation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |