CN110379415A - The training method of domain-adaptive acoustic model - Google Patents
The training method of domain-adaptive acoustic model Download PDFInfo
- Publication number
- CN110379415A CN110379415A CN201910670390.6A CN201910670390A CN110379415A CN 110379415 A CN110379415 A CN 110379415A CN 201910670390 A CN201910670390 A CN 201910670390A CN 110379415 A CN110379415 A CN 110379415A
- Authority
- CN
- China
- Prior art keywords
- voice data
- acoustic model
- field
- data
- designated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 110
- 238000000034 method Methods 0.000 title claims abstract description 81
- 230000008569 process Effects 0.000 claims description 10
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000013481 data capture Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 15
- 238000003066 decision tree Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000009223 counseling Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the present disclosure provides training method, device, readable storage medium storing program for executing and the calculating equipment of a kind of domain-adaptive acoustic model, for constructing the acoustic model for having outstanding recognition effect in designated field.Method includes: the corresponding text data of voice data of the voice data for the near field for obtaining multiple designated fields and the near field of the multiple designated field;Acoustic model is trained according to the voice data of the near field of the voice data of the near field of the multiple designated field and the multiple designated field corresponding text data, obtains generic acoustic model;Obtain the corresponding text data of voice data of the voice data and the designated field of designated field;According to the corresponding text data of the voice data of the voice data of the designated field and the designated field, the generic acoustic model is trained, obtains domain-adaptive acoustic model.
Description
Technical field
This disclosure relates to voice processing technology field more particularly to a kind of training method of domain-adaptive acoustic model,
Device, readable storage medium storing program for executing and calculating equipment.
Background technique
Automatic speech recognition technology (Automatic Speech Recognition, ASR) obtains in the past few years
In considerable progress, certain scenes such as speech-sound intelligent assistant, the performance of state-of-the-art identifying system has been approached the performance of the mankind.
However under phone customer service scene, since sample rate is low, channel disturbance is big and training data is insufficient, and whole discrimination can only achieve
The level of the 50%-70% of 16K speech recognition.In addition in the business application of phone customer service, user often pays close attention to a certain special
The discrimination in field, when the training data of identifying system and application field mismatch, performance can be remarkably decreased, and be caused general
Phone customer service voices identifying system can not often use in these fields.
Summary of the invention
For this purpose, present disclose provides a kind of training method of domain-adaptive acoustic model, device, readable storage medium storing program for executing and
Calculate equipment, with try hard to solve the problems, such as or at least alleviate above existing at least one.
According to the one aspect of the embodiment of the present disclosure, a kind of training method of domain-adaptive acoustic model is provided, is wrapped
It includes:
Obtain the voice number of the voice data of the near field of multiple designated fields and the near field of multiple designated fields
According to corresponding text data;
According to the voice number of the near field of the voice data of the near field of multiple designated fields and multiple designated fields
Acoustic model is trained according to corresponding text data, obtains generic acoustic model;
Obtain the corresponding text data of voice data of the voice data and the designated field of designated field;
According to the corresponding text data of the voice data of the voice data of designated field and designated field, to general acoustic mode
Type is trained, and obtains domain-adaptive acoustic model.
Optionally, according to the corresponding text data of voice data of the voice data of designated field and designated field, to logical
It is trained with acoustic model, obtains domain-adaptive acoustic model, comprising:
According to the corresponding text data of the voice data of the voice data of designated field and designated field, and, specify neck
The corresponding text data of voice data of the near field of the voice data and designated field of the near field in domain, to general acoustics
Model is trained, and obtains domain-adaptive acoustic model;Wherein, the amount of voice data of the near field of designated field and specified
The ratio of the amount of voice data in field should meet preset condition.
Optionally, generic acoustic model is trained, obtains domain-adaptive acoustic model, comprising:
Generic acoustic model is trained using multiple training methods;
The acoustic model that more multiple training methods are respectively trained out is to the discrimination of the voice of designated field, by specified neck
The highest acoustic model of the discrimination of the voice in domain is as domain-adaptive acoustic model.
Optionally, when the training method of generic acoustic model is consistent with the training method of domain-adaptive acoustic model,
The exercise wheel number of domain-adaptive acoustic model is lower than the exercise wheel number of generic acoustic model.
Optionally, multiple training methods include at least:
Sequence crossover entropy target, the minimum Bayes risk sMBR criterion of state.
Optionally, according to the corresponding text data of voice data of the voice data of designated field and designated field, and,
The corresponding text data of voice data of the near field of the voice data and designated field of the near field of designated field, to logical
It is trained with acoustic model, obtains domain-adaptive acoustic model, comprising:
According to generic acoustic model, the corresponding textual data of voice data of voice data and designated field to designated field
According to and designated field near field voice data and designated field near field the corresponding textual data of voice data
According to progress phoneme registration process and generate word figure;
It is aligned result and word figure according to phoneme, generic acoustic model is trained, domain-adaptive acoustic model is obtained.
Optionally, preset condition includes:
The ratio of the amount of voice data of the amount of voice data and designated field of the near field of designated field is 1 to 2.
Optionally, the method for training acoustic model includes:
Clustering processing is carried out to phoneme model according to voice data training phoneme model, and using decision tree;
According to phoneme model and decision tree, phoneme alignment operation is made to voice data;
It is aligned result and the corresponding text data of voice data according to phoneme, generates word figure;
Result and word figure, training acoustic model are aligned according to phoneme.
Optionally, voice data includes:
Phone customer service voices data;
The voice data of the near field of multiple designated fields includes:
The phone customer service voices data of multiple industries;
The voice data of designated field includes:
The phone customer service voices data of designated trade.
According to the another aspect of the embodiment of the present disclosure, a kind of training device of domain-adaptive acoustic model is provided,
Include:
First data capture unit, for obtaining the voice data and the multiple finger of the near field of multiple designated fields
Determine the corresponding text data of voice data of the near field in field;
Generic acoustic model training unit, for the voice data and multiple fingers according to the near fields of multiple designated fields
The corresponding text data of voice data for determining the near field in field is trained acoustic model, obtains generic acoustic model;
Second data capture unit, for obtaining the voice data of designated field and the voice data pair of the designated field
The text data answered;
Designated field acoustic training model unit, for according to the voice data of designated field and the voice number of designated field
According to corresponding text data, generic acoustic model is trained, obtains domain-adaptive acoustic model.
Optionally, designated field acoustic training model unit is specifically used for:
According to the corresponding text data of the voice data of the voice data of designated field and designated field, and, specify neck
The corresponding text data of voice data of the near field of the voice data and designated field of the near field in domain, to general acoustics
Model is trained, and obtains domain-adaptive acoustic model;Wherein, the amount of voice data of the near field of designated field and specified
The ratio of the amount of voice data in field should meet preset condition.
Optionally, designated field acoustic training model unit obtains field certainly for being trained to generic acoustic model
When adapting to acoustic model, it is specifically used for:
Generic acoustic model is trained using multiple training methods;
The acoustic model that more multiple training methods are respectively trained out is to the discrimination of the voice of designated field, by specified neck
The highest acoustic model of the discrimination of the voice in domain is as domain-adaptive acoustic model.
Optionally, when the training method of generic acoustic model is consistent with the training method of domain-adaptive acoustic model,
The exercise wheel number of domain-adaptive acoustic model is lower than the exercise wheel number of generic acoustic model.
Optionally, multiple training methods include at least:
Sequence crossover entropy target, the minimum Bayes risk sMBR criterion of state.
Optionally, designated field acoustic training model unit is specifically used for:
According to generic acoustic model, the corresponding textual data of voice data of voice data and designated field to designated field
According to and designated field near field voice data and designated field near field the corresponding textual data of voice data
According to progress phoneme registration process and generate word figure;
It is aligned result and word figure according to phoneme, generic acoustic model is trained, domain-adaptive acoustic model is obtained.
Optionally, preset condition includes:
The ratio of the amount of voice data of the amount of voice data and designated field of the near field of designated field is 1 to 2.
Optionally, generic acoustic model training unit or designated field acoustic training model unit are for training acoustic model
When, it is specifically used for:
Clustering processing is carried out to phoneme model according to voice data training phoneme model, and using decision tree;
According to phoneme model and decision tree, phoneme alignment operation is made to voice data;
It is aligned result and the corresponding text data of voice data according to phoneme, generates word figure;
Result and word figure, training acoustic model are aligned according to phoneme.
Optionally, voice data includes:
Phone customer service voices data;
The voice data of the near field of multiple designated fields includes:
The phone customer service voices data of multiple industries;
The voice data of designated field includes:
The phone customer service voices data of designated trade.
According to the another aspect of the embodiment of the present disclosure, a kind of readable storage medium storing program for executing is provided, there is executable refer to thereon
It enables, when executable instruction is performed, so that computer executes operation included by the above method.
According to the another aspect of the embodiment of the present disclosure, a kind of calculating equipment is provided, comprising: processor;And memory,
It is stored with executable instruction, and executable instruction makes processor execute operation included by the above method upon being performed.
According to the technical solution that the embodiment of the present disclosure provides, by the basis of the generic acoustic model trained, into
Domain-adaptive acoustic model is trained to one step, can be realized preferable acoustic model recognition performance in designated field.
Detailed description of the invention
Attached drawing shows the illustrative embodiments of the disclosure, and it is bright together for explaining the principles of this disclosure,
Which includes these attached drawings to provide further understanding of the disclosure, and attached drawing is included in the description and constitutes this
Part of specification.
Fig. 1 is exemplary the structural block diagram for calculating equipment;
Fig. 2 is the flow chart according to a kind of training method of domain-adaptive acoustic model of the embodiment of the present disclosure;
Fig. 3 is the flow chart according to the acoustic training model method of the embodiment of the present disclosure;
Fig. 4 is the another flow chart according to a kind of training method of domain-adaptive acoustic model of the embodiment of the present disclosure;
Fig. 5 is the structure chart according to a kind of training device of domain-adaptive acoustic model of the embodiment of the present disclosure.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Fig. 1 is the Example Computing Device for being arranged as realizing the training method of the domain-adaptive acoustic model according to the disclosure
100 block diagram.In basic configuration 102, calculates equipment 100 and typically comprise system storage 106 and one or more
Processor 104.Memory bus 108 can be used for the communication between processor 104 and system storage 106.
Depending on desired configuration, processor 104 can be any kind of processing, including but not limited to: microprocessor
((μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 may include all
Cache, processor core such as one or more rank of on-chip cache 110 and second level cache 112 etc
114 and register 116.Exemplary processor core 114 may include arithmetic and logical unit (ALU), floating-point unit (FPU),
Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor
104 are used together, or in some implementations, and Memory Controller 118 can be an interior section of processor 104.
Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to: easily
The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System storage
Device 106 may include operating system 120, one or more program 122 and program data 124.In some embodiments,
Program 122 can be configured as to be referred to by one or more processor 104 using the execution of program data 124 on an operating system
It enables.
Calculating equipment 100 can also include facilitating from various interface equipments (for example, output equipment 142, Peripheral Interface
144 and communication equipment 146) to basic configuration 102 via the communication of bus/interface controller 130 interface bus 140.Example
Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as facilitate via
One or more port A/V 152 is communicated with the various external equipments of such as display terminal or loudspeaker etc.Example
Peripheral Interface 144 may include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, helps
In via one or more port I/O 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touching
Touch input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.Exemplary communication
Equipment 146 may include network controller 160, can be arranged to convenient for via one or more communication port 164 with
One or more other calculating communications of equipment 162 by network communication link.
Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave
Or computer readable instructions, data structure, program module in the modulated data signal of other transmission mechanisms etc, and can
To include any information delivery media." modulated data signal " can such signal, one in its data set or more
It is a or it change can the mode of encoded information in the signal carry out.As unrestricted example, communication media can be with
Wired medium including such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared
(IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein may include depositing
Both storage media and communication media.
Calculating equipment 100 can be implemented as a part of portable (or mobile) electronic equipment of small size, these electronics are set
The standby such as cellular phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, a of can be
People's helmet, application specific equipment or may include any of the above function mixing apparatus.Calculating equipment 100 can be with
Be embodied as include desktop computer and notebook computer configuration personal computer.
Wherein, the one or more programs 122 for calculating equipment 100 include for executing the domain-adaptive according to the disclosure
The instruction of the training method of acoustic model.
Fig. 2 illustrates the training method 200 of the domain-adaptive acoustic model according to an embodiment of the present disclosure
The training method 200 of flow chart, domain-adaptive acoustic model starts from step S210.
In step S210, the voice data of the near field of multiple designated fields and the close neck of multiple designated fields are obtained
The corresponding text data of the voice data in domain.
In accordance with an embodiment of the present disclosure, the voice data of the near field of multiple designated fields may include designated field
Voice data can not also include the voice data of designated field.
In accordance with an embodiment of the present disclosure, voice data can be phone customer service voices data, multiple designated fields it is close
The voice data in field can be the phone customer service voices data of the customer service department of multiple industries, and the voice data of designated field can
Be some designated trade customer service department phone customer service voices data.
In accordance with an embodiment of the present disclosure, voice data should be corresponded with text data.
Then, in step S220, according to the voice data of the near field of multiple designated fields and multiple designated fields
The voice data corresponding text data of near field acoustic model is trained, obtain generic acoustic model.
In accordance with an embodiment of the present disclosure, generic acoustic model training is especially by sequence crossover entropy target training time delay nerve
Network (Time Delay Neural Network, TDNN) model.Existing generic acoustic model can be directly used in the step
Training result, to reduce the demand of computing resource.
Then, in step S230, the voice data of the voice data and the designated field that obtain designated field is corresponding
Text data.
Then, in step S240, according to the corresponding text of the voice data of the voice data of designated field and designated field
Notebook data is trained generic acoustic model, obtains domain-adaptive acoustic model.
Since phone customer service voices data sampling rate is low, channel disturbance is big and training data is insufficient, leads to existing training knot
The discrimination of fruit is low.By the training method for the domain-adaptive acoustic model that the disclosure provides, can make full use of existing
The phone customer service voices data of all trades and professions are carried out for certain industry into one on the basis of training generic acoustic model
The training of step ground, obtains domain-adaptive acoustic model, so that the discrimination of the phone customer service voices data of the sector can
Reach higher level.
Specifically, step S240 includes:
According to the corresponding text data of the voice data of the voice data of designated field and designated field, and, specify neck
The corresponding text data of voice data of the near field of the voice data and designated field of the near field in domain, to general acoustics
Model is trained, and obtains domain-adaptive acoustic model;Wherein, the amount of voice data of the near field of designated field and specified
The ratio of the amount of voice data in field should meet preset condition.
In accordance with an embodiment of the present disclosure, a small amount of near field is added in the training process of domain-adaptive acoustic model
Voice data and text data, can be improved the generalization ability of domain-adaptive acoustic model, obtain performance more preferably field
Adaptive acoustic model.
For example, the ratio of the amount of voice data of the amount of voice data and designated field of the near field of designated field can be
1 to 2.Ratio is too low, can reduce the generalization ability of domain-adaptive acoustic model, and ratio is excessively high, then can reduce domain-adaptive
Recognition capability of the acoustic model to designated field voice data.
Further, in step S240, generic acoustic model is trained, obtains domain-adaptive acoustic model, is wrapped
It includes:
Generic acoustic model is trained using multiple training methods;
The acoustic model that more multiple training methods are respectively trained out is to the discrimination of the voice of designated field, by specified neck
The highest acoustic model of the discrimination of the voice in domain is as domain-adaptive acoustic model.
In accordance with an embodiment of the present disclosure, generic acoustic model is trained using multiple training methods, chooses verifying most
Excellent training result can be improved the speech recognition performance of domain-adaptive acoustic model as domain-adaptive acoustic model.
Further, when the training method of generic acoustic model is consistent with the training method of domain-adaptive acoustic model
When, the exercise wheel number of domain-adaptive acoustic model is lower than the exercise wheel number of generic acoustic model.
For example, generic acoustic model training needs to train 4-6 using sequence crossover entropy goal approach training TDNN model
Wheel, then, if the equally training of domain-adaptive acoustic model trains TDNN model using sequence crossover entropy goal approach,
1 wheel of training can reach estimated effect, to save the consumption of computing resource.In addition, in the training process, can also adopt
With minimum Bayes risk (state-level Minimum Bayes Risk, sMBR) method training acoustic model, this method
Need to consume more computing resources.
Further, step S240 includes:
According to generic acoustic model, the corresponding textual data of voice data of voice data and designated field to designated field
According to and designated field near field voice data and designated field near field the corresponding textual data of voice data
According to progress phoneme registration process and generate word figure;
It is aligned result and word figure according to phoneme, generic acoustic model is trained, domain-adaptive acoustic model is obtained.
In accordance with an embodiment of the present disclosure, if the training data of generic acoustic model does not include domain-adaptive acoustic model
Training data, can directly using generic acoustic model processing designated field voice data and designated field voice data
Corresponding text data reduces the artificial labeling operation of voice and text therein, improves data-handling efficiency.
Optionally, as shown in figure 3, acoustic training model can use following steps:
S310, clustering processing is carried out to phoneme model according to voice data training phoneme model, and using decision tree;
S320, according to phoneme model and decision tree, phoneme alignment operation is made to voice data;
S330, result and the corresponding text data of voice data are aligned according to phoneme, generate word figure;
S340, result and word figure, training acoustic model are aligned according to phoneme.
Above-mentioned steps can be not only used for realizing generic acoustic model training, it can also be used to realize that domain-adaptive acoustic model is instructed
Practice.
In accordance with an embodiment of the present disclosure, different speech recognition systems can such as be based on MFCC based on different acoustic features
The acoustic model of (Mel-Frequency Cepstrum Coefficients, MFCC cepstrum) feature is based on PLP
The acoustic model etc. of (Perceptual Linear Predictive perceives linear prediction) feature, or different sound can be used
Learn model such as hidden Markov model-gauss hybrid models (Hidden Markov Model-Gaussian Mixture
Model, HMM-GMM), be based on dynamic bayesian network (Dynamic Beyesian Network, DBN) neural network acoustics
Model etc..
Fig. 4 is the another flow chart of the training method for the domain-adaptive acoustic model that the embodiment of the present disclosure provides.Below
Process as shown in connection with fig. 4, phone customer service voices data and corresponding mark using thousands of hours 8KHZ sample rates of multiple scenes
Explanatory notes this as training data, wherein data source includes the common phone in the markets such as sale of automobile/financial sale/educational counseling
Customer service scene provides the specific embodiment of the disclosure.
In disclosure specific embodiment, universal model training includes the following steps:
Step a, triphones HMM-GMM model training:
This system uses half one triphones HMM-GMM model of training of full dose data first, and poly- using decision tree
Class method establishes a decision tree for binding similar triphones, reduces phoneme space.Using a half data rather than full dose
Data are mainly based upon the considerations of computing resource, if computing resource is sufficient, the full dose data training HMM-GMM model can be used.
Step b, TDNN model training:
Sub-step b1, using the HMM-GMM and decision tree generated in step a, phoneme alignment operation is made to full dose data, and
And generate corresponding word figure;
Sub-step b2, using the align data and corresponding word figure generated in sub-step b1, pass through sequence crossover entropy target
Training TDNN model.It can be according to computing resource condition to all data training 4-6 wheel.
In disclosure specific embodiment, domain-adaptive acoustic training model includes the following steps:
Step a, data select:
Other near field data of FIELD Data and 1 to 2 times of quantity.
Step b, data preparation:
Using trained TDNN universal model, selected data in step a is made to be aligned and generates corresponding words figure.
Step c, domain model training:
Use trained TDNN universal model parameter as initialization value, uses the align data and word figure in step b
Parameter is further learnt.
In this specific embodiment, domain-adaptive acoustic training model method used has following two:
Method one, on the basis of TDNN universal model, continue to use data in sequence crossover entropy target training step b.
This method is generally only needed to 1 wheel of all data training.
Method two, on the basis of TDNN universal model, use data in sMBR criterion training step b.This method needs
4-6 is trained to take turns.
The difference of training method can all influence final model performance in training data ratio and step c in step a, no
Need to construct the relevant assessment data acquisition system of scene with scene, according to the model that the selection of discrimination result is suitable.
Technical solution provided by the present disclosure uses minority on the basis of large-scale data training generic acoustic model
Designated field voice data and a small amount of near field voice data train domain-adaptive acoustic model, thus realize for
Every field, which provides, has targetedly domain-adaptive acoustic model, improves the acoustic model performance in the field.
Referring to Fig. 5, the embodiment of the present disclosure provides a kind of training device of domain-adaptive acoustic model, comprising:
First data capture unit 510, for obtaining the voice data of the near field of multiple designated fields and described more
The corresponding text data of the voice data of the near field of a designated field;
Generic acoustic model training unit 520, for according to the voice data of the near fields of multiple designated fields and more
The corresponding text data of the voice data of the near field of a designated field is trained acoustic model, obtains general acoustic mode
Type;
Second data capture unit 530, for obtaining the voice data of designated field and the voice number of the designated field
According to corresponding text data;
Designated field acoustic training model unit 540, for according to the voice data of designated field and the language of designated field
The corresponding text data of sound data, is trained generic acoustic model, obtains domain-adaptive acoustic model.
Optionally, designated field acoustic training model unit 540 is specifically used for:
According to the corresponding text data of the voice data of the voice data of designated field and designated field, and, specify neck
The corresponding text data of voice data of the near field of the voice data and designated field of the near field in domain, to general acoustics
Model is trained, and obtains domain-adaptive acoustic model;Wherein, the amount of voice data of the near field of designated field and specified
The ratio of the amount of voice data in field should meet preset condition.
Optionally, designated field acoustic training model unit 540 obtains field for being trained to generic acoustic model
When adaptive acoustic model, it is specifically used for:
Generic acoustic model is trained using multiple training methods;
The acoustic model that more multiple training methods are respectively trained out is to the discrimination of the voice of designated field, by specified neck
The highest acoustic model of the discrimination of the voice in domain is as domain-adaptive acoustic model.
Optionally, when the training method of generic acoustic model is consistent with the training method of domain-adaptive acoustic model,
The exercise wheel number of domain-adaptive acoustic model is lower than the exercise wheel number of generic acoustic model.
Optionally, designated field acoustic training model unit 540 is specifically used for:
According to generic acoustic model, the corresponding textual data of voice data of voice data and designated field to designated field
According to and designated field near field voice data and designated field near field the corresponding textual data of voice data
According to progress phoneme registration process and generate word figure;
It is aligned result and word figure according to phoneme, generic acoustic model is trained, domain-adaptive acoustic model is obtained.
Optionally, preset condition includes:
The ratio of the amount of voice data of the amount of voice data and designated field of the near field of designated field is 1 to 2.
Optionally, generic acoustic model training unit 520 or designated field acoustic training model unit 540 are for training sound
When learning model, it is specifically used for:
Clustering processing is carried out to phoneme model according to voice data training phoneme model, and using decision tree;
According to phoneme model and decision tree, phoneme alignment operation is made to voice data;
It is aligned result and the corresponding text data of voice data according to phoneme, generates word figure;
Result and word figure, training acoustic model are aligned according to phoneme.
Optionally, voice data includes:
Phone customer service voices data;
The voice data of the near field of multiple designated fields includes:
The phone customer service voices data of multiple industries;
The voice data of designated field includes:
The phone customer service voices data of designated trade.
The specific restriction of training device about domain-adaptive acoustic model may refer to adaptive above for field
The restriction of the training method of acoustic model is answered, details are not described herein.
It should be appreciated that various technologies described herein are realized together in combination with hardware or software or their combination.From
And some aspects or part of disclosed method and equipment or disclosed method and equipment can take the tangible matchmaker of insertion
It is situated between, such as the program code in floppy disk, CD-ROM, hard disk drive or other any machine readable storage mediums (refers to
Enable) form, wherein when program is loaded into the machine of such as computer etc, and when being executed by the machine, which becomes real
The equipment for trampling the disclosure.
In the case where program code executes on programmable computers, calculates equipment and generally comprise processor, processor
Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit, and extremely
A few output device.Wherein, memory is configured for storage program code;Processor is configured for according to the memory
Instruction in the program code of middle storage executes the various methods of the disclosure.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates
Machine readable medium includes computer storage media and communication media.Computer storage medium storage such as computer-readable instruction,
The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc.
Data-signal processed passes to embody computer readable instructions, data structure, program module or other data including any information
Pass medium.Above any combination is also included within the scope of computer-readable medium.
It should be appreciated that in order to simplify the disclosure and help to understand one or more of each open aspect, it is right above
In the description of the exemplary embodiment of the disclosure, each feature of the disclosure be grouped together into sometimes single embodiment, figure or
In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed public affairs
Requirement is opened than feature more features expressly recited in each claim.More precisely, as the following claims
As book reflects, open aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real
Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this public affairs
The separate embodiments opened.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups
Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example
In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple
Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments means to be in the disclosure
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
Meaning one of can in any combination mode come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment
The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method
The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice
Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the disclosure element performed by
Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc.
Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must
Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the disclosure, above description, the art are benefited from
It is interior it is clear for the skilled person that in the scope of the present disclosure thus described, it can be envisaged that other embodiments.Additionally, it should be noted that
Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit
Determine the theme of the disclosure and selects.Therefore, without departing from the scope and spirit of the appended claims, for this
Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present disclosure, to this
Openly done disclosure is illustrative and not restrictive, and the scope of the present disclosure is defined by the appended claims.
Claims (10)
1. a kind of training method of domain-adaptive acoustic model characterized by comprising
Obtain the voice number of the voice data of the near field of multiple designated fields and the near field of the multiple designated field
According to corresponding text data;
According to the language of the near field of the voice data of the near field of the multiple designated field and the multiple designated field
The corresponding text data of sound data is trained acoustic model, obtains generic acoustic model;
Obtain the corresponding text data of voice data of the voice data and the designated field of designated field;
According to the corresponding text data of the voice data of the voice data of the designated field and the designated field, to described logical
It is trained with acoustic model, obtains domain-adaptive acoustic model.
2. the method as described in claim 1, which is characterized in that according to the voice data of the designated field and the specified neck
The corresponding text data of the voice data in domain, is trained the generic acoustic model, obtains domain-adaptive acoustic model,
Include:
According to the corresponding text data of the voice data of the voice data of the designated field and the designated field, and, institute
State the corresponding textual data of voice data of the voice data of the near field of designated field and the near field of the designated field
According to being trained to the generic acoustic model, obtain domain-adaptive acoustic model;Wherein, the designated field is close
The ratio of the amount of voice data in field and the amount of voice data of the designated field should meet preset condition.
3. method according to claim 2, which is characterized in that according to the voice data of the designated field and the specified neck
The corresponding text data of the voice data in domain, and, the voice data of the near field of the designated field and the specified neck
The corresponding text data of the voice data of the near field in domain, is trained the generic acoustic model, it is adaptive to obtain field
Answer acoustic model, comprising:
According to the generic acoustic model, the voice data of voice data and the designated field to the designated field is corresponding
Text data and the designated field near field voice data and the designated field near field voice
The corresponding text data of data carries out phoneme registration process and generates word figure;
It is aligned result and institute's predicate figure according to phoneme, the generic acoustic model is trained, domain-adaptive acoustics is obtained
Model.
4. method as claimed in claim 2 or claim 3, which is characterized in that be trained to the generic acoustic model, obtain field
Adaptive acoustic model, comprising:
The generic acoustic model is trained using multiple training methods;
The acoustic model that more the multiple training method is respectively trained out is to the discrimination of the voice of the designated field, by institute
The highest acoustic model of discrimination of the voice of designated field is stated as the domain-adaptive acoustic model.
5. method as claimed in claim 4, which is characterized in that training method and the field when the generic acoustic model
When the training method of adaptive acoustic model is consistent, the exercise wheel number of the domain-adaptive acoustic model is lower than the general sound
Learn the exercise wheel number of model.
6. method according to claim 2, which is characterized in that the preset condition includes:
The ratio of the amount of voice data of the near field of the designated field and the amount of voice data of the designated field is 1 to 2.
7. the method as described in claim 1, which is characterized in that the voice data includes:
Phone customer service voices data;
The voice data of the near field of the multiple designated field includes:
The phone customer service voices data of multiple industries;
The voice data of the designated field includes:
The phone customer service voices data of designated trade.
8. a kind of training device of domain-adaptive acoustic model characterized by comprising
First data capture unit, for obtaining the voice data and the multiple specified neck of the near field of multiple designated fields
The corresponding text data of the voice data of the near field in domain;
Generic acoustic model training unit, for according to the voice data of the near field of the multiple designated field and described more
The corresponding text data of the voice data of the near field of a designated field is trained acoustic model, obtains general acoustic mode
Type;
Second data capture unit, for obtain designated field voice data and the designated field voice data it is corresponding
Text data;
Designated field acoustic training model unit, for according to the voice data of the designated field and the language of the designated field
The corresponding text data of sound data, is trained the generic acoustic model, obtains domain-adaptive acoustic model.
9. a kind of readable storage medium storing program for executing has executable instruction, when the executable instruction is performed, so that computer thereon
Operation included by any one in perform claim requirement 1-7.
10. a kind of calculating equipment, comprising:
Processor;And
Memory, is stored with executable instruction, and the executable instruction makes the processor perform claim upon being performed
It is required that operation included by any one in 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910670390.6A CN110379415B (en) | 2019-07-24 | 2019-07-24 | Training method of domain adaptive acoustic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910670390.6A CN110379415B (en) | 2019-07-24 | 2019-07-24 | Training method of domain adaptive acoustic model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110379415A true CN110379415A (en) | 2019-10-25 |
CN110379415B CN110379415B (en) | 2022-02-18 |
Family
ID=68255440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910670390.6A Active CN110379415B (en) | 2019-07-24 | 2019-07-24 | Training method of domain adaptive acoustic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110379415B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111243574A (en) * | 2020-01-13 | 2020-06-05 | 苏州奇梦者网络科技有限公司 | Voice model adaptive training method, system, device and storage medium |
CN111477211A (en) * | 2020-04-17 | 2020-07-31 | 珠海声原智能科技有限公司 | Cross-scene fast-adaptation voice recognition method and device |
CN111508479A (en) * | 2020-04-16 | 2020-08-07 | 重庆农村商业银行股份有限公司 | Voice recognition method, device, equipment and storage medium |
CN111613209A (en) * | 2020-04-14 | 2020-09-01 | 北京三快在线科技有限公司 | Acoustic model training method and device, electronic equipment and storage medium |
CN112466294A (en) * | 2020-11-24 | 2021-03-09 | 北京百度网讯科技有限公司 | Acoustic model generation method and device and electronic equipment |
CN112596868A (en) * | 2020-11-27 | 2021-04-02 | 出门问问(武汉)信息科技有限公司 | Model training method and device |
CN113327587A (en) * | 2021-06-02 | 2021-08-31 | 云知声(上海)智能科技有限公司 | Method and device for voice recognition in specific scene, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101923854A (en) * | 2010-08-31 | 2010-12-22 | 中国科学院计算技术研究所 | Interactive speech recognition system and method |
CN102280106A (en) * | 2010-06-12 | 2011-12-14 | 三星电子株式会社 | VWS method and apparatus used for mobile communication terminal |
JP2016102820A (en) * | 2014-11-27 | 2016-06-02 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | Method for improving acoustic model, and computer for improving acoustic model and computer program therefor |
CN107154260A (en) * | 2017-04-11 | 2017-09-12 | 北京智能管家科技有限公司 | A kind of domain-adaptive audio recognition method and device |
US20190066662A1 (en) * | 2017-08-25 | 2019-02-28 | International Business Machines Corporation | Priors adaptation for conservative training of acoustic model |
-
2019
- 2019-07-24 CN CN201910670390.6A patent/CN110379415B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102280106A (en) * | 2010-06-12 | 2011-12-14 | 三星电子株式会社 | VWS method and apparatus used for mobile communication terminal |
CN101923854A (en) * | 2010-08-31 | 2010-12-22 | 中国科学院计算技术研究所 | Interactive speech recognition system and method |
JP2016102820A (en) * | 2014-11-27 | 2016-06-02 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | Method for improving acoustic model, and computer for improving acoustic model and computer program therefor |
CN107154260A (en) * | 2017-04-11 | 2017-09-12 | 北京智能管家科技有限公司 | A kind of domain-adaptive audio recognition method and device |
US20190066662A1 (en) * | 2017-08-25 | 2019-02-28 | International Business Machines Corporation | Priors adaptation for conservative training of acoustic model |
Non-Patent Citations (1)
Title |
---|
王丽: "基于先验概率线性插值的声学模型自适应方法", 《第十四届全国人机语音通讯学术会议(NCMMSC’2017)论文集》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111243574A (en) * | 2020-01-13 | 2020-06-05 | 苏州奇梦者网络科技有限公司 | Voice model adaptive training method, system, device and storage medium |
CN111613209A (en) * | 2020-04-14 | 2020-09-01 | 北京三快在线科技有限公司 | Acoustic model training method and device, electronic equipment and storage medium |
CN111508479A (en) * | 2020-04-16 | 2020-08-07 | 重庆农村商业银行股份有限公司 | Voice recognition method, device, equipment and storage medium |
CN111477211A (en) * | 2020-04-17 | 2020-07-31 | 珠海声原智能科技有限公司 | Cross-scene fast-adaptation voice recognition method and device |
CN112466294A (en) * | 2020-11-24 | 2021-03-09 | 北京百度网讯科技有限公司 | Acoustic model generation method and device and electronic equipment |
CN112466294B (en) * | 2020-11-24 | 2021-12-14 | 北京百度网讯科技有限公司 | Acoustic model generation method and device and electronic equipment |
CN112596868A (en) * | 2020-11-27 | 2021-04-02 | 出门问问(武汉)信息科技有限公司 | Model training method and device |
CN113327587A (en) * | 2021-06-02 | 2021-08-31 | 云知声(上海)智能科技有限公司 | Method and device for voice recognition in specific scene, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110379415B (en) | 2022-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110379415A (en) | The training method of domain-adaptive acoustic model | |
US10971142B2 (en) | Systems and methods for robust speech recognition using generative adversarial networks | |
CN110246487B (en) | Optimization method and system for single-channel speech recognition model | |
US11403345B2 (en) | Method and system for processing unclear intent query in conversation system | |
CN108475505B (en) | Generating a target sequence from an input sequence using partial conditions | |
US9818409B2 (en) | Context-dependent modeling of phonemes | |
CN110197658B (en) | Voice processing method and device and electronic equipment | |
CN110473566A (en) | Audio separation method, device, electronic equipment and computer readable storage medium | |
CN109637546A (en) | Knowledge distillating method and device | |
CN109074820A (en) | Audio processing is carried out using neural network | |
CN107564513A (en) | Audio recognition method and device | |
CN110232907A (en) | A kind of phoneme synthesizing method, device, readable storage medium storing program for executing and calculate equipment | |
CN110379407A (en) | Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment | |
CN110189748A (en) | Model building method and device | |
AU2021236965B2 (en) | Automatically generating diverse text | |
CN111666416A (en) | Method and apparatus for generating semantic matching model | |
CN110277088A (en) | Intelligent voice recognition method, device and computer readable storage medium | |
CN109840052A (en) | A kind of audio-frequency processing method, device, electronic equipment and storage medium | |
CN111081230A (en) | Speech recognition method and apparatus | |
CN110569908B (en) | Speaker counting method and system | |
CN115376495A (en) | Speech recognition model training method, speech recognition method and device | |
US20220254351A1 (en) | Method and system for correcting speaker diarization using speaker change detection based on text | |
CN107910005A (en) | The target service localization method and device of interaction text | |
CN111462755A (en) | Information prompting method and device, electronic equipment and medium | |
US20220327356A1 (en) | Transformer-Based Model Knowledge Graph Link Prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |