CN110379415A - The training method of domain-adaptive acoustic model - Google Patents

The training method of domain-adaptive acoustic model Download PDF

Info

Publication number
CN110379415A
CN110379415A CN201910670390.6A CN201910670390A CN110379415A CN 110379415 A CN110379415 A CN 110379415A CN 201910670390 A CN201910670390 A CN 201910670390A CN 110379415 A CN110379415 A CN 110379415A
Authority
CN
China
Prior art keywords
voice data
acoustic model
field
data
designated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910670390.6A
Other languages
Chinese (zh)
Other versions
CN110379415B (en
Inventor
钟利民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Go Out And Ask (suzhou) Information Technology Co Ltd
Original Assignee
Go Out And Ask (suzhou) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Go Out And Ask (suzhou) Information Technology Co Ltd filed Critical Go Out And Ask (suzhou) Information Technology Co Ltd
Priority to CN201910670390.6A priority Critical patent/CN110379415B/en
Publication of CN110379415A publication Critical patent/CN110379415A/en
Application granted granted Critical
Publication of CN110379415B publication Critical patent/CN110379415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the present disclosure provides training method, device, readable storage medium storing program for executing and the calculating equipment of a kind of domain-adaptive acoustic model, for constructing the acoustic model for having outstanding recognition effect in designated field.Method includes: the corresponding text data of voice data of the voice data for the near field for obtaining multiple designated fields and the near field of the multiple designated field;Acoustic model is trained according to the voice data of the near field of the voice data of the near field of the multiple designated field and the multiple designated field corresponding text data, obtains generic acoustic model;Obtain the corresponding text data of voice data of the voice data and the designated field of designated field;According to the corresponding text data of the voice data of the voice data of the designated field and the designated field, the generic acoustic model is trained, obtains domain-adaptive acoustic model.

Description

The training method of domain-adaptive acoustic model
Technical field
This disclosure relates to voice processing technology field more particularly to a kind of training method of domain-adaptive acoustic model, Device, readable storage medium storing program for executing and calculating equipment.
Background technique
Automatic speech recognition technology (Automatic Speech Recognition, ASR) obtains in the past few years In considerable progress, certain scenes such as speech-sound intelligent assistant, the performance of state-of-the-art identifying system has been approached the performance of the mankind. However under phone customer service scene, since sample rate is low, channel disturbance is big and training data is insufficient, and whole discrimination can only achieve The level of the 50%-70% of 16K speech recognition.In addition in the business application of phone customer service, user often pays close attention to a certain special The discrimination in field, when the training data of identifying system and application field mismatch, performance can be remarkably decreased, and be caused general Phone customer service voices identifying system can not often use in these fields.
Summary of the invention
For this purpose, present disclose provides a kind of training method of domain-adaptive acoustic model, device, readable storage medium storing program for executing and Calculate equipment, with try hard to solve the problems, such as or at least alleviate above existing at least one.
According to the one aspect of the embodiment of the present disclosure, a kind of training method of domain-adaptive acoustic model is provided, is wrapped It includes:
Obtain the voice number of the voice data of the near field of multiple designated fields and the near field of multiple designated fields According to corresponding text data;
According to the voice number of the near field of the voice data of the near field of multiple designated fields and multiple designated fields Acoustic model is trained according to corresponding text data, obtains generic acoustic model;
Obtain the corresponding text data of voice data of the voice data and the designated field of designated field;
According to the corresponding text data of the voice data of the voice data of designated field and designated field, to general acoustic mode Type is trained, and obtains domain-adaptive acoustic model.
Optionally, according to the corresponding text data of voice data of the voice data of designated field and designated field, to logical It is trained with acoustic model, obtains domain-adaptive acoustic model, comprising:
According to the corresponding text data of the voice data of the voice data of designated field and designated field, and, specify neck The corresponding text data of voice data of the near field of the voice data and designated field of the near field in domain, to general acoustics Model is trained, and obtains domain-adaptive acoustic model;Wherein, the amount of voice data of the near field of designated field and specified The ratio of the amount of voice data in field should meet preset condition.
Optionally, generic acoustic model is trained, obtains domain-adaptive acoustic model, comprising:
Generic acoustic model is trained using multiple training methods;
The acoustic model that more multiple training methods are respectively trained out is to the discrimination of the voice of designated field, by specified neck The highest acoustic model of the discrimination of the voice in domain is as domain-adaptive acoustic model.
Optionally, when the training method of generic acoustic model is consistent with the training method of domain-adaptive acoustic model, The exercise wheel number of domain-adaptive acoustic model is lower than the exercise wheel number of generic acoustic model.
Optionally, multiple training methods include at least:
Sequence crossover entropy target, the minimum Bayes risk sMBR criterion of state.
Optionally, according to the corresponding text data of voice data of the voice data of designated field and designated field, and, The corresponding text data of voice data of the near field of the voice data and designated field of the near field of designated field, to logical It is trained with acoustic model, obtains domain-adaptive acoustic model, comprising:
According to generic acoustic model, the corresponding textual data of voice data of voice data and designated field to designated field According to and designated field near field voice data and designated field near field the corresponding textual data of voice data According to progress phoneme registration process and generate word figure;
It is aligned result and word figure according to phoneme, generic acoustic model is trained, domain-adaptive acoustic model is obtained.
Optionally, preset condition includes:
The ratio of the amount of voice data of the amount of voice data and designated field of the near field of designated field is 1 to 2.
Optionally, the method for training acoustic model includes:
Clustering processing is carried out to phoneme model according to voice data training phoneme model, and using decision tree;
According to phoneme model and decision tree, phoneme alignment operation is made to voice data;
It is aligned result and the corresponding text data of voice data according to phoneme, generates word figure;
Result and word figure, training acoustic model are aligned according to phoneme.
Optionally, voice data includes:
Phone customer service voices data;
The voice data of the near field of multiple designated fields includes:
The phone customer service voices data of multiple industries;
The voice data of designated field includes:
The phone customer service voices data of designated trade.
According to the another aspect of the embodiment of the present disclosure, a kind of training device of domain-adaptive acoustic model is provided, Include:
First data capture unit, for obtaining the voice data and the multiple finger of the near field of multiple designated fields Determine the corresponding text data of voice data of the near field in field;
Generic acoustic model training unit, for the voice data and multiple fingers according to the near fields of multiple designated fields The corresponding text data of voice data for determining the near field in field is trained acoustic model, obtains generic acoustic model;
Second data capture unit, for obtaining the voice data of designated field and the voice data pair of the designated field The text data answered;
Designated field acoustic training model unit, for according to the voice data of designated field and the voice number of designated field According to corresponding text data, generic acoustic model is trained, obtains domain-adaptive acoustic model.
Optionally, designated field acoustic training model unit is specifically used for:
According to the corresponding text data of the voice data of the voice data of designated field and designated field, and, specify neck The corresponding text data of voice data of the near field of the voice data and designated field of the near field in domain, to general acoustics Model is trained, and obtains domain-adaptive acoustic model;Wherein, the amount of voice data of the near field of designated field and specified The ratio of the amount of voice data in field should meet preset condition.
Optionally, designated field acoustic training model unit obtains field certainly for being trained to generic acoustic model When adapting to acoustic model, it is specifically used for:
Generic acoustic model is trained using multiple training methods;
The acoustic model that more multiple training methods are respectively trained out is to the discrimination of the voice of designated field, by specified neck The highest acoustic model of the discrimination of the voice in domain is as domain-adaptive acoustic model.
Optionally, when the training method of generic acoustic model is consistent with the training method of domain-adaptive acoustic model, The exercise wheel number of domain-adaptive acoustic model is lower than the exercise wheel number of generic acoustic model.
Optionally, multiple training methods include at least:
Sequence crossover entropy target, the minimum Bayes risk sMBR criterion of state.
Optionally, designated field acoustic training model unit is specifically used for:
According to generic acoustic model, the corresponding textual data of voice data of voice data and designated field to designated field According to and designated field near field voice data and designated field near field the corresponding textual data of voice data According to progress phoneme registration process and generate word figure;
It is aligned result and word figure according to phoneme, generic acoustic model is trained, domain-adaptive acoustic model is obtained.
Optionally, preset condition includes:
The ratio of the amount of voice data of the amount of voice data and designated field of the near field of designated field is 1 to 2.
Optionally, generic acoustic model training unit or designated field acoustic training model unit are for training acoustic model When, it is specifically used for:
Clustering processing is carried out to phoneme model according to voice data training phoneme model, and using decision tree;
According to phoneme model and decision tree, phoneme alignment operation is made to voice data;
It is aligned result and the corresponding text data of voice data according to phoneme, generates word figure;
Result and word figure, training acoustic model are aligned according to phoneme.
Optionally, voice data includes:
Phone customer service voices data;
The voice data of the near field of multiple designated fields includes:
The phone customer service voices data of multiple industries;
The voice data of designated field includes:
The phone customer service voices data of designated trade.
According to the another aspect of the embodiment of the present disclosure, a kind of readable storage medium storing program for executing is provided, there is executable refer to thereon It enables, when executable instruction is performed, so that computer executes operation included by the above method.
According to the another aspect of the embodiment of the present disclosure, a kind of calculating equipment is provided, comprising: processor;And memory, It is stored with executable instruction, and executable instruction makes processor execute operation included by the above method upon being performed.
According to the technical solution that the embodiment of the present disclosure provides, by the basis of the generic acoustic model trained, into Domain-adaptive acoustic model is trained to one step, can be realized preferable acoustic model recognition performance in designated field.
Detailed description of the invention
Attached drawing shows the illustrative embodiments of the disclosure, and it is bright together for explaining the principles of this disclosure, Which includes these attached drawings to provide further understanding of the disclosure, and attached drawing is included in the description and constitutes this Part of specification.
Fig. 1 is exemplary the structural block diagram for calculating equipment;
Fig. 2 is the flow chart according to a kind of training method of domain-adaptive acoustic model of the embodiment of the present disclosure;
Fig. 3 is the flow chart according to the acoustic training model method of the embodiment of the present disclosure;
Fig. 4 is the another flow chart according to a kind of training method of domain-adaptive acoustic model of the embodiment of the present disclosure;
Fig. 5 is the structure chart according to a kind of training device of domain-adaptive acoustic model of the embodiment of the present disclosure.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Fig. 1 is the Example Computing Device for being arranged as realizing the training method of the domain-adaptive acoustic model according to the disclosure 100 block diagram.In basic configuration 102, calculates equipment 100 and typically comprise system storage 106 and one or more Processor 104.Memory bus 108 can be used for the communication between processor 104 and system storage 106.
Depending on desired configuration, processor 104 can be any kind of processing, including but not limited to: microprocessor ((μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 may include all Cache, processor core such as one or more rank of on-chip cache 110 and second level cache 112 etc 114 and register 116.Exemplary processor core 114 may include arithmetic and logical unit (ALU), floating-point unit (FPU), Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor 104 are used together, or in some implementations, and Memory Controller 118 can be an interior section of processor 104.
Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to: easily The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System storage Device 106 may include operating system 120, one or more program 122 and program data 124.In some embodiments, Program 122 can be configured as to be referred to by one or more processor 104 using the execution of program data 124 on an operating system It enables.
Calculating equipment 100 can also include facilitating from various interface equipments (for example, output equipment 142, Peripheral Interface 144 and communication equipment 146) to basic configuration 102 via the communication of bus/interface controller 130 interface bus 140.Example Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as facilitate via One or more port A/V 152 is communicated with the various external equipments of such as display terminal or loudspeaker etc.Example Peripheral Interface 144 may include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, helps In via one or more port I/O 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touching Touch input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.Exemplary communication Equipment 146 may include network controller 160, can be arranged to convenient for via one or more communication port 164 with One or more other calculating communications of equipment 162 by network communication link.
Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave Or computer readable instructions, data structure, program module in the modulated data signal of other transmission mechanisms etc, and can To include any information delivery media." modulated data signal " can such signal, one in its data set or more It is a or it change can the mode of encoded information in the signal carry out.As unrestricted example, communication media can be with Wired medium including such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared (IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein may include depositing Both storage media and communication media.
Calculating equipment 100 can be implemented as a part of portable (or mobile) electronic equipment of small size, these electronics are set The standby such as cellular phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, a of can be People's helmet, application specific equipment or may include any of the above function mixing apparatus.Calculating equipment 100 can be with Be embodied as include desktop computer and notebook computer configuration personal computer.
Wherein, the one or more programs 122 for calculating equipment 100 include for executing the domain-adaptive according to the disclosure The instruction of the training method of acoustic model.
Fig. 2 illustrates the training method 200 of the domain-adaptive acoustic model according to an embodiment of the present disclosure The training method 200 of flow chart, domain-adaptive acoustic model starts from step S210.
In step S210, the voice data of the near field of multiple designated fields and the close neck of multiple designated fields are obtained The corresponding text data of the voice data in domain.
In accordance with an embodiment of the present disclosure, the voice data of the near field of multiple designated fields may include designated field Voice data can not also include the voice data of designated field.
In accordance with an embodiment of the present disclosure, voice data can be phone customer service voices data, multiple designated fields it is close The voice data in field can be the phone customer service voices data of the customer service department of multiple industries, and the voice data of designated field can Be some designated trade customer service department phone customer service voices data.
In accordance with an embodiment of the present disclosure, voice data should be corresponded with text data.
Then, in step S220, according to the voice data of the near field of multiple designated fields and multiple designated fields The voice data corresponding text data of near field acoustic model is trained, obtain generic acoustic model.
In accordance with an embodiment of the present disclosure, generic acoustic model training is especially by sequence crossover entropy target training time delay nerve Network (Time Delay Neural Network, TDNN) model.Existing generic acoustic model can be directly used in the step Training result, to reduce the demand of computing resource.
Then, in step S230, the voice data of the voice data and the designated field that obtain designated field is corresponding Text data.
Then, in step S240, according to the corresponding text of the voice data of the voice data of designated field and designated field Notebook data is trained generic acoustic model, obtains domain-adaptive acoustic model.
Since phone customer service voices data sampling rate is low, channel disturbance is big and training data is insufficient, leads to existing training knot The discrimination of fruit is low.By the training method for the domain-adaptive acoustic model that the disclosure provides, can make full use of existing The phone customer service voices data of all trades and professions are carried out for certain industry into one on the basis of training generic acoustic model The training of step ground, obtains domain-adaptive acoustic model, so that the discrimination of the phone customer service voices data of the sector can Reach higher level.
Specifically, step S240 includes:
According to the corresponding text data of the voice data of the voice data of designated field and designated field, and, specify neck The corresponding text data of voice data of the near field of the voice data and designated field of the near field in domain, to general acoustics Model is trained, and obtains domain-adaptive acoustic model;Wherein, the amount of voice data of the near field of designated field and specified The ratio of the amount of voice data in field should meet preset condition.
In accordance with an embodiment of the present disclosure, a small amount of near field is added in the training process of domain-adaptive acoustic model Voice data and text data, can be improved the generalization ability of domain-adaptive acoustic model, obtain performance more preferably field Adaptive acoustic model.
For example, the ratio of the amount of voice data of the amount of voice data and designated field of the near field of designated field can be 1 to 2.Ratio is too low, can reduce the generalization ability of domain-adaptive acoustic model, and ratio is excessively high, then can reduce domain-adaptive Recognition capability of the acoustic model to designated field voice data.
Further, in step S240, generic acoustic model is trained, obtains domain-adaptive acoustic model, is wrapped It includes:
Generic acoustic model is trained using multiple training methods;
The acoustic model that more multiple training methods are respectively trained out is to the discrimination of the voice of designated field, by specified neck The highest acoustic model of the discrimination of the voice in domain is as domain-adaptive acoustic model.
In accordance with an embodiment of the present disclosure, generic acoustic model is trained using multiple training methods, chooses verifying most Excellent training result can be improved the speech recognition performance of domain-adaptive acoustic model as domain-adaptive acoustic model.
Further, when the training method of generic acoustic model is consistent with the training method of domain-adaptive acoustic model When, the exercise wheel number of domain-adaptive acoustic model is lower than the exercise wheel number of generic acoustic model.
For example, generic acoustic model training needs to train 4-6 using sequence crossover entropy goal approach training TDNN model Wheel, then, if the equally training of domain-adaptive acoustic model trains TDNN model using sequence crossover entropy goal approach, 1 wheel of training can reach estimated effect, to save the consumption of computing resource.In addition, in the training process, can also adopt With minimum Bayes risk (state-level Minimum Bayes Risk, sMBR) method training acoustic model, this method Need to consume more computing resources.
Further, step S240 includes:
According to generic acoustic model, the corresponding textual data of voice data of voice data and designated field to designated field According to and designated field near field voice data and designated field near field the corresponding textual data of voice data According to progress phoneme registration process and generate word figure;
It is aligned result and word figure according to phoneme, generic acoustic model is trained, domain-adaptive acoustic model is obtained.
In accordance with an embodiment of the present disclosure, if the training data of generic acoustic model does not include domain-adaptive acoustic model Training data, can directly using generic acoustic model processing designated field voice data and designated field voice data Corresponding text data reduces the artificial labeling operation of voice and text therein, improves data-handling efficiency.
Optionally, as shown in figure 3, acoustic training model can use following steps:
S310, clustering processing is carried out to phoneme model according to voice data training phoneme model, and using decision tree;
S320, according to phoneme model and decision tree, phoneme alignment operation is made to voice data;
S330, result and the corresponding text data of voice data are aligned according to phoneme, generate word figure;
S340, result and word figure, training acoustic model are aligned according to phoneme.
Above-mentioned steps can be not only used for realizing generic acoustic model training, it can also be used to realize that domain-adaptive acoustic model is instructed Practice.
In accordance with an embodiment of the present disclosure, different speech recognition systems can such as be based on MFCC based on different acoustic features The acoustic model of (Mel-Frequency Cepstrum Coefficients, MFCC cepstrum) feature is based on PLP The acoustic model etc. of (Perceptual Linear Predictive perceives linear prediction) feature, or different sound can be used Learn model such as hidden Markov model-gauss hybrid models (Hidden Markov Model-Gaussian Mixture Model, HMM-GMM), be based on dynamic bayesian network (Dynamic Beyesian Network, DBN) neural network acoustics Model etc..
Fig. 4 is the another flow chart of the training method for the domain-adaptive acoustic model that the embodiment of the present disclosure provides.Below Process as shown in connection with fig. 4, phone customer service voices data and corresponding mark using thousands of hours 8KHZ sample rates of multiple scenes Explanatory notes this as training data, wherein data source includes the common phone in the markets such as sale of automobile/financial sale/educational counseling Customer service scene provides the specific embodiment of the disclosure.
In disclosure specific embodiment, universal model training includes the following steps:
Step a, triphones HMM-GMM model training:
This system uses half one triphones HMM-GMM model of training of full dose data first, and poly- using decision tree Class method establishes a decision tree for binding similar triphones, reduces phoneme space.Using a half data rather than full dose Data are mainly based upon the considerations of computing resource, if computing resource is sufficient, the full dose data training HMM-GMM model can be used.
Step b, TDNN model training:
Sub-step b1, using the HMM-GMM and decision tree generated in step a, phoneme alignment operation is made to full dose data, and And generate corresponding word figure;
Sub-step b2, using the align data and corresponding word figure generated in sub-step b1, pass through sequence crossover entropy target Training TDNN model.It can be according to computing resource condition to all data training 4-6 wheel.
In disclosure specific embodiment, domain-adaptive acoustic training model includes the following steps:
Step a, data select:
Other near field data of FIELD Data and 1 to 2 times of quantity.
Step b, data preparation:
Using trained TDNN universal model, selected data in step a is made to be aligned and generates corresponding words figure.
Step c, domain model training:
Use trained TDNN universal model parameter as initialization value, uses the align data and word figure in step b Parameter is further learnt.
In this specific embodiment, domain-adaptive acoustic training model method used has following two:
Method one, on the basis of TDNN universal model, continue to use data in sequence crossover entropy target training step b. This method is generally only needed to 1 wheel of all data training.
Method two, on the basis of TDNN universal model, use data in sMBR criterion training step b.This method needs 4-6 is trained to take turns.
The difference of training method can all influence final model performance in training data ratio and step c in step a, no Need to construct the relevant assessment data acquisition system of scene with scene, according to the model that the selection of discrimination result is suitable.
Technical solution provided by the present disclosure uses minority on the basis of large-scale data training generic acoustic model Designated field voice data and a small amount of near field voice data train domain-adaptive acoustic model, thus realize for Every field, which provides, has targetedly domain-adaptive acoustic model, improves the acoustic model performance in the field.
Referring to Fig. 5, the embodiment of the present disclosure provides a kind of training device of domain-adaptive acoustic model, comprising:
First data capture unit 510, for obtaining the voice data of the near field of multiple designated fields and described more The corresponding text data of the voice data of the near field of a designated field;
Generic acoustic model training unit 520, for according to the voice data of the near fields of multiple designated fields and more The corresponding text data of the voice data of the near field of a designated field is trained acoustic model, obtains general acoustic mode Type;
Second data capture unit 530, for obtaining the voice data of designated field and the voice number of the designated field According to corresponding text data;
Designated field acoustic training model unit 540, for according to the voice data of designated field and the language of designated field The corresponding text data of sound data, is trained generic acoustic model, obtains domain-adaptive acoustic model.
Optionally, designated field acoustic training model unit 540 is specifically used for:
According to the corresponding text data of the voice data of the voice data of designated field and designated field, and, specify neck The corresponding text data of voice data of the near field of the voice data and designated field of the near field in domain, to general acoustics Model is trained, and obtains domain-adaptive acoustic model;Wherein, the amount of voice data of the near field of designated field and specified The ratio of the amount of voice data in field should meet preset condition.
Optionally, designated field acoustic training model unit 540 obtains field for being trained to generic acoustic model When adaptive acoustic model, it is specifically used for:
Generic acoustic model is trained using multiple training methods;
The acoustic model that more multiple training methods are respectively trained out is to the discrimination of the voice of designated field, by specified neck The highest acoustic model of the discrimination of the voice in domain is as domain-adaptive acoustic model.
Optionally, when the training method of generic acoustic model is consistent with the training method of domain-adaptive acoustic model, The exercise wheel number of domain-adaptive acoustic model is lower than the exercise wheel number of generic acoustic model.
Optionally, designated field acoustic training model unit 540 is specifically used for:
According to generic acoustic model, the corresponding textual data of voice data of voice data and designated field to designated field According to and designated field near field voice data and designated field near field the corresponding textual data of voice data According to progress phoneme registration process and generate word figure;
It is aligned result and word figure according to phoneme, generic acoustic model is trained, domain-adaptive acoustic model is obtained.
Optionally, preset condition includes:
The ratio of the amount of voice data of the amount of voice data and designated field of the near field of designated field is 1 to 2.
Optionally, generic acoustic model training unit 520 or designated field acoustic training model unit 540 are for training sound When learning model, it is specifically used for:
Clustering processing is carried out to phoneme model according to voice data training phoneme model, and using decision tree;
According to phoneme model and decision tree, phoneme alignment operation is made to voice data;
It is aligned result and the corresponding text data of voice data according to phoneme, generates word figure;
Result and word figure, training acoustic model are aligned according to phoneme.
Optionally, voice data includes:
Phone customer service voices data;
The voice data of the near field of multiple designated fields includes:
The phone customer service voices data of multiple industries;
The voice data of designated field includes:
The phone customer service voices data of designated trade.
The specific restriction of training device about domain-adaptive acoustic model may refer to adaptive above for field The restriction of the training method of acoustic model is answered, details are not described herein.
It should be appreciated that various technologies described herein are realized together in combination with hardware or software or their combination.From And some aspects or part of disclosed method and equipment or disclosed method and equipment can take the tangible matchmaker of insertion It is situated between, such as the program code in floppy disk, CD-ROM, hard disk drive or other any machine readable storage mediums (refers to Enable) form, wherein when program is loaded into the machine of such as computer etc, and when being executed by the machine, which becomes real The equipment for trampling the disclosure.
In the case where program code executes on programmable computers, calculates equipment and generally comprise processor, processor Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit, and extremely A few output device.Wherein, memory is configured for storage program code;Processor is configured for according to the memory Instruction in the program code of middle storage executes the various methods of the disclosure.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates Machine readable medium includes computer storage media and communication media.Computer storage medium storage such as computer-readable instruction, The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc. Data-signal processed passes to embody computer readable instructions, data structure, program module or other data including any information Pass medium.Above any combination is also included within the scope of computer-readable medium.
It should be appreciated that in order to simplify the disclosure and help to understand one or more of each open aspect, it is right above In the description of the exemplary embodiment of the disclosure, each feature of the disclosure be grouped together into sometimes single embodiment, figure or In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed public affairs Requirement is opened than feature more features expressly recited in each claim.More precisely, as the following claims As book reflects, open aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this public affairs The separate embodiments opened.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments means to be in the disclosure Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the disclosure element performed by Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the disclosure, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present disclosure thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine the theme of the disclosure and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present disclosure, to this Openly done disclosure is illustrative and not restrictive, and the scope of the present disclosure is defined by the appended claims.

Claims (10)

1. a kind of training method of domain-adaptive acoustic model characterized by comprising
Obtain the voice number of the voice data of the near field of multiple designated fields and the near field of the multiple designated field According to corresponding text data;
According to the language of the near field of the voice data of the near field of the multiple designated field and the multiple designated field The corresponding text data of sound data is trained acoustic model, obtains generic acoustic model;
Obtain the corresponding text data of voice data of the voice data and the designated field of designated field;
According to the corresponding text data of the voice data of the voice data of the designated field and the designated field, to described logical It is trained with acoustic model, obtains domain-adaptive acoustic model.
2. the method as described in claim 1, which is characterized in that according to the voice data of the designated field and the specified neck The corresponding text data of the voice data in domain, is trained the generic acoustic model, obtains domain-adaptive acoustic model, Include:
According to the corresponding text data of the voice data of the voice data of the designated field and the designated field, and, institute State the corresponding textual data of voice data of the voice data of the near field of designated field and the near field of the designated field According to being trained to the generic acoustic model, obtain domain-adaptive acoustic model;Wherein, the designated field is close The ratio of the amount of voice data in field and the amount of voice data of the designated field should meet preset condition.
3. method according to claim 2, which is characterized in that according to the voice data of the designated field and the specified neck The corresponding text data of the voice data in domain, and, the voice data of the near field of the designated field and the specified neck The corresponding text data of the voice data of the near field in domain, is trained the generic acoustic model, it is adaptive to obtain field Answer acoustic model, comprising:
According to the generic acoustic model, the voice data of voice data and the designated field to the designated field is corresponding Text data and the designated field near field voice data and the designated field near field voice The corresponding text data of data carries out phoneme registration process and generates word figure;
It is aligned result and institute's predicate figure according to phoneme, the generic acoustic model is trained, domain-adaptive acoustics is obtained Model.
4. method as claimed in claim 2 or claim 3, which is characterized in that be trained to the generic acoustic model, obtain field Adaptive acoustic model, comprising:
The generic acoustic model is trained using multiple training methods;
The acoustic model that more the multiple training method is respectively trained out is to the discrimination of the voice of the designated field, by institute The highest acoustic model of discrimination of the voice of designated field is stated as the domain-adaptive acoustic model.
5. method as claimed in claim 4, which is characterized in that training method and the field when the generic acoustic model When the training method of adaptive acoustic model is consistent, the exercise wheel number of the domain-adaptive acoustic model is lower than the general sound Learn the exercise wheel number of model.
6. method according to claim 2, which is characterized in that the preset condition includes:
The ratio of the amount of voice data of the near field of the designated field and the amount of voice data of the designated field is 1 to 2.
7. the method as described in claim 1, which is characterized in that the voice data includes:
Phone customer service voices data;
The voice data of the near field of the multiple designated field includes:
The phone customer service voices data of multiple industries;
The voice data of the designated field includes:
The phone customer service voices data of designated trade.
8. a kind of training device of domain-adaptive acoustic model characterized by comprising
First data capture unit, for obtaining the voice data and the multiple specified neck of the near field of multiple designated fields The corresponding text data of the voice data of the near field in domain;
Generic acoustic model training unit, for according to the voice data of the near field of the multiple designated field and described more The corresponding text data of the voice data of the near field of a designated field is trained acoustic model, obtains general acoustic mode Type;
Second data capture unit, for obtain designated field voice data and the designated field voice data it is corresponding Text data;
Designated field acoustic training model unit, for according to the voice data of the designated field and the language of the designated field The corresponding text data of sound data, is trained the generic acoustic model, obtains domain-adaptive acoustic model.
9. a kind of readable storage medium storing program for executing has executable instruction, when the executable instruction is performed, so that computer thereon Operation included by any one in perform claim requirement 1-7.
10. a kind of calculating equipment, comprising:
Processor;And
Memory, is stored with executable instruction, and the executable instruction makes the processor perform claim upon being performed It is required that operation included by any one in 1-7.
CN201910670390.6A 2019-07-24 2019-07-24 Training method of domain adaptive acoustic model Active CN110379415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910670390.6A CN110379415B (en) 2019-07-24 2019-07-24 Training method of domain adaptive acoustic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910670390.6A CN110379415B (en) 2019-07-24 2019-07-24 Training method of domain adaptive acoustic model

Publications (2)

Publication Number Publication Date
CN110379415A true CN110379415A (en) 2019-10-25
CN110379415B CN110379415B (en) 2022-02-18

Family

ID=68255440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910670390.6A Active CN110379415B (en) 2019-07-24 2019-07-24 Training method of domain adaptive acoustic model

Country Status (1)

Country Link
CN (1) CN110379415B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243574A (en) * 2020-01-13 2020-06-05 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
CN111477211A (en) * 2020-04-17 2020-07-31 珠海声原智能科技有限公司 Cross-scene fast-adaptation voice recognition method and device
CN111508479A (en) * 2020-04-16 2020-08-07 重庆农村商业银行股份有限公司 Voice recognition method, device, equipment and storage medium
CN111613209A (en) * 2020-04-14 2020-09-01 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN112466294A (en) * 2020-11-24 2021-03-09 北京百度网讯科技有限公司 Acoustic model generation method and device and electronic equipment
CN112596868A (en) * 2020-11-27 2021-04-02 出门问问(武汉)信息科技有限公司 Model training method and device
CN113327587A (en) * 2021-06-02 2021-08-31 云知声(上海)智能科技有限公司 Method and device for voice recognition in specific scene, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
CN102280106A (en) * 2010-06-12 2011-12-14 三星电子株式会社 VWS method and apparatus used for mobile communication terminal
JP2016102820A (en) * 2014-11-27 2016-06-02 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method for improving acoustic model, and computer for improving acoustic model and computer program therefor
CN107154260A (en) * 2017-04-11 2017-09-12 北京智能管家科技有限公司 A kind of domain-adaptive audio recognition method and device
US20190066662A1 (en) * 2017-08-25 2019-02-28 International Business Machines Corporation Priors adaptation for conservative training of acoustic model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102280106A (en) * 2010-06-12 2011-12-14 三星电子株式会社 VWS method and apparatus used for mobile communication terminal
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
JP2016102820A (en) * 2014-11-27 2016-06-02 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method for improving acoustic model, and computer for improving acoustic model and computer program therefor
CN107154260A (en) * 2017-04-11 2017-09-12 北京智能管家科技有限公司 A kind of domain-adaptive audio recognition method and device
US20190066662A1 (en) * 2017-08-25 2019-02-28 International Business Machines Corporation Priors adaptation for conservative training of acoustic model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王丽: "基于先验概率线性插值的声学模型自适应方法", 《第十四届全国人机语音通讯学术会议(NCMMSC’2017)论文集》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243574A (en) * 2020-01-13 2020-06-05 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
CN111613209A (en) * 2020-04-14 2020-09-01 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN111508479A (en) * 2020-04-16 2020-08-07 重庆农村商业银行股份有限公司 Voice recognition method, device, equipment and storage medium
CN111477211A (en) * 2020-04-17 2020-07-31 珠海声原智能科技有限公司 Cross-scene fast-adaptation voice recognition method and device
CN112466294A (en) * 2020-11-24 2021-03-09 北京百度网讯科技有限公司 Acoustic model generation method and device and electronic equipment
CN112466294B (en) * 2020-11-24 2021-12-14 北京百度网讯科技有限公司 Acoustic model generation method and device and electronic equipment
CN112596868A (en) * 2020-11-27 2021-04-02 出门问问(武汉)信息科技有限公司 Model training method and device
CN113327587A (en) * 2021-06-02 2021-08-31 云知声(上海)智能科技有限公司 Method and device for voice recognition in specific scene, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110379415B (en) 2022-02-18

Similar Documents

Publication Publication Date Title
CN110379415A (en) The training method of domain-adaptive acoustic model
US10971142B2 (en) Systems and methods for robust speech recognition using generative adversarial networks
CN110246487B (en) Optimization method and system for single-channel speech recognition model
US11403345B2 (en) Method and system for processing unclear intent query in conversation system
CN108475505B (en) Generating a target sequence from an input sequence using partial conditions
US9818409B2 (en) Context-dependent modeling of phonemes
CN110197658B (en) Voice processing method and device and electronic equipment
CN110473566A (en) Audio separation method, device, electronic equipment and computer readable storage medium
CN109637546A (en) Knowledge distillating method and device
CN109074820A (en) Audio processing is carried out using neural network
CN107564513A (en) Audio recognition method and device
CN110232907A (en) A kind of phoneme synthesizing method, device, readable storage medium storing program for executing and calculate equipment
CN110379407A (en) Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment
CN110189748A (en) Model building method and device
AU2021236965B2 (en) Automatically generating diverse text
CN111666416A (en) Method and apparatus for generating semantic matching model
CN110277088A (en) Intelligent voice recognition method, device and computer readable storage medium
CN109840052A (en) A kind of audio-frequency processing method, device, electronic equipment and storage medium
CN111081230A (en) Speech recognition method and apparatus
CN110569908B (en) Speaker counting method and system
CN115376495A (en) Speech recognition model training method, speech recognition method and device
US20220254351A1 (en) Method and system for correcting speaker diarization using speaker change detection based on text
CN107910005A (en) The target service localization method and device of interaction text
CN111462755A (en) Information prompting method and device, electronic equipment and medium
US20220327356A1 (en) Transformer-Based Model Knowledge Graph Link Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant