CN110379415B

CN110379415B - Training method of domain adaptive acoustic model

Info

Publication number: CN110379415B
Application number: CN201910670390.6A
Authority: CN
Inventors: 钟利民
Original assignee: Go Out And Ask Suzhou Information Technology Co ltd
Current assignee: Go Out And Ask Suzhou Information Technology Co ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2022-02-18
Anticipated expiration: 2039-07-24
Also published as: CN110379415A

Abstract

The embodiment of the disclosure provides a training method and device for a domain adaptive acoustic model, a readable storage medium and computing equipment, which are used for constructing the acoustic model with excellent recognition effect in a specified domain. The method comprises the following steps: acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the specified fields; training an acoustic model according to the voice data of the similar fields of the multiple specified fields and the text data corresponding to the voice data of the similar fields of the multiple specified fields to obtain a universal acoustic model; acquiring voice data of a designated field and text data corresponding to the voice data of the designated field; and training the general acoustic model according to the voice data of the specified field and the text data corresponding to the voice data of the specified field to obtain a field self-adaptive acoustic model.

Description

Training method of domain adaptive acoustic model

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to a training method and apparatus for a field adaptive acoustic model, a readable storage medium, and a computing device.

Background

Automatic Speech Recognition (ASR) technology has advanced over the past few years, and in some scenarios, such as Speech intelligence assistance, the performance of the most advanced Recognition systems has approached that of humans. However, in the telephone customer service scene, due to low sampling rate, large channel interference and insufficient training data, the overall recognition rate can only reach the level of 50% -70% of 16K voice recognition. In addition, in the business application of telephone customer service, users often pay attention to the recognition rate of a specific field, and when the training data of the recognition system is not matched with the application field, the performance is remarkably reduced, so that the general telephone customer service voice recognition system cannot be used in the fields.

Disclosure of Invention

To this end, the present disclosure provides a method, apparatus, readable storage medium, and computing device for domain adaptive acoustic model training in an effort to solve or at least mitigate at least one of the problems identified above.

According to an aspect of the embodiments of the present disclosure, there is provided a training method for a domain adaptive acoustic model, including:

acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the plurality of specified fields;

training an acoustic model according to voice data of the similar fields of the multiple specified fields and text data corresponding to the voice data of the similar fields of the multiple specified fields to obtain a universal acoustic model;

acquiring voice data of a designated field and text data corresponding to the voice data of the designated field;

and training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field to obtain a field self-adaptive acoustic model.

Optionally, training the general acoustic model according to the voice data in the designated field and the text data corresponding to the voice data in the designated field to obtain a field adaptive acoustic model, including:

training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field self-adaptive acoustic model; wherein, the ratio of the voice data volume of the close domain of the designated domain to the voice data volume of the designated domain should satisfy the preset condition.

Optionally, training the general acoustic model to obtain a domain adaptive acoustic model, including:

training a generic acoustic model using a plurality of training methods;

and comparing the recognition rate of the acoustic models respectively trained by the training methods to the voice in the designated field, and taking the acoustic model with the highest recognition rate of the voice in the designated field as the field self-adaptive acoustic model.

Optionally, when the training method of the generic acoustic model is consistent with the training method of the domain adaptive acoustic model, the number of training rounds of the domain adaptive acoustic model is lower than the number of training rounds of the generic acoustic model.

Optionally, the plurality of training methods at least comprises:

sequence cross entropy target, minimum bayesian risk of state mbr criterion.

Optionally, training the general acoustic model according to the speech data of the designated field and the text data corresponding to the speech data of the designated field, and the speech data of the near field of the designated field and the text data corresponding to the speech data of the near field of the designated field, to obtain a field adaptive acoustic model, including:

performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, the voice data of the near field of the specified field and the text data corresponding to the voice data of the near field of the specified field according to the universal acoustic model to generate a word graph;

and training the general acoustic model according to the phoneme alignment result and the word graph to obtain a field self-adaptive acoustic model.

Optionally, the preset conditions include:

the ratio of the voice data amount of the near domain of the designated domain to the voice data amount of the designated domain is 1 to 2.

Optionally, the method of training an acoustic model comprises:

training a triphone model according to the voice data, and clustering the triphone model by using a decision tree;

performing phoneme alignment operation on the voice data according to the triphone model and the decision tree;

generating a word graph according to the phoneme alignment result and the text data corresponding to the voice data;

and training an acoustic model according to the phoneme alignment result and the word graph.

Optionally, the voice data includes:

the telephone customer service voice data;

the voice data of the near domains of the plurality of specified domains includes:

telephone customer service voice data of a plurality of industries;

the voice data of the specified domain includes:

telephone customers of a given industry service voice data.

According to another aspect of the embodiments of the present disclosure, there is provided a training apparatus for a domain adaptive acoustic model, including:

the device comprises a first data acquisition unit, a second data acquisition unit and a display unit, wherein the first data acquisition unit is used for acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the specified fields;

the universal acoustic model training unit is used for training the acoustic model according to the voice data of the similar fields of the specified fields and the text data corresponding to the voice data of the similar fields of the specified fields to obtain a universal acoustic model;

the second data acquisition unit is used for acquiring voice data of a specified field and text data corresponding to the voice data of the specified field;

and the specified field acoustic model training unit is used for training the general acoustic model according to the voice data of the specified field and the text data corresponding to the voice data of the specified field to obtain a field self-adaptive acoustic model.

Optionally, the specified domain acoustic model training unit is specifically configured to:

Optionally, the specified domain acoustic model training unit is configured to train the general acoustic model, and when the domain adaptive acoustic model is obtained, the specified domain acoustic model training unit is specifically configured to:

training a generic acoustic model using a plurality of training methods;

Optionally, the plurality of training methods at least comprises:

sequence cross entropy target, minimum bayesian risk of state mbr criterion.

Optionally, the preset conditions include:

Optionally, when the general acoustic model training unit or the specified domain acoustic model training unit is used to train an acoustic model, the general acoustic model training unit or the specified domain acoustic model training unit is specifically configured to:

Optionally, the voice data includes:

the telephone customer service voice data;

telephone customer service voice data of a plurality of industries;

the voice data of the specified domain includes:

telephone customers of a given industry service voice data.

According to yet another aspect of an embodiment of the present disclosure, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the operations included in the above-mentioned method.

According to yet another aspect of embodiments of the present disclosure, there is provided a computing device comprising: a processor; and a memory storing executable instructions that, when executed, cause the processor to perform operations included in the above-described methods.

According to the technical scheme provided by the embodiment of the disclosure, the field self-adaptive acoustic model is further trained on the basis of the trained general acoustic model, and better acoustic model identification performance can be realized in the specified field.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a block diagram of an exemplary computing device;

FIG. 2 is a flow chart of a method of training a domain adaptive acoustic model according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of an acoustic model training method according to an embodiment of the present disclosure;

FIG. 4 is yet another flow chart of a method of training a domain adaptive acoustic model according to an embodiment of the present disclosure;

fig. 5 is a block diagram of a training apparatus for a domain adaptive acoustic model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 is a block diagram of an example computing device 100 arranged to implement a training method of a domain adaptive acoustic model according to the present disclosure. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: the processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. the example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.

Wherein the one or more programs 122 of the computing device 100 include instructions for performing a training method of a domain adaptive acoustic model according to the present disclosure.

Fig. 2 illustrates a flow chart of a method 200 for training a domain-adaptive acoustic model according to an embodiment of the present disclosure, the method 200 for training a domain-adaptive acoustic model starting at step S210.

In step S210, voice data of close fields of the plurality of designated fields and text data corresponding to the voice data of close fields of the plurality of designated fields are acquired.

According to the embodiment of the disclosure, the voice data of the similar domains of the plurality of specified domains may or may not include the voice data of the specified domain.

According to the embodiment of the disclosure, the voice data may be telephone customer service voice data, the voice data of the similar fields of the plurality of designated fields may be telephone customer service voice data of customer service departments of the plurality of industries, and the voice data of the designated field may be telephone customer service voice data of a customer service department of a certain designated industry.

According to an embodiment of the present disclosure, the voice data and the text data should correspond one-to-one.

Subsequently, in step S220, the acoustic model is trained according to the voice data of the similar domains of the multiple specified domains and the text data corresponding to the voice data of the similar domains of the multiple specified domains, so as to obtain a general acoustic model.

According to an embodiment of the present disclosure, the generic acoustic model training specifically trains a Time Delay Neural Network (TDNN) model through a sequence cross entropy target. The step can directly adopt the existing general acoustic model training result, thereby reducing the requirement of computing resources.

Subsequently, in step S230, the voice data of the specified domain and the text data corresponding to the voice data of the specified domain are acquired.

Subsequently, in step S240, the generic acoustic model is trained according to the speech data in the specified domain and the text data corresponding to the speech data in the specified domain, so as to obtain a domain adaptive acoustic model.

Due to the low sampling rate of telephone customer service voice data, large channel interference and insufficient training data, the recognition rate of the existing training result is low. By the field adaptive acoustic model training method, the existing telephone customer service voice data of various industries can be fully utilized, further training is carried out on a certain industry on the basis of training a general acoustic model, and the field adaptive acoustic model is obtained, so that the recognition rate of the telephone customer service voice data of the industry can reach a higher level.

Specifically, step S240 includes:

According to the embodiment of the disclosure, a small amount of voice data and text data of the similar fields are added in the training process of the field adaptive acoustic model, so that the generalization capability of the field adaptive acoustic model can be improved, and the field adaptive acoustic model with better performance is obtained.

For example, the ratio of the amount of voice data of the near domain of the specified domain to the amount of voice data of the specified domain may be 1 to 2. If the ratio is too low, the generalization capability of the domain adaptive acoustic model is reduced, and if the ratio is too high, the recognition capability of the domain adaptive acoustic model to the voice data in the specified domain is reduced.

Further, in step S240, training the general acoustic model to obtain a domain adaptive acoustic model, including:

training a generic acoustic model using a plurality of training methods;

According to the embodiment of the disclosure, a plurality of training methods are adopted to train the general acoustic model, and the training result with the optimal verification is selected as the field adaptive acoustic model, so that the voice recognition performance of the field adaptive acoustic model can be improved.

Further, when the training method of the general acoustic model is consistent with the training method of the domain adaptive acoustic model, the number of training rounds of the domain adaptive acoustic model is lower than that of the general acoustic model.

For example, the TDNN model is trained by a sequence cross entropy target method in the training of the general acoustic model, 4-6 rounds of training are needed, and if the TDNN model is trained by a sequence cross entropy target method in the same training of the field adaptive acoustic model, the expected effect can be achieved by 1 round of training, so that the consumption of computing resources is saved. In addition, in the training process, a Minimum-level Minimum Bayes Risk (sMBR) method can be adopted to train the acoustic model, and the method needs to consume more computing resources.

Further, step S240 includes:

According to the embodiment of the disclosure, if the training data of the general acoustic model does not include the training data of the domain adaptive acoustic model, the general acoustic model can be directly used for processing the voice data of the specified domain and the text data corresponding to the voice data of the specified domain, so that manual labeling operation of the voice and the text in the voice data is reduced, and the data processing efficiency is improved.

Alternatively, as shown in fig. 3, the acoustic model training may employ the following steps:

s310, training a triphone model according to the voice data, and clustering the triphone model by using a decision tree;

s320, performing phoneme alignment operation on the voice data according to the triphone model and the decision tree;

s330, generating a word graph according to the phoneme alignment result and the text data corresponding to the voice data;

and S340, training an acoustic model according to the phoneme alignment result and the word graph.

The steps can be used for realizing the training of the general acoustic model and can also be used for realizing the training of the field self-adaptive acoustic model.

According to the embodiments of the present disclosure, different speech recognition systems may be based on different acoustic features such as an acoustic Model based on MFCC (Mel-Frequency Cepstrum Coefficients) features, an acoustic Model based on PLP (Perceptual Linear prediction) features, etc., or may employ different acoustic models such as a Hidden Markov Model-Gaussian Mixture Model (HMM-GMM), a neural Network acoustic Model based on Dynamic Bayesian Network (DBN), etc.

Fig. 4 is a further flowchart of a training method of a domain adaptive acoustic model according to an embodiment of the present disclosure. In conjunction with the flow chart shown in fig. 4, the telephone service voice data with a sampling rate of thousands of hours and 8KHZ and the corresponding labeled text of a plurality of scenes are used as training data, wherein the data sources include telephone service scenes commonly used in the markets such as automobile sales/financial sales/educational consultation, and specific embodiments of the present disclosure are given.

In the specific embodiment of the present disclosure, the training of the general model includes the following steps:

step a, training a triphone HMM-GMM model:

the system firstly trains a triphone HMM-GMM model by using half of the total data, and establishes a decision tree for binding similar triphones by using a decision tree clustering method, thereby reducing the phoneme space. Using half the data rather than the full amount of data is primarily based on computational resource considerations, and if sufficient computational resources are available, the HMM-GMM model may be trained using the full amount of data.

Step b, training a TDNN model:

sub-step b1, using the HMM-GMM and decision tree generated in step a, performing phoneme alignment operation on the full-scale data and generating corresponding word graph;

sub-step b2, training the TDNN model with the sequence cross-entropy targets using the alignment data and corresponding word graph generated in sub-step b 1. All data may be trained for 4-6 rounds based on computational resource conditions.

In the embodiment of the disclosure, the training of the domain adaptive acoustic model includes the following steps:

step a, data selection:

domain data and 1 to 2 times the number of other similar domain data.

Step b, data preparation:

and c, aligning the data selected in the step a and generating a corresponding word graph by using a trained TDNN general model.

Step c, training a domain model:

and c, using the trained TDNN general model parameters as initialization values, and further learning the parameters by using the alignment data and the word graph in the step b.

In this embodiment, the following two methods for training the adaptive acoustic model in the field are used:

the method I is characterized in that on the basis of a TDNN general model, data in the step b are trained by using sequence cross entropy targets continuously. This approach typically requires only 1 round of training for all data.

And secondly, training the data in the step b by using an sMBR criterion on the basis of the TDNN general model. This method requires 4-6 rounds of training.

The final model performance is influenced by the training data proportion in the step a and the difference of the training method in the step c, evaluation data sets relevant to scenes are required to be constructed in different scenes, and a suitable model is selected according to the recognition rate result.

According to the technical scheme, on the basis of a large-scale data training general acoustic model, a field self-adaptive acoustic model is trained by using a small number of designated field voice data and a small number of similar field voice data, so that a field self-adaptive acoustic model with pertinence is provided for each field, and the performance of the acoustic model in the field is improved.

Referring to fig. 5, an embodiment of the present disclosure provides a training apparatus for a domain adaptive acoustic model, including:

a first data obtaining unit 510, configured to obtain voice data of close fields of a plurality of specified fields and text data corresponding to the voice data of the close fields of the plurality of specified fields;

the general acoustic model training unit 520 is configured to train an acoustic model according to the voice data of the similar fields in the multiple specified fields and the text data corresponding to the voice data of the similar fields in the multiple specified fields to obtain a general acoustic model;

a second data obtaining unit 530, configured to obtain voice data of a specified domain and text data corresponding to the voice data of the specified domain;

and the specified field acoustic model training unit 540 is configured to train the general acoustic model according to the speech data of the specified field and the text data corresponding to the speech data of the specified field, so as to obtain a field adaptive acoustic model.

Optionally, the specified domain acoustic model training unit 540 is specifically configured to:

Optionally, the specified domain acoustic model training unit 540 is configured to train the general acoustic model, and when the domain adaptive acoustic model is obtained, specifically configured to:

training a generic acoustic model using a plurality of training methods;

Optionally, the preset conditions include:

Optionally, when the general acoustic model training unit 520 or the specified domain acoustic model training unit 540 is used to train an acoustic model, it is specifically configured to:

Optionally, the voice data includes:

the telephone customer service voice data;

telephone customer service voice data of a plurality of industries;

the voice data of the specified domain includes:

telephone customers of a given industry service voice data.

For specific definition of the training apparatus for the domain adaptive acoustic model, refer to the above definition of the training method for the domain adaptive acoustic model, and are not described herein again.

It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification, and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except that at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification may be replaced by an alternative feature serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter.

Claims

1. A training method of a domain adaptive acoustic model is characterized by comprising the following steps:

acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the specified fields;

training an acoustic model according to the voice data of the similar fields of the multiple specified fields and the text data corresponding to the voice data of the similar fields of the multiple specified fields to obtain a universal acoustic model;

training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field adaptive acoustic model; the ratio of the voice data volume of the similar domain of the specified domain to the voice data volume of the specified domain meets a preset condition;

training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field adaptive acoustic model, comprising:

according to the universal acoustic model, performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, and the voice data of the close field of the specified field and the text data corresponding to the voice data of the close field of the specified field to generate a word graph;

2. The method of claim 1, wherein training the generic acoustic model to obtain a domain adaptive acoustic model comprises:

training the generic acoustic model using a plurality of training methods;

and comparing the recognition rate of the acoustic models respectively trained by the training methods to the voice in the specified field, and taking the acoustic model with the highest recognition rate of the voice in the specified field as the field self-adaptive acoustic model.

3. The method of claim 2, wherein the number of training rounds of the domain-adaptive acoustic model is lower than the number of training rounds of the generic acoustic model when the training method of the generic acoustic model and the training method of the domain-adaptive acoustic model are consistent.

4. The method of claim 1, wherein the preset conditions include:

the ratio of the voice data volume of the close domain of the specified domain to the voice data volume of the specified domain is 1 to 2.

5. The method of claim 1, wherein the voice data comprises:

the telephone customer service voice data;

the voice data of the similar domains of the plurality of specified domains includes:

telephone customer service voice data of a plurality of industries;

the voice data of the specified domain includes:

telephone customers of a given industry service voice data.

6. A training device for a domain adaptive acoustic model, comprising:

the universal acoustic model training unit is used for training an acoustic model according to the voice data of the similar fields of the specified fields and the text data corresponding to the voice data of the similar fields of the specified fields to obtain a universal acoustic model;

a designated field acoustic model training unit, configured to train the general acoustic model according to the speech data of the designated field and the text data corresponding to the speech data of the designated field, and according to the speech data of the near field of the designated field and the text data corresponding to the speech data of the near field of the designated field, so as to obtain a field adaptive acoustic model; the ratio of the voice data volume of the similar domain of the specified domain to the voice data volume of the specified domain meets a preset condition; training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field adaptive acoustic model, comprising: according to the universal acoustic model, performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, and the voice data of the close field of the specified field and the text data corresponding to the voice data of the close field of the specified field to generate a word graph; and training the general acoustic model according to the phoneme alignment result and the word graph to obtain a field self-adaptive acoustic model.

7. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the operations included in any one of claims 1-5.

8. A computing device, comprising:

a processor; and

a memory storing executable instructions that, when executed, cause the processor to perform the operations included in any one of claims 1-5.