CN110379415B - Training method of domain adaptive acoustic model - Google Patents

Training method of domain adaptive acoustic model Download PDF

Info

Publication number
CN110379415B
CN110379415B CN201910670390.6A CN201910670390A CN110379415B CN 110379415 B CN110379415 B CN 110379415B CN 201910670390 A CN201910670390 A CN 201910670390A CN 110379415 B CN110379415 B CN 110379415B
Authority
CN
China
Prior art keywords
voice data
acoustic model
field
specified
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910670390.6A
Other languages
Chinese (zh)
Other versions
CN110379415A (en
Inventor
钟利民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Go Out And Ask Suzhou Information Technology Co ltd
Original Assignee
Go Out And Ask Suzhou Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Go Out And Ask Suzhou Information Technology Co ltd filed Critical Go Out And Ask Suzhou Information Technology Co ltd
Priority to CN201910670390.6A priority Critical patent/CN110379415B/en
Publication of CN110379415A publication Critical patent/CN110379415A/en
Application granted granted Critical
Publication of CN110379415B publication Critical patent/CN110379415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the disclosure provides a training method and device for a domain adaptive acoustic model, a readable storage medium and computing equipment, which are used for constructing the acoustic model with excellent recognition effect in a specified domain. The method comprises the following steps: acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the specified fields; training an acoustic model according to the voice data of the similar fields of the multiple specified fields and the text data corresponding to the voice data of the similar fields of the multiple specified fields to obtain a universal acoustic model; acquiring voice data of a designated field and text data corresponding to the voice data of the designated field; and training the general acoustic model according to the voice data of the specified field and the text data corresponding to the voice data of the specified field to obtain a field self-adaptive acoustic model.

Description

Training method of domain adaptive acoustic model
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a training method and apparatus for a field adaptive acoustic model, a readable storage medium, and a computing device.
Background
Automatic Speech Recognition (ASR) technology has advanced over the past few years, and in some scenarios, such as Speech intelligence assistance, the performance of the most advanced Recognition systems has approached that of humans. However, in the telephone customer service scene, due to low sampling rate, large channel interference and insufficient training data, the overall recognition rate can only reach the level of 50% -70% of 16K voice recognition. In addition, in the business application of telephone customer service, users often pay attention to the recognition rate of a specific field, and when the training data of the recognition system is not matched with the application field, the performance is remarkably reduced, so that the general telephone customer service voice recognition system cannot be used in the fields.
Disclosure of Invention
To this end, the present disclosure provides a method, apparatus, readable storage medium, and computing device for domain adaptive acoustic model training in an effort to solve or at least mitigate at least one of the problems identified above.
According to an aspect of the embodiments of the present disclosure, there is provided a training method for a domain adaptive acoustic model, including:
acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the plurality of specified fields;
training an acoustic model according to voice data of the similar fields of the multiple specified fields and text data corresponding to the voice data of the similar fields of the multiple specified fields to obtain a universal acoustic model;
acquiring voice data of a designated field and text data corresponding to the voice data of the designated field;
and training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field to obtain a field self-adaptive acoustic model.
Optionally, training the general acoustic model according to the voice data in the designated field and the text data corresponding to the voice data in the designated field to obtain a field adaptive acoustic model, including:
training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field self-adaptive acoustic model; wherein, the ratio of the voice data volume of the close domain of the designated domain to the voice data volume of the designated domain should satisfy the preset condition.
Optionally, training the general acoustic model to obtain a domain adaptive acoustic model, including:
training a generic acoustic model using a plurality of training methods;
and comparing the recognition rate of the acoustic models respectively trained by the training methods to the voice in the designated field, and taking the acoustic model with the highest recognition rate of the voice in the designated field as the field self-adaptive acoustic model.
Optionally, when the training method of the generic acoustic model is consistent with the training method of the domain adaptive acoustic model, the number of training rounds of the domain adaptive acoustic model is lower than the number of training rounds of the generic acoustic model.
Optionally, the plurality of training methods at least comprises:
sequence cross entropy target, minimum bayesian risk of state mbr criterion.
Optionally, training the general acoustic model according to the speech data of the designated field and the text data corresponding to the speech data of the designated field, and the speech data of the near field of the designated field and the text data corresponding to the speech data of the near field of the designated field, to obtain a field adaptive acoustic model, including:
performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, the voice data of the near field of the specified field and the text data corresponding to the voice data of the near field of the specified field according to the universal acoustic model to generate a word graph;
and training the general acoustic model according to the phoneme alignment result and the word graph to obtain a field self-adaptive acoustic model.
Optionally, the preset conditions include:
the ratio of the voice data amount of the near domain of the designated domain to the voice data amount of the designated domain is 1 to 2.
Optionally, the method of training an acoustic model comprises:
training a triphone model according to the voice data, and clustering the triphone model by using a decision tree;
performing phoneme alignment operation on the voice data according to the triphone model and the decision tree;
generating a word graph according to the phoneme alignment result and the text data corresponding to the voice data;
and training an acoustic model according to the phoneme alignment result and the word graph.
Optionally, the voice data includes:
the telephone customer service voice data;
the voice data of the near domains of the plurality of specified domains includes:
telephone customer service voice data of a plurality of industries;
the voice data of the specified domain includes:
telephone customers of a given industry service voice data.
According to another aspect of the embodiments of the present disclosure, there is provided a training apparatus for a domain adaptive acoustic model, including:
the device comprises a first data acquisition unit, a second data acquisition unit and a display unit, wherein the first data acquisition unit is used for acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the specified fields;
the universal acoustic model training unit is used for training the acoustic model according to the voice data of the similar fields of the specified fields and the text data corresponding to the voice data of the similar fields of the specified fields to obtain a universal acoustic model;
the second data acquisition unit is used for acquiring voice data of a specified field and text data corresponding to the voice data of the specified field;
and the specified field acoustic model training unit is used for training the general acoustic model according to the voice data of the specified field and the text data corresponding to the voice data of the specified field to obtain a field self-adaptive acoustic model.
Optionally, the specified domain acoustic model training unit is specifically configured to:
training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field self-adaptive acoustic model; wherein, the ratio of the voice data volume of the close domain of the designated domain to the voice data volume of the designated domain should satisfy the preset condition.
Optionally, the specified domain acoustic model training unit is configured to train the general acoustic model, and when the domain adaptive acoustic model is obtained, the specified domain acoustic model training unit is specifically configured to:
training a generic acoustic model using a plurality of training methods;
and comparing the recognition rate of the acoustic models respectively trained by the training methods to the voice in the designated field, and taking the acoustic model with the highest recognition rate of the voice in the designated field as the field self-adaptive acoustic model.
Optionally, when the training method of the generic acoustic model is consistent with the training method of the domain adaptive acoustic model, the number of training rounds of the domain adaptive acoustic model is lower than the number of training rounds of the generic acoustic model.
Optionally, the plurality of training methods at least comprises:
sequence cross entropy target, minimum bayesian risk of state mbr criterion.
Optionally, the specified domain acoustic model training unit is specifically configured to:
performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, the voice data of the near field of the specified field and the text data corresponding to the voice data of the near field of the specified field according to the universal acoustic model to generate a word graph;
and training the general acoustic model according to the phoneme alignment result and the word graph to obtain a field self-adaptive acoustic model.
Optionally, the preset conditions include:
the ratio of the voice data amount of the near domain of the designated domain to the voice data amount of the designated domain is 1 to 2.
Optionally, when the general acoustic model training unit or the specified domain acoustic model training unit is used to train an acoustic model, the general acoustic model training unit or the specified domain acoustic model training unit is specifically configured to:
training a triphone model according to the voice data, and clustering the triphone model by using a decision tree;
performing phoneme alignment operation on the voice data according to the triphone model and the decision tree;
generating a word graph according to the phoneme alignment result and the text data corresponding to the voice data;
and training an acoustic model according to the phoneme alignment result and the word graph.
Optionally, the voice data includes:
the telephone customer service voice data;
the voice data of the near domains of the plurality of specified domains includes:
telephone customer service voice data of a plurality of industries;
the voice data of the specified domain includes:
telephone customers of a given industry service voice data.
According to yet another aspect of an embodiment of the present disclosure, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the operations included in the above-mentioned method.
According to yet another aspect of embodiments of the present disclosure, there is provided a computing device comprising: a processor; and a memory storing executable instructions that, when executed, cause the processor to perform operations included in the above-described methods.
According to the technical scheme provided by the embodiment of the disclosure, the field self-adaptive acoustic model is further trained on the basis of the trained general acoustic model, and better acoustic model identification performance can be realized in the specified field.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
FIG. 1 is a block diagram of an exemplary computing device;
FIG. 2 is a flow chart of a method of training a domain adaptive acoustic model according to an embodiment of the present disclosure;
FIG. 3 is a flow chart of an acoustic model training method according to an embodiment of the present disclosure;
FIG. 4 is yet another flow chart of a method of training a domain adaptive acoustic model according to an embodiment of the present disclosure;
fig. 5 is a block diagram of a training apparatus for a domain adaptive acoustic model according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Fig. 1 is a block diagram of an example computing device 100 arranged to implement a training method of a domain adaptive acoustic model according to the present disclosure. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: the processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. the example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.
Wherein the one or more programs 122 of the computing device 100 include instructions for performing a training method of a domain adaptive acoustic model according to the present disclosure.
Fig. 2 illustrates a flow chart of a method 200 for training a domain-adaptive acoustic model according to an embodiment of the present disclosure, the method 200 for training a domain-adaptive acoustic model starting at step S210.
In step S210, voice data of close fields of the plurality of designated fields and text data corresponding to the voice data of close fields of the plurality of designated fields are acquired.
According to the embodiment of the disclosure, the voice data of the similar domains of the plurality of specified domains may or may not include the voice data of the specified domain.
According to the embodiment of the disclosure, the voice data may be telephone customer service voice data, the voice data of the similar fields of the plurality of designated fields may be telephone customer service voice data of customer service departments of the plurality of industries, and the voice data of the designated field may be telephone customer service voice data of a customer service department of a certain designated industry.
According to an embodiment of the present disclosure, the voice data and the text data should correspond one-to-one.
Subsequently, in step S220, the acoustic model is trained according to the voice data of the similar domains of the multiple specified domains and the text data corresponding to the voice data of the similar domains of the multiple specified domains, so as to obtain a general acoustic model.
According to an embodiment of the present disclosure, the generic acoustic model training specifically trains a Time Delay Neural Network (TDNN) model through a sequence cross entropy target. The step can directly adopt the existing general acoustic model training result, thereby reducing the requirement of computing resources.
Subsequently, in step S230, the voice data of the specified domain and the text data corresponding to the voice data of the specified domain are acquired.
Subsequently, in step S240, the generic acoustic model is trained according to the speech data in the specified domain and the text data corresponding to the speech data in the specified domain, so as to obtain a domain adaptive acoustic model.
Due to the low sampling rate of telephone customer service voice data, large channel interference and insufficient training data, the recognition rate of the existing training result is low. By the field adaptive acoustic model training method, the existing telephone customer service voice data of various industries can be fully utilized, further training is carried out on a certain industry on the basis of training a general acoustic model, and the field adaptive acoustic model is obtained, so that the recognition rate of the telephone customer service voice data of the industry can reach a higher level.
Specifically, step S240 includes:
training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field self-adaptive acoustic model; wherein, the ratio of the voice data volume of the close domain of the designated domain to the voice data volume of the designated domain should satisfy the preset condition.
According to the embodiment of the disclosure, a small amount of voice data and text data of the similar fields are added in the training process of the field adaptive acoustic model, so that the generalization capability of the field adaptive acoustic model can be improved, and the field adaptive acoustic model with better performance is obtained.
For example, the ratio of the amount of voice data of the near domain of the specified domain to the amount of voice data of the specified domain may be 1 to 2. If the ratio is too low, the generalization capability of the domain adaptive acoustic model is reduced, and if the ratio is too high, the recognition capability of the domain adaptive acoustic model to the voice data in the specified domain is reduced.
Further, in step S240, training the general acoustic model to obtain a domain adaptive acoustic model, including:
training a generic acoustic model using a plurality of training methods;
and comparing the recognition rate of the acoustic models respectively trained by the training methods to the voice in the designated field, and taking the acoustic model with the highest recognition rate of the voice in the designated field as the field self-adaptive acoustic model.
According to the embodiment of the disclosure, a plurality of training methods are adopted to train the general acoustic model, and the training result with the optimal verification is selected as the field adaptive acoustic model, so that the voice recognition performance of the field adaptive acoustic model can be improved.
Further, when the training method of the general acoustic model is consistent with the training method of the domain adaptive acoustic model, the number of training rounds of the domain adaptive acoustic model is lower than that of the general acoustic model.
For example, the TDNN model is trained by a sequence cross entropy target method in the training of the general acoustic model, 4-6 rounds of training are needed, and if the TDNN model is trained by a sequence cross entropy target method in the same training of the field adaptive acoustic model, the expected effect can be achieved by 1 round of training, so that the consumption of computing resources is saved. In addition, in the training process, a Minimum-level Minimum Bayes Risk (sMBR) method can be adopted to train the acoustic model, and the method needs to consume more computing resources.
Further, step S240 includes:
performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, the voice data of the near field of the specified field and the text data corresponding to the voice data of the near field of the specified field according to the universal acoustic model to generate a word graph;
and training the general acoustic model according to the phoneme alignment result and the word graph to obtain a field self-adaptive acoustic model.
According to the embodiment of the disclosure, if the training data of the general acoustic model does not include the training data of the domain adaptive acoustic model, the general acoustic model can be directly used for processing the voice data of the specified domain and the text data corresponding to the voice data of the specified domain, so that manual labeling operation of the voice and the text in the voice data is reduced, and the data processing efficiency is improved.
Alternatively, as shown in fig. 3, the acoustic model training may employ the following steps:
s310, training a triphone model according to the voice data, and clustering the triphone model by using a decision tree;
s320, performing phoneme alignment operation on the voice data according to the triphone model and the decision tree;
s330, generating a word graph according to the phoneme alignment result and the text data corresponding to the voice data;
and S340, training an acoustic model according to the phoneme alignment result and the word graph.
The steps can be used for realizing the training of the general acoustic model and can also be used for realizing the training of the field self-adaptive acoustic model.
According to the embodiments of the present disclosure, different speech recognition systems may be based on different acoustic features such as an acoustic Model based on MFCC (Mel-Frequency Cepstrum Coefficients) features, an acoustic Model based on PLP (Perceptual Linear prediction) features, etc., or may employ different acoustic models such as a Hidden Markov Model-Gaussian Mixture Model (HMM-GMM), a neural Network acoustic Model based on Dynamic Bayesian Network (DBN), etc.
Fig. 4 is a further flowchart of a training method of a domain adaptive acoustic model according to an embodiment of the present disclosure. In conjunction with the flow chart shown in fig. 4, the telephone service voice data with a sampling rate of thousands of hours and 8KHZ and the corresponding labeled text of a plurality of scenes are used as training data, wherein the data sources include telephone service scenes commonly used in the markets such as automobile sales/financial sales/educational consultation, and specific embodiments of the present disclosure are given.
In the specific embodiment of the present disclosure, the training of the general model includes the following steps:
step a, training a triphone HMM-GMM model:
the system firstly trains a triphone HMM-GMM model by using half of the total data, and establishes a decision tree for binding similar triphones by using a decision tree clustering method, thereby reducing the phoneme space. Using half the data rather than the full amount of data is primarily based on computational resource considerations, and if sufficient computational resources are available, the HMM-GMM model may be trained using the full amount of data.
Step b, training a TDNN model:
sub-step b1, using the HMM-GMM and decision tree generated in step a, performing phoneme alignment operation on the full-scale data and generating corresponding word graph;
sub-step b2, training the TDNN model with the sequence cross-entropy targets using the alignment data and corresponding word graph generated in sub-step b 1. All data may be trained for 4-6 rounds based on computational resource conditions.
In the embodiment of the disclosure, the training of the domain adaptive acoustic model includes the following steps:
step a, data selection:
domain data and 1 to 2 times the number of other similar domain data.
Step b, data preparation:
and c, aligning the data selected in the step a and generating a corresponding word graph by using a trained TDNN general model.
Step c, training a domain model:
and c, using the trained TDNN general model parameters as initialization values, and further learning the parameters by using the alignment data and the word graph in the step b.
In this embodiment, the following two methods for training the adaptive acoustic model in the field are used:
the method I is characterized in that on the basis of a TDNN general model, data in the step b are trained by using sequence cross entropy targets continuously. This approach typically requires only 1 round of training for all data.
And secondly, training the data in the step b by using an sMBR criterion on the basis of the TDNN general model. This method requires 4-6 rounds of training.
The final model performance is influenced by the training data proportion in the step a and the difference of the training method in the step c, evaluation data sets relevant to scenes are required to be constructed in different scenes, and a suitable model is selected according to the recognition rate result.
According to the technical scheme, on the basis of a large-scale data training general acoustic model, a field self-adaptive acoustic model is trained by using a small number of designated field voice data and a small number of similar field voice data, so that a field self-adaptive acoustic model with pertinence is provided for each field, and the performance of the acoustic model in the field is improved.
Referring to fig. 5, an embodiment of the present disclosure provides a training apparatus for a domain adaptive acoustic model, including:
a first data obtaining unit 510, configured to obtain voice data of close fields of a plurality of specified fields and text data corresponding to the voice data of the close fields of the plurality of specified fields;
the general acoustic model training unit 520 is configured to train an acoustic model according to the voice data of the similar fields in the multiple specified fields and the text data corresponding to the voice data of the similar fields in the multiple specified fields to obtain a general acoustic model;
a second data obtaining unit 530, configured to obtain voice data of a specified domain and text data corresponding to the voice data of the specified domain;
and the specified field acoustic model training unit 540 is configured to train the general acoustic model according to the speech data of the specified field and the text data corresponding to the speech data of the specified field, so as to obtain a field adaptive acoustic model.
Optionally, the specified domain acoustic model training unit 540 is specifically configured to:
training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field self-adaptive acoustic model; wherein, the ratio of the voice data volume of the close domain of the designated domain to the voice data volume of the designated domain should satisfy the preset condition.
Optionally, the specified domain acoustic model training unit 540 is configured to train the general acoustic model, and when the domain adaptive acoustic model is obtained, specifically configured to:
training a generic acoustic model using a plurality of training methods;
and comparing the recognition rate of the acoustic models respectively trained by the training methods to the voice in the designated field, and taking the acoustic model with the highest recognition rate of the voice in the designated field as the field self-adaptive acoustic model.
Optionally, when the training method of the generic acoustic model is consistent with the training method of the domain adaptive acoustic model, the number of training rounds of the domain adaptive acoustic model is lower than the number of training rounds of the generic acoustic model.
Optionally, the specified domain acoustic model training unit 540 is specifically configured to:
performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, the voice data of the near field of the specified field and the text data corresponding to the voice data of the near field of the specified field according to the universal acoustic model to generate a word graph;
and training the general acoustic model according to the phoneme alignment result and the word graph to obtain a field self-adaptive acoustic model.
Optionally, the preset conditions include:
the ratio of the voice data amount of the near domain of the designated domain to the voice data amount of the designated domain is 1 to 2.
Optionally, when the general acoustic model training unit 520 or the specified domain acoustic model training unit 540 is used to train an acoustic model, it is specifically configured to:
training a triphone model according to the voice data, and clustering the triphone model by using a decision tree;
performing phoneme alignment operation on the voice data according to the triphone model and the decision tree;
generating a word graph according to the phoneme alignment result and the text data corresponding to the voice data;
and training an acoustic model according to the phoneme alignment result and the word graph.
Optionally, the voice data includes:
the telephone customer service voice data;
the voice data of the near domains of the plurality of specified domains includes:
telephone customer service voice data of a plurality of industries;
the voice data of the specified domain includes:
telephone customers of a given industry service voice data.
For specific definition of the training apparatus for the domain adaptive acoustic model, refer to the above definition of the training method for the domain adaptive acoustic model, and are not described herein again.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.
By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification, and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except that at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification may be replaced by an alternative feature serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter.

Claims (8)

1. A training method of a domain adaptive acoustic model is characterized by comprising the following steps:
acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the specified fields;
training an acoustic model according to the voice data of the similar fields of the multiple specified fields and the text data corresponding to the voice data of the similar fields of the multiple specified fields to obtain a universal acoustic model;
acquiring voice data of a designated field and text data corresponding to the voice data of the designated field;
training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field adaptive acoustic model; the ratio of the voice data volume of the similar domain of the specified domain to the voice data volume of the specified domain meets a preset condition;
training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field adaptive acoustic model, comprising:
according to the universal acoustic model, performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, and the voice data of the close field of the specified field and the text data corresponding to the voice data of the close field of the specified field to generate a word graph;
and training the general acoustic model according to the phoneme alignment result and the word graph to obtain a field self-adaptive acoustic model.
2. The method of claim 1, wherein training the generic acoustic model to obtain a domain adaptive acoustic model comprises:
training the generic acoustic model using a plurality of training methods;
and comparing the recognition rate of the acoustic models respectively trained by the training methods to the voice in the specified field, and taking the acoustic model with the highest recognition rate of the voice in the specified field as the field self-adaptive acoustic model.
3. The method of claim 2, wherein the number of training rounds of the domain-adaptive acoustic model is lower than the number of training rounds of the generic acoustic model when the training method of the generic acoustic model and the training method of the domain-adaptive acoustic model are consistent.
4. The method of claim 1, wherein the preset conditions include:
the ratio of the voice data volume of the close domain of the specified domain to the voice data volume of the specified domain is 1 to 2.
5. The method of claim 1, wherein the voice data comprises:
the telephone customer service voice data;
the voice data of the similar domains of the plurality of specified domains includes:
telephone customer service voice data of a plurality of industries;
the voice data of the specified domain includes:
telephone customers of a given industry service voice data.
6. A training device for a domain adaptive acoustic model, comprising:
the device comprises a first data acquisition unit, a second data acquisition unit and a display unit, wherein the first data acquisition unit is used for acquiring voice data of similar fields of a plurality of specified fields and text data corresponding to the voice data of the similar fields of the specified fields;
the universal acoustic model training unit is used for training an acoustic model according to the voice data of the similar fields of the specified fields and the text data corresponding to the voice data of the similar fields of the specified fields to obtain a universal acoustic model;
the second data acquisition unit is used for acquiring voice data of a specified field and text data corresponding to the voice data of the specified field;
a designated field acoustic model training unit, configured to train the general acoustic model according to the speech data of the designated field and the text data corresponding to the speech data of the designated field, and according to the speech data of the near field of the designated field and the text data corresponding to the speech data of the near field of the designated field, so as to obtain a field adaptive acoustic model; the ratio of the voice data volume of the similar domain of the specified domain to the voice data volume of the specified domain meets a preset condition; training the general acoustic model according to the voice data of the designated field and the text data corresponding to the voice data of the designated field, and the voice data of the near field of the designated field and the text data corresponding to the voice data of the near field of the designated field to obtain a field adaptive acoustic model, comprising: according to the universal acoustic model, performing phoneme alignment processing on the voice data of the specified field and the text data corresponding to the voice data of the specified field, and the voice data of the close field of the specified field and the text data corresponding to the voice data of the close field of the specified field to generate a word graph; and training the general acoustic model according to the phoneme alignment result and the word graph to obtain a field self-adaptive acoustic model.
7. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the operations included in any one of claims 1-5.
8. A computing device, comprising:
a processor; and
a memory storing executable instructions that, when executed, cause the processor to perform the operations included in any one of claims 1-5.
CN201910670390.6A 2019-07-24 2019-07-24 Training method of domain adaptive acoustic model Active CN110379415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910670390.6A CN110379415B (en) 2019-07-24 2019-07-24 Training method of domain adaptive acoustic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910670390.6A CN110379415B (en) 2019-07-24 2019-07-24 Training method of domain adaptive acoustic model

Publications (2)

Publication Number Publication Date
CN110379415A CN110379415A (en) 2019-10-25
CN110379415B true CN110379415B (en) 2022-02-18

Family

ID=68255440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910670390.6A Active CN110379415B (en) 2019-07-24 2019-07-24 Training method of domain adaptive acoustic model

Country Status (1)

Country Link
CN (1) CN110379415B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243574B (en) * 2020-01-13 2023-01-03 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
CN111613209B (en) * 2020-04-14 2023-05-26 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN111508479B (en) * 2020-04-16 2022-11-22 重庆农村商业银行股份有限公司 Voice recognition method, device, equipment and storage medium
CN111477211A (en) * 2020-04-17 2020-07-31 珠海声原智能科技有限公司 Cross-scene fast-adaptation voice recognition method and device
CN112466294B (en) * 2020-11-24 2021-12-14 北京百度网讯科技有限公司 Acoustic model generation method and device and electronic equipment
CN112596868A (en) * 2020-11-27 2021-04-02 出门问问(武汉)信息科技有限公司 Model training method and device
CN113327587A (en) * 2021-06-02 2021-08-31 云知声(上海)智能科技有限公司 Method and device for voice recognition in specific scene, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
CN102280106A (en) * 2010-06-12 2011-12-14 三星电子株式会社 VWS method and apparatus used for mobile communication terminal
JP2016102820A (en) * 2014-11-27 2016-06-02 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method for improving acoustic model, and computer for improving acoustic model and computer program therefor
CN107154260A (en) * 2017-04-11 2017-09-12 北京智能管家科技有限公司 A kind of domain-adaptive audio recognition method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10923110B2 (en) * 2017-08-25 2021-02-16 International Business Machines Corporation Priors adaptation for conservative training of acoustic model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102280106A (en) * 2010-06-12 2011-12-14 三星电子株式会社 VWS method and apparatus used for mobile communication terminal
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
JP2016102820A (en) * 2014-11-27 2016-06-02 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method for improving acoustic model, and computer for improving acoustic model and computer program therefor
CN107154260A (en) * 2017-04-11 2017-09-12 北京智能管家科技有限公司 A kind of domain-adaptive audio recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于先验概率线性插值的声学模型自适应方法;王丽;《第十四届全国人机语音通讯学术会议(NCMMSC’2017)论文集》;20171030;第392-396页 *

Also Published As

Publication number Publication date
CN110379415A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
CN110379415B (en) Training method of domain adaptive acoustic model
US10971142B2 (en) Systems and methods for robust speech recognition using generative adversarial networks
US10657955B2 (en) Systems and methods for principled bias reduction in production speech models
CN108475505B (en) Generating a target sequence from an input sequence using partial conditions
CN107610709B (en) Method and system for training voiceprint recognition model
KR101255402B1 (en) Redictation 0f misrecognized words using a list of alternatives
CN110379407B (en) Adaptive speech synthesis method, device, readable storage medium and computing equipment
CN103280216B (en) Improve the speech recognition device the relying on context robustness to environmental change
CN107481717B (en) Acoustic model training method and system
US20140156575A1 (en) Method and Apparatus of Processing Data Using Deep Belief Networks Employing Low-Rank Matrix Factorization
CN110232907B (en) Voice synthesis method and device, readable storage medium and computing equipment
CN111581375A (en) Dialog intention type identification method, multi-turn dialog method, device and computing equipment
CN113299282B (en) Voice recognition method, device, equipment and storage medium
CN110675863A (en) Voice corpus generation method and device and voice recognition method and device
CN111444719A (en) Entity identification method and device and computing equipment
JP7329393B2 (en) Audio signal processing device, audio signal processing method, audio signal processing program, learning device, learning method and learning program
CN112989041A (en) Text data processing method and device based on BERT
CN111243604B (en) Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
US11295732B2 (en) Dynamic interpolation for hybrid language models
CN113674733A (en) Method and apparatus for speaking time estimation
CN111681661B (en) Speech recognition method, apparatus, electronic device and computer readable medium
US20120173240A1 (en) Subspace Speech Adaptation
JP7112348B2 (en) SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD AND SIGNAL PROCESSING PROGRAM
US20220254351A1 (en) Method and system for correcting speaker diarization using speaker change detection based on text
Khemakhem et al. Towards A Distributed Arabic OCR Based on the DTW Algorithm: Performance Analysis.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant