CN111833844A - Training method and system of mixed model for speech recognition and language classification - Google Patents

Training method and system of mixed model for speech recognition and language classification Download PDF

Info

Publication number
CN111833844A
CN111833844A CN202010739233.9A CN202010739233A CN111833844A CN 111833844 A CN111833844 A CN 111833844A CN 202010739233 A CN202010739233 A CN 202010739233A CN 111833844 A CN111833844 A CN 111833844A
Authority
CN
China
Prior art keywords
training
language
layer
speech recognition
language classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010739233.9A
Other languages
Chinese (zh)
Inventor
陆一帆
钱彦旻
朱森
陈梦姣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN202010739233.9A priority Critical patent/CN111833844A/en
Publication of CN111833844A publication Critical patent/CN111833844A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a training method of a mixed model for voice recognition and language classification. The method comprises the following steps: performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training; inputting input data for training into the N layers of intermediate layers, performing voice recognition training based on voice recognition results and text labels output by the voice recognition layer, and training neural network parameters of the N layers of intermediate layers and the voice recognition layer; after the speech recognition training is completed, based on the language classification result and the language label output by the language classification layer, only the neural network parameters of the language classification layer are trained, and the language classification training is completed. The embodiment of the invention also provides a training system of the mixed model for speech recognition and language classification. The embodiment of the invention combines the voice recognition and the language classification, simplifies the system structure, saves the training cost and improves the overall system performance of the hybrid model.

Description

Training method and system of mixed model for speech recognition and language classification
Technical Field
The invention relates to the field of voice recognition, in particular to a training method and a training system for a mixed model for voice recognition and language classification.
Background
For multi-lingual speech recognition, separate dialect and mandarin speech recognition (ASR) modules and language recognition modules are typically trained using neural networks based on existing dialect and mandarin audio. For the audio frequency sent into the system, it needs to first go through the language identification module to judge which language belongs to, then call the corresponding speech recognition (ASR) module to convert the sound into words, and then interact with other modules (such as semantic understanding, speech synthesis, etc.).
In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:
(1) high training and deployment costs
A plurality of speech recognition (ASR) modules of dialects and mandarin and a plurality of speech recognition modules and the like need to be prepared independently, a plurality of training models are long in time consumption, a plurality of ASR resources need to be deployed when online, a plurality of resources are occupied, and training and deployment costs are high.
(2) Interdependence between modules, mutual influence of performance
The accuracy of the language identification affects the performance of the subsequent speech identification, which results in higher requirements on the language module, and under the condition of wrong language identification, the high probability of the performance of the speech identification is very poor, thereby affecting the accuracy of other modules after identification.
(3) Poor integratability
The ASR module or the language module is only one, so that the ASR module or the language module is difficult to be a truly available product, most of the ASR module or the language module needs to be matched with other models (such as semantic understanding, speech synthesis, a dialogue system and the like) to form an available product, and the models need to recognize text and language information at the same time in many times, but the system is of a serial structure, cannot output recognized text and language information at the same time, cannot meet the requirement, and therefore the integratability is poor.
Disclosure of Invention
The problems that in the prior art, training and deployment cost is high, modules are interdependent, performance is influenced mutually, and integratability is poor are at least solved.
In a first aspect, an embodiment of the present invention provides a training method for a hybrid model for speech recognition and language classification, where the hybrid model is a deep neural network structure having N intermediate layers, and an nth intermediate layer branches into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, the training method including:
performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;
and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.
In a second aspect, an embodiment of the present invention provides a training system for a hybrid model for speech recognition and language classification, where the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is bifurcated into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, the training system including:
the input data determining program module is used for performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
an output program module, configured to input the input data for training to the N-layer intermediate layer, perform speech recognition training based on the speech recognition result output by the speech recognition layer and the text label, and train neural network parameters of the N-layer intermediate layer and the speech recognition layer;
and the training program module is used for training only the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer after the speech recognition training is finished, and finishing the language classification training.
In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a hybrid model for speech recognition and language classification according to any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the training method for a hybrid model for speech recognition and language classification according to any embodiment of the present invention.
The embodiment of the invention has the beneficial effects that: the speech recognition and the language classification are combined, parameters of a speech recognition part are not affected during the speech classification, so that the information of the language classification can be output more under the condition of keeping the speech recognition performance unchanged, the combining effect is achieved, the system structure is simplified, the training cost is saved during training, and meanwhile, the overall system performance of the hybrid model is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for training a hybrid model for speech recognition and language classification according to an embodiment of the present invention;
FIG. 2 is a flow chart of a training phase of a method for training a hybrid model for speech recognition and language classification according to an embodiment of the present invention;
FIG. 3 is a network structure diagram of a training method for a hybrid model of speech recognition and language classification according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a training system for a hybrid model of speech recognition and language classification according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a training method of a hybrid model for speech recognition and language classification according to an embodiment of the present invention, where the hybrid model is a deep neural network structure having N intermediate layers, and the nth intermediate layer is bifurcated into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, including the following steps:
s11: performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
s12: inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;
s13: and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.
In the embodiment, the speech recognition module and the language classification module are combined, and the combined and mixed module can output the recognition text and the language information at the same time. Mainly focusing on the training of languages, the training of the speech recognition model is not limited, and includes four stages as shown in fig. 2, including: data preparation, feature extraction, data alignment and model training.
For step S11, training the blending module requires data preparation. The audio with text labels and language labels needs to be prepared, the audio can be manually labeled, other modes can also be used, the correct text of the audio and the dialect or mandarin which the audio belongs to need to be determined, and the higher the accuracy of labeling is, the better the labeling is, and the model training at the later stage is facilitated.
After a large amount of marked audio data are collected, wav and corresponding text marks are arranged, the audio is subjected to feature extraction, and FBANK features are adopted. More specifically, the feature extraction of the mixed training audio data with text labels and language labels includes:
the mixed training audio data is framed by using a window with a frame length of 25ms and a frame shift of 10ms, and m-dimensional FBANK characteristics and Mel cepstral coefficient characteristics of each frame in the mixed training audio data (the parameters provided herein are relatively good parameters recognized in the speech recognition field, and may also be broadly referred to as well, for example, the characteristics may be FBANK (FilterBank), mfcc (Mel Frequency Cepstrum Coefficients, Mel cepstral Coefficients), plp (perceptual linear prediction cepstral Coefficients), the frame length may be 20-40ms, and the frame shift may be 10-20 ms). And determining the data alignment according to the Mel cepstrum coefficient characteristics.
As an embodiment, the performing feature extraction and data alignment on mixed training audio data with text labels and language labels, and determining input data for training includes:
performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;
and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.
In this embodiment, for supervised training, it is necessary to know what the phoneme and language information is on each frame of each audio, and this step may use a training gaussian mixture model (GMM, although a neural network model may also be trained to generate alignment, the method is not limited). Extracting Mel cepstrum coefficient (MFCC) features of the audio, framing the audio by using a window with a frame length of 25ms and a frame shift of 10ms, extracting n-dimensional MFCC features of each frame, and preparing a pronunciation dictionary (the pronunciation dictionary is a pre-prepared union set comprising phoneme sets of dialect speech and mandarin speech); the corresponding GMM model is trained from the MFCC features and the pronunciation dictionary, and then the corresponding frame-level data alignment (alignment) is generated. For speech recognition, the alignment of each frame is a phoneme, and for linguistic classification, the alignment of each frame is a class of languages or silence (silence).
For step S12, the m-dimensional FBANK feature for each frame and the alignment corresponding to each frame prepared in step S11 are input to N intermediate layers, wherein the structure of the N intermediate layers is as shown in fig. 3, and the intermediate layer structure of the neural network may adopt multiple layers of DNN (deep neural network), LSTM (long short term memory neural network), FSMN (feedforward type sequence memory network), and the like. The output of the model is two, one is the output of ASR, and the other is the output of language; the speech recognition result output by the speech recognition layer can be obtained. Because the data preparation has the text labels corresponding to the voice, the text labels can be used as targets for training, and the voice recognition result is learned from the text labels. Thereby improving the recognition effect. The method of speech recognition is only an example of training, and the training mode is not limited. In this way, the parameters for speech recognition in the N intermediate layers are continuously trained.
For step S13, after the speech recognition part is trained, as an implementation manner, the language classification result and the language label output by the language classification layer include: based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.
After the performance of voice recognition meets the requirements, one more language output is added, when a language neural network is trained, Cross-Entropy training criteria is adopted, MLE (maximum Likelihood estimate) is utilized to carry out classification optimization on each frame, the classification error rate of each frame is minimized, the gradient only transmits and updates the parameters of the forked language NN layer, the neural network layer of the voice recognition part does not update the gradient, namely the parameters of the network are not changed, and only the neural network parameters of the language part are trained, so that the language information can be output while the voice recognition performance is kept unchanged.
According to the embodiment, the speech recognition and the language classification are combined, and the parameters of the speech recognition part are not influenced during the speech classification, so that the information of the language classification can be output more under the condition of keeping the speech recognition performance unchanged, the combination effect is achieved, the system structure is simplified, the training cost is saved during training, and the overall system performance of the hybrid model is improved.
Fig. 4 is a schematic structural diagram of a training system for a hybrid model for speech recognition and language classification according to an embodiment of the present invention, which can execute the training method for a hybrid model for speech recognition and language classification according to any of the above embodiments and is configured in a terminal.
The embodiment provides a training system of a hybrid model for speech recognition and language classification, which includes: an input data determination program module 11, an output program module 12 and a training program module 13.
The input data determining program module 11 is configured to perform feature extraction and data alignment on mixed training audio data with text labels and language labels, and determine input data for training; the output program module 12 is configured to input the input data for training to the N-layer intermediate layer, perform speech recognition training based on the speech recognition result output by the speech recognition layer and the text label, and train neural network parameters of the N-layer intermediate layer and the speech recognition layer; the training program module 13 is configured to train only the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer after the speech recognition training is completed, so as to complete the language classification training.
Further, the input data determination program module is for:
performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;
and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.
Further, the training program module is to:
based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.
Further, the input data determination program module is for:
and framing the mixed training audio data by using a window with the frame length of 25ms and the frame shift of 10ms, and determining m-dimensional FBANK characteristics and Mel cepstrum coefficient characteristics of each frame in the mixed training audio data.
Further, the structure of the N intermediate layers at least comprises: deep neural network, long and short term memory neural network, and feedforward sequence memory network.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the training method of the mixed model for speech recognition and language classification in any method embodiment;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;
and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.
As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of training a hybrid model for speech recognition and language classification in any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a hybrid model for speech recognition and language classification according to any embodiment of the present invention.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.
(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) Other electronic devices with data processing capabilities.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A training method of a hybrid model for speech recognition and language classification, wherein the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is branched into a speech recognition layer outputting speech recognition results and a language classification layer outputting language classification results, the training method comprising:
performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;
and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.
2. The method of claim 1, wherein the performing feature extraction and data alignment on the mixed training audio data with text labels and language labels comprises:
performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;
and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.
3. The method of claim 1, wherein training only neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer comprises:
based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.
4. The method of claim 1, wherein the feature extraction of the mixed training audio data with text labels and language labels comprises:
and framing the mixed training audio data by using a window with the frame length of 25ms and the frame shift of 10ms, and determining m-dimensional FBANK characteristics and Mel cepstrum coefficient characteristics of each frame in the mixed training audio data.
5. The method of claim 1, wherein the structure of the N-layer interlayer comprises at least: deep neural network, long and short term memory neural network, and feedforward sequence memory network.
6. A training system for a hybrid model of speech recognition and language classification, wherein the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is branched into a speech recognition layer and a language classification layer, the speech recognition layer outputting speech recognition results, the language classification layer outputting language classification results, the training system comprising:
the input data determining program module is used for performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;
an output program module, configured to input the input data for training to the N-layer intermediate layer, perform speech recognition training based on the speech recognition result output by the speech recognition layer and the text label, and train neural network parameters of the N-layer intermediate layer and the speech recognition layer;
and the training program module is used for training only the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer after the speech recognition training is finished, and finishing the language classification training.
7. The system of claim 6, wherein the input data determination program module is to:
performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;
and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.
8. The system of claim 6, wherein the training program module is to:
based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.
9. The system of claim 6, wherein the input data determination program module is to:
and framing the mixed training audio data by using a window with the frame length of 25ms and the frame shift of 10ms, and determining m-dimensional FBANK characteristics and Mel cepstrum coefficient characteristics of each frame in the mixed training audio data.
10. The system of claim 6, wherein the structure of the N layers of interlayers comprises at least: deep neural network, long and short term memory neural network, and feedforward sequence memory network.
CN202010739233.9A 2020-07-28 2020-07-28 Training method and system of mixed model for speech recognition and language classification Withdrawn CN111833844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010739233.9A CN111833844A (en) 2020-07-28 2020-07-28 Training method and system of mixed model for speech recognition and language classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010739233.9A CN111833844A (en) 2020-07-28 2020-07-28 Training method and system of mixed model for speech recognition and language classification

Publications (1)

Publication Number Publication Date
CN111833844A true CN111833844A (en) 2020-10-27

Family

ID=72919152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010739233.9A Withdrawn CN111833844A (en) 2020-07-28 2020-07-28 Training method and system of mixed model for speech recognition and language classification

Country Status (1)

Country Link
CN (1) CN111833844A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077781A (en) * 2021-06-04 2021-07-06 北京世纪好未来教育科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113129925A (en) * 2021-04-20 2021-07-16 深圳追一科技有限公司 Mouth action driving model training method and assembly based on VC model
CN113327596A (en) * 2021-06-17 2021-08-31 北京百度网讯科技有限公司 Training method of voice recognition model, voice recognition method and device
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
WO2023231576A1 (en) * 2022-05-30 2023-12-07 京东科技信息技术有限公司 Generation method and apparatus for mixed language speech recognition model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
CN109326277A (en) * 2018-12-05 2019-02-12 四川长虹电器股份有限公司 Semi-supervised phoneme forces alignment model method for building up and system
CN110033760A (en) * 2019-04-15 2019-07-19 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
CN110349564A (en) * 2019-07-22 2019-10-18 苏州思必驰信息科技有限公司 Across the language voice recognition methods of one kind and device
CN110517664A (en) * 2019-09-10 2019-11-29 科大讯飞股份有限公司 Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing
CN110930980A (en) * 2019-12-12 2020-03-27 苏州思必驰信息科技有限公司 Acoustic recognition model, method and system for Chinese and English mixed speech

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
CN109326277A (en) * 2018-12-05 2019-02-12 四川长虹电器股份有限公司 Semi-supervised phoneme forces alignment model method for building up and system
CN110033760A (en) * 2019-04-15 2019-07-19 北京百度网讯科技有限公司 Modeling method, device and the equipment of speech recognition
CN110349564A (en) * 2019-07-22 2019-10-18 苏州思必驰信息科技有限公司 Across the language voice recognition methods of one kind and device
CN110517664A (en) * 2019-09-10 2019-11-29 科大讯飞股份有限公司 Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing
CN110930980A (en) * 2019-12-12 2020-03-27 苏州思必驰信息科技有限公司 Acoustic recognition model, method and system for Chinese and English mixed speech

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113129925A (en) * 2021-04-20 2021-07-16 深圳追一科技有限公司 Mouth action driving model training method and assembly based on VC model
CN113077781A (en) * 2021-06-04 2021-07-06 北京世纪好未来教育科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113077781B (en) * 2021-06-04 2021-09-07 北京世纪好未来教育科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113327596A (en) * 2021-06-17 2021-08-31 北京百度网讯科技有限公司 Training method of voice recognition model, voice recognition method and device
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
WO2023231576A1 (en) * 2022-05-30 2023-12-07 京东科技信息技术有限公司 Generation method and apparatus for mixed language speech recognition model

Similar Documents

Publication Publication Date Title
US11664020B2 (en) Speech recognition method and apparatus
EP3857543B1 (en) Conversational agent pipeline trained on synthetic data
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN113439301B (en) Method and system for machine learning
EP4053835A1 (en) Speech recognition method and apparatus, and device and storage medium
CN107195296B (en) Voice recognition method, device, terminal and system
CN111833844A (en) Training method and system of mixed model for speech recognition and language classification
CN106796787B (en) Context interpretation using previous dialog behavior in natural language processing
CN111862942B (en) Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN112397056B (en) Voice evaluation method and computer storage medium
WO2022252904A1 (en) Artificial intelligence-based audio processing method and apparatus, device, storage medium, and computer program product
WO2017184387A1 (en) Hierarchical speech recognition decoder
Rasipuram et al. Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model
CN112771607A (en) Electronic device and control method thereof
CN112216270B (en) Speech phoneme recognition method and system, electronic equipment and storage medium
CN113793599A (en) Training method of voice recognition model and voice recognition method and device
CN114913859B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
CN115132170A (en) Language classification method and device and computer readable storage medium
CN114512121A (en) Speech synthesis method, model training method and device
CN114267334A (en) Speech recognition model training method and speech recognition method
CN114333790A (en) Data processing method, device, equipment, storage medium and program product
CN113724690A (en) PPG feature output method, target audio output method and device
JP4163207B2 (en) Multilingual speaker adaptation method, apparatus and program
CN117456999B (en) Audio identification method, audio identification device, vehicle, computer device, and medium
CN113168828B (en) Conversation agent pipeline based on synthetic data training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
WW01 Invention patent application withdrawn after publication

Application publication date: 20201027

WW01 Invention patent application withdrawn after publication