WO2022105472A1 - 一种语音识别方法、装置和电子设备 - Google Patents

一种语音识别方法、装置和电子设备 Download PDF

Info

Publication number
WO2022105472A1
WO2022105472A1 PCT/CN2021/122961 CN2021122961W WO2022105472A1 WO 2022105472 A1 WO2022105472 A1 WO 2022105472A1 CN 2021122961 W CN2021122961 W CN 2021122961W WO 2022105472 A1 WO2022105472 A1 WO 2022105472A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
acoustic
data
text data
acoustic representation
Prior art date
Application number
PCT/CN2021/122961
Other languages
English (en)
French (fr)
Inventor
易中华
Original Assignee
北京帝派智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京帝派智能科技有限公司 filed Critical 北京帝派智能科技有限公司
Priority to JP2021577529A priority Critical patent/JP7335569B2/ja
Publication of WO2022105472A1 publication Critical patent/WO2022105472A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present application relates to the technical field of natural language processing, and in particular, to a speech recognition method, apparatus and electronic device.
  • Speech recognition technology also known as automatic speech recognition (automatic speech recognition, ASR), computer speech recognition (computer speech recognition) or speech to text recognition (speech to text, STT), its goal is computer automatic Convert human speech content into corresponding text.
  • ASR automatic speech recognition
  • computer speech recognition computer speech recognition
  • speech to text recognition speech to text, STT
  • voice recognition technology can be applied in many fields including voice dialing, voice navigation, indoor equipment control, voice document retrieval, dictation data entry and so on. If speech recognition technology is combined with other natural language processing technology (such as machine translation and speech synthesis technology), more complex applications can be built, such as speech-to-speech translation.
  • the current speech recognition system is usually trained by the method of completely separating the acoustic model and the language model, and the speech recognition application is carried out in a loosely coupled manner.
  • the acoustic model only contains the most basic language model information, while the language model only contains the most basic language model information.
  • the language-related information has nothing to do with the acoustic data, that is, the language model only represents the text-level collocation relationship.
  • the disadvantage of this scheme is that the acoustic model and the language model are independently trained and optimized, so the pipeline scheme cannot be optimized end-to-end as a whole, and the global optimal recognition result cannot be obtained, so the speech recognition of the pipeline scheme is accurate. Sex is hard to improve.
  • the prior art also adopts a solution in which all components of the speech recognition system are used as a single end-to-end network model.
  • this end-to-end network model solution uses audio-text samples for training, and the current number of audio-text samples can usually only meet the training requirements of the acoustic model, but cannot meet the training requirements of the language model, which leads to
  • This model cannot be widely used in large-vocabulary continuous speech recognition applications, and can only be used in small speech recognition systems for specific purposes, and its accuracy and scalability are not as good as traditional pipeline schemes such as acoustic models plus N-Gram language models.
  • Embodiments of the present application provide a speech recognition method, apparatus, and electronic device, so as to improve the recognition accuracy of the speech recognition system.
  • an embodiment of the present application provides a speech recognition method, the method includes: using an acoustic model to generate a first acoustic representation corresponding to the first speech data; using the first text data corresponding to the first speech data and the first acoustic representation
  • the acoustic representation trains the data generator model, so that the data generator model is used to generate corresponding acoustic representations from any text data; the data generator model is used to generate a second acoustic representation corresponding to the second text data, and the scale of the second text data is larger than the first text data;
  • the language model is trained using the second text data and the second acoustic representation, so that the language model is used to generate a corresponding text sequence according to the acoustic representation output by the acoustic model.
  • using the first text data corresponding to the first voice data and the first acoustic representation to train the data generator model includes: generating a first phonetic symbol sequence corresponding to the first text data; A sequence of diacritics is used as the input to the data generator model, the first acoustic representation is used as the output of the data generator model, and the data generator model is trained using the output of the acoustic model as a supervisory signal for the data generator model.
  • using a data generator model to generate a second acoustic representation corresponding to the second text data includes: generating a second phonetic symbol sequence corresponding to the second text data; inputting the second phonetic symbol sequence into a A data generator model to generate a second acoustic representation.
  • the acoustic model includes a Gaussian mixture model combined with a hidden Markov model GMM-HMM, or a neural network model combined with a hidden Markov model NN-HMM;
  • the neural network model includes a long short-term memory network model LSTM ;
  • the acoustic representation includes the output probabilities in all HMM states output by the GMM-HMM; alternatively, the acoustic representation includes the normalized probabilities in all HMM states output by the neural network model via the softmax layer via the connected timing model CTC or the Viterbi algorithm viterbi output
  • the pronunciation unit may be a state, a factor, an initial or final, a syllable, a character or a word, which is not limited in this embodiment.
  • the data generator model includes a generative adversarial network GANNet.
  • using the second text data and the second acoustic representation to train the language model includes: using the second acoustic representation as the input of the language model, using the second text data as the output of the language model, and training the language Model.
  • using the second text data and the second acoustic representation to train the language model includes: using the first acoustic representation and the second acoustic representation as the input of the language model, and using the first text data and the second acoustic representation as the input of the language model.
  • the second text data is the output of the language model, and the language model is trained.
  • the language model includes an attention mechanism-based sequence-to-sequence encoder and decoder; the encoder includes a recurrent neural network structure or a convolutional neural network structure; and the decoder includes a recurrent neural network structure.
  • an embodiment of the present application provides a speech recognition device, the device includes: a first training unit for generating a first acoustic representation corresponding to the first speech data by using an acoustic model; a second training unit for using the first text data and the first acoustic representation corresponding to the first speech data to train the data generator model, so that the data generator model is used to generate the corresponding acoustic representation according to the arbitrary text data; the first generation unit is used for using the data The generator model generates a second acoustic representation corresponding to the second text data, and the scale of the second text data is larger than that of the first text data; the second generation unit is used for training the language model using the second text data and the second acoustic representation, so that the The language model is used to generate corresponding text sequences based on the acoustic representations output by the acoustic model.
  • an embodiment of the present application provides an electronic device, the electronic device includes: a processor and a memory, the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the processor is caused to perform the following program steps: using the acoustic model to generate a first acoustic representation corresponding to the first speech data; using the first text data corresponding to the first speech data and the first acoustic representation to train the data generator model, so that the data generator model is used to generating the corresponding acoustic representation from the data; using the data generator model to generate the second acoustic representation corresponding to the second text data, the scale of the second text data is larger than that of the first text data; using the second text data and the second acoustic representation to train the language model, so that the language model is used to generate corresponding text sequences from the acoustic representations output by the acoustic model.
  • the technical solutions of the embodiments of the present application are based on the input-output relationship between the sound model AM, the language model LM, and the data generator model.
  • an acoustic model is obtained by first using the voice-text pair data training, and then the acoustic model is used to generate the voice-text data.
  • the acoustic representation output on the text pair data is used as the target, and the text is used as the input to train the data generator model, so as to generate the corresponding acoustic representation from any text, and then use the data generator model to generate the acoustic representation on the ultra-large-scale text-text data pair
  • the three models can be partially or integrally trained jointly in some stages of implementation.
  • FIG. 1 is a flowchart of a speech recognition method provided by an embodiment of the present application.
  • FIG. 2 is a structural diagram of an achievable acoustic model provided by an embodiment of the present application.
  • FIG. 3 is a structural diagram of an achievable data generator model provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a GANNet framework provided by an embodiment of the present application.
  • step S102 of a speech recognition method provided by an embodiment of the present application
  • step S103 of a speech recognition method provided by an embodiment of the present application
  • FIG. 7 is a structural diagram of an achievable language model provided by an embodiment of the present application.
  • FIG. 8 is a frame diagram of a speech recognition system provided by an embodiment of the present application.
  • FIG. 9 is a structural diagram of a speech recognition apparatus provided by an embodiment of the present application.
  • Speech recognition technology also known as automatic speech recognition (automatic speech recognition, ASR), computer speech recognition (computer speech recognition) or speech to text recognition (speech to text, STT), its goal is computer automatic Convert human speech content into corresponding text.
  • ASR automatic speech recognition
  • computer speech recognition computer speech recognition
  • speech to text recognition speech to text, STT
  • voice recognition technology can be applied in many fields including voice dialing, voice navigation, indoor equipment control, voice document retrieval, dictation data entry and so on. If speech recognition technology is combined with other natural language processing technology (such as machine translation and speech synthesis technology), more complex applications can be built, such as speech-to-speech translation.
  • the most advanced speech recognition system is usually trained by the method of completely separating the acoustic model and the language model, and the speech recognition application is carried out in a loosely coupled way.
  • the acoustic model only contains the most basic language model information, while the language model It only contains language-related information but has nothing to do with acoustic data, that is, the language model only represents the text collocation relationship at the text level.
  • a speech recognition system in a traditional pipeline scheme usually includes an acoustic model AM, a language model LM and a pronunciation model PM.
  • the acoustic model AM is used to represent the relationship between the acoustic features and the pronunciation units.
  • the acoustic model AM generally takes the acoustic features extracted from the audio data as the input, and the output is usually the pronunciation unit sequence corresponding to each acoustic feature.
  • the model can use a factor sequence grid or matrix with a posterior probability PDF for characterizing the sound unit sequence as the acoustic representation of the intermediate output.
  • the language model LM is used to represent the mapping relationship between the pronunciation unit sequence and the final recognized text sequence.
  • the language model can take the acoustic representation of the intermediate output of the acoustic model as the input and the text sequence as the output.
  • the pronunciation model PM is used to output text sequences as sounds.
  • the acoustic model AM extracts acoustic features and predicts a set of subword units, usually context-dependent or context-independent phoneme sequences Then, the phoneme sequence generated by the acoustic model is mapped to the word sequence through a manually designed dictionary; finally, the language model LM assigns the probability to the word sequence, and then seeks the word sequence with the largest overall joint probability as the recognition result.
  • the above three models can be constructed by traditional hidden Markov model (hidden markov model, HMM), N-gram N-Gram and other methods, or by deep neural network and other methods.
  • the solution is merged so that there are only two models (pronunciation model AM and language model LM) open to the outside world.
  • the pronunciation model AM and the language model LM are separated and independent from each other.
  • the disadvantage of this scheme is that the acoustic model and the language model are independently trained and optimized, so the pipeline scheme cannot be optimized end-to-end as a whole, and the global optimal recognition result cannot be obtained, so the speech recognition of the pipeline scheme is accurate. Sex is hard to improve.
  • the prior art also adopts a scheme that treats all components of the speech recognition system as a single end-to-end network model.
  • the end-to-end scheme jointly trains all components as a single end-to-end neural network, which makes training simpler and less expensive.
  • the fusion of acoustic representation and language representation (features of language model LM) is well carried out, and it has theoretical support for obtaining optimal recognition results.
  • the end-to-end model is entirely a neural network, no external, hand-designed components such as finite state transformers, lexicons or text normalization modules are required.
  • training end-to-end models does not require decision trees or temporal calibration bootstrapping generated from a separate system, and can be trained given pairs of textual and corresponding acoustic features.
  • this end-to-end model does not perform well when evaluated on the data of the generated environment, because the model is learned on tens of thousands of audio-text sample pairs, although these samples are sufficient for the training of the acoustic model AM requirements, but the data scale cannot correspond to or equal to the scale of text content or speech content required for traditional language model training.
  • the model cannot be applied to large-vocabulary continuous speech recognition systems, and can only be used in small speech recognition systems for specific purposes. Its general speech recognition capabilities and applications are far lower than those of traditional pipeline solutions.
  • an embodiment of the present application provides a speech recognition method, as shown in FIG. 1 , the method includes the following steps:
  • Step S101 using an acoustic model to generate a first acoustic representation corresponding to the first speech data.
  • the acoustic model may, for example, be composed of a neural network model combined with a hidden Markov model NN-HMM, wherein the neural network part of the acoustic model may be a long short-term memory network (LSTM), a recurrent neural network (recurrent neural network, RNN), gated recurrent unit (gate recurrent unit, GRU), convolutional neural network (convolutional neural networks, CNN), etc., which are not limited in the embodiments of the present application.
  • the acoustic model can also be a Gaussian mixture model combined with a hidden Markov model GMM-HMM. This application does not specifically limit which form of acoustic model to use.
  • the embodiment of the present application may introduce a first training data set consisting of speech data and its corresponding text data, denoted as (a1, T1), where a1 represents the first speech data, and T1 represents the first training data set.
  • the first text data corresponding to the voice data.
  • the first training data set may adopt a common data set in the industry, or may be collected and created by yourself, which is not limited in this embodiment of the present application.
  • the data scale of the first training data set may range from several thousand hours to dozens of hours.
  • the time required for training a speech-to-text pair for speech recognition in the industry is about 100,000 hours, and the corresponding text data is generally less than 200MB bytes. Although it can meet the training scale of the acoustic model, but It is far from the training scale of language models.
  • the acoustic representation when the acoustic model is composed of a neural network model combined with a hidden Markov model NN-HMM, the acoustic representation may include the normalized probabilities of all HMM states output by the neural network model via the softmax layer via the connection time series model The pronunciation unit sequence grid with posterior probability (probability density function, PDF) output by CTC or Viterbi algorithm viterbi.
  • the acoustic model is a Gaussian mixture model combined with a hidden Markov model GMM-HMM
  • the acoustic representation may include output probabilities for all HMM states output by the GMM-HMM.
  • the acoustic representation can be the normalized probability of the long short-term memory network LSTM model output via the softmax layer in all HMM states via the connection time series model CTC. Or a grid of phonetic unit sequences with posterior PDFs output by the Viterbi algorithm viterbi.
  • Figure 2 shows a structural diagram of an achievable acoustic model.
  • the acoustic model includes the feature frame layer AM Ferture Frames, the pre-network layer AMPreNet, the encoder layer AMEncoder, and the post-processing layer AMPostNet.
  • the feature frame layer AM Ferture Frame is used to perform spectrum conversion on the waveform data of the input speech to obtain the frequency domain feature of the speech, which is the actual input data of the acoustic model and the speech recognition model.
  • the frequency domain feature can be It is a mel-frequency cepstral coefficient (mel-frequency cepstral coefficients, MFCC), a mel-frequency cepstral (MFC), or a linear spectrum, etc., which is not limited in the embodiment of the present application.
  • the pre-network layer AMPreNet is used to pre-process the frequency-domain features of speech, such as converting them into high-dimensional input vectors to facilitate computational processing.
  • the encoder layer AMEncoder can be a long short-term memory network LSTM, a recurrent neural network RNN, a gated recurrent unit GRU, a convolutional neural network CNN, etc., which are not limited in the embodiments of this application, and are used to map the input vector of speech to a feature. express.
  • the post-processing layer AMPostNet can be a multi-layer convolutional neural network CNN, which is used to convolve the output of the encoder layer to achieve dimensionality reduction processing, and obtain the pronunciation unit sequence grid of the posterior probability PDF corresponding to the input speech frame.
  • the acoustic model aims at the Pronunciation Token Sequence in the training process, and uses the connection timing model CTC to calculate the loss Loss to supervise the output direction of the pronunciation unit sequence grid of the PDF.
  • the pronunciation symbol refers to the information used to characterize the pronunciation of the text, such as the International Phonetic Alphabet, Chinese Pinyin, etc.
  • the unit can be phoneme, syllable, word, or Chinese character, as long as the information that can characterize the pronunciation of the text can be used as pronunciation. symbols, which are not limited in the embodiments of the present application.
  • the first speech data a1 is input into the acoustic model, and the corresponding first acoustic representation A1 can be obtained.
  • Step S102 using the first text data corresponding to the first speech data and the first acoustic representation to train the data generator model, so that the data generator model is used to generate the corresponding acoustic representation according to any text data.
  • the first acoustic representation A1 and the first textual data T1 constitute a second training data set used to train the generator model.
  • the data generator model is used to generate larger-scale acoustic representations according to more text data, so as to meet the requirement of the quantity set of acoustic representations required for training the language model.
  • the data size of text data is unlimited, as long as a data generator model is obtained, an unlimited number of acoustic representations can be generated, which is sufficient for training a language model.
  • the data generator model can be built using generative adversarial networks (GANNet).
  • the data generator model can be a pronunciation unit posterior probability generation model Text2Pdf GenModel, which includes: character embedding layer Char Embedding, GANNet layer, and GAN post-processing layer GenPostNet.
  • the character embedding layer Char Embedding is used to perform word embedding coding on the ultra-large-scale text symbols corresponding to the ultra-large-scale text data, and obtain the vector form of editing calculation.
  • the GANNet layer is used to generate a representation of acoustic features from the text data.
  • the GANNet layer can be composed of a deep neural network or other generator and discriminant functions.
  • the GAN post-processing layer GenPostNet is used to convolve the GANNet layer to achieve dimensionality reduction processing, and obtain the ultra-large-scale acoustic representation PDF By GenNet corresponding to the final ultra-large-scale text data.
  • the cross-entropy loss function CrossEntropyLoss between the PDF output from the acoustic model and the acoustic representation PDF By GenNet, or other loss functions can be constructed to supervise the training direction.
  • FIG. 4 is a schematic diagram of a GANNet framework provided by an embodiment of the present application.
  • GANNet can be composed of a generative model and a discriminative model.
  • the generative model and the discriminative model can make GANNet produce good output in the mutual game learning.
  • the generative model and the discriminant model can be a neural network or a Other functions that can fit the corresponding generation and discrimination.
  • the pronunciation unit posterior probability generation model Text2Pdf GenModel only needs to use the Generative Model part of the generative model in the use phase (including the training phase of jointly training the language model LM).
  • the generative model and the discriminant model can be any one or a combination of models such as long short-term memory network LSTM, recurrent neural network RNN, gated recurrent unit GRU, convolutional neural network CNN and Transformer.
  • step S102 is shown in FIG. 5 , which can be specifically implemented in the following manner:
  • Step S201 generating a first phonetic symbol sequence corresponding to the first text data.
  • Step S201 can preferably be applied to a pictographic language such as Chinese and a scenario in which the scale of the first text data is small.
  • the first text data is a Chinese character string
  • the first phonetic symbol sequence may be a pinyin string corresponding to the Chinese character string.
  • Step S202 with the first phonetic symbol sequence as the input of the data generator model, with the first acoustic representation A1 as the output of the data generator model, and using the output of the acoustic model as the supervision signal of the data generator model, the training data generates device model.
  • a cross-entropy loss function can be constructed between the output PDF of the acoustic model and the output PDF By GenNet of the data generator model to supervise the training direction and improve the quality of the model.
  • Step S103 using a data generator model to generate a second acoustic representation corresponding to the second text data, where the scale of the second text data is larger than that of the first text data.
  • step S103 is shown in FIG. 6 , which can be specifically implemented by the following steps:
  • Step S301 generating a second phonetic symbol sequence corresponding to the second text data.
  • Step S301 can preferably be applied to scenarios of pictographic languages such as Chinese.
  • the second text data T2 is a Chinese character string
  • the second phonetic symbol sequence may be a pinyin string corresponding to the Chinese character string.
  • the scale of the second text data can be much larger than the scale of the first text data.
  • Step S302 the second phonetic symbol sequence is input into the data generator model to generate the second acoustic representation.
  • the second acoustic feature A2 and the second text data T2 may constitute a training data set for training a language model.
  • Step S104 using the second text data and the second acoustic representation to train the language model, so that the language model is used to generate a corresponding text sequence according to the acoustic representation output by the acoustic model.
  • FIG. 7 is a schematic structural diagram of a language model LM provided by an embodiment of the present application.
  • the language model LM includes a pre-network layer LMPreNet, an encoding and decoding layer LMNet, and a SoftMax layer.
  • the pre-network layer LMPreNet is used to pre-process the input acoustic representation, such as converting it into a vector form that is convenient for computation.
  • the encoder-decoder layer LMNet can be constructed using a sequence-to-sequence encoder-decoder deep neural network algorithm based on the attention mechanism.
  • the encoder can generally use a long short-term memory network (LSTM), a recurrent neural network (RNN), and a gated recurrent unit (GRU). , convolutional neural network CNN, etc.
  • the decoder can generally be built with a cyclic neural network RNN, and the attention mechanism can be a position-sensitive attention mechanism.
  • the SoftMax layer is used to calculate the normalized probability of the data output by the encoder and decoder layer LMNet, so as to determine the result of the maximum probability according to the normalized probability as the final output text sequence Final Token Sequence.
  • the cross entropy loss function Cross Entropy Loss can be constructed between the final output text sequence Final Token Sequence and the SoftMax layer to supervise the generation direction of the text sequence Final Token Sequence.
  • the language model may be trained using the second acoustic representation as the input of the language model and the second text data as the output of the language model.
  • the language model can be trained by using the first acoustic representation and the second acoustic representation as the input of the language model, and the first text data and the second text data as the output of the language model, thereby increasing the scale of the training data of the language model, Improve model quality.
  • FIG. 8 in an embodiment of the present application shows a schematic structural diagram of a speech recognition system.
  • the speech recognition system includes: sound model AM, language model LM and pronunciation unit posterior probability generation model Text2Pdf GenModel.
  • the language model LM takes the acoustic representation PDF output by the sound model AM and the acoustic representation PDF By GenNet output by the pronunciation unit posterior probability generation model as input, and outputs the text sequence as the final result.
  • an acoustic model is obtained by first using the voice-text pair data training, and then the acoustic model is used to generate the voice-text data.
  • the acoustic representation output on the text pair data is used as the target, and the text is used as the input, and the data generator model is trained to generate the corresponding acoustic representation from any text, and then the data generator model is used to generate the acoustic representation-text data pair on the ultra-large-scale text.
  • the language model after the training is completed, the acoustic model and the language model are cascaded to realize the conversion process from speech to text.
  • the three models can be partially or integrally trained jointly in some stages of implementation. Since the data generator model theoretically increases the size of the acoustic representation-text pair data infinitely, it is possible to construct a large vocabulary with high accuracy in this field without the need to obtain the speech data in a certain field in advance. Continuous speech recognition systems; if data generation and language models are trained on sufficient text scale, systems can be built with high accuracy in all domains.
  • the embodiment of the present application also provides a speech recognition device, and as shown in FIG. 9 , the speech recognition device may include:
  • a first training unit 401 configured to generate a first acoustic representation corresponding to the first speech data by using an acoustic model
  • the second training unit 402 is used for training the data generator model using the first text data and the first acoustic representation corresponding to the first speech data, so that the data generator model is used to generate the corresponding acoustic representation according to any text data;
  • a first generating unit 403 configured to use a data generator model to generate a second acoustic representation corresponding to the second text data, where the scale of the second text data is larger than that of the first text data;
  • the third training unit 404 is configured to train the language model using the second text data and the second acoustic representation, so that the language model is used to generate a corresponding text sequence according to the acoustic representation output by the acoustic model.
  • the second training unit 402 is specifically configured to generate a first phonetic symbol sequence corresponding to the first text data; and, using the first phonetic symbol sequence as the input of the data generator model, and the first acoustic representation as the input
  • the output of the data generator model, and the data generator model is trained using the output of the acoustic model as a supervision signal for the data generator model.
  • the first generating unit 403 is specifically configured to generate a second phonetic symbol sequence corresponding to the second text data; and input the second phonetic symbol sequence into a data generator model to generate a second acoustic representation.
  • the third training unit 404 is specifically configured to use the second acoustic representation as the input of the language model, and use the second text data as the output of the language model to train the language model.
  • the third training unit 404 is specifically configured to use the first acoustic representation and the second acoustic representation as the input of the language model, and use the first text data and the second text data as the output of the language model to train the language model .
  • the technical solution of the embodiment of the present application is based on the input-output relationship between the sound model AM, the language model LM and the pronunciation unit posterior probability generation model Text2Pdf GenModel, these three models can be jointly trained in some stages of implementation, and, because Pronunciation unit posterior probability generation model Text2Pdf GenModel increases the scale of acoustic representation, so that the trained speech recognition system can be applied to the scene of large vocabulary continuous speech recognition, and has high accuracy.
  • the embodiment of the present application also provides an electronic device, which may include, for example, a mobile phone, a tablet computer, a personal computer, a server, a workstation device, a large-screen device (eg, a smart screen, a smart TV, etc.), a smart speaker, and a handheld game.
  • the electronic device may include: a processor 501 and a memory 502, wherein the memory 502 stores computer program instructions, when the computer program instructions are executed by the processor 501, the processor 501 is caused to perform the following program steps: using the acoustic model to generate the first voice data corresponding to the The first acoustic representation of The generator model generates a second acoustic representation corresponding to the second text data, the scale of the second text data is larger than that of the first text data; the language model is trained using the second text data and the second acoustic representation, so that the language model is used according to the acoustic model The output acoustic representations generate corresponding text sequences.
  • the technical solution of the embodiment of the present application is based on the input-output relationship between the sound model AM, the language model LM and the pronunciation unit posterior probability generation model Text2Pdf GenModel, these three models can be jointly trained in some stages of implementation, and, because Pronunciation unit posterior probability generation model Text2Pdf GenModel increases the scale of acoustic representation, so that terminal equipment has the ability to perform speech recognition in the scene of large vocabulary continuous speech recognition, and has high accuracy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例提供了一种语音识别方法、装置和电子设备,能够使用声学模型生成第一语音数据对应的第一声学表征;使用第一语音数据对应的第一文本数据和第一声学表征训练数据生成器模型,以使数据生成器模型用于根据任意文本数据生成对应的声学表征;使用数据生成器模型生成第二文本数据对应的第二声学表征,第二文本数据的规模大于第一文本数据;使用第二文本数据和第二声学表征训练语言模型,以使语言模型用于根据声学模型输出的声学表征生成对应的文本序列。本申请实施例的技术方案,通过数据生成器模型增大了声学表征的规模,使得训练得到的语音识别系统可以应用于大词汇量连续语音识别的场景中,并且具有较高的准确性。

Description

一种语音识别方法、装置和电子设备
本申请要求于2020年11月18日提交到国家知识产权局、申请号为202011294806.8、发明名称为“一种语音识别方法、装置和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自然语言处理技术领域,尤其涉及一种语音识别方法、装置和电子设备。
背景技术
语音识别技术(speech recognition),也被称为自动语音识别(automatic speech recognition,ASR)、电脑语音识别(computer speech recognition)或是语音转文本识别(speech to text,STT),其目标是计算机自动将人类的语音内容转换为相应的文字。语音识别技术可以应用在包括语音拨号、语音导航、室内设备控制、语音文档检索、听写数据录入等众多领域。如果将语音识别技术与其他自然语言处理技术(如机器翻译及语音合成技术)相结合,则可以构建出更加复杂的应用,例如语音到语音的翻译等。
目前的语音识别系统,通常采用声学模型和语言模型完全分离的方法进行训练,并以一种松耦合的方式进行语音识别应用,声学模型中仅蕴含最基础的语言模型信息,而语言模型仅蕴含了语言相关的信息却与声学数据无关,即语言模型仅仅表征了文本层面的文字搭配关系。这种方案的缺陷在于:声学模型和语言模型是单独训练并且独立优化的,因此无法对管道方案进行端到端的整体优化,也就无法得出全局最优识别结果,因此管道方案的语音识别准确性难以提升。
为克服上述缺陷,现有技术还通过了一种将语音识别系统的所有组件作为单一的端到端网络模型的方案。然而,这种端到端网络模型的方案采用音频-文本的样本训练,而目前音频-文本的样本的数量规模通常只能满足声学模型的训练要求,无法满足语言模型的训练要求,这就导致该模型无法在大词汇量连续语音识别应用中有广泛适用性,仅能用于特定用途的小型语音识别系统中,并且准确率和扩展性不如传统管道方案如声学模型加N-Gram语言模型。
发明内容
本申请实施例提供了一种语音识别方法、装置和电子设备,以解决提高语音识别系统的识别准确率。
第一方面,本申请实施例提供了一种语音识别方法,该方法包括:使用声学模型生成第一语音数据对应的第一声学表征;使用第一语音数据对应的第一文本数据和第一声学表征训练数据生成器模型,以使数据生成器模型用于根据任意文本数据生成对应的声学表征;使用数据生成器模型生成第二文本数据对应的第二声学表征,第二文本数据的规模大于第一文本数据;使用第二文本数据和第二声学表征训练语言模型,以使语言模型用于根据声学模型输出的声学表征生成对应的文本序列。
在一种可选择的实现方式中,使用第一语音数据对应的第一文本数据和第一声学表征训练数据生成器模型,包括:生成第一文本数据对应的第一发音符号序列;以第一发音符号序 列作为数据生成器模型的输入,以第一声学表征作为数据生成器模型的输出,并且使用声学模型的输出作为数据生成器模型的监督信号,训练数据生成器模型。
在一种可选择的实现方式中,使用数据生成器模型生成第二文本数据对应的第二声学表征,包括:生成第二文本数据对应的第二发音符号序列;将第二发音符号序列输入到数据生成器模型,以生成第二声学表征。
在一种可选择的实现方式中,声学模型包括高斯混合模型结合隐马尔可夫模型GMM-HMM,或者神经网络模型结合隐马尔可夫模型NN-HMM;神经网络模型包括长短期记忆网络模型LSTM;声学表征包括GMM-HMM输出的所有HMM状态下的输出概率;或者,声学表征包括神经网络模型经由softmax层输出的所有HMM状态下的归一化概率经由连接时序模型CTC或者维特比算法viterbi输出的带有后验概率PDF的发音单元序列网格。所述发音单元可以是状态、因素、声韵母、音节、字或者词,本实施例对此不做限定。
在一种可选择的实现方式中,数据生成器模型包括生成对抗网络GANNet。
在一种可选择的实现方式中,使用第二文本数据和第二声学表征训练语言模型,包括:以第二声学表征为语言模型的输入,以第二文本数据为语言模型的输出,训练语言模型。
在一种可选择的实现方式中,使用第二文本数据和第二声学表征训练语言模型,包括:以第一声学表征和第二声学表征为语言模型的输入,以第一文本数据和第二文本数据为语言模型的输出,训练语言模型。
在一种可选择的实现方式中,语言模型包括基于注意力机制的序列到序列的编码器和解码器;编码器包括循环神经网络结构或者卷积神经网络结构;解码器包括循环神经网络结构。
第二方面,本申请实施例提供了一种语音识别装置,该装置包括:第一训练单元,用于使用声学模型生成第一语音数据对应的第一声学表征;第二训练单元,用于使用第一语音数据对应的第一文本数据和第一声学表征训练数据生成器模型,以使数据生成器模型用于根据任意文本数据生成对应的声学表征;第一生成单元,用于使用数据生成器模型生成第二文本数据对应的第二声学表征,第二文本数据的规模大于第一文本数据;第二生成单元,用于使用第二文本数据和第二声学表征训练语言模型,以使语言模型用于根据声学模型输出的声学表征生成对应的文本序列。
第三方面,本申请实施例提供了一种电子设备,该电子设备包括:处理器和存储器,存储器存储有计算机程序指令,当计算机程序指令被处理器执行时,使得处理器执行以下程序步骤:使用声学模型生成第一语音数据对应的第一声学表征;使用第一语音数据对应的第一文本数据和第一声学表征训练数据生成器模型,以使数据生成器模型用于根据任意文本数据生成对应的声学表征;使用数据生成器模型生成第二文本数据对应的第二声学表征,第二文本数据的规模大于第一文本数据;使用第二文本数据和第二声学表征训练语言模型,以使语言模型用于根据声学模型输出的声学表征生成对应的文本序列。
本申请实施例的技术方案,基于声音模型AM、语言模型LM和数据生成器模型之间的输入输出关系,一般地,先使用语音-文本对数据训练得到声学模型,再使用声学模型以语音-文本对数据上的声学表征输出作为目标、文本作为输入训练数据生成器模型,从而实现从任意的文本生成对应的声学表征,然后使用数据生成器模型在超大规模文本上生成声学表征-文本数据对训练语言模型;训练完成后将声学模型和语言模型级联实现从语音到文本的转换过程。根据模型的输入输出关系,所述3个模型在实施的某些阶段可以部分联合或者整体联合训练。由于数据生成器模型理论上无限增大了声学表征-文本对数据的规模,使得在不需要 预先获得某领域的语音数据情况下也能构建出在该领域下具有较高准确性的大词汇量连续语音识别系统;如果在足够的文本规模上进行数据生成并训练语言模型,则可构建在所有领域下均具备较高准确率的系统。
附图说明
图1是本申请实施例提供的一种语音识别方法的流程图;
图2是本申请实施例提供的一种可实现的声学模型的结构图;
图3是本申请实施例提供的一种可实现的数据生成器模型的结构图;
图4是本申请实施例提供的GANNet的框架示意图;
图5是本申请实施例提供的一种语音识别方法步骤S102的流程图;
图6是本申请实施例提供的一种语音识别方法步骤S103的流程图;
图7是本申请实施例提供的一种可实现的语言模型的结构图;
图8是本申请实施例提供的一种语音识别系统的框架图;
图9是本申请实施例提供的一种语音识别装置的结构图。
具体实施方式
语音识别技术(speech recognition),也被称为自动语音识别(automatic speech recognition,ASR)、电脑语音识别(computer speech recognition)或是语音转文本识别(speech to text,STT),其目标是计算机自动将人类的语音内容转换为相应的文字。语音识别技术可以应用在包括语音拨号、语音导航、室内设备控制、语音文档检索、听写数据录入等众多领域。如果将语音识别技术与其他自然语言处理技术(如机器翻译及语音合成技术)相结合,则可以构建出更加复杂的应用,例如语音到语音的翻译等。
目前最先进的语音识别系统,通常采用声学模型和语言模型完全分离的方法进行训练,并以一种松耦合的方式进行语音识别应用,声学模型中仅蕴含最基础的语言模型信息,而语言模型仅蕴含了语言相关的信息却与声学数据无关,即语言模型仅仅表征了文本层面的文字搭配关系。例如,在传统的管道(pipeline)方案的语音识别系统中,通常包含一个声学模型AM,一个语言模型LM和一个发音模型PM。其中,声学模型AM用于表征从声学特征到发音单元之间的关系,声学模型AM一般以从音频数据中提取的声学特征作为输入,输出通常就是各个声学特征对应的发音单元序列,其中,声学模型可以将用于表征声音单元序列的带有后验概率PDF的因素序列网格或者矩阵等作为中间输出的声学表征。语言模型LM用于表征发音单元序列到最终识别的文本序列的映射关系,语言模型可以以声学模型中间输出的的声学表征作为输入,以文本序列作为输出。发音模型PM则用于将文本序列输出为声音。基于管道(pipeline)方案,传统的语音识别通过以下方式实现:首先,声学模型AM提取声学特征,并预测一组子词单元(subword unit),通常是与上下文相关的或与上下文无关的音素序列;然后,通过一个手动设计的词典将声学模型生成的音素序列映射到单词序列;最后,语言模型LM将概率分配给单词序列,进而寻求整体联合概率最大的单词序列作为识别结果。上述三个模型可以通过传统的隐马尔可夫模型(hidden markov model,HMM)、N元语法N-Gram等方法构造,也可以是通过深度神经网络等方法构造,也有将上述模型中的两个进行合并从而对外界开来只有两个模型(发音模型AM和语言模型LM)的方案。但是,无论基于管道(pipeline)方案如何变化,都没有脱离发音模型AM和语言模型LM分离而相互独立的技术构思。这种方案的缺陷在于:声学模型和语言模型是单独训练并且独立优化的,因此 无法对管道方案进行端到端的整体优化,也就无法得出全局最优识别结果,因此管道方案的语音识别准确性难以提升。
为了克服管道方案的缺陷,现有技术还通过了一种将语音识别系统的所有组件作为单一的端到端网络模型的方案。与传统的管道方案将发音模型AM和语言模型LM作为单独的模块进行训练的方案不同,端到端方案将所有组件作为单一的端到端神经网络进行联合训练,这使得训练更简单,并且很好地进行了声学表征和语言表征(语言模型LM的特征)融合,具备获得最优识别结果的理论支持。此外,由于端到端模型完全是神经网络,所以不需要外部的、手工设计的组件,例如有限状态转换器,词典或文本标准化模块。最后,与传统模型不同的是,训练端到端模型不需要从一个单独的系统中生成的决策树或时间校准引导,并且可以在给定的文本和相应的声学特征对下训练。然而,这种端到端模型在生成环境的数据上评估时,表现却不够好,因为该模型是在上万个音频-文本的样本对上学习的,这些样本虽然能够满足声学模型AM的训练要求,但是其数据规模无法与传统的语言模型训练所需要的文本内容或者语音内容的规模相应或相当。这就导致该模型无法应用于大词汇量连续语音识别系统,仅能用于特定用途的小型语音识别系统中,其通用语音识别能力和应用场合远低于传统的管道方案。
为了解决端到端模型方案的训练数据规模不足的问题,本申请实施例提供了一种语音识别方法,该方法如图1所示,包括以下步骤:
步骤S101,使用声学模型生成第一语音数据对应的第一声学表征。
可选的,声学模型例如可以由神经网络模型结合隐马尔可夫模型NN-HMM构成,其中,声学模型的神经网络部分可以是长短期记忆网络(long short-term memory,LSTM),循环神经网络(recurrent neural network,RNN)、门控循环单元(gate recurrent unit,GRU)、卷积神经网络(convolutional neural networks,CNN)等,本申请实施例不做限定。另一方面,声学模型也可以是高斯混合模型结合隐马尔可夫模型GMM-HMM。本申请对于采用哪种形式的声学模型不做具体限定。
为了获取到声学特征,本申请实施例可以引入由语音数据及其对应的文本数据组成的第一训练数据集,记作(a1,T1),其中,a1表示第一语音数据,T1表示第一语音数据对应的第一文本数据。第一训练数据集可以采用业内常见的数据集,也可以是自行收集创建,本申请实施例对此不做限定,一般来说,第一训练数据集的数据规模可以在几千小时到几十万小时不等,在业内目前训练一个语音识别的语音-文本对数据中规模较大的约在10万小时级,对应的文本数据一般小于200MB字节,虽然能够满足声学模型的训练规模,但是远达不到语言模型的训练规模。
具体实现中,当声学模型由神经网络模型结合隐马尔可夫模型NN-HMM构成时,声学表征可以包括所述神经网络模型经由softmax层输出的所有HMM状态下的归一化概率经由连接时序模型CTC或者维特比算法viterbi输出的带有后验概率(probability density function,PDF)的发音单元序列网格。当声学模型是高斯混合模型结合隐马尔可夫模型GMM-HMM时,声学表征可以包括所述GMM-HMM输出的所有HMM状态下的输出概率。
示例地,以声学模型是长短期记忆网络结合隐马尔可夫模型HMM为例,声学表征可以是长短期记忆网络LSTM模型经由softmax层输出的所有HMM状态下的归一化概率经由连接时序模型CTC或者维特比算法viterbi输出的带有后验概率PDF的发音单元序列网格。
图2示出了一种可实现的声学模型结构图。如图2所示,该声学模型包括特征帧层AM  Ferture Frames、前置网络层AMPreNet、编码器层AMEncoder、后处理层AMPostNet。其中,特征帧层AM Ferture Frame用于对输入的语音的波形数据进行频谱转换,得到语音的频域特征,该频域特征即为声学模型和语音识别模型的实际输入数据,频域特征例如可以是梅尔倒频谱系数(mel-frequency cepstral coefficients,MFCC)、梅尔倒频谱(mel-frequency cepstrum,MFC)或者线性谱等,本申请实施例不做限定。前置网络层AMPreNet用于对语音的频域特征进行前置处理,例如转换成高维的输入向量,以便于计算处理。编码器层AMEncoder可以是长短期记忆网络LSTM、循环神经网络RNN、门控循环单元GRU、卷积神经网络CNN等,本申请实施例不做限定,用于将语音的输入向量映射到一种特征表示。后处理层AMPostNet可以是多层的卷积神经网络CNN,用于对编码器层的输出进行卷积以实现降维处理,得到输入的语音帧对应的后验概率PDF的发音单元序列网格。另外,该声学模型以在训练过程中以发音符号序列Pronunciation Token Sequence为目标,使用连接时序模型CTC计算损失Loss,以监督PDF的发音单元序列网格的输出方向。其中,发音符号是指用于表征文本发音情况的信息,例如国际音标、汉语拼音等,其单位可以是音素、音节、词,也可以是汉字,只要能够表征文本发音情况的信息均可以作为发音符号,本申请实施例对此不做限定。
在声学模型训练完成之后,将第一语音数据a1输入至声学模型,即可得到其对应的第一声学表征A1。
步骤S102,使用第一语音数据对应的第一文本数据和第一声学表征训练数据生成器模型,以使数据生成器模型用于根据任意文本数据生成对应的声学表征。
第一声学表征A1和第一文本数据T1构成了训练生成器模型所使用的第二训练数据集。
本申请实施例中,数据生成器模型用于根据更多的文本数据生成更大规模的声学表征,从而达到训练语言模型所需要的声学表征的数量集需求。通常来说,由于文本数据的数据规模是无限制的,因此,只要得到数据生成器模型,就可以无限制地产生声学表征,足够用于训练语言模型。
在一种实现方式中,数据生成器模型可采用生成对抗网络(generative adversarial networks,GANNet)来搭建。例如,数据生成器模型如图3所示可以是发音单元后验概率生成模型Text2Pdf GenModel,该模型包括:字符嵌入层Char Embedding、GANNet层、GAN后处理层GenPostNet。其中,字符嵌入层Char Embedding用于对超大规模的文本数据对应的超大规模文本符号进行性词嵌入编码,得到编辑计算的向量形式。GANNet层用于将文本数据生成一种声学特征的表示,GANNet层可以由深度神经网络或者其他生成函数和判别函数构成。GAN后处理层GenPostNet用于对GANNet层进行卷积以实现降维处理,得到最终的超大规模文本数据对应的超大规模的声学表征PDF By GenNet。并且,在训练过程中,可以构造声学模型输出的PDF至声学表征PDF By GenNet之间的交叉熵损失函数CrossEntropyLoss,或者其他的损失函数,以互相监督训练方向。
图4是本申请实施例提供的GANNet的框架示意图。如图4所示,GANNet可以由生成模型Generative Model和判别模型Discriminative Model组成,生成模型和判别模型可以在相互博弈学习中使GANNet产生良好的输出,生成模型和判别模型可以是神经网络也可以是其他能够拟合相应生成和判别的函数。在本申请中,发音单元后验概率生成模型Text2Pdf GenModel在使用阶段(其中包括联合对语言模型LM进行训练的训练阶段)仅需要使用生成模型Generative Model部分。其中,生成模型和判别模型可以是长短期记忆网络LSTM、循环神经网络RNN、门控循环单元GRU、卷积神经网络CNN和Transformer等模型中的任意 一种或者多种的组合。
在一种实现方式中,基于上述数据生成器模型,步骤S102如图5所示,具体可以通过以下方式实现:
步骤S201,生成第一文本数据对应的第一发音符号序列。
步骤S201优选可以应用于中文等象形语言以及第一文本数据规模较小的场景中。例如,当第一文本数据是中文字符串时,第一发音符号序列可以是中文字符串对应的拼音串。
步骤S202,以第一发音符号序列作为数据生成器模型的输入,以第一声学表征A1作为数据生成器模型的输出,并且使用声学模型的输出作为数据生成器模型的监督信号,训练数据生成器模型。
如前文所示,声学模型的输出PDF与数据生成器模型的输出PDF By GenNet之间可以构造交叉熵损失函数CrossEntropyLoss,或者其他的损失函数,以互相监督训练方向,提升模型质量。
可以理解的是,在数据生成器模型训练完成之后,即具备了输入任意的文本数据,输出其对应的声学表征的能力,由于文本数据规模理论上不受限制,因此可以生成大规模的声学特征。
步骤S103,使用数据生成器模型生成第二文本数据对应的第二声学表征,第二文本数据的规模大于第一文本数据。
具体实现中,步骤S103如图6所示,具体可以通过以下步骤实现:
步骤S301,生成第二文本数据对应的第二发音符号序列。
步骤S301优选可以应用于中文等象形语言的场景中。例如,当第二文本数据T2是中文字符串时,第二发音符号序列可以是中文字符串对应的拼音串。为了得到足够的满足语言模型训练需求的第二声学表征,第二文本数据的规模可以远大于第一文本数据的规模。
步骤S302,将第二发音符号序列输入到数据生成器模型,以生成第二声学表征。
其中,第二声学特征A2和第二文本数据T2可以构成用于训练语言模型的训练数据集。
步骤S104,使用第二文本数据和第二声学表征训练语言模型,以使语言模型用于根据声学模型输出的声学表征生成对应的文本序列。
图7是本申请实施例提供的语言模型LM的结构示意图。如图7所示,该语言模型LM包括前置网络层LMPreNet、编解码层LMNet、SoftMax层。其中,前置网络层LMPreNet用于对输入的声学表征进行前置处理,例如转换成利于计算的向量形式。编解码层LMNet可以采用基于注意力机制的序列到序列的编码器-解码器的深度神经网络算法构建,其中,编码器一般可以采用长短期记忆网络LSTM、循环神经网络RNN、门控循环单元GRU、卷积神经网络CNN等构建,解码器一般可以采用循环神经网络RNN搭建,注意力机制可以是位置敏感的注意力机制。SoftMax层用于对编解码层LMNet输出的数据计算归一化概率,以根据归一化概率确定概率最大结果作为最终输出的文本序列Final Token Sequence。其中,在最终输出的文本序列Final Token Sequence和SoftMax层之间可以构造交叉熵损失函数Cross Entropy Loss,以监督文本序列Final Token Sequence的生成方向。
可选的,可以以第二声学表征为语言模型的输入,以第二文本数据为语言模型的输出,训练语言模型。或者,可以以第一声学表征和第二声学表征为语言模型的输入,以第一文本数据和第二文本数据为语言模型的输出,训练语言模型,从而提升语言模型的训练数据的规模,提升模型质量。
基于以上技术方案,本申请实施例的图8示出了语音识别系统的结构示意图。该语音识别系统包括:声音模型AM、语言模型LM和发音单元后验概率生成模型Text2Pdf GenModel。其中,语言模型LM以声音模型AM输出的声学表征PDF和发音单元后验概率生成模型输出的声学表征PDF By GenNet作为输入,输出作为最终结果的文本序列。
本申请实施例的技术方案,基于声音模型AM、语言模型LM和数据生成器模型之间的输入输出关系,一般地,先使用语音-文本对数据训练得到声学模型,再使用声学模型以语音-文本对数据上的声学表征输出作为目标、文本作为输入,训练数据生成器模型从而实现从任意的文本生成对应的声学表征,然后使用数据生成器模型在超大规模文本上生成声学表征-文本数据对训练语言模型;训练完成后将声学模型和语言模型级联实现从语音到文本的转换过程。根据模型的输入输出关系,所述3个模型在实施的某些阶段可以部分联合或者整体联合训练。由于数据生成器模型理论上无限增大了声学表征-文本对数据的规模,使得在不需要预先获得某领域的语音数据情况下也能构建出在该领域下具有较高准确性的大词汇量连续语音识别系统;如果在足够的文本规模上进行数据生成并训练语言模型,则可构建在所有领域下均具备较高准确率的系统。
本申请实施例还提供了一种语音识别装置,该语音识别装置如图9所示可以包括:
第一训练单元401,用于使用声学模型生成第一语音数据对应的第一声学表征;
第二训练单元402,用于使用第一语音数据对应的第一文本数据和第一声学表征训练数据生成器模型,以使数据生成器模型用于根据任意文本数据生成对应的声学表征;
第一生成单元403,用于使用数据生成器模型生成第二文本数据对应的第二声学表征,第二文本数据的规模大于第一文本数据;
第三训练单元404,用于使用第二文本数据和第二声学表征训练语言模型,以使语言模型用于根据声学模型输出的声学表征生成对应的文本序列。
在一个实施例中,第二训练单元402具体用于生成第一文本数据对应的第一发音符号序列;以及,以第一发音符号序列作为数据生成器模型的输入,以第一声学表征作为数据生成器模型的输出,并且使用声学模型的输出作为数据生成器模型的监督信号,训练数据生成器模型。
在一个实施例中,第一生成单元403具体用于生成第二文本数据对应的第二发音符号序列;以及将第二发音符号序列输入到数据生成器模型,以生成第二声学表征。
在一个实施例中,第三训练单元404具体用于以第二声学表征为语言模型的输入,以第二文本数据为语言模型的输出,训练语言模型。
在一个实施例中,第三训练单元404具体用于以第一声学表征和第二声学表征为语言模型的输入,以第一文本数据和第二文本数据为语言模型的输出,训练语言模型。
本申请实施例的技术方案,基于声音模型AM、语言模型LM和发音单元后验概率生成模型Text2Pdf GenModel之间的输入输出关系,这三个模型在实施的某些阶段可以联合训练,并且,由于发音单元后验概率生成模型Text2Pdf GenModel增大了声学表征的规模,使得训练得到的语音识别系统可以应用于大词汇量连续语音识别的场景中,并且具有较高的准确性。
本申请实施例还提供了一种电子设备,该电子设备例如可以包括手机、平板电脑、个人电脑、服务器、工作站设备、大屏设备(例如:智慧屏、智能电视等)、智能音箱、掌上游戏机、家用游戏机、虚拟现实设备、增强现实设备、混合现实设备等、车载智能终端、自动驾驶汽车、用户驻地设备(customer-premises equipment,CPE)等,本申请实施例对此不做限 定。
该电子设备可以包括:处理器501和存储器502,存储器502存储有计算机程序指令,当计算机程序指令被处理器501执行时,使得处理器501执行以下程序步骤:使用声学模型生成第一语音数据对应的第一声学表征;使用第一语音数据对应的第一文本数据和第一声学表征训练数据生成器模型,以使数据生成器模型用于根据任意文本数据生成对应的声学表征;使用数据生成器模型生成第二文本数据对应的第二声学表征,第二文本数据的规模大于第一文本数据;使用第二文本数据和第二声学表征训练语言模型,以使语言模型用于根据声学模型输出的声学表征生成对应的文本序列。
本申请实施例的技术方案,基于声音模型AM、语言模型LM和发音单元后验概率生成模型Text2Pdf GenModel之间的输入输出关系,这三个模型在实施的某些阶段可以联合训练,并且,由于发音单元后验概率生成模型Text2Pdf GenModel增大了声学表征的规模,使得终端设备具备在大词汇量连续语音识别的场景中进行语音识别的能力,并且具有较高的准确性。

Claims (9)

  1. 一种语音识别方法,其特征在于,包括:
    使用声学模型生成第一语音数据对应的第一声学表征;
    生成第一文本数据对应的第一发音符号序列;
    以所述第一发音符号序列作为数据生成器模型的输入,以所述第一声学表征作为所述数据生成器模型的输出,训练所述数据生成器模型,以使所述数据生成器模型用于根据任意文本数据生成对应的声学表征;
    使用所述数据生成器模型生成第二文本数据对应的第二声学表征,所述第二文本数据的规模大于所述第一文本数据;
    使用所述第二文本数据和所述第二声学表征训练语言模型,以使所述语言模型用于根据所述声学模型输出的所述声学表征生成对应的文本序列。
  2. 根据权利要求1所述的方法,其特征在于,所述使用所述数据生成器模型生成第二文本数据对应的第二声学表征,包括:
    生成所述第二文本数据对应的第二发音符号序列;
    将所述第二发音符号序列输入到所述数据生成器模型,以生成所述第二声学表征。
  3. 根据权利要求1或2所述的方法,其特征在于,
    所述声学模型包括高斯混合模型结合隐马尔可夫模型GMM-HMM,或者神经网络模型结合隐马尔可夫模型NN-HMM;所述神经网络模型包括长短期记忆网络模型LSTM;
    所述声学表征包括所述GMM-HMM输出的所有HMM状态下的输出概率;
    或者,所述声学表征包括所述神经网络模型经由softmax层输出的所有HMM状态下的归一化概率经由连接时序模型CTC或者维特比算法viterbi输出的带有后验概率PDF的发音单元序列网格。
  4. 根据权利要求1或2所述的方法,其特征在于,所述数据生成器模型包括生成对抗网络GANNet。
  5. 根据权利要求1所述的方法,其特征在于,所述使用所述第二文本数据和所述第二声学表征训练语言模型,包括:以所述第二声学表征为所述语言模型的输入,以所述第二文本数据为所述语言模型的输出,训练所述语言模型。
  6. 根据权利要求1所述的方法,其特征在于,所述使用所述第二文本数据和所述第二声学表征训练语言模型,包括:以所述第一声学表征和所述第二声学表征为所述语言模型的输入,以所述第一文本数据和所述第二文本数据为所述语言模型的输出,训练所述语言模型。
  7. 根据权利要求1、5、6任一项所述的方法,所述语言模型包括基于注意力机制的序列到序列的编码器和解码器;所述编码器包括循环神经网络结构或者卷积神经网络结 构;所述解码器包括循环神经网络结构。
  8. 一种语音识别装置,其特征在于,包括:
    第一训练单元,用于使用声学模型生成第一语音数据对应的第一声学表征;
    第二训练单元,用于生成第一文本数据对应的第一发音符号序列,以所述第一发音符号序列作为数据生成器模型的输入,以所述第一声学表征作为所述数据生成器模型的输出,训练所述数据生成器模型,以使所述数据生成器模型用于根据任意文本数据生成对应的声学表征;
    第一生成单元,用于使用所述数据生成器模型生成第二文本数据对应的第二声学表征,所述第二文本数据的规模大于所述第一文本数据;
    第二生成单元,用于使用所述第二文本数据和所述第二声学表征训练语言模型,以使所述语言模型用于根据所述声学模型输出的所述声学表征生成对应的文本序列。
  9. 一种电子设备,其特征在于,包括:处理器和存储器,所述存储器存储有计算机程序指令,当所述计算机程序指令被所述处理器执行时,使得所述处理器执行以下程序步骤:
    使用声学模型生成第一语音数据对应的第一声学表征;
    生成第一文本数据对应的第一发音符号序列;
    以所述第一发音符号序列作为数据生成器模型的输入,以所述第一声学表征作为所述数据生成器模型的输出,训练所述数据生成器模型,以使所述数据生成器模型用于根据任意文本数据生成对应的声学表征;
    使用所述数据生成器模型生成第二文本数据对应的第二声学表征,所述第二文本数据的规模大于所述第一文本数据;
    使用所述第二文本数据和所述第二声学表征训练语言模型,以使所述语言模型用于根据所述声学模型输出的所述声学表征生成对应的文本序列。
PCT/CN2021/122961 2020-11-18 2021-10-11 一种语音识别方法、装置和电子设备 WO2022105472A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2021577529A JP7335569B2 (ja) 2020-11-18 2021-10-11 音声認識方法、装置及び電子機器

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011294806.8 2020-11-18
CN202011294806.8A CN112420050B (zh) 2020-11-18 2020-11-18 一种语音识别方法、装置和电子设备

Publications (1)

Publication Number Publication Date
WO2022105472A1 true WO2022105472A1 (zh) 2022-05-27

Family

ID=74774269

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122961 WO2022105472A1 (zh) 2020-11-18 2021-10-11 一种语音识别方法、装置和电子设备

Country Status (3)

Country Link
JP (1) JP7335569B2 (zh)
CN (1) CN112420050B (zh)
WO (1) WO2022105472A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420050B (zh) * 2020-11-18 2021-06-18 北京帝派智能科技有限公司 一种语音识别方法、装置和电子设备
CN113643694A (zh) * 2021-08-17 2021-11-12 科大讯飞股份有限公司 语音识别方法、装置、电子设备和存储介质
CN116013256B (zh) * 2022-12-19 2024-01-30 镁佳(北京)科技有限公司 一种语音识别模型构建及语音识别方法、装置及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003140685A (ja) * 2001-10-30 2003-05-16 Nippon Hoso Kyokai <Nhk> 連続音声認識装置およびそのプログラム
US20160232892A1 (en) * 2015-02-11 2016-08-11 Electronics And Telecommunications Research Institute Method and apparatus of expanding speech recognition database
CN109739370A (zh) * 2019-01-10 2019-05-10 北京帝派智能科技有限公司 一种语言模型训练方法、汉语拼音输入方法及装置
CN111179917A (zh) * 2020-01-17 2020-05-19 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN112420050A (zh) * 2020-11-18 2021-02-26 北京帝派智能科技有限公司 一种语音识别方法、装置和电子设备

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2017037830A1 (ja) * 2015-08-31 2017-11-24 三菱電機株式会社 音声認識装置および音声認識処理方法
KR102423302B1 (ko) * 2015-10-06 2022-07-19 삼성전자주식회사 음성 인식에서의 음향 점수 계산 장치 및 방법과, 음향 모델 학습 장치 및 방법
KR102399535B1 (ko) * 2017-03-23 2022-05-19 삼성전자주식회사 음성 인식을 위한 학습 방법 및 장치
US11318373B2 (en) * 2017-10-04 2022-05-03 Ford Global Technologies, Llc Natural speech data generation systems and methods
CN110085215B (zh) * 2018-01-23 2021-06-08 中国科学院声学研究所 一种基于生成对抗网络的语言模型数据增强方法
CN108922518B (zh) * 2018-07-18 2020-10-23 苏州思必驰信息科技有限公司 语音数据扩增方法和系统
CN109117484B (zh) * 2018-08-13 2019-08-06 北京帝派智能科技有限公司 一种语音翻译方法和语音翻译设备
US10573296B1 (en) * 2018-12-10 2020-02-25 Apprente Llc Reconciliation between simulator and speech recognition output using sequence-to-sequence mapping
US11417322B2 (en) 2018-12-12 2022-08-16 Google Llc Transliteration for speech recognition training and scoring

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003140685A (ja) * 2001-10-30 2003-05-16 Nippon Hoso Kyokai <Nhk> 連続音声認識装置およびそのプログラム
US20160232892A1 (en) * 2015-02-11 2016-08-11 Electronics And Telecommunications Research Institute Method and apparatus of expanding speech recognition database
CN109739370A (zh) * 2019-01-10 2019-05-10 北京帝派智能科技有限公司 一种语言模型训练方法、汉语拼音输入方法及装置
CN111179917A (zh) * 2020-01-17 2020-05-19 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN112420050A (zh) * 2020-11-18 2021-02-26 北京帝派智能科技有限公司 一种语音识别方法、装置和电子设备

Also Published As

Publication number Publication date
CN112420050B (zh) 2021-06-18
CN112420050A (zh) 2021-02-26
JP2022551678A (ja) 2022-12-13
JP7335569B2 (ja) 2023-08-30

Similar Documents

Publication Publication Date Title
Xiong Fundamentals of speech recognition
US11837216B2 (en) Speech recognition using unspoken text and speech synthesis
KR102386854B1 (ko) 통합 모델 기반의 음성 인식 장치 및 방법
Le et al. Deep shallow fusion for RNN-T personalization
WO2022105472A1 (zh) 一种语音识别方法、装置和电子设备
JP7436760B1 (ja) サブワードエンドツーエンド自動音声認識のための学習ワードレベルコンフィデンス
US20160147740A1 (en) Adapting machine translation data using damaging channel model
CN110870004B (zh) 基于音节的自动语音识别
US20220122622A1 (en) Cascaded Encoders for Simplified Streaming and Non-Streaming ASR
JP2023545988A (ja) トランスフォーマトランスデューサ:ストリーミング音声認識と非ストリーミング音声認識を統合する1つのモデル
CN111243599A (zh) 语音识别模型构建方法、装置、介质及电子设备
Garg et al. Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing.
JP2023175029A (ja) アテンションベースのジョイント音響およびテキストのオンデバイス・エンド・ツー・エンドモデル
CN117063228A (zh) 用于灵活流式和非流式自动语音识别的混合模型注意力
US11715458B2 (en) Efficient streaming non-recurrent on-device end-to-end model
JP2024511176A (ja) エンドツーエンド自動音声認識コンフィデンスおよび削除推定のためのマルチタスク学習
US20220310081A1 (en) Multilingual Re-Scoring Models for Automatic Speech Recognition
US20230017892A1 (en) Injecting Text in Self-Supervised Speech Pre-training
Effendi et al. Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer.
WO2024020154A1 (en) Using aligned text and speech representations to train automatic speech recognition models without transcribed speech data
WO2023059978A1 (en) Deliberation of streaming rnn-transducer by non-autoregressive decoding
KR20240068755A (ko) 비-자기회귀 디코딩에 의한 스트리밍 rnn-변환기의 심의
Pandey et al. Towards bootstrapping Acoustic Models for resource poor Indian languages
CN113439301A (zh) 使用序列到序列映射在模拟数据与语音识别输出之间进行协调

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021577529

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893622

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11/09/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21893622

Country of ref document: EP

Kind code of ref document: A1