WO2018153213A1 - 一种多语言混合语音识别方法 - Google Patents

一种多语言混合语音识别方法 Download PDF

Info

Publication number
WO2018153213A1
WO2018153213A1 PCT/CN2018/074314 CN2018074314W WO2018153213A1 WO 2018153213 A1 WO2018153213 A1 WO 2018153213A1 CN 2018074314 W CN2018074314 W CN 2018074314W WO 2018153213 A1 WO2018153213 A1 WO 2018153213A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
represent
voice data
speech
speech recognition
Prior art date
Application number
PCT/CN2018/074314
Other languages
English (en)
French (fr)
Inventor
范利春
孟猛
高鹏
Original Assignee
芋头科技(杭州)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 芋头科技(杭州)有限公司 filed Critical 芋头科技(杭州)有限公司
Priority to US16/487,279 priority Critical patent/US11151984B2/en
Publication of WO2018153213A1 publication Critical patent/WO2018153213A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/148Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/022Demisyllables, biphones or triphones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • G10L2015/0633Creating reference templates; Clustering using lexical or orthographic knowledge sources
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Definitions

  • the present invention relates to the field of speech recognition technology, and in particular, to a multi-language hybrid speech recognition method.
  • the recognition principle of the early multi-language hybrid speech recognition system is to establish a separate speech recognition system, then cut the mixed speech, and send the speech segments of different languages into the corresponding speech recognition system for identification, and finally each The recognition results of the speech segments are combined to form a recognition result of the mixed speech.
  • this identification method is difficult to ensure the accuracy of segmenting mixed speech according to the language.
  • the context information of each segment formed after segmentation is too short, thus affecting the recognition accuracy.
  • a technical solution of a multi-language mixed speech recognition method which aims to support the recognition of mixed speech in multiple languages, improve the accuracy and efficiency of recognition, and thus improve the performance of the speech recognition system.
  • a multi-language hybrid speech recognition method wherein a speech recognition system for recognizing a multi-language mixed speech is first formed, and the method for forming the speech recognition system includes:
  • Step S1 configuring a multi-language hybrid dictionary including a plurality of different languages
  • Step S2 forming an acoustic recognition model according to the multi-language hybrid dictionary and multi-language speech data including a plurality of different languages;
  • Step S3 forming a language recognition model according to multi-language text corpus training including a plurality of different languages
  • Step S4 forming the speech recognition system by using the multi-language hybrid dictionary, the acoustic recognition model, and the language recognition model;
  • the mixed speech is recognized by the speech recognition system, and a corresponding recognition result is output.
  • the multi-language hybrid speech recognition method is configured, wherein in the step S1, the multi-language hybrid dictionary is configured according to a single-language dictionary corresponding to each different language in a manner of triphone modeling.
  • the multi-language mixed speech recognition method wherein in the step S1, the multi-language hybrid dictionary is configured by means of triphone modeling;
  • a corresponding language tag is respectively added to each of the syllables of the language included in the multi-language hybrid dictionary to distinguish the phonons of the plurality of different languages.
  • the multi-lingual mixed speech recognition method wherein the step S2 specifically includes:
  • Step S21 forming an acoustic model according to multi-language speech data mixed in a plurality of different languages and the multi-language mixed dictionary training;
  • Step S22 extracting a voice feature from the multi-language voice data, and performing frame alignment operation on the voice feature by using the acoustic model to obtain an output label corresponding to the voice feature in each frame;
  • Step S23 using the voice feature as input data of the acoustic recognition model, and using the output tag corresponding to the voice feature as an output tag in an output layer of the acoustic recognition model to train to form the acoustic Identify the model.
  • the multi-language hybrid speech recognition method wherein the acoustic model is a hidden Markov-Gaussian hybrid model.
  • the multi-language hybrid speech recognition method wherein, in the step S23, after the acoustic recognition model is trained, the output layer of the acoustic recognition model is adjusted, which specifically includes:
  • Step S231 respectively calculating a prior probability of each language, and calculating a prior probability of muting common to all kinds of languages;
  • Step S232 respectively calculating a posterior probability of each language, and calculating a posterior probability of the muting
  • Step S233 adjusting the output layer of the acoustic recognition model according to the prior probability and the posterior probability of each language, and the a priori probability and the posterior probability of the silence.
  • the multi-language mixed speech recognition method wherein, in the step S231, the prior probabilities of each language are respectively calculated according to the following formula:
  • the output tag for indicating an ith state of a jth language in the multilingual voice data
  • the output label used to represent the multi-language voice data is total
  • the output tag for indicating an ith state of the mute in the multi-lingual voice data
  • the output label used to represent the multi-language voice data is total
  • M j is used to represent the total number of states in the jth language in the multilingual speech data
  • M sil is used to represent the total number of states of the muting in the multi-lingual voice data.
  • the multi-lingual mixed speech recognition method wherein, in the step S231, the prior probability of the mute is calculated according to the following formula:
  • the output tag for indicating an ith state of the mute in the multi-lingual voice data
  • the output label used to represent the multi-language voice data is total
  • the output tag for indicating an ith state of a jth language in the multilingual voice data
  • the output label used to represent the multi-language voice data is total
  • M j is used to represent the total number of states in the jth language in the multilingual speech data
  • M sil is used to represent the total number of states of the muting in the multi-lingual voice data
  • L is used to represent all languages in the multi-lingual voice data.
  • the multi-language mixed speech recognition method wherein, in the step S232, the posterior probability of each language is separately calculated according to the following formula:
  • the output tag for indicating an ith state of a jth language in the multilingual voice data
  • x is used to represent the speech feature
  • M j is used to represent the total number of states in the jth language in the multilingual speech data
  • M sil is used to represent the total number of states of the muting in the multi-lingual voice data
  • Exp is used to indicate how the exponential function is calculated.
  • the multi-lingual mixed speech recognition method wherein, in the step S232, the posterior probability of the mute is calculated according to the following formula:
  • the output tag for indicating an ith state of the mute in the multi-lingual voice data
  • x is used to represent the speech feature
  • M j is used to represent the total number of states in the jth language in the multilingual speech data
  • M sil is used to represent the total number of states of the muting in the multi-lingual voice data
  • L is used to represent all languages in the multi-lingual voice data
  • Exp is used to indicate how the exponential function is calculated.
  • the multi-lingual mixed speech recognition method wherein in the step S2, the acoustic recognition model is an acoustic model of a deep neural network.
  • the multi-lingual mixed speech recognition method wherein in the step S3, the language recognition model is formed by using an n-Gram model training, or the language recognition model is formed by using a recurrent neural network training.
  • the multi-language hybrid speech recognition method after forming the speech recognition system, first performing weight adjustment on different kinds of languages in the speech recognition system;
  • the steps of performing the weight adjustment include:
  • Step A1 respectively determining a posteriori probability weight value of each language according to the real voice data
  • step A2 the posterior probability of each language is separately adjusted according to the posterior probability weight value to complete the weight adjustment.
  • the multi-language mixed speech recognition method wherein, in the step A2, the weight adjustment is performed according to the following formula:
  • the output tag for indicating an ith state of a jth language in the multilingual voice data
  • x is used to represent the speech feature
  • a j is used to represent the posterior probability weight value of the jth language in the multilingual speech data
  • the output label used to represent the multi-language voice data adjusted by the weight is The posterior probability.
  • the beneficial effects of the above technical solution are: providing a multi-language hybrid speech recognition method capable of supporting multi-language mixed speech recognition, improving the accuracy and efficiency of recognition, and thus improving the performance of the speech recognition system.
  • FIG. 1 is a schematic diagram showing the overall flow of forming a voice recognition system in a multi-language mixed voice recognition method according to a preferred embodiment of the present invention
  • FIG. 2 is a schematic diagram of a multi-language hybrid dictionary in a preferred embodiment of the present invention.
  • FIG. 3 is a flow chart showing the process of forming an acoustic recognition model on the basis of FIG. 1 in a preferred embodiment of the present invention
  • FIG. 4 is a schematic structural view of an acoustic recognition model in a preferred embodiment of the present invention.
  • Figure 5 is a flow chart showing the adjustment of the output layer of the acoustic recognition model on the basis of Figure 2 in a preferred embodiment of the present invention
  • FIG. 6 is a flow chart showing weight adjustment of a voice recognition system in a preferred embodiment of the present invention.
  • the present invention provides a multi-language mixed speech recognition method.
  • the so-called mixed speech refers to a mixture of voice data of a plurality of different languages, such as a user inputting a voice "I need a USB interface. "This speech includes both Chinese speech and English proper noun "USB", then the speech is mixed speech.
  • the mixed voice may be a mixture of two or more voices, and is not limited herein.
  • the method for forming the voice recognition system is specifically as shown in FIG. 1 and includes:
  • Step S1 configuring a multi-language hybrid dictionary including a plurality of different languages
  • Step S2 forming an acoustic recognition model according to the multi-language hybrid dictionary and multi-language speech data training including a plurality of different languages;
  • Step S3 forming a language recognition model according to multi-language text corpus training including a plurality of different languages
  • step S4 a speech recognition system is formed by using a multi-language hybrid dictionary, an acoustic recognition model, and a language recognition model.
  • the speech recognition system can be used to recognize the mixed speech and output the corresponding recognition result.
  • the multi-language hybrid dictionary is a hybrid dictionary including a plurality of different languages, and the hybrid dictionary is configured to a sound sub-level.
  • the above-described hybrid dictionary is configured by means of triphone modeling, and a dictionary model more stable than word modeling can be obtained.
  • the dictionaries of different languages may contain the same character representations, it is necessary to add a corresponding language tag to each of the phonological words included in the multi-language hybrid dictionary when configuring the hybrid dictionary, so that Different types of different sounds are distinguished.
  • both Chinese and English sub-notes include “b", "d” and other sounds.
  • language tags are added in front of all English phonetic subsets (for example, adding "en” as a prefix) to distinguish the English phonetic subset from the Chinese phonetic subset, as shown in Figure 2.
  • the above-mentioned language tags can be empty. For example, if there are two languages in the mixed dictionary, only one language tag needs to be added to one of the languages, that is, the two languages can be distinguished. Similarly, if there are three languages in the mixed dictionary, you only need to add language tags to the two languages, that is, you can distinguish the three languages, and so on.
  • a mixed dictionary includes Chinese, English, and other languages, and only the Chinese and English sub-sets may be confused. The problem, so you only need to add a language tag in front of the English sub-set.
  • an acoustic recognition model is formed according to the hybrid dictionary and multi-language speech data including multiple languages.
  • the multi-language speech data described above is prepared in advance for mixed speech data including training in a plurality of different languages, and the hybrid dictionary provides memes in different languages in the process of forming an acoustic recognition model. Therefore, in the process of training to form a multi-lingual mixed acoustic recognition model, in order to obtain the triphone relationship of the mixed language memes, it is necessary to prepare the multi-language speech data mixed in the above plurality of languages, and the multi-language hybrid dictionary formed according to the above. get on.
  • a multi-language text corpus training is then formed according to a plurality of languages to form a language recognition model, and finally the multi-language hybrid dictionary, the acoustic recognition model and the language recognition model are included in a speech recognition system, and according to the speech
  • the recognition system recognizes the mixed voice input by the user including multiple languages, and outputs the recognition result.
  • the recognition process of the mixed voice is similar to the recognition process of the single language voice in the prior art, and the voice feature in a piece of voice data is recognized as a corresponding note by the acoustic recognition model or A sequence of words, and the word recognition sequence is recognized as a complete sentence by the language recognition model, thereby completing the process of recognizing the mixed speech.
  • the above identification process will not be described in detail herein.
  • a multi-language hybrid dictionary including a plurality of languages is first formed according to a plurality of monolingual language dictionaries, and in which pronouns of different languages are marked with a language to distinguish. Then, an acoustic recognition model is formed according to the multi-language mixed speech data and the multi-language mixed dictionary training, and a language recognition model is formed according to the multi-language mixed text corpus training. Then, a complete speech recognition system is formed according to the multi-language hybrid dictionary, the acoustic recognition model and the language recognition model to identify the multi-language mixed speech input by the user.
  • step S2 specifically includes:
  • Step S21 forming an acoustic model according to multi-language speech data mixed in a plurality of different languages and multi-language mixed dictionary training;
  • Step S22 extracting a speech feature from the multi-language speech data, and performing an image alignment operation on the speech feature by using an acoustic model to obtain an output tag corresponding to each frame of the speech feature;
  • Step S23 the speech feature is taken as the input data of the acoustic recognition model, and the output tag corresponding to the speech feature is used as an output tag in the output layer of the acoustic recognition model to train to form the acoustic recognition model.
  • an acoustic model is first formed according to multi-language speech data mixed in a plurality of different languages.
  • the acoustic model can be a Hidden Markov Model-Gaussian Mixture Model (HMM-GMM) model.
  • HMM-GMM Hidden Markov Model-Gaussian Mixture Model
  • the parameter sharing technique can be selected in the process of training the acoustic model to reduce the parameter size.
  • the modeling technology of the acoustic model based on HMM-GMM is currently considered to be mature and will not be described here.
  • the multi-language speech data needs to be frame-aligned by using the acoustic model, so that the speech features extracted in each frame of the multi-language speech data are corresponding to an output tag.
  • each frame of speech features corresponds to a GMM number.
  • the output label in the output layer of the acoustic recognition model is the label corresponding to each frame of the voice feature, so the number of output labels in the output layer of the acoustic recognition model is the number of GMMs in the HMM-GMM model, each One output node corresponds to one GMM.
  • the speech feature is used as the input data of the acoustic recognition model
  • the output tag corresponding to the speech feature is used as an output tag in the output layer of the acoustic recognition model to train to form the acoustic recognition model.
  • FIG. 4 shows a schematic structure of an acoustic recognition model in an embodiment of the present invention, the acoustic recognition model being a deep neural network model established by a fully connected neural network structure, the neural network comprising a total of 7 fully connected
  • the neural network unit has 2048 nodes per layer, and each of the two neural networks contains a sigmoid nonlinear unit.
  • the output layer is implemented using a softmax nonlinear unit.
  • S51 in Fig. 4 is used to represent the output layer of the acoustic recognition model, and L1, L2, and L3 respectively represent output tags on the output layer associated with different kinds of languages.
  • step S23 after the acoustic recognition model is trained, the output layer of the acoustic recognition model needs to be adjusted and a priori for multi-language, as shown in FIG. 5, include:
  • Step S231 respectively calculating a prior probability of each language, and calculating a prior probability of muting common to all kinds of languages;
  • Step S232 respectively calculating a posterior probability of each language, and calculating a posterior probability of muting
  • Step S233 adjusting the output layer of the acoustic recognition model according to the prior probability and the posterior probability of each language, and the prior probability and posterior probability of the silence.
  • the string of the output result for a given speech feature is usually determined by the following formula:
  • a string used to represent the output w represents a possible string
  • x represents the input speech feature
  • P(w) is used to represent the probability of the above-mentioned language recognition model
  • w) is used to represent the above acoustic recognition model. The probability.
  • x t is used to represent the speech feature input at time t
  • q t is used to represent the triphone state bound at time t
  • ⁇ (q 0 ) is used to represent the probability distribution of the initial state q 0
  • q t ) is used to indicate the probability that the speech feature is x t in the q t state.
  • q t ) is the posterior probability of the output layer of the above acoustic recognition model
  • P(q t ) is the prior probability of the above acoustic recognition model
  • P(x t ) is the probability of x t .
  • P(x t ) is not related to the string sequence and can therefore be ignored.
  • the character string of the output result can be adjusted by calculating the prior probability and the posterior probability of the output layer of the acoustic recognition model.
  • the prior probability P(q) of the neural network is typically calculated by the following formula:
  • Count(q i ) is used to indicate the total number of labels q i in multi-language voice data
  • N is used to represent the total number of all output labels
  • the prior probability may not be uniformly calculated, and calculation needs to be performed separately according to different kinds of languages.
  • step S231 the prior probabilities of each language are first calculated separately, and the prior probabilities of muting common to all kinds of languages are calculated.
  • An output tag for indicating an i-th state of mute in multi-language voice data
  • M j is used to represent the total number of states in the jth language in the multilingual voice data
  • M sil is used to represent the total number of states of silence in multi-lingual voice data.
  • L is used to represent all languages in multilingual speech data.
  • the posterior probability of the acoustic recognition model is continuously calculated.
  • x) of the neural network output is usually calculated by the output layer.
  • the posterior probability is usually calculated according to the following formula:
  • the imbalance of the amount of training data in different kinds of languages may result in an imbalance in the distribution of state value calculation results for different kinds of languages, so the posterior probability still needs to be calculated separately for different kinds of languages.
  • the posterior probability of each language is separately calculated according to the following formula:
  • x is used to represent a phonetic feature
  • Input data for representing an i-th state of a j-th language in multi-language speech data
  • Exp is used to indicate how the exponential function is calculated.
  • step S232 the posterior probability of muting is calculated according to the following formula:
  • Used to indicate that the output label in multi-language voice data is The posterior probability.
  • the prior probability and the posterior probability in each language and in the mute state can be calculated by using the above improved formulas (6)-(9), so that the acoustic recognition model can meet the output requirements of the multi-language hybrid modeling. It is able to describe each language and mute state more accurately. It should be noted that after the above formula is adjusted, the sum of the prior probability and the posterior probability is no longer 1.
  • the n-Gram model training may be used to form a language recognition model, or the recurrent neural network training may be used to form a language recognition model.
  • the above multi-language text corpus needs to include a multi-language separate text corpus, as well as multi-language mixed text data.
  • weight adjustment is first performed on different kinds of languages in the speech recognition system
  • the steps for performing the weight adjustment are as shown in FIG. 6, and include:
  • Step A1 respectively determining a posteriori probability weight value of each language according to the real voice data
  • step A2 the posterior probability of each language is separately adjusted according to the posterior probability weight value to complete the weight adjustment.
  • a problem that the amount of training data may be unbalanced during the training process may result in a relatively large prior probability due to a relatively large amount of data.
  • the final recognition probability is the posterior probability divided by the prior probability. Therefore, the actual recognition probability of the language with more training data is smaller, which may cause the recognition result of the recognition system to tend to identify a certain language. Identify another language, causing bias in the recognition results.
  • x is used to represent a phonetic feature
  • a j is used to represent a posteriori probability weight value of the jth language in the multilingual speech data, and the posterior probability weight value is determined by performing a physical measurement on the acoustic recognition model by the development set composed of the above real data.
  • the output label used to represent the weighted multi-language voice data is The posterior probability.
  • the adjustment can make the speech recognition system get a good recognition effect in different application scenarios.
  • the posterior probability weight value of Chinese can be set to 1.0, and the posterior probability weight of English is weighted.
  • the value is set to 0.3, and the silenced posterior probability weight value is set to 1.0.
  • the a posteriori probability weight value may be repeatedly adjusted by using a development set composed of different real data multiple times to finally determine an optimal value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种多语言混合语音识别方法,属于语音识别技术领域;该方法包括:步骤S1,配置一包括多种不同语言的多语言混合词典;步骤S2,根据多语言混合词典以及包括多种不同语言的多语言语音数据训练形成一声学识别模型;步骤S3,根据包括多种不同语言的多语言文本语料训练形成一语言识别模型;步骤S4,采用多语言混合词典、声学识别模型以及语言识别模型形成语音识别系统;随后,采用语音识别系统对混合语音进行识别,并输出对应的识别结果。该方法的有益效果是:能够支持多种语言混合语音的识别,提升识别的准确率和效率,因此提高语音识别系统的性能。

Description

一种多语言混合语音识别方法 技术领域
本发明涉及语音识别技术领域,尤其涉及一种多语言混合语音识别方法。
背景技术
在日常说话的表达中,人们往往在无意中使用一种语言中夹杂另一种或者另几种语言的表达方式,例如一些英文单词在中文中会直接沿用其原本名字,例如“ipad”、“iphone”、“USB”等专有名词,因此会造成中英文混杂的现象,这种现象会给语音识别带来一定的困难和挑战。
早期的多语言混合语音识别系统的识别原理是分别建立单独的语音识别系统,然后将混合语音切开,并将不同语种的语音片段分别送入对应的语音识别系统中进行识别,最后再将各个语音片段的识别结果合并,以形成混合语音的识别结果。这种识别方法一方面很难保证按照语种对混合语音进行切分的准确性,另一方面每个被切分后形成的语音片段的上下文信息太短,从而影响识别准确率。
近年来,多语言混合语音的识别方法的做法开始发生变化,具体为将单独的语音识别系统进行词典扩充,即使用一种语言的音子集去拼凑另一种语言,例如英语中的“iphone”在中文词典中的发音会被拼凑成“爱疯”。这样的识别方法虽然能够识别出个别不同语种的词汇,但是一方面要求使用者的发音非常怪异(例如“iphone”必须准确发成“爱疯”),另一方面在识别整句混合语音的准却率会大幅下降。
发明内容
根据现有技术中存在的上述问题,现提供一种多语言混合语音识别方法的技术方案,旨在支持多种语言混合语音的识别,提升识别的准确率和效率,因此提高语音识别系统的性能。
上述技术方案具体包括:
一种多语言混合语音识别方法,其中,首先形成用于识别多语言的混合语音的语音识别系统,形成所述语音识别系统的方法包括:
步骤S1,配置一包括多种不同语言的多语言混合词典;
步骤S2,根据所述多语言混合词典以及包括多种不同语言的多语言语音数据训练形成一声学识别模型;
步骤S3,根据包括多种不同语言的多语言文本语料训练形成一语言识别模型;
步骤S4,采用所述多语言混合词典、所述声学识别模型以及所述语言识别模型形成所述语音识别系统;
随后,采用所述语音识别系统对所述混合语音进行识别,并输出对应的识别结果。
优选的,该多语言混合语音识别方法,其中,所述步骤S1中,采用三音子建模的方式,根据分别对应每种不同语言的单语言词典配置所述多语言混合词典。
优选的,该多语言混合语音识别方法,其中,所述步骤S1中,采用三音子建模的方式配置所述多语言混合词典;
在配置所述多语言混合词典时,对所述多语言混合词典中包括的每种所语言的音子前分别添加一对应的语种标记,以将多种不同语言的音子进行区分。
优选的,该多语言混合语音识别方法,其中,所述步骤S2具体包括:
步骤S21,根据多种不同语言混合的多语言语音数据以及所述多语言混合词典训练形成一声学模型;
步骤S22,对所述多语言语音数据提取语音特征,并采用所述声学模型对所述语音特征进行帧对齐操作,以获得每一帧所述语音特征所对应的输出标签;
步骤S23,将所述语音特征作为所述声学识别模型的输入数据,以及将所述语音特征对应的所述输出标签作为所述声学识别模型的输出层中的输出标签,以训练形成所述声学识别模型。
优选的,该多语言混合语音识别方法,其中,所述声学模型为隐马尔可夫-高斯混合模型。
优选的,该多语言混合语音识别方法,其中,所述步骤S23中,对所述声学识别模型进行训练后,对所述声学识别模型的所述输出层进行调整,具体包括:
步骤S231,分别计算得到每种语言的先验概率,以及计算得到所有种类的语言公用的静音的先验概率;
步骤S232,分别计算得到每种语言的后验概率,以及计算得到所述静音的后验概率;
步骤S233,根据每种语言的先验概率和后验概率,以及所述静音的先验概率和后验概率,调整所述声学识别模型的所述输出层。
优选的,该多语言混合语音识别方法,其中,所述步骤S231中,依照下述公式分别计算得到每种语言的先验概率:
Figure PCTCN2018074314-appb-000001
其中,
Figure PCTCN2018074314-appb-000002
用于表示所述多语言语音数据中第j种语言的第i个状态的所述输出标签;
Figure PCTCN2018074314-appb-000003
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000004
的先验概率;
Figure PCTCN2018074314-appb-000005
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000006
的总数;
Figure PCTCN2018074314-appb-000007
用于表示所述多语言语音数据中的所述静音的第i种状态的所述输出标签;
Figure PCTCN2018074314-appb-000008
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000009
的总数;
M j用于表示所述多语言语音数据中的第j种语言中的状态的总数;
M sil用于表示所述多语言语音数据中的所述静音的状态的总数。
优选的,该多语言混合语音识别方法,其中,所述步骤S231中,依照下述公式计算得到所述静音的先验概率:
Figure PCTCN2018074314-appb-000010
其中,
Figure PCTCN2018074314-appb-000011
用于表示所述多语言语音数据中的所述静音的第i种状态的所述输 出标签;
Figure PCTCN2018074314-appb-000012
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000013
的先验概率;
Figure PCTCN2018074314-appb-000014
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000015
的总数;
Figure PCTCN2018074314-appb-000016
用于表示所述多语言语音数据中第j种语言的第i个状态的所述输出标签;
Figure PCTCN2018074314-appb-000017
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000018
的总数;
M j用于表示所述多语言语音数据中的第j种语言中的状态的总数;
M sil用于表示所述多语言语音数据中的所述静音的状态的总数;
L用于表示所述多语言语音数据中的所有语言。
优选的,该多语言混合语音识别方法,其中,所述步骤S232中,依照下述公式分别计算得到每种语言的后验概率:
Figure PCTCN2018074314-appb-000019
其中,
Figure PCTCN2018074314-appb-000020
用于表示所述多语言语音数据中第j种语言的第i个状态的所述输出标签;
x用于表示所述语音特征;
Figure PCTCN2018074314-appb-000021
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000022
的后验概率;
Figure PCTCN2018074314-appb-000023
用于表示所述多语言语音数据中第j种语言的第i个状态的所述输入数据;
Figure PCTCN2018074314-appb-000024
用于表示所述静音的第i种状态的所述输入数据;
M j用于表示所述多语言语音数据中的第j种语言中的状态的总数;
M sil用于表示所述多语言语音数据中的所述静音的状态的总数;
exp用于表示指数函数计算方式。
优选的,该多语言混合语音识别方法,其中,所述步骤S232中,依照下述公式计算得到所述静音的后验概率:
Figure PCTCN2018074314-appb-000025
其中,
Figure PCTCN2018074314-appb-000026
用于表示所述多语言语音数据中的所述静音的第i种状态的所述输出标签;
x用于表示所述语音特征;
Figure PCTCN2018074314-appb-000027
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000028
的后验概率;
Figure PCTCN2018074314-appb-000029
用于表示所述多语言语音数据中第j种语言的第i个状态的所述输入数据;
Figure PCTCN2018074314-appb-000030
用于表示所述静音的第i种状态的所述输入数据;
M j用于表示所述多语言语音数据中的第j种语言中的状态的总数;
M sil用于表示所述多语言语音数据中的所述静音的状态的总数;
L用于表示所述多语言语音数据中的所有语言;
exp用于表示指数函数计算方式。
优选的,该多语言混合语音识别方法,其中,所述步骤S2中,所述声学识别模型为深度神经网络的声学模型。
优选的,该多语言混合语音识别方法,其中,所述步骤S3中,采用n-Gram模型训练形成所述语言识别模型,或者采用递归神经网络训练形成所述语言识别模型。
优选的,该多语言混合语音识别方法,其中,形成所述语音识别系统后,首先对所述语音识别系统中不同种类的语言进行权重调整;
进行所述权重调整的步骤包括:
步骤A1,根据真实语音数据分别确定每种语言的后验概率权重值;
步骤A2,根据所述后验概率权重值,分别调整每种语言的后验概率,以完成所述权重调整。
优选的,该多语言混合语音识别方法,其中,所述步骤A2中,依照下述公式进行所述权重调整:
Figure PCTCN2018074314-appb-000031
其中,
Figure PCTCN2018074314-appb-000032
用于表示所述多语言语音数据中第j种语言的第i个状态的所述输出标签;
x用于表示所述语音特征;
Figure PCTCN2018074314-appb-000033
用于表示所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000034
的后验概率;
a j用于表示所述多语言语音数据中第j种语言的所述后验概率权重值;
Figure PCTCN2018074314-appb-000035
用于表示经过所述权重调整的所述多语言语音数据中所述输出标签为
Figure PCTCN2018074314-appb-000036
的后验概率。
上述技术方案的有益效果是:提供一种多语言混合语音识别方法,能够支持多种语言混合语音的识别,提升识别的准确率和效率,因此提高语音识别系统的性能。
附图说明
图1是本发明的较佳的实施例中,一种多语言混合语音识别方法中,形成语音识别系统的总体流程示意图;
图2是本发明的较佳的实施例中,多语言混合词典的示意图;
图3是本发明的较佳的实施例中,于图1的基础上,训练形成声学识别模型的流程示意图;
图4是本发明的较佳的实施例中,声学识别模型的结构示意图;
图5是本发明的较佳的实施例中,于图2的基础上,对声学识别模型的输出层进行调整的流程示意图;
图6是本发明的较佳的实施例中,对语音识别系统进行权重调整的流程示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。
需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特 征可以相互组合。
下面结合附图和具体实施例对本发明作进一步说明,但不作为本发明的限定。
基于现有技术中存在的上述问题,本发明中提供一种多语言混合语音识别方法,所谓混合语音,是指混合了多种不同语言的语音数据,例如使用者输入语音“我需要一个USB接口”,该段语音中既包括中文语音,也包括英文的专有名词“USB”,则该段语音为混合语音。本发明的其他实施例中,上述混合语音也可以为两种以上语音的混合体,在此不做限制。
上述多语言混合语音识别方法中,首先需要形成用于识别上述混合语音的语音识别系统。该语音识别系统的形成方法具体如图1所示,包括:
步骤S1,配置一包括多种不同语言的多语言混合词典;
步骤S2,根据多语言混合词典以及包括多种不同语言的多语言语音数据训练形成一声学识别模型;
步骤S3,根据包括多种不同语言的多语言文本语料训练形成一语言识别模型;
步骤S4,采用多语言混合词典、声学识别模型以及语言识别模型形成语音识别系统。
在形成语音识别系统后,则可以采用语音识别系统对混合语音进行识别,并输出对应的识别结果。
具体地,本实施例中,上述多语言混合词典为包括多种不同语言的混合词典,该混合词典被配置到音子级别。本发明的较佳的实施例中,采用三音子建模的方式配置上述混合词典,能够得到比字建模更稳定的词典模型。另外,由于不同语言的词典中可能包含相同字符表述的音子,因此需要在配置混合词典时对多语言混合词典中包括的每种所语言的音子前分别添加一对应的语种标记,以将多种不同语言的音子进行区分。
例如,中英文的音子集中都包括“b”、“d”等音子。为了加以区分,在所有的英文的音子集前面都添加语种标记(例如添加“en”作为前缀)以将英文的音子集与中文的音子集进行区分,具体如图2所示。
上述语种标记可以为空,例如在混合词典中存在两种语言,则只需要对其中一种语言添加语种标记,即可以将两种语言区分开来。类似地,若混合 词典中存在三种语言,则只需要对其中两种语言添加语种标记,即可以将三种语言区分开来,以此类推。
在上述混合词典中,也可以只对可能产生混淆的语种的音子集之间添加语种标记,例如一个混合词典中包括中文、英文以及其他语种,并且其中只有中英文的音子集可能存在混淆的问题,因此只需要在英文的音子集前面添加语种标记即可。
本实施例中,在形成多语言混合词典之后,根据该混合词典和包括多种语言的多语言语音数据训练形成一声学识别模型。具体地,上述多语言语音数据为事先预备好的包括多种不同语言的训练用的混合语音数据,上述混合词典在形成声学识别模型的过程中提供不同种语言的音子。因此,在训练形成多语言混合的声学识别模型的过程中,为了得到混合语言音子的三音子关系,需要准备上述多种语言混合的多语言语音数据,以及依据上述形成的多语言混合词典进行。
本实施例中,随后根据多种语言混合的多语言文本语料训练形成一语言识别模型,并最终将多语言混合词典、声学识别模型和语言识别模型包括在一语音识别系统中,并根据该语音识别系统对使用者输入的包括多种语言的混合语音进行识别,输出识别结果。
本实施例中,经过上述处理后,上述混合语音的识别过程就与现有技术中对于单语种语音的识别过程类似,通过声学识别模型将一段语音数据中的语音特征识别成对应的音子或者字词序列,并且通过语言识别模型将字词序列识别成一个完整的句子,从而完成对混合语音的识别过程。上述识别过程在本文中不再赘述。
综上,本发明技术方案中,首先根据多个单语种的语言词典形成包括多个语种的多语言混合词典,并在其中对不同语种的音子打上语种标记以进行区分。随后根据多语言混合语音数据和多语言混合词典训练形成一声学识别模型,以及根据多语言混合文本语料训练形成一语言识别模型。随后根据多语言混合词典、声学识别模型以及语言识别模型形成一个完整的语音识别系统,以对使用者输入的多语言混合语音进行识别。
本发明的较佳的实施例中,如图3所示,上述步骤S2具体包括:
步骤S21,根据多种不同语言混合的多语言语音数据以及多语言混合词 典训练形成一声学模型;
步骤S22,对多语言语音数据提取语音特征,并采用声学模型对语音特征进行帧对齐操作,以获得每一帧语音特征所对应的输出标签;
步骤S23,将语音特征作为声学识别模型的输入数据,以及将语音特征对应的输出标签作为声学识别模型的输出层中的输出标签,以训练形成声学识别模型。
具体地,本实施例中,在训练形成声学识别模型之前,首先根据多种不同语言混合的多语言语音数据训练形成一声学模型。该声学模型可以为一隐马尔可夫-高斯混合(Hidden Markov Model-Gaussian Mixture Model,HMM-GMM)模型。针对三音子建模中面临的参数重估鲁班性问题,在训练形成声学模型的过程中可以选择使用参数共享技术,从而减少参数规模。基于HMM-GMM的声学模型的建模技术目前已经想当成熟,在此不再赘述。
本实施例中,形成上述声学模型后,需要利用该声学模型对上述多语言语音数据进行帧对齐操作,从而将每一帧多语言语音数据中提取的语音特征都对应有一个输出标签。具体地,经过帧对齐后,每一帧语音特征都对应有一个GMM编号。而声学识别模型的输出层中的输出标签是每一帧语音特征对应的标签,因此该声学识别模型的输出层中的输出标签的个数即为HMM-GMM模型中的GMM的个数,每一个输出节点对应一个GMM。
本实施例中,将语音特征作为声学识别模型的输入数据,以及将语音特征对应的输出标签作为声学识别模型的输出层中的输出标签,以训练形成声学识别模型。
如图4所示为本发明的一个实施例中的声学识别模型的大致结构,该声学识别模型为由全连接的神经网络结构建立的深度神经网络模型,该神经网络中共包含7个全连接的神经网络单元,每层具有2048个节点,每两个神经网络中间都包含一个sigmoid非线性单元。其输出层采用softmax非线性单元实现。图4中的s51用于表示该声学识别模型的输出层,L1、L2和L3分别表示输出层上的关联于不同种类的语言的输出标签。
本发明的较佳的实施例中,上述步骤S23中,在对声学识别模型进行训练后,需要针对多语言对声学识别模型的输出层进行调整和先验等操作,具体如图5所示,包括:
步骤S231,分别计算得到每种语言的先验概率,以及计算得到所有种类的语言公用的静音的先验概率;
步骤S232,分别计算得到每种语言的后验概率,以及计算得到静音的后验概率;
步骤S233,根据每种语言的先验概率和后验概率,以及静音的先验概率和后验概率,调整声学识别模型的输出层。
具体地,本发明的较佳的实施例中,在采用声学识别模型进行语音识别时,对于给定的语音特征,其输出结果的字符串通常由下述公式决定:
Figure PCTCN2018074314-appb-000037
其中,
Figure PCTCN2018074314-appb-000038
用于表示输出结果的字符串,w表示可能的字符串,x表示输入的语音特征,P(w)用于表示上述语言识别模型的概率,P(x|w)用于表示上述声学识别模型的概率。
则上述P(x|w)可以进一步展开为:
Figure PCTCN2018074314-appb-000039
其中,x t用于表示t时刻输入的语音特征,q t用于表示t时刻绑定的三音子状态,π(q 0)用于表示初始状态为q 0的概率分布,P(x t|q t)用于表示q t状态下,语音特征为x t的概率。
则上述P(x t|q t)可以进一步展开为:
P(x t|q t)=P(q t|x t)P(x t)/P(q t);             (3)
其中,P(x t|q t)为上述声学识别模型的输出层的后验概率,P(q t)为上述声学识别模型的先验概率,P(x t)则表示x t的概率。P(x t)跟字符串序列不相关,因此可以忽略。
则根据上述公式(3)可以得出,通过计算声学识别模型的输出层的先验概率和后验概率能够对输出结果的字符串进行调整。
本发明的较佳的实施例中,神经网络的先验概率P(q)通常由下述公式计算得到:
Figure PCTCN2018074314-appb-000040
其中,Count(q i)用于表示多语言语音数据中标签为q i的总数,N用于表示所有输出标签的总数。
本发明的较佳的实施例中,由于不同种类的语言的训练用语音数据的数量可能不同,因此上述先验概率不能统一计算,需要根据不同种类的语言分别进行计算。
则本发明的较佳的实施例中,上述步骤S231,首先分别计算得到每种语言的先验概率,以及计算得到所有种类的语言公用的静音的先验概率。
首先依照下述公式分别计算得到每种语言的先验概率:
Figure PCTCN2018074314-appb-000041
其中,
Figure PCTCN2018074314-appb-000042
用于表示多语言语音数据中第j种语言的第i个状态的输出标签;
Figure PCTCN2018074314-appb-000043
用于表示多语言语音数据中输出标签为
Figure PCTCN2018074314-appb-000044
的先验概率;
Figure PCTCN2018074314-appb-000045
用于表示多语言语音数据中输出标签为
Figure PCTCN2018074314-appb-000046
的总数;
Figure PCTCN2018074314-appb-000047
用于表示多语言语音数据中的静音的第i种状态的输出标签;
Figure PCTCN2018074314-appb-000048
用于表示多语言语音数据中输出标签为
Figure PCTCN2018074314-appb-000049
的总数;
M j用于表示多语言语音数据中的第j种语言中的状态的总数;
M sil用于表示多语言语音数据中的静音的状态的总数。
随后,依照下述公式计算得到静音的先验概率:
Figure PCTCN2018074314-appb-000050
其中,
Figure PCTCN2018074314-appb-000051
用于表示多语言语音数据中输出标签为
Figure PCTCN2018074314-appb-000052
的先验概率;
L用于表示多语言语音数据中的所有语言。
本发明的较佳的实施例中,在计算得到上述每种语言的先验概率以及静音的先验概率后,继续计算声学识别模型的后验概率。神经网络输出的后验概率P(q i|x)通常由输出层计算得到,当输出层为softmax非线性单元实现时,后验概率通常按照下述公式计算得到:
Figure PCTCN2018074314-appb-000053
其中,y i用于表示第i个状态下的输入值,N为所有状态的数目。
同样地,在声学识别模型中,不同种类语言的训练数据数量不均衡会造成不同种类语言的状态值计算结果的分布不平衡,因此后验概率仍然需要针对不同种类的语言分别进行计算。
则本发明的较佳的实施例中,上述步骤S232中,依照下述公式分别计算得到每种语言的后验概率:
Figure PCTCN2018074314-appb-000054
其中,
x用于表示语音特征;
Figure PCTCN2018074314-appb-000055
用于表示多语言语音数据中输出标签为
Figure PCTCN2018074314-appb-000056
的后验概率;
Figure PCTCN2018074314-appb-000057
用于表示多语言语音数据中第j种语言的第i个状态的输入数据;
Figure PCTCN2018074314-appb-000058
用于表示静音的第i种状态的输入数据;
exp用于表示指数函数计算方式。
本发明的较佳的实施例中,步骤S232中,依照下述公式计算得到静音的后验概率:
Figure PCTCN2018074314-appb-000059
其中,
Figure PCTCN2018074314-appb-000060
用于表示多语言语音数据中输出标签为
Figure PCTCN2018074314-appb-000061
的后验概率。
本发明中,利用上述改进的公式(6)-(9)可以计算得到每种语言和静音状态下的先验概率以及后验概率,从而使得声学识别模型能够符合多语言混合建模的输出需求,能够更加精准地对每种语言以及静音状态进行描述。需要注意的是,经过上述公式调整后,先验概率和后验概率的总和均不再为1。
本发明的较佳的实施例中,上述步骤S3中,可以采用n-Gram模型训练 形成语言识别模型,或者采用递归神经网络训练形成语言识别模型。上述多语言文本语料中需要包括多语言单独的文本语料,以及多语言混合的文本数据。
本发明的较佳的实施例中,形成语音识别系统后,首先对语音识别系统中不同种类的语言进行权重调整;
进行权重调整的步骤如图6所示,包括:
步骤A1,根据真实语音数据分别确定每种语言的后验概率权重值;
步骤A2,根据后验概率权重值,分别调整每种语言的后验概率,以完成权重调整。
具体地,本实施例中,形成上述语音识别系统后,由于在训练过程中可能会产生训练数据量不均衡的问题,数据量较多的一种语言会得到相对较大的先验概率,由于最终的识别概率是后验概率除以先验概率,因此训练数据较多的语言实际的识别概率反而偏小,这就会造成识别系统的识别结果可能会倾向于识别出某一种语言而无法识别另一种语言,从而造成识别结果的偏差。
为了解决这个问题,在将上述语音识别系统进行实用之前,需要采用真实的数据作为开发集对其进行实测以对每种语言的权重进行调整。上述权重调整通常应用在声学识别模型输出的后验概率上,因此其公式如下:
Figure PCTCN2018074314-appb-000062
其中,
Figure PCTCN2018074314-appb-000063
用于表示多语言语音数据中第j种语言的第i个状态的输出标签;
x用于表示语音特征;
Figure PCTCN2018074314-appb-000064
用于表示多语言语音数据中输出标签为
Figure PCTCN2018074314-appb-000065
的后验概率;
a j用于表示多语言语音数据中第j种语言的后验概率权重值,该后验概率权重值通过上述真实数据组成的开发集对声学识别模型进行实测来确定。
Figure PCTCN2018074314-appb-000066
用于表示经过权重调整的多语言语音数据中输出标签为
Figure PCTCN2018074314-appb-000067
的后验概率。
通过上述劝着哦那个调整后能够使得语音识别系统在不同的应用场景中都能得到很好的识别效果。
在本发明的一个较佳的实施例中,对于一个由中英文混合的语音识别系统中,经过真实数据实测后可以将中文的后验概率权重值设定为1.0,将英文的后验概率权重值设定为0.3,将静音的后验概率权重值设定为1.0。
本发明的其他实施例中,上述后验概率权重值可以通过多次采用不同的真实数据组成的开发集进行反复调整,最终确定最佳的取值。
以上所述仅为本发明较佳的实施例,并非因此限制本发明的实施方式及保护范围,对于本领域技术人员而言,应当能够意识到凡运用本发明说明书及图示内容所作出的等同替换和显而易见的变化所得到的方案,均应当包含在本发明的保护范围内。

Claims (14)

  1. 一种多语言混合语音识别方法,其特征在于,首先形成用于识别多语言的混合语音的语音识别系统,形成所述语音识别系统的方法包括:
    步骤S1,配置一包括多种不同语言的多语言混合词典;
    步骤S2,根据所述多语言混合词典以及包括多种不同语言的多语言语音数据训练形成一声学识别模型;
    步骤S3,根据包括多种不同语言的多语言文本语料训练形成一语言识别模型;
    步骤S4,采用所述多语言混合词典、所述声学识别模型以及所述语言识别模型形成所述语音识别系统;
    随后,采用所述语音识别系统对所述混合语音进行识别,并输出对应的识别结果。
  2. 如权利要求1所述的多语言混合语音识别方法,其特征在于,所述步骤S1中,采用三音子建模的方式,根据分别对应每种不同语言的单语言词典配置所述多语言混合词典。
  3. 如权利要求1所述的多语言混合语音识别方法,其特征在于,所述步骤S1中,采用三音子建模的方式配置所述多语言混合词典;
    在配置所述多语言混合词典时,对所述多语言混合词典中包括的每种所语言的音子前分别添加一对应的语种标记,以将多种不同语言的音子进行区分。
  4. 如权利要求1所述的多语言混合语音识别方法,其特征在于,所述步骤S2具体包括:
    步骤S21,根据所述多语言语音数据以及所述多语言混合词典训练形成一声学模型;
    步骤S22,对所述多语言语音数据提取语音特征,并采用所述声学模型对所述语音特征进行帧对齐操作,以获得每一帧所述语音特征所对应的输出标签;
    步骤S23,将所述语音特征作为所述声学识别模型的输入数据,以及将所述语音特征对应的所述输出标签作为所述声学识别模型的输出层中的输出 标签,以训练形成所述声学识别模型。
  5. 如权利要求4所述的多语言混合语音识别方法,其特征在于,所述声学模型为隐马尔可夫-高斯混合模型。
  6. 如权利要求4所述的多语言混合语音识别方法,其特征在于,所述步骤S23中,对所述声学识别模型进行训练后,对所述声学识别模型的所述输出层进行调整,具体包括:
    步骤S231,分别计算得到每种语言的先验概率,以及计算得到所有种类的语言公用的静音的先验概率;
    步骤S232,分别计算得到每种语言的后验概率,以及计算得到所述静音的后验概率;
    步骤S233,根据每种语言的先验概率和后验概率,以及所述静音的先验概率和后验概率,调整所述声学识别模型的所述输出层。
  7. 如权利要求6所述的多语言混合语音识别方法,其特征在于,所述步骤S231中,依照下述公式分别计算得到每种语言的先验概率:
    Figure PCTCN2018074314-appb-100001
    其中,
    Figure PCTCN2018074314-appb-100002
    用于表示所述多语言语音数据中第j种语言的第i个状态的所述输出标签;
    Figure PCTCN2018074314-appb-100003
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100004
    的先验概率;
    Figure PCTCN2018074314-appb-100005
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100006
    的总数;
    Figure PCTCN2018074314-appb-100007
    用于表示所述多语言语音数据中的所述静音的第i种状态的所述输出标签;
    Figure PCTCN2018074314-appb-100008
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100009
    的总数;
    M j用于表示所述多语言语音数据中的第j种语言中的状态的总数;
    M sil用于表示所述多语言语音数据中的所述静音的状态的总数。
  8. 如权利要求6所述的多语言混合语音识别方法,其特征在于,所述步骤S231中,依照下述公式计算得到所述静音的先验概率:
    Figure PCTCN2018074314-appb-100010
    其中,
    Figure PCTCN2018074314-appb-100011
    用于表示所述多语言语音数据中的所述静音的第i种状态的所述输出标签;
    Figure PCTCN2018074314-appb-100012
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100013
    的先验概率;
    Figure PCTCN2018074314-appb-100014
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100015
    的总数;
    Figure PCTCN2018074314-appb-100016
    用于表示所述多语言语音数据中第j种语言的第i个状态的所述输出标签;
    Figure PCTCN2018074314-appb-100017
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100018
    的总数;
    M j用于表示所述多语言语音数据中的第j种语言中的状态的总数;
    M sil用于表示所述多语言语音数据中的所述静音的状态的总数;
    L用于表示所述多语言语音数据中的所有语言。
  9. 如权利要求6所述的多语言混合语音识别方法,其特征在于,所述步骤S232中,依照下述公式分别计算得到每种语言的后验概率:
    Figure PCTCN2018074314-appb-100019
    其中,
    Figure PCTCN2018074314-appb-100020
    用于表示所述多语言语音数据中第j种语言的第i个状态的所述输出标签;
    x用于表示所述语音特征;
    Figure PCTCN2018074314-appb-100021
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100022
    的后验概率;
    Figure PCTCN2018074314-appb-100023
    用于表示所述多语言语音数据中第j种语言的第i个状态的所述输入数据;
    Figure PCTCN2018074314-appb-100024
    用于表示所述静音的第i种状态的所述输入数据;
    M j用于表示所述多语言语音数据中的第j种语言中的状态的总数;
    M sil用于表示所述多语言语音数据中的所述静音的状态的总数;
    exp用于表示指数函数计算方式。
  10. 如权利要求6所述的多语言混合语音识别方法,其特征在于,所述步骤S232中,依照下述公式计算得到所述静音的后验概率:
    Figure PCTCN2018074314-appb-100025
    其中,
    Figure PCTCN2018074314-appb-100026
    用于表示所述多语言语音数据中的所述静音的第i种状态的所述输出标签;
    x用于表示所述语音特征;
    Figure PCTCN2018074314-appb-100027
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100028
    的后验概率;
    Figure PCTCN2018074314-appb-100029
    用于表示所述多语言语音数据中第j种语言的第i个状态的所述输入数据;
    Figure PCTCN2018074314-appb-100030
    用于表示所述静音的第i种状态的所述输入数据;
    M j用于表示所述多语言语音数据中的第j种语言中的状态的总数;
    M sil用于表示所述多语言语音数据中的所述静音的状态的总数;
    L用于表示所述多语言语音数据中的所有语言;
    exp用于表示指数函数计算方式。
  11. 如权利要求1所述的多语言混合语音识别方法,其特征在于,所述步骤S2中,所述声学识别模型为深度神经网络的声学模型。
  12. 如权利要求1所述的多语言混合语音识别方法,其特征在于,所述步骤S3中,采用n-Gram模型训练形成所述语言识别模型,或者采用递归神经网络训练形成所述语言识别模型。
  13. 如权利要求4所述的多语言混合语音识别方法,其特征在于,形成所述语音识别系统后,首先对所述语音识别系统中不同种类的语言进行权重调整;
    进行所述权重调整的步骤包括:
    步骤A1,根据真实语音数据分别确定每种语言的后验概率权重值;
    步骤A2,根据所述后验概率权重值,分别调整每种语言的后验概率,以完成所述权重调整。
  14. 如权利要求13所述的多语言混合语音识别方法,其特征在于,所述步骤A2中,依照下述公式进行所述权重调整:
    Figure PCTCN2018074314-appb-100031
    其中,
    Figure PCTCN2018074314-appb-100032
    用于表示所述多语言语音数据中第j种语言的第i个状态的所述输出标签;
    x用于表示所述语音特征;
    Figure PCTCN2018074314-appb-100033
    用于表示所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100034
    的后验概率;
    a j用于表示所述多语言语音数据中第j种语言的所述后验概率权重值;
    Figure PCTCN2018074314-appb-100035
    用于表示经过所述权重调整的所述多语言语音数据中所述输出标签为
    Figure PCTCN2018074314-appb-100036
    的后验概率。
PCT/CN2018/074314 2017-02-24 2018-01-26 一种多语言混合语音识别方法 WO2018153213A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/487,279 US11151984B2 (en) 2017-02-24 2018-01-26 Multi-language mixed speech recognition method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710103972.7 2017-02-24
CN201710103972.7A CN108510976B (zh) 2017-02-24 2017-02-24 一种多语言混合语音识别方法

Publications (1)

Publication Number Publication Date
WO2018153213A1 true WO2018153213A1 (zh) 2018-08-30

Family

ID=63254098

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/074314 WO2018153213A1 (zh) 2017-02-24 2018-01-26 一种多语言混合语音识别方法

Country Status (3)

Country Link
US (1) US11151984B2 (zh)
CN (1) CN108510976B (zh)
WO (1) WO2018153213A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110797016A (zh) * 2019-02-26 2020-02-14 北京嘀嘀无限科技发展有限公司 一种语音识别方法、装置、电子设备及存储介质
CN111369978A (zh) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
WO2020211350A1 (zh) * 2019-04-19 2020-10-22 平安科技(深圳)有限公司 语音语料训练方法、装置、计算机设备和存储介质
CN113205795A (zh) * 2020-01-15 2021-08-03 普天信息技术有限公司 多语种混说语音的语种识别方法及装置
CN113782000A (zh) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 一种基于多任务的语种识别方法
CN114398468A (zh) * 2021-12-09 2022-04-26 广东外语外贸大学 一种多语种识别方法和系统

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970018B (zh) * 2018-09-28 2022-05-27 珠海格力电器股份有限公司 语音识别方法和装置
CN109493846B (zh) * 2018-11-18 2021-06-08 深圳市声希科技有限公司 一种英语口音识别系统
CN110491382B (zh) 2019-03-11 2020-12-04 腾讯科技(深圳)有限公司 基于人工智能的语音识别方法、装置及语音交互设备
CN111862961A (zh) * 2019-04-29 2020-10-30 京东数字科技控股有限公司 识别语音的方法和装置
CN111916062B (zh) * 2019-05-07 2024-07-26 阿里巴巴集团控股有限公司 语音识别方法、装置和系统
CN112364658B (zh) * 2019-07-24 2024-07-26 阿里巴巴集团控股有限公司 翻译以及语音识别方法、装置、设备
CN110517664B (zh) * 2019-09-10 2022-08-05 科大讯飞股份有限公司 多方言识别方法、装置、设备及可读存储介质
CN110580908A (zh) * 2019-09-29 2019-12-17 出门问问信息科技有限公司 一种支持不同语种的命令词检测方法及设备
CN112837674B (zh) * 2019-11-22 2024-06-11 阿里巴巴集团控股有限公司 语音识别方法、装置及相关系统和设备
CN111508505B (zh) * 2020-04-28 2023-11-03 讯飞智元信息科技有限公司 一种说话人识别方法、装置、设备及存储介质
CN113014854B (zh) * 2020-04-30 2022-11-11 北京字节跳动网络技术有限公司 互动记录的生成方法、装置、设备及介质
CN111968646B (zh) * 2020-08-25 2023-10-13 腾讯科技(深圳)有限公司 一种语音识别方法及装置
CN112652311B (zh) * 2020-12-01 2021-09-03 北京百度网讯科技有限公司 中英文混合语音识别方法、装置、电子设备和存储介质
CN112652300B (zh) * 2020-12-24 2024-05-17 百果园技术(新加坡)有限公司 多方言语音识别方法、装置、设备和存储介质
CN114078475B (zh) * 2021-11-08 2023-07-25 北京百度网讯科技有限公司 语音识别和更新方法、装置、设备和存储介质
US12106753B2 (en) * 2022-03-08 2024-10-01 Microsoft Technology Licensing, Llc Code-mixed speech recognition using attention and language-specific joint analysis
CN116386609A (zh) * 2023-04-14 2023-07-04 南通大学 一种中英混合语音识别方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200421263A (en) * 2003-04-10 2004-10-16 Delta Electronics Inc Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme
CN101604522A (zh) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 非特定人的嵌入式中英文混合语音识别方法及系统
CN101826325A (zh) * 2010-03-10 2010-09-08 华为终端有限公司 对中英文语音信号进行识别的方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578464B (zh) * 2013-10-18 2017-01-11 威盛电子股份有限公司 语言模型的建立方法、语音辨识方法及电子装置
US10235994B2 (en) * 2016-03-04 2019-03-19 Microsoft Technology Licensing, Llc Modular deep learning model
CN106228976B (zh) * 2016-07-22 2019-05-31 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN107633842B (zh) * 2017-06-12 2018-08-31 平安科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200421263A (en) * 2003-04-10 2004-10-16 Delta Electronics Inc Speech recognition device and method using di-phone model to realize the mixed-multi-lingual global phoneme
CN101604522A (zh) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 非特定人的嵌入式中英文混合语音识别方法及系统
CN101826325A (zh) * 2010-03-10 2010-09-08 华为终端有限公司 对中英文语音信号进行识别的方法和装置

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ADDA-DECKER, M.: "Towards Multilingual Interoperability in Automatic Speech Recognition", SPEECH COMMUNICATION, vol. 35, no. 1-2, 30 August 2001 (2001-08-30), pages 5 - 20, XP055342504, Retrieved from the Internet <URL:https://doi.org/10.1016/S0167-6393(00)00092-3> *
SHIH, P.Y. ET AL.: "Acoustic and Phoneme Modeling Based on Confusion Matrix for Ubiquitous Mixed-Language Speech Recognition", IEEE INTERNATIONAL CONFERENCE ON SENSOR NETWORKS, vol. 36, no. 11, 30 November 2008 (2008-11-30), pages 500 - 506 *
WANG, SHIJIN ET AL.: "Multilingual-Based PRLM for Language Identification", JOURNAL OF TSINGHUA UNIVERSITY ( SCIENCE AND TECHNOLOGY, vol. 48, no. S1, 15 April 2008 (2008-04-15), pages 678 - 682, ISSN: 1000-0054 *
YAO, HAITAO ET AL.: "Multilingual Acoustic Modelling for Automatic Speech Recognition", PROCEEDINGS OF THE 11TH YOUTH ACADEMIC CONFERENCE OF THE ACOUSTICAL SOCIETY OF CHINA, 15 October 2015 (2015-10-15), pages 404 - 406 *
YU , SHENGMIN ET AL.: "Research of Chinese -English Bilingual Acoustic Modelling", JOURNAL OF CHINESE INFORMATION PROCESSING, vol. 18, no. 5, 25 September 2004 (2004-09-25), pages 78 - 83, ISSN: 1003-0077 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111369978A (zh) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN111369978B (zh) * 2018-12-26 2024-05-17 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN110797016A (zh) * 2019-02-26 2020-02-14 北京嘀嘀无限科技发展有限公司 一种语音识别方法、装置、电子设备及存储介质
WO2020211350A1 (zh) * 2019-04-19 2020-10-22 平安科技(深圳)有限公司 语音语料训练方法、装置、计算机设备和存储介质
CN113205795A (zh) * 2020-01-15 2021-08-03 普天信息技术有限公司 多语种混说语音的语种识别方法及装置
CN113782000A (zh) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 一种基于多任务的语种识别方法
CN113782000B (zh) * 2021-09-29 2022-04-12 北京中科智加科技有限公司 一种基于多任务的语种识别方法
CN114398468A (zh) * 2021-12-09 2022-04-26 广东外语外贸大学 一种多语种识别方法和系统

Also Published As

Publication number Publication date
CN108510976B (zh) 2021-03-19
US11151984B2 (en) 2021-10-19
CN108510976A (zh) 2018-09-07
US20190378497A1 (en) 2019-12-12

Similar Documents

Publication Publication Date Title
WO2018153213A1 (zh) 一种多语言混合语音识别方法
AU2019395322B2 (en) Reconciliation between simulated data and speech recognition output using sequence-to-sequence mapping
US9711139B2 (en) Method for building language model, speech recognition method and electronic apparatus
Besacier et al. Automatic speech recognition for under-resourced languages: A survey
EP3723084A1 (en) Facilitating end-to-end communications with automated assistants in multiple languages
US9613621B2 (en) Speech recognition method and electronic apparatus
US20150112674A1 (en) Method for building acoustic model, speech recognition method and electronic apparatus
CN110852075B (zh) 自动添加标点符号的语音转写方法、装置及可读存储介质
CN103632663B (zh) 一种基于hmm的蒙古语语音合成前端处理的方法
TWI659411B (zh) 一種多語言混合語音識別方法
CN111489746A (zh) 一种基于bert的电网调度语音识别语言模型构建方法
TWI467566B (zh) 多語言語音合成方法
CN105895076B (zh) 一种语音合成方法及系统
CN114254649A (zh) 一种语言模型的训练方法、装置、存储介质及设备
Lin et al. Hierarchical prosody modeling for Mandarin spontaneous speech
Liu et al. A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin
Mustafa et al. Developing an HMM-based speech synthesis system for Malay: a comparison of iterative and isolated unit training
Saychum et al. Efficient Thai Grapheme-to-Phoneme Conversion Using CRF-Based Joint Sequence Modeling.
Abudubiyaz et al. The acoustical and language modeling issues on Uyghur speech recognition
Elfahal Automatic recognition and identification for mixed sudanese arabic–english languages speech
TW201909165A (zh) 藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統
Li et al. Education of Recognition Training Combined with Hidden Markov Model to Explore English Speaking
JP2023006055A (ja) プログラム、情報処理装置、方法
Kuo et al. Some studies on Min-nan speech processing
Silamu et al. HMM-based uyghur continuous speech recognition system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18756783

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18756783

Country of ref document: EP

Kind code of ref document: A1