CN113160804B

CN113160804B - Hybrid voice recognition method and device, storage medium and electronic device

Info

Publication number: CN113160804B
Application number: CN202110219826.7A
Authority: CN
Inventors: 黄石磊; 王昕�; 程刚
Original assignee: Shenzhen Raixun Information Technology Co ltd
Current assignee: Shenzhen Raixun Information Technology Co ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2022-03-08
Anticipated expiration: 2041-02-26
Also published as: CN113160804A

Abstract

The invention provides a method and a device for recognizing mixed voice, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring mixed voice to be subjected to phoneme recognition, wherein the mixed voice comprises Chinese words and English words; extracting English non-abbreviated words from the mixed voice; and identifying first phoneme information of the English non-abbreviated word by adopting a G2P model from a first preset grapheme sequence to phoneme sequence, wherein the first preset G2P model is obtained by training a decoding result of Chinese phonemes and comprises a mapping sequence between the English word and the Chinese phonemes. According to the invention, the labor cost is saved, and simultaneously the mapping labels with high similarity in acoustics are pursued, so that an English pronunciation scheme with reliable quality is realized. The technical problem of low efficiency of recognizing mixed speech by phonemes in the related art is solved.

Description

Hybrid voice recognition method and device, storage medium and electronic device

Technical Field

The invention relates to the field of voice recognition, in particular to a method and a device for recognizing mixed voice, a storage medium and an electronic device.

Background

In the related art, the hybrid Speech Recognition of chinese and english means Automatic Speech Recognition (ASR) including both chinese and english languages in the process of speaking to a speaker, and in the present day when english is popularized in the coming and going months, the hybrid communication of chinese and english gradually becomes a common phenomenon for most chinese people. In Chinese-English conversation of Chinese population, the Chinese part is still the main language, and can be divided into 'switching in sentence', that is, inserting English words in Chinese sentences, according to the switching type of Chinese and English, and the other is 'switching between sentences', that is, switching of Chinese and English whole sentences. Compared to traditional single-language automatic speech recognition techniques, the challenge of mixed-language speech recognition techniques, particularly mixed-speech recognition with "switch-in-sentence" is the lack of sufficient speech/text data to train the acoustic/speech models facing the scene. In addition, aiming at the type of switching in sentences, on the premise of having a Chinese acoustic model with relatively sufficient training data, the English recognition capability of the Chinese acoustic model is expected to be expanded, wherein the key technology is to obtain high-quality pronunciation of an English word, namely the pronunciation of the English word is represented by a reliable Chinese phoneme sequence, so that a Chinese-English mixed pronunciation dictionary is constructed. Meanwhile, English contents are reserved in the n-gram language model, so that Chinese and English mixed speech recognition is realized to a certain extent. Over the past decades, efforts have been made to study speech recognition-oriented learning of pronunciation dictionaries driven by acoustic data, i.e. automatic phonetic transcription of words with audio but without labels. In practical applications, a set of phoneme unit sets and a certain pronunciation dictionary based on expert knowledge are generally owned, but the pronunciation of many out-of-vocabulary words (OOV) cannot be covered. The most straightforward approach is to train a G2P (graph to phone) model using a seed dictionary based on expert knowledge, and then pronounce the OOV through the model. However, for the pronunciation of some proprietary words, G2P may not give suitable labels, and decoding this phoneme gives the pronunciation of the word close to the real scene by decoding acoustically, often the phoneme decoding method will work with the G2P tool to give multiple candidate pronunciations of the word.

In the related art, for a mixed chinese and english speech recognition task, the current mainstream ASR framework is to train a hybrid acoustic model and a hybrid language model. The traditional Chinese-English acoustic model is characterized in that a large amount of Chinese-English mixed speech and label data are needed as training materials for the Chinese-English mixed acoustic model, but compared with Chinese (single language) training data, the Chinese-English mixed data are rare, the cost for retraining a special acoustic model for Chinese-English mixed recognition is higher, the problem of combining and processing Chinese and English phoneme sets serving as modeling units of the Chinese-English acoustic model is also a difficult problem, and if a conventional English acoustic model (mapping between English words and English phonemes) is used for recognition, the model needs to be switched, so that the recognition time is influenced. In addition, there is a problem that there is a difference between the pronunciation of english by a speaker who has a native chinese language and the pronunciation of english by a speaker who has a native english language.

In view of the above problems in the related art, no effective solution has been found at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for recognizing mixed voice, a storage medium and an electronic device.

According to an embodiment of the present invention, there is provided a hybrid speech recognition method including: acquiring mixed voice to be subjected to phoneme recognition, wherein the mixed voice comprises Chinese words and English words; extracting English non-abbreviated words from the mixed voice; and identifying first phoneme information of the English non-abbreviated word by adopting a G2P model from a first preset grapheme sequence to phoneme sequence, wherein the first preset G2P model is obtained by training a decoding result of Chinese phonemes and comprises a mapping sequence between the English word and the Chinese phonemes.

Optionally, before the first phoneme information of the english word is identified by using the first preset G2P model, the method further includes: generating a seed dictionary of an appointed word set through a phoneme decoding and preferential algorithm, wherein the appointed word set is an English word in a Chinese and English sample corpus; and training by adopting the seed dictionary to generate the first preset G2P model, wherein the first preset G2P model is a G2P model based on a Seq2Seq network.

Optionally, the generating the seed dictionary of the specified word set through the phoneme decoding and the preferential algorithm includes: aiming at each English word in the appointed word set, acquiring an appointed sound segment of the English word from mixed audio of the Chinese and English sample corpus; decoding the specified segment into a Chinese phoneme sequence using a Chinese phoneme level decoding network, wherein the Chinese phoneme level decoding network comprises a Chinese acoustic model and a phoneme level language model; and generating a seed dictionary corresponding to the specified word set according to the Chinese phoneme sequence.

Optionally, the generating a seed dictionary corresponding to the specified word set according to the chinese phoneme sequence includes: for each English word in the appointed word set, respectively embedding the candidate pronunciations into corresponding mixed audio, and calculating the posterior probability of the average pronunciation; and sequencing the candidate pronunciations according to the posterior probability to obtain a plurality of optimal pronunciation results, and integrating to generate a seed dictionary corresponding to the appointed word set.

Optionally, the obtaining of the specified voice segment of the english word from the mixed audio of the chinese and english sample corpus includes: training a Gaussian mixture model-hidden Markov model GMM-HMM model by adopting the Chinese and English sample corpus, the Chinese pronunciation dictionary and the English pronunciation dictionary; obtaining a fragment timestamp of the English word by aligning the trained mixed GMM-HMM model; extracting a specified voice segment of the English word in the mixed audio based on the segment time stamp.

Optionally, the generating the first preset G2P model by using the seed dictionary training includes: training an initial G2P model by using the seed dictionary to obtain a first G2P model; re-decoding the specified word set by adopting the first G2P model, and generating n + i prediction results for each English word, wherein the mapping sequence of each English word in the seed dictionary comprises n candidate pronunciations, n is an integer greater than 1, and i is an integer greater than 0; and preferentially reselecting n candidate pronunciations from the n candidate pronunciations and the n + i prediction results to serve as training data, continuing to iteratively train the first G2P model until the error mixing rate of the prediction results and the test set meets a preset condition, and determining the latest G2P model as the first preset G2P model.

Optionally, after obtaining the mixed speech to be phoneme recognized, the method further includes: extracting Chinese words and English abbreviated words from the mixed voice; recognizing the Chinese words by adopting a second preset G2P model to obtain second phoneme information of the Chinese words, and spelling the English abbreviated words by adopting a letter pronunciation table to obtain third phoneme information of the English abbreviated words, wherein the second preset G2P model comprises a mapping sequence between the Chinese words and Chinese phonemes; and combining the first phoneme information, the second phoneme information and the third phoneme information to obtain a mixed phoneme of the mixed voice.

According to another embodiment of the present invention, there is provided a hybrid speech recognition apparatus including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring mixed voice to be subjected to phoneme recognition, and the mixed voice comprises Chinese words and English words; the extraction module is used for extracting English non-abbreviated words from the mixed voice; the recognition module is used for recognizing first phoneme information of the English non-abbreviated word by adopting a G2P model from a first preset grapheme sequence to a phoneme sequence, wherein the first preset G2P model is obtained by training a decoding result of a Chinese phoneme and comprises a mapping sequence between the English word and the Chinese phoneme.

Optionally, the apparatus further comprises: the first generation module is used for generating a seed dictionary of an appointed word set through phoneme decoding and a preferred algorithm before the recognition module recognizes first phoneme information of the English word by adopting a first preset G2P model, wherein the appointed word set is an English word in a Chinese and English sample corpus; a second generating module, configured to generate the first preset G2P model by using the seed dictionary training, where the first preset G2P model is a G2P model based on a Seq2Seq network.

Optionally, the first generating module includes: the acquisition unit is used for acquiring the designated sound segment of the English word from the mixed audio of the Chinese and English sample corpus aiming at each English word in the designated word set; a decoding unit for decoding the specified segment into a Chinese phoneme sequence using a Chinese phoneme level decoding network, wherein the Chinese phoneme level decoding network comprises a Chinese acoustic model and a phoneme level language model; and the generating unit is used for generating a seed dictionary corresponding to the specified word set according to the Chinese phoneme sequence.

Optionally, the chinese phoneme sequence includes a plurality of candidate pronunciations, and the generating unit includes: the computing subunit is used for embedding the candidate pronunciations into corresponding mixed audios respectively aiming at each English word in the appointed word set and calculating the posterior probability of the average pronunciation; and the generating subunit is used for sequencing the candidate pronunciations according to the posterior probability to obtain a plurality of optimal pronunciation results and integrating to generate a seed dictionary corresponding to the specified word set.

Optionally, the obtaining unit includes: the training subunit is used for training a Gaussian mixture model-hidden Markov model GMM-HMM model by adopting the Chinese and English sample corpus, the Chinese pronunciation dictionary and the English pronunciation dictionary; the processing subunit is used for obtaining a segment timestamp of the English word by aligning the trained mixed GMM-HMM model; an extracting subunit, configured to extract the specified sound segment of the english word in the mixed audio based on the segment timestamp.

Optionally, the second generating module includes: the training unit is used for training an initial G2P model by adopting the seed dictionary to obtain a first G2P model; a generating unit, configured to re-decode the specified word set by using the first G2P model, and generate n + i prediction results for each english word, where a mapping sequence of each english word in the seed dictionary includes n candidate pronunciations, n is an integer greater than 1, and i is an integer greater than 0; and the training unit is used for preferentially reselecting n candidate pronunciations from the n candidate pronunciations and the n + i prediction results to serve as training data, continuing to iteratively train the first G2P model until the error mixing rate of the prediction results and the test set meets a preset condition, and determining the latest G2P model as the first preset G2P model.

Optionally, the apparatus further comprises: the second extraction module is used for extracting Chinese words and English abbreviation words from the mixed voice after the acquisition module acquires the mixed voice to be subjected to phoneme recognition; the second recognition module is used for recognizing the Chinese words by adopting a second preset G2P model to obtain second phoneme information of the Chinese words, and spelling the English abbreviated words by adopting a letter pronunciation table to obtain third phoneme information of the English abbreviated words, wherein the second preset G2P model comprises a mapping sequence between the Chinese words and Chinese phonemes; and the combination module is used for combining the first phoneme information, the second phoneme information and the third phoneme information to obtain a mixed phoneme of the mixed voice.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, the mixed speech to be subjected to phoneme recognition and comprising Chinese words and English words is obtained, then English non-abbreviated words are extracted from the mixed speech, the first phoneme information of the English non-abbreviated words is recognized by adopting a first preset grapheme sequence to phoneme sequence G2P model, the first preset G2P model is obtained by training the decoding result of Chinese phonemes, the mapping sequence between the English words and the Chinese phonemes is included, an acoustic model is not required to be retrained, the existing Chinese acoustic model is used, the Chinese pronunciation and the mapping pronunciation between the English words are automatically generated, and then the English part of the language model is reserved, so that the recognition capability of the English words in the system can be quickly realized, and the method is a quick and low-cost construction scheme. The labor cost is saved, meanwhile, mapping labels with high similarity in acoustics are pursued, and an English pronunciation scheme with reliable quality is achieved. The technical problem of low efficiency of recognizing mixed speech by phonemes in the related art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a mobile phone according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of hybrid speech recognition according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating obtaining mapping pronunciation of English word OFFICE according to an embodiment of the present invention;

FIG. 4 is a flow chart of the acquisition of an initial pronunciation set in an embodiment of the present invention;

FIG. 5 is a flowchart of G2P iterative training in an embodiment of the present invention;

FIG. 6 is a flow chart of the prediction of pronunciation of any non-abbreviated English word in accordance with an embodiment of the present invention;

FIG. 7 is a schematic diagram and an exemplary diagram of splitting a word according to an embodiment of the present invention;

fig. 8 is a block diagram of a hybrid speech recognition apparatus according to an embodiment of the present invention;

fig. 9 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

The method provided by the first embodiment of the present application may be executed in a server, a computer, a mobile phone, a voice recognition device, a recording pen, or a similar computing device. Taking the operation on a mobile phone as an example, fig. 1 is a block diagram of a hardware structure of a mobile phone according to an embodiment of the present invention. As shown in fig. 1, the handset may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting to the structure of the mobile phone. For example, a cell phone may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a mobile phone program, for example, a software program and a module of an application software, such as a mobile phone program corresponding to a hybrid speech recognition method in an embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the mobile phone program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a cell phone over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a cellular phone. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In the present embodiment, a method for recognizing a mixed speech is provided, and fig. 2 is a flowchart of a method for recognizing a mixed speech according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, acquiring mixed voice to be subjected to phoneme recognition, wherein the mixed voice comprises Chinese words and English words;

the mixed speech of the embodiment includes text information and speech information, and the words in the text can be labeled with pronunciation by speech recognition of phonemes. In this embodiment, for chinese, the pronunciation unit Phoneme (Phoneme) is a form, and may also be Initial/Final, etc.

Step S204, extracting English non-abbreviated words from the mixed voice;

step S206, identifying first phoneme information of the English non-abbreviated word by adopting a G2P model from a first preset grapheme sequence to a phoneme sequence, wherein the first preset G2P model is obtained by training a decoding result of a Chinese phoneme and comprises a mapping sequence between the English word and the Chinese phoneme;

it should be noted that, the embodiment can be applied to the speech recognition scene of the chinese-english mixed language; speech recognition for any one subject language and another language mixed language is also possible. For example, but not limited to, speech recognition for mixed Chinese and Japanese languages, and speech recognition for mixed cantonese and English languages.

Through the steps, the mixed voice to be subjected to phoneme recognition and comprising the Chinese word and the English word is obtained, then the English non-abbreviated word is extracted from the mixed voice, the first phoneme information of the English non-abbreviated word is recognized by adopting a first preset grapheme sequence to phoneme sequence G2P model, the first preset G2P model is obtained through the training of the decoding result of the Chinese phoneme, the mapping sequence between the English word and the Chinese phoneme is included, the acoustic model does not need to be retrained, the existing Chinese acoustic model is used, the Chinese pronunciation and the English word mapping pronunciation are automatically generated, and the English part of the language model is reserved, so that the recognition capability of the English word in the system can be quickly realized, and the method is a quick and low-cost construction scheme. The labor cost is saved, meanwhile, mapping labels with high similarity in acoustics are pursued, and an English pronunciation scheme with reliable quality is achieved. The technical problem of low efficiency of recognizing mixed speech by phonemes in the related art is solved.

In an embodiment of this embodiment, before recognizing the first phoneme information of the english word by using the first preset G2P model, the method further includes:

s11, generating a seed dictionary of an appointed word set through phoneme decoding and a preferred algorithm, wherein the appointed word set is an English word in a Chinese and English sample corpus;

optionally, the preferred algorithm of the present embodiment is implemented by using a confusion network like a rotor (rotator Output voltage Error reduction).

The seed dictionary and the trained pronunciation dictionary of the present embodiment are mapping tables from words to pronunciation units, and are composed of a plurality of rows. Where each row represents a word to its corresponding sequence of pronunciation units. Wherein the words are Chinese words or English words, and the pronunciation units are Chinese phonemes.

In some embodiments, generating a seed dictionary of a given word set by a phoneme decoding and preferential algorithm includes: aiming at each English word in the appointed word set, acquiring an appointed voice segment of the English word from mixed audio of Chinese and English sample linguistic data; decoding the designated sound segment into a Chinese sound sequence by adopting a Chinese sound level decoding network, wherein the Chinese sound level decoding network comprises a Chinese acoustic model and a sound level language model; and generating a seed dictionary corresponding to the specified word set according to the Chinese phoneme sequence.

In some examples, the chinese phoneme sequence includes a plurality of candidate pronunciations, and generating a seed dictionary corresponding to the specified word set from the chinese phoneme sequence includes: respectively embedding a plurality of candidate pronunciations into corresponding mixed audio aiming at each English word in the appointed word set, and calculating the posterior probability of the average pronunciation; and sequencing the candidate pronunciations according to the posterior probability to obtain a plurality of optimal pronunciation results, and integrating to generate a seed dictionary corresponding to the appointed word set.

In some embodiments, obtaining the specified voice segment of the english word from the mixed audio of the english sample corpus comprises: training a Mixed GMM-HMM (Gaussian mixture Model) -Hidden Markov Model (high Markov Model)) Model by adopting Chinese and English sample corpora, a Chinese pronunciation dictionary and an English pronunciation dictionary; obtaining a fragment timestamp of the English word by adopting the trained mixed GMM-HMM model through alignment; a designated sound segment of the English word is extracted in the mixed audio based on the segment time stamp.

S12, generating a first preset G2P model by adopting seed dictionary training, wherein the first preset G2P model is a G2P model based on a Seq2Seq network.

In some embodiments, generating the first preset G2P model using seed dictionary training comprises: training an initial G2P model by using a seed dictionary to obtain a first G2P model; re-decoding the appointed word set by adopting a first G2P model, and generating n + i prediction results for each English word, wherein the mapping sequence of each English word in the seed dictionary comprises n candidate pronunciations, n is an integer greater than 1, and i is an integer greater than 0; and preferentially reselecting n candidate pronunciations from the n candidate pronunciations and the n + i prediction results to serve as training data, continuing to iteratively train the first G2P model until the error mixing rate of the prediction results and the test set meets a preset condition, and determining the latest G2P model as a first preset G2P model.

In the preference process, even if the number before and after the preference is the same, in the currently selected n candidate utterances, there is a case where the order of the optimal candidate is different from the optimal candidate selected last time.

In the present embodiment, the english non-abbreviated word pronunciation acquisition mode adopts a data-driven mode. Firstly, a pronunciation result of a specified word set can be obtained through phoneme decoding and a preferred method. In order to realize the prediction of the pronunciation of any word, the decoding result is used for training a G2P model based on a seq2seq method, and the flow comprises the following steps:

step a, mapping pronunciation generation of English words in a set;

in order to obtain a Chinese phoneme mapping sequence with high acoustic similarity of English words, a data driving mode is adopted, and the method comprises the following steps:

training a mixed GMM-HMM model by using an English corpus, a Chinese pronunciation dictionary and an English pronunciation dictionary, and obtaining a segment timestamp of an English word through alignment so as to extract a pronunciation segment of the English word;

establishing a Chinese phoneme level decoding network by using a Chinese acoustic model and a phoneme level language model, and sending English word segments into a decoder to obtain a Chinese phoneme sequence;

because a large amount of noise exists in a decoding result due to mapping difference or inaccurate timestamp, a preferred algorithm is needed to obtain the optimal pronunciation, all candidate pronunciations are brought into an original text, the posterior probability of the average pronunciation is calculated, and the optimal pronunciation can be obtained by ranking the word pronunciation set according to a statistical result.

Fig. 3 is a schematic diagram of acquiring mapping pronunciation of english word OFFICE according to the embodiment of the present invention, and some optimal mapping pronunciation of OFFICE can be acquired through the ranking table. Similarly, for a given english vocabulary E, a pronunciation set can be obtained by this method, and some examples are shown in table 1:

TABLE 1

ABNORMAL	ee er3 b u2 n ao3 m ou3
		ABNORMAL	aa a1 b u4 n ao3 m ou3
ABNORMAL	ee er3 b u4 n ao3 m ou3
		ABNORMAL	aa ai4 b u4 n ao3 m ou3
ABOARD	ee er3 b ao4
		ABOARD	ee er2 b u4 ee er3 d e5
ABOARD	ee e4 b o1 ee er3 d e5
		ABOARD	ee er2 b o1 ee er3 d e5
ABORTION	ee e4 b ao1 sh en3
		ABORTION	ee er2 b ao1 sh en3
ABORTION	ee er3 b o1 sh en3
		ABORTION	ee er3 b ao1 sh en3

The mapping pronunciations are phoneme sequences of Chinese, so that the mapping pronunciations can be directly added into a pronunciation dictionary of the Chinese ASR, and the recognition of English words by the Chinese ASR is realized.

B, predicting pronunciation of any English word;

in step a, a first round of initial pronunciation sets (seed dictionaries) are obtained through phoneme decoding and preferential selection by using certain Chinese-English mixed speech data (containing English seed word sets E), and in order to better predict pronunciations of arbitrary English words, a G2P model based on seq2seq needs to be trained to learn pronunciation rules. However, the data-driven preliminarily acquired pronunciation set still contains certain noise, and an iterative G2P training method is proposed to realize the cleaning of the seed pronunciation set and the training of the G2P model. FIG. 4 is a flowchart illustrating the acquisition of an initial pronunciation set, i.e., a seed dictionary, according to an embodiment of the present invention.

Extracting good pronunciation rules from an initial seed dictionary carrying noise, re-decoding a word set E by G2P trained by a training set (n best n-best candidate pronunciations of each word), and returning n + i (i >0) prediction results for each word, wherein the aim is to obtain additional pronunciation rules except the pronunciation rules in the training set by learning. The n-best utterance is then re-prioritized by the prioritization algorithm among the initial set of utterances and the G2P prediction results as training data for the next round of G2P.

Table 2 below is a sample of iterative training in G2P, in which 4 candidate pronunciations are assigned to each word in the seed vocabulary E during training, and 6 candidate pronunciations are assigned to each word in the vocabulary E during prediction, so that it is expected that more pronunciation rules may participate in the preference, and the preference method is consistent with step a.

TABLE 2

Wherein, E is an English seed word set; e, English testing word set; te is a Chinese-English mixed voice test set containing an English word set e; G2P_iModel G2P for the ith wheel;

word set E from G2P_iA set of predicted n-best utterances, wherein

Is empty;

the ith round of preference result of the word set E, each word has n-best candidate pronunciation;

set of words e consisting of

A set of predicted n-best utterances; MER (test, P) is the mixed error rate index containing the pronunciation set P test set test.

FIG. 5 is a flowchart of G2P iterative training in an embodiment of the present invention, in which each round of the G2P model is trained by preferred set data, for the update termination condition of G2P, a test set Te containing a vocabulary e is selected, n-best pronunciation of e given by the current G2P model is added to a pronunciation dictionary, the misinterpretation rate result of the test set Te is given under the current pronunciation, and when the misinterpretation rate is equal to the misinterpretation rate

And stopping updating the G2P when the current iteration cycle does not descend, wherein the G2P model is an optimal model, calculating the prediction result of the current iteration cycle and the mixed error rate of the test set, comparing the mixed error rate of the current iteration cycle with the mixed error rate of the previous iteration cycle, stopping iteration if the mixed error rate of the current iteration cycle is greater than or equal to the mixed error rate of the previous iteration cycle, and judging that the preset condition is met.

FIG. 6 is a flow chart of the prediction of pronunciation of any non-abbreviated English word in accordance with an embodiment of the present invention.

In some other scenarios of this embodiment, after acquiring the mixed speech to be phoneme recognized, the method further includes: extracting Chinese words and English abbreviated words from the mixed voice; recognizing the Chinese words by adopting a second preset G2P model to obtain second phoneme information of the Chinese words, and spelling the English abbreviated words by adopting a letter pronunciation table to obtain third phoneme information of the English abbreviated words, wherein the second preset G2P model comprises a mapping sequence between the Chinese words and Chinese phonemes; and combining the first phoneme information, the second phoneme information and the third phoneme information to obtain a mixed phoneme of the mixed voice.

In this embodiment, the mixed word pronunciation can be divided into chinese and english parts, pronunciations are generated for the respective parts, and finally the pronunciations are combined to form the final pronunciation of the word.

FIG. 7 is a schematic diagram and an exemplary diagram of word splitting according to an embodiment of the present invention, in which "papi" and "catsup" in the mixed speech are split and separately recognized, and the mixed phoneme is p a1 p i5 jaang 4.

The scheme for rapidly generating the Chinese-English mixed pronunciation dictionary provided by the embodiment is a rapid construction scheme aiming at Chinese-English mixed recognition ASR based on the existing Chinese acoustic model. The recognition capability of the Chinese words in the current Chinese ASR system can be quickly realized by automatically generating high-quality mapping pronunciations of Chinese pronunciations and English words without retraining an acoustic model and then reserving English parts of a language model, so that the method is a quick and low-cost construction scheme. Compared with a manual marking pronunciation mode, for English word pronunciation, the data-driven method in the text pursues high-degree similar mapping marks on acoustics while saving labor cost, and provides English pronunciation with reliable quality. Meanwhile, the problem that the actual pronunciation mode and the standard pronunciation mode are different in the non-native pronunciation can be solved.

In ASR (automatic speech recognition network) construction, it is possible for a pronunciation vocabulary (from a language model) to contain forms of chinese, english, mixed (chinese-english, mixed-digit), etc. The word types are divided firstly, and then the generation method is selected according to the pronunciation mode.

Chinese words belong to the spelling type, but polyphone phenomena also exist in part of the context, and the spelling or seq2seq method can be used. English words are divided into abbreviation types and non-abbreviation types, pronunciation of the abbreviation belongs to spelling types, spelling methods can be adopted, such as abbreviations of 'APP', 'CCTV' and the like, the length of a character string is generally short, and abbreviation pronunciation candidates can be added through length judgment or the length of continuous capital letters. Since the pronunciation of the non-abbreviated English word is not regularly predicted by a seq2seq method, a model based on the seq2seq is obtained by training a phoneme decoding result. For the mixed word such as 'papi sauce', the mixed word is firstly split into Chinese, English and number parts, then pronunciations are given according to corresponding generation methods respectively, and then the pronunciations are combined to obtain the whole pronunciation.

For constructing a Chinese-English hybrid acoustic model, the method provided by the embodiment obtains high-quality English word pronunciation based on a mapping strategy, realizes the recognition of English words under the Chinese acoustic model, meets the requirements under a certain Chinese-English hybrid recognition scene, and avoids the high cost of retraining and constructing a Chinese-English ASR system. In the generation of a pronunciation dictionary, the proposed method can automatically divide the part of speech and automatically generate single or multiple candidate pronunciations of a word, thereby saving the way of pursuing reliable and quick phonetic transcription by manual operation.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, a mixed speech recognition apparatus is further provided, which is used to implement the foregoing embodiments and preferred embodiments, and the description that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 8 is a block diagram of a mixed speech recognition apparatus according to an embodiment of the present invention, as shown in fig. 8, the apparatus including: an acquisition module 80, a first extraction module 82, a first identification module 84, wherein,

an obtaining module 80, configured to obtain a mixed speech to be subjected to phoneme recognition, where the mixed speech includes a chinese word and an english word;

a first extraction module 82, configured to extract an english non-abbreviated word from the mixed speech;

the first recognition module 84 is configured to recognize first phoneme information of the non-abbreviated english word by using a first preset grapheme sequence to phoneme sequence G2P model, where the first preset G2P model is obtained by training a decoding result of a chinese phoneme and includes a mapping sequence between an english word and a chinese phoneme.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Fig. 9 is a structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 9, the electronic device includes a processor 91, a communication interface 92, a memory 93, and a communication bus 94, where the processor 91, the communication interface 92, and the memory 93 complete communication with each other through the communication bus 94, and the memory 93 is used for storing a computer program;

the processor 91, when executing the program stored in the memory 93, implements the following steps: acquiring mixed voice to be subjected to phoneme recognition, wherein the mixed voice comprises Chinese words and English words; extracting English non-abbreviated words from the mixed voice; and identifying first phoneme information of the English non-abbreviated word by adopting a G2P model from a first preset grapheme sequence to phoneme sequence, wherein the first preset G2P model is obtained by training a decoding result of Chinese phonemes and comprises a mapping sequence between the English word and the Chinese phonemes.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to execute the method for recognizing mixed speech in any one of the above embodiments.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for mixed speech recognition described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method for recognizing mixed speech, comprising:

acquiring mixed voice to be subjected to phoneme recognition, wherein the mixed voice comprises Chinese words and English words;

extracting English non-abbreviated words from the mixed voice;

identifying first phoneme information of the English non-abbreviated word by adopting a G2P model from a first preset grapheme sequence to a phoneme sequence, wherein the first preset G2P model is obtained by training a decoding result of Chinese phonemes and comprises a mapping sequence between the English word and the Chinese phonemes;

before recognizing the first phoneme information of the English word by using the first preset G2P model, the method further comprises: generating a seed dictionary of an appointed word set through a phoneme decoding and preferential algorithm, wherein the appointed word set is an English word in a Chinese and English sample corpus; generating the first preset G2P model by using the seed dictionary training, wherein the first preset G2P model is a G2P model based on a Seq2Seq network, and the generating the first preset G2P model by using the seed dictionary training comprises: training an initial G2P model by using the seed dictionary to obtain a first G2P model; re-decoding the specified word set by adopting the first G2P model, and generating n + i prediction results for each English word, wherein the mapping sequence of each English word in the seed dictionary comprises n candidate pronunciations, n is an integer greater than 1, and i is an integer greater than 0; and preferentially reselecting n candidate pronunciations from the n candidate pronunciations and the n + i prediction results to serve as training data, continuing to iteratively train the first G2P model until the error mixing rate of the prediction results and the test set meets a preset condition, and determining the latest G2P model as the first preset G2P model.

2. The method of claim 1, wherein generating a seed dictionary of the specified word set by a phoneme decoding and preferential algorithm comprises:

aiming at each English word in the appointed word set, acquiring an appointed sound segment of the English word from mixed audio of the Chinese and English sample corpus;

decoding the specified segment into a Chinese phoneme sequence using a Chinese phoneme level decoding network, wherein the Chinese phoneme level decoding network comprises a Chinese acoustic model and a phoneme level language model;

and generating a seed dictionary corresponding to the specified word set according to the Chinese phoneme sequence.

3. The method of claim 2, wherein the chinese phoneme sequence includes a plurality of candidate pronunciations, and wherein generating a seed dictionary corresponding to the specified set of words from the chinese phoneme sequence comprises:

for each English word in the appointed word set, respectively embedding the candidate pronunciations into corresponding mixed audio, and calculating the posterior probability of the average pronunciation;

and sequencing the candidate pronunciations according to the posterior probability to obtain a plurality of optimal pronunciation results, and integrating to generate a seed dictionary corresponding to the appointed word set.

4. The method of claim 2, wherein obtaining the specified voice segment of the english word from the mixed audio of the chinese and english sample corpus comprises:

training a Gaussian mixture model-hidden Markov model GMM-HMM model by adopting the Chinese and English sample corpus, the Chinese pronunciation dictionary and the English pronunciation dictionary;

obtaining a fragment timestamp of the English word by adopting the trained mixed GMM-HMM model through alignment;

extracting a specified voice segment of the English word in the mixed audio based on the segment time stamp.

5. The method of claim 1, wherein after obtaining the mixed speech to be phoneme recognized, the method further comprises:

extracting Chinese words and English abbreviated words from the mixed voice;

recognizing the Chinese words by adopting a second preset G2P model to obtain second phoneme information of the Chinese words, and spelling the English abbreviated words by adopting a letter pronunciation table to obtain third phoneme information of the English abbreviated words, wherein the second preset G2P model comprises a mapping sequence between the Chinese words and Chinese phonemes;

and combining the first phoneme information, the second phoneme information and the third phoneme information to obtain a mixed phoneme of the mixed voice.

6. A hybrid speech recognition apparatus, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring mixed voice to be subjected to phoneme recognition, and the mixed voice comprises Chinese words and English words;

the first extraction module is used for extracting English non-abbreviated words from the mixed voice;

the first recognition module is used for recognizing first phoneme information of the English non-abbreviated word by adopting a G2P model from a first preset grapheme sequence to a phoneme sequence, wherein the first preset G2P model is obtained by training a decoding result of a Chinese phoneme and comprises a mapping sequence between the English word and the Chinese phoneme;

wherein the apparatus further comprises: the first generation module is used for generating a seed dictionary of an appointed word set through phoneme decoding and a preferred algorithm before the recognition module recognizes first phoneme information of the English word by adopting a first preset G2P model, wherein the appointed word set is an English word in a Chinese and English sample corpus; a second generation module, configured to generate the first preset G2P model by using the seed dictionary training, where the first preset G2P model is a G2P model based on a Seq2Seq network, and the second generation module includes: the training unit is used for training an initial G2P model by adopting the seed dictionary to obtain a first G2P model; a generating unit, configured to re-decode the specified word set by using the first G2P model, and generate n + i prediction results for each english word, where a mapping sequence of each english word in the seed dictionary includes n candidate pronunciations, n is an integer greater than 1, and i is an integer greater than 0; and the training unit is used for preferentially reselecting n candidate pronunciations from the n candidate pronunciations and the n + i prediction results to serve as training data, continuing to iteratively train the first G2P model until the error mixing rate of the prediction results and the test set meets a preset condition, and determining the latest G2P model as the first preset G2P model.

7. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.

8. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.