CN101826325A

CN101826325A - Method and device for identifying Chinese and English speech signal

Info

Publication number: CN101826325A
Application number: CN 201010123191
Authority: CN
Inventors: 刘轶; 詹五洲; 王东琦
Original assignee: Huawei Device Co Ltd
Current assignee: Global Innovation Polymerization LLC; Tanous Co
Priority date: 2010-03-10
Filing date: 2010-03-10
Publication date: 2010-09-08
Anticipated expiration: 2030-03-10
Also published as: CN101826325B

Abstract

The invention provides a method and a device for identifying a Chinese and English speech signal. The method mainly comprises the following steps of: carrying out feature extraction on a Chinese and English speech signal to be identified by a searching algorithm to acquire the feature information of the speech signal to be identified; and comparing the feature information with an acoustic model corresponding to each phoneme sequence in a mixed speech database, determining a phoneme sequence corresponding to the feature information based on the comparative result, and acquiring a Chinese and English mixed phrase corresponding to the phoneme sequence, wherein the Chinese and English mixed phrase is taken as an identification result of the Chinese and English speech signal to be identified. The invention can establish the acoustic model with less confusion, and does not need a large amount of labeled speech training data, thereby saving system resources. The invention can effectively raise the identification rate of the Chinese and English speech signal.

Description

The method and apparatus that Chinese and English speech signal is discerned

Technical field

The present invention relates to field of speech recognition, relate in particular to a kind of method and apparatus that Chinese and English speech signal is discerned.

Background technology

Along with the development of information globalization, multi-lingual and multi-lingual communication becomes more and more general phenomenon.Single speech recognition system can not be carried out effective recognition to multi-lingual communication, and the speech recognition system that foundation can be discerned multilingual, voice signal is a new task of speech recognition technology.

Chinese is the maximum language of present user, the English person of the being to use the widest language that distributes, and therefore setting up a bilingual recognition system of Chinese and English has good application prospects.

The implementation of first kind of Chinese and English double-language voice recognition system is in the prior art: Chinese speech recognizer and English speech recognition device are integrated, speech data to input carries out languages identification earlier, then according to the corresponding speech recognition device of call by result of languages identification, thereby realize the task that Chinese and English double-language voice are discerned.

The implementation of second kind of Chinese and English double-language voice recognition system is in the prior art: the way according to linguistic knowledge or data-driven realizes that the parameter of Chinese and English model is shared, reduce model degree of obscuring, train acoustic model and language model that Chinese and English is shared on this basis.So only use a recognizer just can discern Chinese, English and Chinese and English hybrid voice.

In the process that realizes the embodiment of the invention, the inventor finds such scheme, and there are the following problems at least: the implementation of first kind of Chinese and English double-language voice recognition system needs a large amount of good speech datas of mark to be used for the training of acoustic model, and takies more system resources.The implementation of above-mentioned second kind of Chinese and English double-language voice recognition system only utilizes linguistic knowledge or data-driven to carry out model parameter sharing on model hierarchy, it is insufficient to cause parameter to share, Chinese and English model degree of obscuring is bigger, and then causes the recognition performance of Chinese and English double-language voice recognition system not accurate enough.

Summary of the invention

Embodiments of the invention provide a kind of method and apparatus that Chinese and English speech signal is discerned, and to improve the discrimination of Chinese and English speech signal effectively, reduce the system resource that takies.

A kind of method that Chinese and English speech signal is discerned comprises:

By searching algorithm Chinese and English speech signal to be identified is carried out feature extraction, obtain the characteristic information of described voice signal to be identified;

The acoustic model that described characteristic information is corresponding with each aligned phoneme sequence in the confluent articulation database compares;

Determine the aligned phoneme sequence of described characteristic information correspondence according to comparative result, obtain the Chinese and English mixing phrase of this aligned phoneme sequence correspondence, with the recognition result of this Chinese and English mixing phrase as described Chinese and English speech signal to be identified.

A kind of device that Chinese and English speech signal is discerned comprises:

Characteristic information extracting module is used for by searching algorithm Chinese and English speech signal to be identified being carried out feature extraction, obtains the characteristic information of described voice signal to be identified;

The identification comparison module is used for the acoustic model that described characteristic information is corresponding with each aligned phoneme sequence of the confluent articulation database of presetting and compares,

Processing module is used for determining according to comparative result the aligned phoneme sequence of described characteristic information correspondence, obtains the Chinese and English mixing phrase of this aligned phoneme sequence correspondence, with the recognition result of this Chinese and English mixing phrase as described Chinese and English speech signal to be identified.

The technical scheme that is provided by the embodiment of the invention described above as can be seen, the embodiment of the invention compares by the acoustic model that the characteristic information of voice signal to be identified is corresponding with each aligned phoneme sequence in the confluent articulation database, can improve the discrimination of Chinese and English speech signal effectively.

Description of drawings

In order to be illustrated more clearly in the technical scheme of the embodiment of the invention, the accompanying drawing of required use is done to introduce simply in will describing embodiment below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

The processing flow chart of a kind of method that Chinese and English speech signal is discerned that Fig. 1 provides for the embodiment of the invention one;

A kind of Chinese sound pattern master of sharing based on Chinese and English parameter that Fig. 2 provides for the embodiment of the invention two and the processing flow chart of English phoneme model way;

The structured flowchart of a kind of Chinese and English speech signal recognition device that Fig. 3 provides for the embodiment of the invention three;

The structural representation of a kind of confluent articulation database that Fig. 4 provides for the embodiment of the invention three;

Data strip purpose structural representation in a kind of confluent articulation database that Fig. 5 provides for the embodiment of the invention three;

The structural representation of a kind of context-sensitive hidden Markov model corresponding to an aligned phoneme sequence that Fig. 6 provides for the embodiment of the invention three;

The processing flow chart that carries out speech recognition of a kind of Chinese and English speech signal recognition device that Fig. 7 provides for the embodiment of the invention three.

Embodiment

For ease of understanding, be that example is further explained explanation below in conjunction with accompanying drawing with several specific embodiments, and each embodiment does not constitute the qualification to the embodiment of the invention to the embodiment of the invention.

Embodiment one

Treatment scheme such as Fig. 1 of a kind of method that Chinese and English speech signal is discerned that this embodiment provides comprise following treatment step:

Step 11, Chinese and English speech signal to be identified is carried out feature extraction, obtain the characteristic information of described voice signal to be identified by searching algorithm.

After obtaining Chinese and English speech signal to be identified, by searching algorithm Chinese and English speech signal to be identified is carried out feature extraction, obtain the characteristic information of described voice signal to be identified.Above-mentioned searching algorithm can for frame synchronization beam search algorithm, based on the back to the N-best storehouse of the ternary syntax (3-gram) decoding searching algorithm etc.

Step 12, the acoustic model that described characteristic information is corresponding with each aligned phoneme sequence in the confluent articulation database of presetting compare.

This embodiment need be according to the correspondence mappings method of Chinese and English pronunciation phonemes, utilize Chinese sound pattern master and English phoneme model to set up and comprise a plurality of data strip purpose confluent articulation databases, comprise in each data clauses and subclauses that a Chinese and English mixes phrase and represents described Chinese and English to mix the aligned phoneme sequence of the acoustic feature of phrase.

Then, the characteristic information acoustic model corresponding with each aligned phoneme sequence in the above-mentioned confluent articulation database with described voice signal to be identified compares.

Step 13, determine the aligned phoneme sequence of described characteristic information correspondence, obtain the Chinese and English mixing phrase of this aligned phoneme sequence correspondence, with the recognition result of this Chinese and English mixing phrase as described Chinese and English speech signal to be identified according to comparative result.

By the Chinese speech training data each aligned phoneme sequence in the described confluent articulation database is trained, obtain the acoustic model of the Chinese of each aligned phoneme sequence correspondence.

The acoustic model of the Chinese that the characteristic information of described voice signal to be identified is corresponding with described each aligned phoneme sequence compares, and obtains corresponding similarity respectively, obtains the acoustic model of the highest Chinese of similarity.The Chinese and English mixing phrase of the aligned phoneme sequence correspondence of the acoustic model correspondence of the Chinese that described similarity is the highest is as the recognition result of described Chinese and English speech signal to be identified.

The acoustic model of described Chinese comprises: context-sensitive hidden Markov model.

Embodiment two

The embodiment of the invention two provides a kind of Chinese sound pattern master and English phoneme model method of sharing based on Chinese and English parameter, and this method treatment scheme comprises the steps: as shown in Figure 2

Step 21: single state model table of comparisons of setting up Chinese sound mother and English phoneme.

For fear of female identical the obscuring of causing of Chinese sound with English phoneme symbol, before the female symbol of Chinese sound, add prefix ch_, before English phoneme, add prefix eng_, be written as ch_f as Chinese initial consonant f, English phoneme f is written as eng_f: Chinese sound female 64 (comprising zero initial), English 45 of phonemes (selecting the British English phoneme for use), totally 109 of Chinese sound mother and English phonemes.

Each Chinese sound mother and English phoneme are split into a plurality of (such as 3) single state model, and for example Chinese initial consonant ch_f is split as ch_f1, ch_f2, and ch_f3, English phoneme eng_f is split as eng_f1, eng_f2, eng_f3.Make single state model table of comparisons of Chinese sound mother and English phoneme, the line number of this list state model table of comparisons is 109, and columns is 2, and following table 1 is the part of this list state model table of comparisons.

Table 1:

Female and the English phoneme (109) of Chinese sound	Single state model after the fractionation
Female and the English phoneme (109) of Chinese sound	Single state model after the fractionation	??ch_f	??ch_f1，ch_f2，ch_f3
??eng_f	??eng_f1，eng_f2，??eng_f3	??ch_f	??ch_f1，ch_f2，ch_f3

Step 22: produce single state model mark file of Chinese, English training data and the test data of single state model mark, extract characteristic parameter.

Make the sound female unit mark of Chinese training data and the phoneme mark of English training data.

Described phoneme distance pole refers to the initial end position that identifies the female unit of sound (or English phoneme) in the voice data file, if for example the content of one section Chinese speech data is " Beijing ", distance pole in the following way:

b??????100????120

ei?????120????350

j??????350????410

ing????410????620

The unit of top numerical value is a millisecond.

Single state model table of comparisons according to above-mentioned Chinese sound mother and English phoneme, the female mark of the sound of Chinese training data, the phoneme mark of English training data, single state model of producing Chinese, English training data and the test data of single state model mark marks file.

The female unit of sound (or English phoneme) comprises a plurality of states, for example simple or compound vowel of a Chinese syllable ei comprises ei1, ei2, three states of ei3, in front for example in ei from 120 milliseconds still to 350 milliseconds of end, can adopt manual or automatic method to distribute to this three states this one piece of data, the form of the single state model mark after distributing can be as follows:

ei1????120????180

ei2????180????280

ei3????280????350

Extract the characteristic parameter of Chinese and English training data, as MFCC ((Mel FrequencyCepstral Coefficient, Mei Er frequency cepstral coefficient) parameter.

MFCC Parameter Extraction method is, voice signal is divided into some frames, it is wonderful wonderful to 30 millis that the length of frame is generally 10 millis, then every frame signal is carried out time-frequency conversion, the frequency-region signal after the conversion is divided into some groups according to people's hearing mechanism, obtain every group energy, again the energy of these groups is asked logarithm and carried out surplus profound conversion, coefficient after the surplus profound conversion is the MFCC parameter, because this Parameter Extraction is a known technology, the present invention does not introduce in more detail.

Step 23: to Chinese and English single state model training.

Chinese training data with HTK (Hidden Markov Model Toolkit hides the Markov model kit) instrument and arrangement is trained Chinese single state model.Certainly, also can adopt other method training Chinese and English single state model in embodiments of the present invention, present embodiment includes but not limited to aforesaid way.

Train English single state model with the English training data of HTK instrument and arrangement.

Step 24: carry out TCM (Two-pass phone clustering method based onConfusion Matrix, twice phoneme clustering method based on confusion matrix) first pass of twice searching algorithm search, Chinese is as target language, and is English as source language.

With Chinese list state model Chinese test data is alignd, obtain the zero-time that Chinese single state model sequence and each Chinese single state model occur.

With English list state model Chinese test data is discerned, obtained the zero-time that English single state model sequence and each English single state model occur.

Definition is with existing rule, when in same Chinese test data file, the time that the single state model of Chinese and English single state model occur overlaps when partly reaching the certain proportion of Chinese single state model duration, defines this Chinese list state model and this English list state model and occurs once with existing.According to showing the same existing matrix ch_eng_co_current_matri that rule obtains Chinese single state model and English single state model together, this is 192 with showing the matrix line number, and columns is 135.

Obtain the confusion matrix ch_eng_confusion_matix of Chinese single state model and English single state model by the same existing matrix of Chinese list state model and English single state model.The i of this confusion matrix is capable, and j element is that i is capable in the same matrix now, and the value of j element is divided by j column element sum.

ch_eng_confusion_matix [i, j] = \frac{ch_eng_co_current_matri [i, j]}{Σ_{i = 1}^{192} ch_eng_co_current_matri [i, j]}

I=1 wherein, 2,,, 192; J=1,2,,, 135.

Step 25: carry out second time search of twice searching algorithm of TCM, English as target language, Chinese is as source language.

With English list state model English test data is alignd, obtain the zero-time that English single state model sequence and each English single state model occur.

With Chinese list state model English test data is discerned, obtained the zero-time that Chinese single state model sequence and each Chinese single state model occur.

Definition is with existing rule, when in same English test data file, the time that English single state model and Chinese single state model occur overlaps when partly reaching the certain proportion of English single state model duration, defines this English list state model and this Chinese list state model and occurs once with existing.According to showing the same existing matrix eng_ch_co_current_matri that rule obtains English single state model and Chinese single state model together, this is 135 with showing the matrix line number, and columns is 192.

Same existing matrix by the single state model of above-mentioned English and Chinese single state model obtains English single state model and Chinese single state model confusion matrix.The i of this confusion matrix is capable, and j element is that i is capable in the same matrix now, and the value of j element is divided by j column element sum.

eng_ch_confusion_matix [i, j] = \frac{eng_ch_co_current_matri [i, j]}{Σ_{i = 1}^{135} eng_ch_co_current_matri [i, j]}

I=1 wherein, 2,,, 135; J=1,2,,, 192.

Step 26: comprehensive TCM first pass and second time Search Results, obtain final confusion matrix final_confusion_matrix, according to this final hybrid matrix Chinese list state model and English single state model are carried out cluster.

After the Chinese single state model that step 24 is obtained and the confusion matrix of English single state model carry out transposition, the corresponding element summation of the English single state model that obtains with step 25 and the confusion matrix of Chinese single state model, be averaged and obtain final confusion matrix final_confusion_matrix.

final_confusion_matrix = \frac{1}{2} ({ch_eng_confusion_matrix + exg_ch_confusion_matrix}^{T})

Search the element of degree of obscuring maximum from above-mentioned final confusion matrix, if it is capable that this element is arranged in the i of final confusion matrix, the j row, then that English single state model of the capable correspondence of i is corresponding with the j row Chinese single state model is poly-to be a class, i row element and j column element among the deletion matrix final_confusion_matrix, this moment, final confusion matrix final_confusion_matrix reduced delegation, reduced by row.

Upgrade above-mentioned final confusion matrix, the value of the capable k column element of n is updated to the minimum value of all the phoneme confusion degree in the class that all phonemes Chinese phoneme corresponding with the k row in the class that the English phoneme of the capable correspondence of n belongs to belong in this final confusion matrix.

Repeat the above-mentioned element that searches degree of obscuring maximum, upgrade the processing procedure of above-mentioned final confusion matrix, till cluster reaches requirement.

Step 27: with Chinese single state model and the English single state model clustering result and the female and English phoneme list state model table of comparisons through the Chinese sound that splits, synthetic Chinese sound pattern master and English phoneme model, calculate Bhattacharyya (refined in Ba Ta is proper, the name) distance between Chinese sound pattern master and the English phoneme model.

According to the Chinese after the cluster and English single state model, synthetic Chinese sound pattern master and English phoneme model.

Calculate the Bhattacharyya distance between Chinese sound pattern master and the English phoneme model, Bhattacharyya distance definition between Chinese sound pattern master and the English phoneme model is, Bhattacharyya is apart from sum between Chinese sound pattern master and the English phoneme model corresponding states, obtain Bhattacharyya distance matrix distance_matrix, between for example Chinese initial consonant ch_f and the English phoneme eng_b is (establishing ch_f is that i is individual among the Chinese sound mother, and eng_b is j in the English phoneme) apart from Bhattacharyya:

dis \tan ce_matrix [i, j] = Σ_{n = 1}^{3} dis \tan ce (i_{n}, j_{n})

I wherein _nBe n the state of initial consonant ch_f, j _nN state for phoneme eng_b.

If i _nOutput probability Gaussian distributed N (μ _{I, n}, σ _{I, n}), j _nOutput probability Gaussian distributed N (μ _{J, n}, σ _{J, n})

Then:

dis \tan ce (i_{n}, j_{n}) = \frac{1}{8} {(μ_{i, n} - μ_{j, n})}^{2} \times {(\frac{σ_{i, n} + σ_{j, n}}{2})}^{- 1} + \frac{1}{2} \ln \frac{(σ_{i, n} + σ_{j, n}) / 2}{{σ_{i, n}}^{\frac{1}{2}} \times {σ_{j, n}}^{\frac{1}{2}}}

Step 28: utilize the Bhattacharyya distance matrix and the clustering distance threshold value k of Chinese sound mother and English phoneme model, Chinese sound pattern master and English phoneme are carried out cluster.

Utilize the Bhattacharyya distance matrix of the female and English phoneme model of Chinese sound that above-mentioned steps 27 obtains, and predefined clustering distance threshold value k, Chinese sound pattern master and English phoneme model are carried out cluster.

Search the element of numerical value minimum from above-mentioned Bhattacharyya distance matrix, if it is capable that this element is arranged in the i of Bhattacharyya distance matrix, the j row, then that the English phoneme model of the capable correspondence of i is corresponding with the j row Chinese sound pattern master is poly-to be a class.

Upgrade above-mentioned Bhattacharyya distance matrix, the value of the capable j column element of i in the Bhattacharyya distance matrix is updated to the maximal value of all the phoneme confusion degree in the class that all phonemes Chinese sound mother corresponding with the j row in the class that the English phoneme of the capable correspondence of i belongs to belong to.

Repeat in the above-mentioned Bhattacharyya distance matrix search numerical value minimum element, upgrade the processing procedure of above-mentioned final Bhattacharyya distance matrix, greater than threshold value k, then cluster process finishes up to the element of the numerical value minimum that searches out.

This embodiment splits into a plurality of (such as 3) single state model by each Chinese sound is female and English phoneme, sets up single state model table of comparisons of Chinese sound mother and English phoneme, synthetic Chinese sound pattern master and English phoneme model.Take into account the characteristics of Chinese and English equivalents, make Chinese and English model parameter fully shared in the state aspect, in the distance that has kept on the model hierarchy between the Chinese and English bilingual model.

Embodiment three

Based on above-mentioned Chinese sound pattern master and English phoneme model, the structured flowchart of a kind of Chinese and English speech signal recognition device that this embodiment provides comprises as shown in Figure 3 as lower module:

Characteristic information extracting module 33 is used for by searching algorithm Chinese and English speech signal to be identified being carried out feature extraction, obtains the characteristic information of described voice signal to be identified.Above-mentioned characteristic information extracting module can be by realizing with the speech recognition decoder device of language independent,

Identification comparison module 35 is used for the acoustic model that described characteristic information is corresponding with each aligned phoneme sequence of the confluent articulation database of presetting and compares;

Processing module 36, be used for determining the aligned phoneme sequence of described characteristic information correspondence according to the comparative result that described identification comparison module 35 is obtained, obtain the Chinese and English mixing phrase of this aligned phoneme sequence correspondence, with the recognition result of this Chinese and English mixing phrase as described Chinese and English speech signal to be identified.

Described device can also comprise 31:

Chinese sound pattern master and English phoneme model are set up module, are used for each Chinese sound mother and English phoneme are split into a plurality of single state models, make the contrast relationship of single state model of female single state model of Chinese sound and English phoneme; With single state model that Chinese training data is trained described Chinese sound mother, train single state model of described English phoneme with English training data;

Utilize the contrast relationship of single state model of female single state model of described Chinese sound and English phoneme, twice searching algorithm of single state model execution TCM female to the described Chinese sound after the training and English phoneme obtains confusion matrix, according to described hybrid matrix single state model of single state model of described Chinese sound mother, English phoneme carried out cluster;

According to the synthetic Chinese sound pattern master of the single state model of the described Chinese sound mother after the cluster,, and described Chinese sound pattern master and English phoneme model carried out cluster according to the synthetic English phoneme model of single state model of the described English phoneme after the cluster.

Language model administration module 34, be used for correspondence mappings method according to Chinese and English pronunciation phonemes, utilize Chinese sound pattern master and English phoneme model to set up the confluent articulation database, described confluent articulation database comprises a plurality of data clauses and subclauses, comprises in each data clauses and subclauses that a Chinese and English mixes phrase and the described Chinese and English aligned phoneme sequence of mixing the acoustic feature of phrase of representative.

Described confluent articulation database has taken into full account Chinese and English separately pronunciation characteristic, and the phoneme of English has been mapped on the Chinese sound mother.The structural representation of a kind of above-mentioned confluent articulation database that this embodiment provides as shown in Figure 4, the confluent articulation database can comprise a plurality of data clauses and subclauses, such as comprising that data clauses and subclauses 1 are to data clauses and subclauses N.A kind of above-mentioned data strip purpose structural representation that this embodiment provides as shown in Figure 5, the data clauses and subclauses among Fig. 5 comprise an entry sequence number, a Chinese and English mixes phrase and an aligned phoneme sequence.The entry sequence number is used to organize all entries.

Above-mentioned Chinese and English mixing phrase can be the language or the voice of any kind, comprises sentence, phrase or the word represented according to suitable scheme.Such as, can be the order phrase (opening recreation PK as Peter) that Chinese phrase, English phrase and Chinese and English mix, the Chinese and English mixing phrase voice that its corresponding voice are unspecified person, standard mandarin or slightly accent supported in Chinese, English support standard English or slightly accent.Chinese and English mixes the entry number individual layer of phrase can dynamically be adjusted, from tens of to the hundreds of bar, and can expand the entry sum by the layering method for building up; The content of identification entry can need not to train again by the external data automatic software updating.

Data clauses and subclauses class in the above-mentioned confluent articulation database can include, but is not limited to the Chinese mandarin-english class of Chinese mandarin class, English class, mixing.Wherein, Chinese mandarin class can comprise any proper word or the phrase of selecting from standard mandarin.Similarly, the English class can comprise any proper word or the phrase of selecting from standard English.Yet, for pronunciation of English being changed into corresponding Chinese mandarin pronunciation, need utilize the quick correspondence mappings method of Chinese and English pronunciation phonemes to finish the conversion of pronunciation of English to Chinese mandarin pronunciation from the word and expression of English class.The Chinese mandarin-english class of mixing can comprise any proper word or the phrase of selecting from standard Chinese mandarin language and standard English language.The embodiment of the invention can make up these word and expressions from two kinds of different languages, so that create the combined type phrase in the confluent articulation database.The combined type phrase can be the combination in any of Chinese phrase and English phrase, and for example Chinese phrase is preceding, English phrase in the back or English phrase preceding, Chinese phrase after array mode.Yet, as discussed above, for pronunciation of English being changed into corresponding Chinese mandarin pronunciation, need utilize the quick correspondence mappings method of Chinese and English pronunciation phonemes to finish the conversion of pronunciation of English to Chinese mandarin pronunciation from the word and expression of English class.

Above-mentioned aligned phoneme sequence is corresponding with Chinese and English mixing phrase, is used to represent that Chinese and English mixes the acoustic characteristic of phrase, and aligned phoneme sequence can be made of a series of phonemes in the predetermined set of phonemes, and set of phonemes is wherein used for the speech recognition decoder device.In Fig. 5, above-mentioned aligned phoneme sequence can comprise phoneme 1 to phoneme N.

For purposes of illustration, provide following table 1 as specific embodiment of above-mentioned data strip purpose.

Table 1:

??1	[peter opens]	??p?i?t?e?zh?ang
??1	[peter opens]	??p?i?t?e?zh?ang	??2	[book room]	??d?u?k?f?ang?j?ian
??3	[your mailbox of check]	??ch?ai?k?n?i?d?e_i?iu?x?iang	??2	[book room]	??d?u?k?f?ang?j?ian
??3	[your mailbox of check]	??ch?ai?k?n?i?d?e_i?iu?x?iang	??4	[mood of high]	??g?uai?d?e?x?in?q?ing
??5	[Beijing airport]	??b?ei?j?ing?ei?p?o?t	??4	[mood of high]	??g?uai?d?e?x?in?q?ing
??5	[Beijing airport]	??b?ei?j?ing?ei?p?o?t	??6	[hello]	??n?i?h?ao
??7	[point in afternoons two]	??x?ia_u?u?l?iang?d?ian	??6	[hello]	??n?i?h?ao
??7	[point in afternoons two]	??x?ia_u?u?l?iang?d?ian	??8	[point in afternoons three]	??x?ia_u?u?s?an?d?ian
??9	[weather how about]	??t?ian?q?i?h?ui?z?en_i?iang	??8	[point in afternoons three]	??x?ia_u?u?s?an?d?ian
??9	[weather how about]	??t?ian?q?i?h?ui?z?en_i?iang	??10	[weather how]	??t?ian?q?i?z?en?m?e_i?iang
??11	[night]	??n?ai?t	??10	[weather how]	??t?ian?q?i?z?en?m?e_i?iang
??11	[night]	??n?ai?t	??12	[super?star]	??s?ui?p?er?s?t?a
??13	[my?god]	??m?ai?g?o?d	??12	[super?star]	??s?ui?p?er?s?t?a
??13	[my?god]	??m?ai?g?o?d	??14	[shopping?mall]	??x?ve?o?p?i?eng?m?ao
??15	[I?dont′s?care]	??ai?d?o?eng?t?ch?ei?er	??14	[shopping?mall]	??x?ve?o?p?i?eng?m?ao

In above-mentioned table 1, the entry sequence number is represented on first hurdle longitudinally; Second hurdle represents that Chinese and English mixes phrase longitudinally; Third column is represented the aligned phoneme sequence corresponding to the acoustic characteristic of Chinese and English mixing phrase longitudinally.1 to 5 horizontal hurdle is the embodiment of the combined type phrase selected from standard Chinese mandarin language and standard English language, belongs to the Chinese mandarin-english class of mixing; 6 to 10 horizontal hurdles are embodiment of the word or expression selected from standard mandarin, belong to Chinese mandarin class; 11 to 15 horizontal hurdles are embodiment of the word or expression selected from standard English, belong to the English class.Wherein the quick correspondence mappings method that need utilize Chinese and English pronunciation phonemes from the described word and the described phrase of English class is finished the conversion of pronunciation of English to Chinese mandarin pronunciation.

For purposes of illustration, provide following table 2 a embodiment as the quick correspondence mappings method of Chinese and English pronunciation phonemes.Man represents on one hurdle the sound mother of Chinese in the table 2, and Eng represents on one hurdle English phoneme.English phoneme can be mapped on the sound mother in nearest Man one hurdle, the left side.

Table 2:

??Man	??Eng	??Man	??Eng	??Man	??Eng	??Man	??Eng	??Man	??Eng
??Man	??Eng	??Man	??Eng	??Man	??Eng	??Man	??Eng	??Man	??Eng	??a	??aa	??eng	??aa?ng	??ing	??iy?ng	??q	??ch	??uen	??er?n
??ai	??ae	??er	??aa	??iong	??uw?ng	??r	??y	??ueng	??aa?ng	??a	??aa	??eng	??aa?ng	??ing	??iy?ng	??q	??ch	??uen	??er?n
??ai	??ae	??er	??aa	??iong	??uw?ng	??r	??y	??ueng	??aa?ng	??an	??ae?ng	??f	??f	??iou	??ow	??s	??s	??uo	??ao
??ang	??aa?ng	??g	??g	??j	??y	??sh	??sh	??ü	??iy	??an	??ae?ng	??f	??f	??iou	??ow	??s	??s	??uo	??ao
??ang	??aa?ng	??g	??g	??j	??y	??sh	??sh	??ü	??iy	??ao	??aw	??g	??hh	??k	??k	??t	??hh	??üan	??ae?ng
??b	??b	??i	??iy	??l	??y	??u	??uw	??üe	??ey	??ao	??aw	??g	??hh	??k	??k	??t	??hh	??üan	??ae?ng
??b	??b	??i	??iy	??l	??y	??u	??uw	??üe	??ey	??c	??th	??ia	??aa	??m	??m	??ua	??aa	??ün	??ey?ng
??ch	??ch	??ian	??ae?ng	??n	??y	??uai	??ay	??w	??w	??c	??th	??ia	??aa	??m	??m	??ua	??aa	??ün	??ey?ng
??ch	??ch	??ian	??ae?ng	??n	??y	??uai	??ay	??w	??w	??d	??b	??iang	??aa?ng	??o	??ao	??uan	??ay?ng	??x	??s
??e	??aa	??iao	??aw	??ong	??ow?ng	??uang	??aa?ng	??y	??y	??d	??b	??iang	??aa?ng	??o	??ao	??uan	??ay?ng	??x	??s
??e	??aa	??iao	??aw	??ong	??ow?ng	??uang	??aa?ng	??y	??y	??ei	??ey	??ie	??ey	??ou	??ow	??uei	??ey	??z	??th

??Man	??Eng	??Man	??Eng	??Man	??Eng	??Man	??Eng	??Man	??Eng
??Man	??Eng	??Man	??Eng	??Man	??Eng	??Man	??Eng	??Man	??Eng	??en	??ae?n	??in	??iy?ng	??p	??p	??ue	??er?n	??zh	??jh

The Chinese pronunciations phoneme is represented on Man hurdle in the described table 2, and the English equivalents phoneme is represented on the Eng hurdle in the described table 1.An embodiment of the quick correspondence mappings method of the Chinese and English pronunciation phonemes that provides in the table 2 is on the basis of data-driven and statistical method and linguistic rules, set up based on above-mentioned Chinese sound pattern master and English phoneme model, it has taken into full account Chinese and English pronunciation characteristic separately, the phoneme of English has been mapped on the Chinese sound mother, formed of the credible combination of Chinese and English mixing voice, provide the assurance on the method for making up the confluent articulation database at phoneme and dictionary layer.

Acoustic model administration module 32 is used for by the Chinese speech training data each aligned phoneme sequence of confluent articulation database being trained, and obtains the acoustic model of the Chinese of each aligned phoneme sequence correspondence.The acoustic model of described Chinese can be context-sensitive hidden Markov (HMM) model that obtains after utilizing a large amount of Chinese speech data that above-mentioned confluent articulation database is trained.This hidden Markov model can comprise to be described the acoustics of existing any one phoneme, and can effectively utilize a specific aligned phoneme sequence and represent that each Chinese and English in the confluent articulation database mixes phrase.The structural representation of this embodiment provides a kind of context-sensitive hidden Markov model corresponding to an aligned phoneme sequence as shown in Figure 6.In Fig. 6, context-sensitive hidden Markov model is corresponding to the acoustic model of Chinese, and it is to be come by a large amount of Chinese speech data training.The Chinese training data can be the part of 863 speech databases, and can comprise the speech data that has Chongqing, Guangzhou, Xiamen, the four kinds of accents in Shanghai.

Described characteristic information extracting module 33 specifically can comprise:

The first search processing module 331 is used to utilize described confluent articulation database, by the frame synchronization beam search based on the forward direction bi-gram Chinese and English speech signal to be identified is carried out characteristic information and extracts, and obtains first Search Results;

The second search processing module 332 is used to utilize described confluent articulation database, by separating code searching to the storehouse of the ternary syntax described first Search Results is further searched for based on the back, obtains the characteristic information of described Chinese and English speech signal to be identified.

Described identification comparison module 35, the acoustic model that is used for the Chinese that described characteristic information is corresponding with described each aligned phoneme sequence compares, and obtains corresponding similarity respectively, determines the acoustic model of the Chinese that similarity is the highest;

Described processing module 36, be used to obtain the aligned phoneme sequence of the acoustic model correspondence of the highest Chinese of described identification comparison module 35 determined similarities, by inquiring about the Chinese and English mixing phrase that described confluent articulation database obtains described aligned phoneme sequence correspondence, with the recognition result of this Chinese and English mixing phrase as described Chinese and English speech signal to be identified.

Above-mentioned characteristic information extracting module can be by realizing with the speech recognition decoder device of language independent, above-mentioned speech recognition decoder device is used to utilize Chinese acoustic model and confluent articulation database to come succinctly, finish accurately, efficiently and in real time the identification of Chinese and English speech signal.

The treatment scheme that a kind of above-mentioned Chinese and English speech signal recognition device that this embodiment provides carries out speech recognition as shown in Figure 7, concrete processing procedure is as follows:

The Chinese and English speech signal recognition device at first will be finished initialization, could receive Chinese and English speech signal and finish identification mission.Want loading system configuration file and parameters,acoustic configuration file before the initialization, and analytically dependent configuration file, ready for being written into acoustic model and confluent articulation database.

Be written into the acoustic model (context-sensitive hidden Markov model) of above-mentioned confluent articulation database and Chinese, then, system can set up the HMM lexical tree, and analyzes the parameter of all models.After these preliminary works were finished, system began the initialization input equipment, received Chinese and English speech signal to be identified, by utilizing the speech recognition decoder device to carry out twin-channel search, exported recognition result at last.

In Chinese and English speech signal recognition device shown in Figure 3, can handle by the search that the speech recognition decoder device utilizes the confluent articulation database that the Chinese and English speech signal of input is carried out two passages, obtain the characteristic information of the Chinese and English speech signal of input.First passage (Pass 1) is based on the frame synchronization beam search of forward direction bi-gram (2-gram), and it is a kind of proximity search method of high speed; Second channel (Pass 2) is based on the back and separates code searching to the N-best storehouse of the ternary syntax (3-gram).The processing of initial input is carried out at first passage, and last search is handled at second channel, and it has used the Search Results of first passage, has dwindled the search volume, has guaranteed the accuracy of high search.This and speech recognition decoder device language independent does not need to carry out languages identification, does not have the risk that the languages identification error occurs, and can expand and be applied to other languages.

Then, the context-sensitive hidden Markov model that the characteristic information of the Chinese and English speech signal of the input that the speech recognition decoder device will obtain is corresponding with each aligned phoneme sequence in the confluent articulation database compares, obtain corresponding similarity respectively, determine the hidden Markov model that similarity is the highest.The Chinese and English mixing phrase of the aligned phoneme sequence correspondence of the hidden Markov model correspondence that described similarity is the highest is as the recognition result of the Chinese and English speech signal of described input.

Cooperate the acoustic model of Chinese and Chinese and English mixings phrase recognition system that the speech recognition decoder device makes up can finish search mission by the confluent articulation database among this embodiment, and keep in real time, at a high speed and the accuracy of discerning up to 65535 speech.For the writing task of listening of 20000 speech of single languages, discrimination is up to more than 95%; For the identification mission that 500 Chinese and English mix the order speech, discrimination is more than 88%.Chinese and English mixing phrase recognition system is low to the demand of internal memory, can not take a lot of computer resources.General 20000 speech are with interior search mission, and the internal memory that needs can not surpass 64MBytes.

This embodiment sets up the confluent articulation database by the quick correspondence mappings method of utilizing Chinese and English pronunciation phonemes.Described confluent articulation database provides the accurate expression of Chinese mandarin language spoken language, and can be by the various word and expressions of quick correspondence mappings method merging from English language.Speech recognition decoder device among the present invention can only utilize the confluent articulation database to cooperate the acoustic model of Chinese to finish the identification mission of Chinese and English speech signal succinctly, accurately, efficiently and in real time.

One of ordinary skill in the art will appreciate that all or part of flow process that realizes in the foregoing description method, be to instruct relevant hardware to finish by computer program, described program can be stored in the computer read/write memory medium, this program can comprise the flow process as the embodiment of above-mentioned each side method when carrying out.Wherein, described storage medium can be magnetic disc, CD, read-only storage memory body (Read-Only Memory, ROM) or at random store memory body (Random Access Memory, RAM) etc.

In sum, the embodiment of the invention makes Chinese and English model parameter fully shared in the state aspect, in the distance that has kept on the model hierarchy between the Chinese and English bilingual model.The method that employing state hierarchical clustering is female at Chinese sound then and English phoneme model level compares can be set up the littler acoustic model of degree of obscuring, and not need the good voice training data of a large amount of marks, conserve system resources.

The embodiment of the invention is utilized Chinese sound pattern master and English phoneme model to set up and is comprised a plurality of data strip purpose confluent articulation databases according to the correspondence mappings method of Chinese and English pronunciation phonemes, can improve the discrimination of Chinese and English speech signal effectively.

The embodiment of the invention comprises Chinese and English is mixed the support that recognition system is transplanted to be provided on the other Languages on the theory and technology for striding speech recognition.

The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the method that Chinese and English speech signal is discerned is characterized in that, comprising:

The acoustic model that described characteristic information is corresponding with each aligned phoneme sequence in the confluent articulation database of presetting compares;

2. the method that Chinese and English speech signal is discerned according to claim 1 is characterized in that, before by searching algorithm Chinese and English speech signal to be identified being carried out feature extraction, described method also comprises:

Correspondence mappings method according to Chinese and English pronunciation phonemes, utilize Chinese sound pattern master and English phoneme model to set up and comprise a plurality of data strip purpose confluent articulation databases, comprise in each data clauses and subclauses that a Chinese and English mixes phrase and represents described Chinese and English to mix the aligned phoneme sequence of the acoustic feature of phrase.

3. the method that Chinese and English speech signal is discerned according to claim 2 is characterized in that, the correspondence mappings method of described Chinese and English pronunciation phonemes comprises following table 1:

Table 1:

??Man ??Eng ??Man ??Eng ??Man ??Eng ??Man ??Eng ??Man ??Eng ??a ??aa ??eng ??aa?ng ??ing ??iy?ng ??q ??ch ??uen ??er?n ??ai ??ae ??er ??aa ??iong ??uw?ng ??r ??y ??ueng ??aa?ng ??an ??ae?ng ??f ??f ??iou ??ow ??s ??s ??uo ??ao ??ang ??aa?ng ??g ??g ??j ??y ??sh ??sh ??ü ??iy ??ao ??aw ??g ??hh ??k ??k ??t ??hh ??üan ??ae?ng ??b ??b ??i ??iy ??l ??y ??u ??uw ??üe ??ey ??c ??th ??ia ??aa ??m ??m ??ua ??aa ??ün ??ey?ng ??ch ??ch ??ian ??ae?ng ??n ??y ??uai ??ay ??w ??w

??Man ??Eng ??Man ??Eng ??Man ??Eng ??Man ??Eng ??Man ??Eng ??d ??b ??iang ??aa?ng ??o ??ao ??uan ??ay?ng ??x ??s ??e ??aa ??iao ??aw ??ong ??ow?ng ??uang ??aa?ng ??y ??y ??ei ??ey ??ie ??ey ??ou ??ow ??uei ??ey ??z ??th ??en ??ae?n ??in ??iy?ng ??p ??p ??ue ??er?n ??zh ??jh

The Chinese pronunciations phoneme is represented on Man hurdle in the described table 1, and the English equivalents phoneme is represented on the Eng hurdle in the described table 1.

4. the method that Chinese and English speech signal is discerned according to claim 2 is characterized in that, Chinese sound pattern master of described utilization and English phoneme model also comprise before setting up and comprising a plurality of data strip purpose confluent articulation databases:

Each Chinese sound mother and English phoneme are split into a plurality of single state models, make the contrast relationship of single state model of female single state model of Chinese sound and English phoneme;

With single state model that Chinese training data is trained described Chinese sound mother, train single state model of described English phoneme with English training data;

Utilize the contrast relationship of single state model of female single state model of described Chinese sound and English phoneme, single state model execution of and English phoneme female to the described Chinese sound after the training obtains confusion matrix based on twice searching algorithm of twice phoneme clustering method TCM of confusion matrix, according to described hybrid matrix single state model of single state model of described Chinese sound mother, English phoneme is carried out cluster;

5. the method that Chinese and English speech signal is discerned according to claim 2, it is characterized in that, describedly Chinese and English speech signal to be identified is carried out feature extraction, obtains the characteristic information of described Chinese and English speech signal to be identified, comprising by searching algorithm:

Utilize language model, by frame synchronization beam search Chinese and English speech signal to be identified is carried out characteristic information and extract, obtain first Search Results based on the forward direction bi-gram;

Utilize language model,, obtain the characteristic information of described Chinese and English speech signal to be identified by separating code searching to the storehouse of the ternary syntax described first Search Results is further searched for based on the back.

6. according to each described method that Chinese and English speech signal is discerned of claim 1 to 5, it is characterized in that, the described acoustic model that described characteristic information is corresponding with each aligned phoneme sequence in the confluent articulation database compares, determine the aligned phoneme sequence of described characteristic information correspondence according to comparative result, obtain the Chinese and English mixing phrase of this aligned phoneme sequence correspondence, with the recognition result of this Chinese and English mixing phrase, comprising as described Chinese and English speech signal to be identified:

By the Chinese speech training data each aligned phoneme sequence in the described confluent articulation database is trained, obtain the acoustic model of the Chinese of each aligned phoneme sequence correspondence;

The acoustic model of the Chinese that described characteristic information is corresponding with described each aligned phoneme sequence compares, and obtains corresponding similarity respectively, obtains the acoustic model of the highest Chinese of similarity;

The Chinese and English mixing phrase of the aligned phoneme sequence correspondence of the acoustic model correspondence of the Chinese that described similarity is the highest is as the recognition result of described Chinese and English speech signal to be identified.

7. the method that Chinese and English speech signal is discerned according to claim 6 is characterized in that, the acoustic model of described Chinese comprises: context-sensitive hidden Markov model.

8. the device that Chinese and English speech signal is discerned is characterized in that, comprising:

9. the device that Chinese and English speech signal is discerned according to claim 8 is characterized in that, described device also comprises:

10. the device that Chinese and English speech signal is discerned according to claim 9 is characterized in that, described device also comprises:

The language model administration module, be used for correspondence mappings method according to Chinese and English pronunciation phonemes, utilize Chinese sound pattern master and English phoneme model to set up the confluent articulation database, described confluent articulation database comprises a plurality of data clauses and subclauses, comprises in each data clauses and subclauses that a Chinese and English mixes phrase and the described Chinese and English aligned phoneme sequence of mixing the acoustic feature of phrase of representative;

The acoustic model administration module is used for by the Chinese speech training data each aligned phoneme sequence of confluent articulation database being trained, and obtains the acoustic model of the Chinese of each aligned phoneme sequence correspondence.

11. the device that Chinese and English speech signal is discerned according to claim 8 is characterized in that, described characteristic information extracting module comprises:

The first search processing module is used to utilize described confluent articulation database, by the frame synchronization beam search based on the forward direction bi-gram Chinese and English speech signal to be identified is carried out characteristic information and extracts, and obtains first Search Results;

The second search processing module is used to utilize described confluent articulation database, by separating code searching to the storehouse of the ternary syntax described first Search Results is further searched for based on the back, obtains the characteristic information of described Chinese and English speech signal to be identified.

12. according to Claim 8,9, the 10 or 11 described devices that Chinese and English speech signal is discerned, it is characterized in that:

Described identification comparison module, the acoustic model that is used for the Chinese that described characteristic information is corresponding with described each aligned phoneme sequence compares, and obtains corresponding similarity respectively, determines the acoustic model of the Chinese that similarity is the highest;

Described processing module, be used to obtain the aligned phoneme sequence of the acoustic model correspondence of the highest Chinese of the determined similarity of described identification comparison module, by inquiring about the Chinese and English mixing phrase that described confluent articulation database obtains described aligned phoneme sequence correspondence, with the recognition result of this Chinese and English mixing phrase as described Chinese and English speech signal to be identified.