CN104143330A

CN104143330A - Voice recognizing method and voice recognizing system

Info

Publication number: CN104143330A
Application number: CN201310163355.8A
Authority: CN
Inventors: 刘贺飞; 郭莉莉
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-05-07
Filing date: 2013-05-07
Publication date: 2014-11-12

Abstract

The invention discloses a voice recognizing method and a voice recognizing system. The voice recognizing method comprises the steps that corresponding entries for a second voice recognizing engine are generated according to entries in a vocabulary of a first voice recognizing engine; the generated corresponding entries are added to a vocabulary of the second engine so that a combined vocabulary can be generated through the added entries and original entries in the vocabulary of the second engine; the first engine is used for recognizing input voice through the vocabulary of the first engine; the second engine is used for recognizing the input voice through the combined vocabulary so that a recognizing result related to the original entries and a recognizing result related to the corresponding entries can be generated; the recognizing result related to the corresponding entries and output from the second engine is used, the recognizing result of the first engine and the recognizing result related to the original entries of the second engine are compared, and a comparison result is output. The recognizing result of the first engine and the recognizing result of the second engine respectively comprise recognized words and corresponding recognizing scores.

Description

Audio recognition method and speech recognition system

Technical field

The present invention relates to audio recognition method and speech recognition system, relate in particular to the audio recognition method and the speech recognition system that use two speech recognition engines to carry out speech recognition.

Background technology

Speech recognition is the gordian technique by realize man-machine interaction with the order of machine recognition user voice, and it can significantly improve the mode of man-machine interaction and can in saying order, complete more multitask so that obtain user.Speech recognition is to realize by speech recognition engine online or that off-line training obtains.Speech recognition process can be divided into training stage and cognitive phase conventionally.In the training stage, according to speech recognition engine based on mathematical model, from training data, add up and obtain acoustic model (acoustic model, AM) and vocabulary (lexicon).In cognitive phase, speech recognition engine uses acoustic model and vocabulary to process the voice of input, obtains voice identification result.For example, carry out feature extraction to obtain proper vector from the audiogram of sound import, then obtain phoneme (as [i], [o] etc.) sequence according to acoustic model, the finally location word higher with aligned phoneme sequence matching degree from vocabulary, or even sentence.

In speech recognition system, may load more than 1 speech recognition engine and identify same voice simultaneously.For example, the first speech recognition engine can be speaker's related voice identification (speaker-dependent automatic speech recognition, SD-ASR) engine, the recognition result that comprises corresponding mark also exported in its voice of being trained to identify from speaker dependent.The second speech recognition engine can be irrelevant speech recognition (the speaker-independent automatic speech recognition of speaker, SI-ASR) engine, it can be identified from any user's voice and export the recognition result that comprises corresponding mark.Known because SD-ASR is formed by speaker dependent's voice training, its acoustic model is more accurate to the expression of voice, so SD-ASR provides the better recognition accuracy than SI-ASR conventionally.On the other hand, known to SI-ASR is formed by nonspecific speaker's voice training, it can identify multiple speakers' voice, has more compatible identification so SI-ASR can provide than SD-ASR.Therefore, the recognition result of SD-ASR and SI-ASR is carried out not only improving the accuracy of speech recognition but also have advantages of better compatibility in conjunction with obtaining.

Known a kind of associated methods is that all output candidates from these two engines or recognition result (each comprise the word that identifies and corresponding identification mark) are resequenced according to their identification mark.But, because these two engines have different vocabularies and/or acoustic model conventionally, so also not identical (such as the mark of SI-ASR is mainly distributed in " 0-0.5 ", the mark of SD-ASR is mainly distributed in scope " 0.5-1 ") and be difficult to the identification mark from these two engines directly to compare of the distribution of the identification mark of these two engines.

U.S. Patent application US6836758B2 discloses the system and method that the multiple speech recognition engines of a kind of use carry out speech recognition.The method mainly comprises: the different weights that preset the identification mark of each speech recognition engine, so that the weighting sum maximization of correct recognition result and the weighting sum of wrong identification result are minimized, next the identification mark through weighting of each speech recognition engine is compared, and then export the recognition result with optimum weighted score.But in the method,, inappropriate if weight is set to, recognition result may be also poorer than the situation of any speech recognition engine of independent use.Obviously, be difficult to, for each speech recognition engine, weight is accurately set, and be therefore difficult to ensure that the recognition performance of the method is better than using respectively the situation of each speech recognition engine.

U.S. Patent application US7149689 discloses a kind of audio recognition method with double engines, and it uses confusion matrix (confusion matrix) to carry out the identification mark of two speech recognition engines of comparison.In the method, for each speech recognition engine and by statistics generate confusion matrix be converted into alternating matrix (alternative matrix) and set up the recognition result of each speech recognition engine and alternating matrix intersects compare program loop, in alternating matrix each be listed as according to be up to minimum probability sort.But in the method, the vocabulary of two speech recognition engines must be identical.Otherwise confusion matrix just will have different entries, alternately vector can not be compared, and program loop can not be found correct recognition result.

Summary of the invention

The object of the present invention is to provide and can simply and there is less restriction the audio recognition method and the speech recognition system that merge from the recognition result of multiple speech recognition engines.

An aspect of of the present present invention relates to audio recognition method, comprise: corresponding entry generates step, generate the corresponding entry (counterpart entry) for the second speech recognition engine for the each entry (entry) in the vocabulary of the first speech recognition engine; Combination vocabulary generates step, generated corresponding entry is added in the vocabulary of the second speech recognition engine, to generate combination vocabulary with together with original entry in the vocabulary of the second speech recognition engine; The first speech recognition steps, identifies input voice by the vocabulary of the first speech recognition engine with the first speech recognition engine; The second speech recognition steps, with the second speech recognition engine by combination vocabulary identify input voice, to generate the recognition result relevant with original entry and the recognition result relevant with corresponding entry; And recognition result compares and output step, use from the recognition result relevant with corresponding entry of the second speech recognition engine output, the relevant recognition result with original entry of the recognition result of the first speech recognition engine and the second speech recognition engine is compared and export comparative result.Each recognition result of the first speech recognition engine and the second speech recognition engine comprises the word that identifies and corresponding identification mark.

Another aspect of the present invention relates to speech recognition system, comprising: corresponding entry generating apparatus, is configured to generate the corresponding entry for the second speech recognition engine for the each entry in the vocabulary of the first speech recognition engine; Combination vocabulary table creating device, is configured to generated corresponding entry to add in the vocabulary of the second speech recognition engine, to generate combination vocabulary with together with original entry in the vocabulary of the second speech recognition engine; The first speech recognition equipment, is configured to identify input voice with the first speech recognition engine by the vocabulary of the first speech recognition engine; The second speech recognition equipment, is configured to identify input voice with the second speech recognition engine by combination vocabulary, to generate the recognition result relevant with original entry and the recognition result relevant with corresponding entry; And recognition result comparison and output unit, be configured to use the recognition result relevant with corresponding entry from the second speech recognition engine output, the relevant recognition result with original entry of the recognition result of the first speech recognition engine and the second speech recognition engine is compared and export comparative result.Each recognition result of the first speech recognition engine and the second speech recognition engine comprises the word that identifies and corresponding identification mark.

Therefore, according to the audio recognition method of each aspect of the present invention and speech recognition system, by using the corresponding entry from different phonetic identification engine, can compare simply the different recognition results from different phonetic identification engine, and without any weight being set as in prior art or vocabulary being had to any restriction.

Brief description of the drawings

Below in conjunction with specific embodiment, and with reference to accompanying drawing, the above-mentioned of the embodiment of the present invention and other object and advantage are further described.In the accompanying drawings, identical or corresponding technical characterictic or parts will adopt identical or corresponding Reference numeral to represent.

Fig. 1 illustrates the process flow diagram of audio recognition method according to an embodiment of the invention;

Fig. 2 illustrates the block diagram of the process of the corresponding entry of generation according to an embodiment of the invention;

Fig. 3 illustrates the block diagram of the process of the corresponding entry of generation according to still another embodiment of the invention;

Fig. 4 illustrates the process flow diagram for recognition result is compared and exported according to an embodiment of the invention;

Fig. 5 illustrates the process flow diagram for recognition result is compared and exported according to another embodiment of the invention;

Fig. 6 illustrates the block diagram of the exemplary configuration of speech recognition system according to an embodiment of the invention; And

Fig. 7 illustrates the block diagram of the hardware configuration of the computer system that can implement embodiments of the invention.

Embodiment

In connection with accompanying drawing, example embodiment of the present invention is described hereinafter.All features of embodiment are not described for clarity and conciseness, in instructions.But, should understand, in the process that embodiment is implemented, must make much settings specific to embodiment, to realize developer's objectives, for example, meet and those restrictive conditions of system and traffic aided, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition,, although will also be appreciated that development is likely very complicated and time-consuming, concerning having benefited from those skilled in the art of present disclosure, this development is only routine task.

At this, should also be noted that, for fear of the details because of unnecessary fuzzy the present invention, only show in the accompanying drawings with at least according to the closely-related treatment step of the solution of the present invention and/or system architecture, and omitted other details little with relation of the present invention.

First the process flow diagram of audio recognition method according to an embodiment of the invention is described with reference to Fig. 1.

In step S101, generate the corresponding entry for the second speech recognition engine for the each entry in the vocabulary of the first speech recognition engine.

According to one embodiment of present invention, the first speech recognition engine can be above-mentioned speaker's related voice identification (SD-ASR) engine, its acoustic model is for example set up based on word, for example, realize based on word hidden Markov model (whole-word based HMM); In addition, the acoustic model of SD-ASR also can be set up based on phoneme, for example, realize based on phoneme hidden Markov model (phoneme based HMM).The second speech recognition engine can be above-mentioned irrelevant speech recognition (SI-ASR) engine of speaker, and its acoustic model is set up based on phoneme, for example, realize based on phoneme hidden Markov model.Although enumerated this two kinds of concrete speech recognition engines here, but be to be understood that the first speech recognition engine and the second speech recognition engine in the present invention are not limited to SD-ASR and the SI-ASR for different user, and can be other different phonetic identification engine with similar demand.For example, the first and second speech recognition engines can be for for example English of different types of language and Chinese and the engine of training.Or the first and second speech recognition engines can be for for example quiet environment of different environments for use and noisy environment and the engine of training.In addition, the acoustic model of the first and second speech recognition engines is also not limited to the model realization based on HMM, also can be based on other model realization, for example, based on dynamic time warping (dynamic time wrapping, DTW) model realization, as long as applicable principle of the present invention.In relevant portion below, for convenience of explanation, will be taking the first speech recognition engine as SD-ASR and the second speech recognition engine be described as SI-ASR as example, and referred to as the first engine SD-ASR and the second engine SI-ASR.

Fig. 2 specifically illustrates the block diagram of the process 200 of the corresponding entry of generation according to an embodiment of the invention.In this embodiment, the acoustic model of the first engine SD-ASR is set up based on word.

In 201, the registration voice of input are the speech samples that the first engine SD-ASR training is used, for example, be specific user's voice.The registration voice (, speech samples) of input for example by microphones capture, be converted to analog electrical signal, stand digitized processing, and stand subsequently spectrum analysis, to generate the sequence of the eigenvector that comprises characteristic parameter.

In 202, these characteristic parameters can be extracted the registration voice for representing input.

203-204 in Fig. 2 the first half illustrates for registration voice, the acoustic model of training the first engine SD-ASR with the eigenvector extracting of registration voice, and whole acoustic models of training gained are for forming the vocabulary of the first engine SD-ASR.Here because the acoustic model of the first engine SD-ASR is set up based on word, so training result is the acoustic model of word.Each entry of the vocabulary of the first engine SD-ASR for example comprise identification number (ID), with word (text) corresponding to registration voice and the aligned phoneme sequence corresponding with registration voice.

As training method, can use general training method, for example maximum likelihood method (maximum likelihood, ML) or the property distinguished training (discriminative training, DT).

In addition,, in order to improve the efficiency of identifying, can carry out off-line in advance training process.On the other hand, consider that the word quantity of vocabulary of SD-ASR is conventionally less, also can train online.

With the first half of Fig. 2 concurrently, in the 205-206 of the latter half, for the each entry in the vocabulary of the first engine SD-ASR, use the eigenvector extracting of the phoneme cutting device pair speech samples corresponding with this entry matching with the second engine SI-ASR to carry out phoneme cutting 205.Be used as afterwards the aligned phoneme sequence of the result of phoneme cutting to generate the corresponding entry 206 for the second engine SI-ASR.

Here, phoneme cutting refers to that using the sequence of the phoneme cutting device eigenvector extracting to input voice to re-start divides and generate aligned phoneme sequence, is trained again by phoneme cutting device.Typical phoneme cutting device is for example phoneme typewriter.The phoneme cutting device using in the present invention should match with the second engine SI-ASR, and this is the entry that will be used to form the second engine SI-ASR due to the aligned phoneme sequence of cutting gained.That is to say, the acoustic model of phoneme cutting device is also set up based on phoneme, for example realize based on phoneme hidden Markov model, but may be different from the syntax rule of the second engine SI-ASR, so the aligned phoneme sequence that its training generates is identical with the aligned phoneme sequence of original entry of the second engine SI-ASR on form, but content can be not identical.

According to one embodiment of present invention, if there is more than one speech samples corresponding with this entry, for example can obtain the aligned phoneme sequence as the result of phoneme cutting by following two kinds of modes.Mode is that the each speech samples corresponding with this entry carried out to phoneme cutting to obtain a corresponding aligned phoneme sequence, then generates corresponding multiple corresponding entry by these aligned phoneme sequence.Another kind of mode is that different aligned phoneme sequence that obtain are like this merged into an aligned phoneme sequence, to use this aligned phoneme sequence to become corresponding entry next life.For example, the straightforward procedure that merges different aligned phoneme sequence is voting algorithm or dynamic time warping method.

Should be noted that as previously described, the acoustic model of the second engine SI-ASR is set up based on phoneme, so for example, for input voice (general training utterance), the training result of the second engine SI-ASR is the acoustic model of phoneme.In addition, the training method of the acoustic model of the second engine SI-ASR and the first engine SD-ASR are similar, are therefore no longer repeated in this description herein.Each aligned phoneme sequence of obtaining of training forms an entry of the second engine SI-ASR together together with the word of at least its representative (text, as open or stop) and identification number (ID), the set of entry forms the vocabulary of the second engine SI-ASR.If for training the word (text) of the acoustic model that the second engine SI-ASR obtains with consistent for the word (text) of training the acoustic model that SD-ASR obtains, its entry is corresponding to the entry with same word in the vocabulary of the first engine SD-ASR.

According to one embodiment of present invention, consider the form of the above-mentioned entry that the training of the second engine SI-ASR obtains, being used as the aligned phoneme sequence of the result of phoneme cutting to generate for the corresponding entry of the second engine SI-ASR specifically comprises: first, be this aligned phoneme sequence set with the vocabulary of the second engine SI-ASR in the unduplicated ID of identification number of entry.Then, obtain the word (text) of this aligned phoneme sequence, for example, from the vocabulary of the first engine SD-ASR, obtain.If certain word (text) in the vocabulary of this word (text) and the second engine SI-ASR is identical, can adds and identify for example prefix or suffix to distinguish.Finally, use ID, word (text) and aligned phoneme sequence to become corresponding entry next life.

Fig. 3 shows the block diagram of the process 300 of the corresponding entry of foundation according to still another embodiment of the invention.

The key distinction of the process of Fig. 3 and Fig. 2 is in the 305-307 of the latter half of Fig. 3, not to use phoneme cutting device, but use word tone conversion (the Letter to Sound matching with the second engine SI-ASR, LTS) device is to carrying out the conversion to sound of letter or word with the corresponding word (text) of registration voice, thereby generation aligned phoneme sequence, so with generate aligned phoneme sequence generate the corresponding entry for the second engine SI-ASR.

Word tone conversion equipment is the letter of input or the text conversion device that is corresponding phoneme according to pre-defined rule.This device is widely used in the speech recognition system of various language, and for example, in various phonetic notation software, input characters or letter just obtain exporting phonetic notation.Its principle of work is technology as known in the art, is no longer described in greater detail at this.The word tone conversion equipment using in the present invention also should match with the second engine SI-ASR.That is to say, the pre-defined rule of word tone conversion equipment is provided so that the aligned phoneme sequence generating by conversion is identical with the aligned phoneme sequence of original entry of the second engine SI-ASR on form, but content can be not identical.

For generate by the aligned phoneme sequence generating for the detailed process of the corresponding entry of the second engine SI-ASR can with describe for Fig. 2 above similar, be no longer repeated in this description at this.

Although it should be noted that herein the acoustic model that for example understands the second engine SI-ASR sets up based on phoneme and use the aligned phoneme sequence generating to become corresponding entry next life, this and unrestricted the present invention.The acoustic model of the second engine SI-ASR also can be set up based on syllable (syllable) or word (character), in this case, the corresponding cutting operation of phoneme cutting device and the transformation rule of word tone conversion equipment also answer accommodation to mate the second engine SI-ASR.

Get back to the process flow diagram of Fig. 1.Next, in step S102, generated corresponding entry is added in the vocabulary of the second engine SI-ASR, to generate combination vocabulary with together with original entry in the vocabulary of the training gained of the second engine SI-ASR.

As previously described, in combination vocabulary, if corresponding entry is identical with the word (text) of original entry, can identify for example prefix or suffix is distinguished by interpolation.Further, can also identify and for example add prefix or suffix and distinguish original entry and the new corresponding entry adding in the combination vocabulary of the second engine SI-ASR by interpolation, to clearly differentiate in form original entry and corresponding entry.

In addition, because the corresponding entry in the combination vocabulary of the second engine SI-ASR is as described above corresponding to the entry with same word in the vocabulary of the first engine SD-ASR, so in order to show this corresponding relation, can set up the corresponding relation of the entry in the vocabulary of corresponding entry in the combination vocabulary that a mapping table records the second engine SI-ASR and the first engine SD-ASR.Certainly, be not limited to this mode and indicate corresponding relation, for example also can be directly in combination vocabulary, identify for example prefix by interpolation or this corresponding relation indicated in suffix.

Next,, in the step S103 of the process flow diagram of Fig. 1, for inputting arbitrarily voice, identify this input voice by the vocabulary of the first engine SD-ASR with the first engine SD-ASR.

Next, in the step S104 of the process flow diagram of Fig. 1, or for same input voice, with the second engine SI-ASR by combine vocabulary identify this input voice, to generate the recognition result relevant with original entry and the recognition result relevant with corresponding entry.

Above-mentioned each recognition result comprises the word that identifies and corresponding identification mark.Certainly for the ease of carrying out other operations, recognition result can also comprise other, for example ID etc.

Here should be appreciated that the recognition result relevant with corresponding entry of the second engine SI-ASR is corresponding to the recognition result with same word of the first engine SD-ASR.

Next, in the step S105 of the process flow diagram of Fig. 1, use from the recognition result relevant with corresponding entry of the second engine SI-ASR output, the relevant recognition result with original entry of the recognition result of the first engine SD-ASR and the second engine SI-ASR is compared and export comparative result.

Fig. 4 is illustrated in the comparison carried out in step S105 and an embodiment of output function.As shown in Figure 4, in step 400, input voice and can also carry out feature extraction.For current input voice, the first engine SD-ASR output comprises the recognition result as the 1st optimal identification result, and the 1st optimal identification result for example comprises word " テニス " and corresponding identification mark (step 403) thereof.The second engine SI-ASR output comprises the recognition result of 1st optimal identification result relevant with original entry, the 1st optimal identification result for example comprises word " あり Ga とう " and corresponding identification mark (step 401) thereof, the also output recognition result relevant with corresponding entry, i.e. word " teniis " and corresponding identification mark (step 402) thereof.The identifying of phonetic entry and feature extraction etc. was described above, herein no longer repeat specification.

Next these recognition results are carried out relatively.In one aspect, whether the word " あり Ga とう " that judges 1st optimal identification result relevant with original entry of the second engine SI-ASR is the word (step 404) in the vocabulary of the first engine SD-ASR, that is to say, whether be the outer word (OOV, out of vocabulary) of collection of the vocabulary of the first engine SD-ASR.Word outside collection, directly exports the 1st optimal identification result of the first engine SD-ASR, i.e. " テニス " (step 407).

Herein, because the acoustic model of SD-ASR is higher with respect to the recognition accuracy of the acoustic model of the first engine SI-ASR, so think that in these cases the degree of confidence of recognition result of the second engine SD-ASR is higher than the degree of confidence of the recognition result of the first engine SI-ASR.

In yet another aspect, if the word of 1st optimal identification result relevant with original entry of the second engine SI-ASR " あり Ga とう " is the outer word of collection of the vocabulary of the first engine SD-ASR, compare the identification mark of 1st optimal identification result relevant with original entry of the second engine SI-ASR and the identification mark of the recognition result relevant with corresponding entry of the second engine SI-ASR, that is to say the identification mark of comparison " teniis " and the identification mark (step 405) of " あり Ga とう ".

It should be noted that this recognition result relevant with corresponding entry of the second engine SI-ASR is corresponding to the 1st optimal identification result of the first engine SD-ASR herein.

If the identification mark of this recognition result relevant with corresponding entry, the identification mark of " teniis " is larger, export the 1st optimal identification result of the first engine SD-ASR, i.e. " テニス " (step 407), otherwise 1st optimal identification result relevant with original entry of output the second engine SI-ASR, i.e. " あり Ga とう " (step 406).

In above-mentioned comparison procedure, by the mark by corresponding entry, realize the comparison of the recognition result of the first engine SD-ASR and the second engine SI-ASR, and exported preferred result.As can be seen here, recognition result is not arranged any weight or vocabulary is had to any restriction according to the audio recognition method of the present embodiment, and therefore do not need to design complicated decision logic, make speech recognition process more simple efficient.

In another embodiment, in the situation that needs are exported more than 1 recognition result, can follow-up relatively in continue for the N optimal identification result in the whole recognition results relevant with original entry of the N optimal identification result in whole recognition results of the first engine SD-ASR and the second engine SI-ASR, repeat the foregoing comparison for the 1st optimal identification result and output function, until the recognition result of output predetermined quantity, N is greater than 1 predetermined integers herein, and described predetermined quantity can be determined according to practical application request.

In the situation that needs are exported multiple recognition result, directly continue the relatively mode of the N optimal identification result of two engines and may have defect.For example, if having exported for the first time the 1st optimal identification result " テニス " of the first engine SD-ASR in relatively in the operation for the 1st optimal identification result, but the 1st optimal identification result of the second engine SI-ASR is likely better than the 2nd optimal identification result of the first engine SD-ASR.In this case, if directly compare the 2nd optimal identification result of two engines in the time comparing for the second time, though for the second time relatively in output the first engine SD-ASR or the 2nd optimal identification result of the second engine SI-ASR, be all not so good as the recognition result using the 1st optimal identification result of the second engine SI-ASR as this output.

Consider the problems referred to above, Fig. 5 is illustrated in another embodiment of the operation of carrying out in step S105.

As shown in the first half of Fig. 5, for current input voice, the recognition result of the first engine SD-ASR, except the word " テニス " of above-mentioned the 1st optimal identification result and the identification mark of correspondence thereof, also comprises N optimal identification result, and N is greater than 1 integer.Here show for the sake of simplicity for example front 3 optimal identification result " テニス ", “ コーヒー " and " Er ユース " and corresponding identification mark thereof, but N is not limited to this.Similarly, for the recognition result of SI-ASR, also show front 3 optimal identification result relevant with original entry " あり Ga とう ", " かい " and " コーヒー " and corresponding identification mark thereof, and front 3 recognition results " teniis " relevant with corresponding entry, " koohii " and " niuusu " and corresponding identification mark thereof.

Next these recognition results are carried out relatively.Comparing for the first time in (step 501-504), identical with output function with the comparison for the 1st optimal identification result of describing with reference to Fig. 4, be no longer repeated in this description herein.

Relatively and after output judging for the first time.On the one hand, if that output is the 1st optimal identification result D1 of the first engine SD-ASR, from the recognition result of the first engine SD-ASR, remove exported recognition result D1, and if contain the recognition result with exported recognition result D1 with same word in the recognition result relevant with original entry of the second engine SI-ASR, for example Ax removes this and has the recognition result Ax(step 505 of same word from the recognition result relevant with original entry).

Otherwise, if do not contain the recognition result with exported recognition result D1 with same word in the recognition result relevant with original entry of the second engine SI-ASR, in the recognition result relevant with original entry, do not remove.

In another aspect, if that output is the 1st optimal identification result A1 relevant with original entry of the second engine SI-ASR for the first time, from the recognition result relevant with original entry of the second engine SI-ASR, remove exported recognition result A1(step 506).Should note, now need in the recognition result of the first engine SD-ASR, not delete, this is to be the outer word of collection of the first engine SD-ASR because the prerequisite of output A1 is exactly A1, be that A1 is not the word in the vocabulary of the first engine SD-ASR, natural A1 is not also in the recognition result of the first engine SD-ASR.

For the residue recognition result of the first engine SD-ASR after removing and the residue recognition result relevant with original entry of the second engine SI-ASR, repeat for the above comparison of the 1st optimal identification result, export and remove operation (step 500-506) until exported the recognition result (step 507) of predetermined quantity.

In addition, repeating comparison, export and remove in operation, if any is empty in the residue recognition result relevant with original entry of the residue recognition result of the first engine SD-ASR and the second engine SI-ASR, directly export the recognition result in another residue recognition result, until exported the recognition result (step 507) of predetermined quantity.

Further, repeating comparison, export and remove in operation, if certain once in the recognition result relevant with corresponding entry at the second engine SI-ASR, do not exist relatively time with the first engine SD-ASR when the corresponding recognition result of previous recognition result, for example do not have the word of the corresponding entry that has the recognition result of same word or record in mapping table, that directly exports the first engine SD-ASR and the second engine SI-ASR identifies the recognition result that mark is larger at this in the recognition result in relatively; Or the recognition result of two engines of output, and confirm final recognition result by user.

Again further, repeating comparison, export and remove in operation, if the residue recognition result relevant with original entry of the residue recognition result of the first engine SD-ASR and the second engine SI-ASR is all empty (step 508) or the recognition result (step 507) of having exported predetermined quantity, step S105 stops carrying out and finishing (step 509) for the identifying operation of current input voice.

Also recognition result is not arranged any weight or vocabulary is had to any restriction according to the audio recognition method of this embodiment, and therefore do not need to design complicated decision logic, make speech recognition process more simple efficient.And, the in the situation that of the multiple recognition result of needs, by the recognition result of exporting in removing each time relatively from the recognition result of an engine, and if some words also remove the recognition result with exported recognition result with same word from the recognition result of another engine, and then for the residue recognition result of two engines carry out based on the comparison of the 1st optimal identification result, export and remove like class of operation and operate, can obtain more accurately the recognition result of requirement.

Next the block diagram of the exemplary configuration of speech recognition system according to an embodiment of the invention is described with reference to Fig. 6.This speech recognition system comprises corresponding entry generating apparatus 601, combination vocabulary table creating device 602, the first speech recognition equipment 603, the second speech recognition equipment 604 and recognition result comparison and output unit 605.

Corresponding entry generating apparatus 601 is couple to the vocabulary 6031 of the first speech recognition engine, and generates the corresponding entry for the second speech recognition engine for the each entry in the vocabulary 6031 of the first speech recognition engine.

Combination vocabulary table creating device 602 is couple to corresponding entry generating apparatus 601, receive the corresponding entry that corresponding entry generating apparatus 601 generates, and generated corresponding entry is added in the vocabulary of the second speech recognition engine, to generate combination vocabulary 6041 with together with original entry in the vocabulary of the second speech recognition engine.

The first speech recognition equipment 603 receives input voice 600, and is identified and inputted voice 600 by the vocabulary 6031 of the first speech recognition engine with the first speech recognition engine.

The second speech recognition equipment 604 receives identical input voice 600, with the second speech recognition engine by combination vocabulary 6041 identify input voice 600, to generate the recognition result relevant with original entry and the recognition result relevant with corresponding entry.

Recognition result comparison and output unit 605 are couple to the first speech recognition equipment 603 and the second speech recognition equipment 604, receive the recognition result of the first speech recognition equipment 603 and the second speech recognition equipment 604, and use from the second speech recognition equipment 604 recognition result relevant with corresponding entry that export, the relevant recognition result with original entry of the recognition result of the first speech recognition equipment 603 and the second speech recognition equipment 604 is compared and exports comparative result.

Each recognition result of the first speech recognition equipment 603 and the second speech recognition equipment 604 comprises the word that identifies and corresponding identification mark.

In the time of operation, first, corresponding entry generating apparatus 601 and combination vocabulary table creating device 602 can generate corresponding entry and generate combination vocabulary 6041 during the training stage.

Next, during cognitive phase, the first speech recognition equipment 603 and the second speech recognition engine 604 can receive identical input voice 600, and identify input voice 600 with vocabulary separately respectively, to generate recognition result separately.Finally, recognition result comparison and output unit 605 can compare and export required comparative result as described in audio recognition method above to recognition result.

Device described above is for implementing the exemplary of processing that the disclosure describes and/or preferred device.These devices can be hardware cell (such as field programmable gate array, digital signal processor, special IC or computing machine etc.) and/or software service (such as computer-readable program).Device for implementing each step is not below at large described.But, as long as there is the step of carrying out certain processing, just can be useful on the corresponding device (by hardware and/or implement software) of implementing same processing.The technical scheme limiting by all combinations of described step and the device corresponding with these steps is all included in the application's disclosure, as long as these technical schemes that their form are complete and applicable.

Fig. 7 is the block diagram that the hardware configuration of the computer system that can implement embodiments of the invention is shown.

As shown in Figure 7, computer system comprises the processing unit 701, ROM (read-only memory) 702, random access memory 703 and the input/output interface 705 that connect via system bus 704, and the input block 706 connecting via input/output interface 705, output unit 707, storage unit 708, communication unit 709 and driver 710.Program can pre-recorded recording medium built-in in as computing machine ROM(ROM (read-only memory)) 702 or storage unit 708 in.Or program can be stored (record) in removable media 711.In this article, removable media 711 comprises such as floppy disk, CD-ROM(compact disk ROM (read-only memory)), MO(magneto-optic) dish, DVD(digital versatile disc), disk, semiconductor memory etc.

Input block 706 disposes keyboard, mouse, microphone etc.In addition, output unit 707 disposes LCD(liquid crystal display), loudspeaker etc.

In addition, except by driver 710 from above-mentioned removable media 711 installation to the configuration of computing machine, can program be downloaded to computing machine to be arranged in built-in storage unit 708 by communication network or radio network.In other words, can be for example with wireless mode by the satellite for digital satellite broadcasting from download point to computing machine or with wired mode by such as LAN(LAN (Local Area Network)) or the network of internet etc. to computing machine transmission procedure.

If by the user of input block 706 being manipulated etc., having inputted order via input/output interface 705 to computer system, CPU701 carries out the program of storing in ROM702 according to order.Or it is upper with executive routine that CPU701 is carried in RAM703 the program of storage in storage unit 708.

Therefore, CPU701 carries out the processing of carrying out according to some processing of above-mentioned process flow diagram or by the configuration of above-mentioned block diagram.Next, if necessary, CPU701 allow process result for example by input/output interface 705 from output unit 707 export, from communication unit 707 transmit, storage unit 708 record etc.

In addition, program can be carried out by a computing machine (processor).In addition, program can be by multiple computing machines with distributed mode processing.In addition, program can be transferred to remote computer carries out.

Computer system shown in Fig. 7 is only illustrative and is never intended to invention, its application, or uses to carry out any restriction.

Computer system shown in Fig. 7 can be incorporated in any embodiment, can be used as stand-alone computer, or also can be used as the disposal system in equipment, can remove one or more unnecessary assembly, also can add one or more additional assembly to it.

Can implement method and system of the present invention by many modes.For example, can implement method and system of the present invention by software, hardware, firmware or its any combination.The order of above-mentioned method step is only illustrative, and method step of the present invention is not limited to above specifically described order, unless otherwise clearly stated.In addition, in certain embodiments, the present invention can also be implemented as the program being recorded in recording medium, and it comprises the machine readable instructions for realizing the method according to this invention.Thereby the present invention also covers the recording medium of storing the program for realizing the method according to this invention.

Although describe specific embodiments more of the present invention in detail by example, it will be appreciated by those skilled in the art that above-mentioned example is only illustrative and do not limit the scope of the invention.It should be appreciated by those skilled in the art that above-described embodiment to be modified and do not depart from the scope and spirit of the present invention.Scope of the present invention is to limit by appended claim.

Claims

1. an audio recognition method, comprising:

Corresponding entry generates step, generates the corresponding entry for the second speech recognition engine for the each entry in the vocabulary of the first speech recognition engine;

Combination vocabulary generates step, generated corresponding entry is added in the vocabulary of the second speech recognition engine, to generate combination vocabulary with together with original entry in the vocabulary of the second speech recognition engine;

The first speech recognition steps, identifies input voice by the vocabulary of the first speech recognition engine with the first speech recognition engine;

The second speech recognition steps, with the second speech recognition engine by combination vocabulary identify input voice, to generate the recognition result relevant with original entry and the recognition result relevant with corresponding entry; And

Recognition result compares and output step, use from the recognition result relevant with corresponding entry of the second speech recognition engine output, the relevant recognition result with original entry of the recognition result of the first speech recognition engine and the second speech recognition engine is compared and export comparative result;

Wherein each recognition result of the first speech recognition engine and the second speech recognition engine comprises the word that identifies and corresponding identification mark.

2. audio recognition method according to claim 1, wherein, for the each entry in the vocabulary of the first speech recognition engine, corresponding entry generates step and comprises:

Obtain the speech samples of this entry;

Come speech samples to carry out phoneme cutting with the phoneme cutting device matching with the second speech recognition engine; And

Be used as the aligned phoneme sequence of the result of phoneme cutting to become corresponding entry next life.

3. audio recognition method according to claim 1, wherein, for the each entry in the vocabulary of the first speech recognition engine, corresponding entry generates step and comprises:

Obtain the text of this entry;

Generate the aligned phoneme sequence of the text of this entry with the word tone conversion equipment matching with the second speech recognition engine; And

Use the aligned phoneme sequence generating to become corresponding entry next life.

4. according to the audio recognition method described in any one in claim 1-3, wherein, recognition result comparison and output step comprise:

If the word of 1st optimal identification result relevant with original entry of the second speech recognition engine is the word in the vocabulary of the first speech recognition engine, directly export the 1st optimal identification result of the first speech recognition engine.

5. audio recognition method according to claim 4, recognition result comparison and output step also comprise:

If the word of 1st optimal identification result relevant with original entry of the second speech recognition engine is not the word in the vocabulary of the first speech recognition engine, relatively the identification mark of 1st optimal identification result relevant with original entry of the second speech recognition engine and with the 1st optimal identification result of the first speech recognition engine corresponding, the identification mark of the recognition result relevant with corresponding entry of the second speech recognition engine;

If the identification mark of the recognition result relevant with corresponding entry is larger, export the 1st optimal identification result of the first speech recognition engine, otherwise 1st optimal identification result relevant with original entry of output the second speech recognition engine.

6. audio recognition method according to claim 5, recognition result comparison and output step also comprise:

For the N optimal identification result of the first speech recognition engine and the N optimal identification result relevant with original entry of the second speech recognition engine, repeat for the aforementioned comparison of the 1st optimal identification result and output function until the recognition result of output predetermined quantity, wherein N is greater than 1 predetermined integers.

7. audio recognition method according to claim 5, recognition result comparison and output step also comprise:

In the case of the 1st optimal identification result of output the first speech recognition engine, from the recognition result of the first speech recognition engine, remove exported recognition result, and in the recognition result relevant with original entry of the second speech recognition engine, contain with exported recognition result and there is the recognition result that removes this recognition result of same word from the recognition result relevant with original entry and have same word;

In the case of 1st optimal identification result relevant with original entry of output the second speech recognition engine, from the recognition result relevant with original entry of the second speech recognition engine, remove exported recognition result;

For the residue recognition result of the first speech recognition engine and the residue recognition result relevant with original entry of the second speech recognition engine, repeat for the aforementioned comparison of the 1st optimal identification result, export and remove operation until exported the recognition result of predetermined quantity.

8. audio recognition method according to claim 7, recognition result comparison and output step also comprise:

If any is empty in the residue recognition result relevant with original entry of the residue recognition result of the first speech recognition engine and the second speech recognition engine, directly export the recognition result in another residue recognition result, until exported the recognition result of predetermined quantity.

9. audio recognition method according to claim 7, recognition result comparison and output step also comprise:

If the residue recognition result relevant with original entry of the residue recognition result of the first speech recognition engine and the second speech recognition engine is all empty, this recognition result comparison and output step stop carrying out.

10. according to the audio recognition method described in any one in claim 1-3, wherein, the acoustic model of the second speech recognition engine is established based on phoneme.

11. 1 kinds of speech recognition systems, comprising:

Corresponding entry generating apparatus, is configured to generate the corresponding entry for the second speech recognition engine for the each entry in the vocabulary of the first speech recognition engine;

Combination vocabulary table creating device, is configured to generated corresponding entry to add in the vocabulary of the second speech recognition engine, to generate combination vocabulary with together with original entry in the vocabulary of the second speech recognition engine;

The first speech recognition equipment, is configured to identify input voice with the first speech recognition engine by the vocabulary of the first speech recognition engine;

The second speech recognition equipment, is configured to identify input voice with the second speech recognition engine by combination vocabulary, to generate the recognition result relevant with original entry and the recognition result relevant with corresponding entry; And

Recognition result comparison and output unit, be configured to use the recognition result relevant with corresponding entry from the second speech recognition engine output, the relevant recognition result with original entry of the recognition result of the first speech recognition engine and the second speech recognition engine is compared and export comparative result;

12. speech recognition systems according to claim 11, wherein, corresponding entry generating apparatus comprises:

Be configured to obtain the device of the speech samples of the each entry in the vocabulary of the first speech recognition engine;

Be configured to use the phoneme cutting device matching with the second speech recognition engine to carry out speech samples to carry out the device of phoneme cutting; And

The aligned phoneme sequence that is configured to the result that is used as phoneme cutting becomes the device of corresponding entry next life.

13. speech recognition systems according to claim 11, wherein, corresponding entry generating apparatus comprises:

Be configured to obtain the device of the text of the each entry in the vocabulary of the first speech recognition engine;

Be configured to use the word tone conversion equipment matching with the second speech recognition engine to generate the device of the aligned phoneme sequence of the text of this entry; And

Be configured to use the aligned phoneme sequence generating to become the device of corresponding entry next life.

14. according to the speech recognition system described in any one in claim 11-13, and wherein, recognition result comparison and output unit comprise:

If be configured to the word in vocabulary that the word of the 1st optimal identification result relevant with original entry of the second speech recognition engine is the first speech recognition engine, directly export the device of the 1st optimal identification result of the first speech recognition engine.

15. speech recognition systems according to claim 14, recognition result comparison and output unit also comprise:

If be configured to the word in vocabulary that the word of the 1st optimal identification result relevant with original entry of the second speech recognition engine is not the first speech recognition engine, relatively the identification mark of 1st optimal identification result relevant with original entry of the second speech recognition engine and with the 1st optimal identification result of the first speech recognition engine corresponding, the device of the identification mark of the recognition result relevant with corresponding entry of the second speech recognition engine; And

If it is larger to be configured to the identification mark of the recognition result relevant with corresponding entry, export the 1st optimal identification result of the first speech recognition engine, otherwise the device of 1st optimal identification result relevant with original entry of output the second speech recognition engine.

16. speech recognition systems according to claim 15, recognition result comparison and output unit also comprise:

Be configured to for the N optimal identification result of the first speech recognition engine and the N optimal identification result relevant with original entry of the second speech recognition engine, repeat for the comparison of the 1st optimal identification result and output function until the device of the recognition result of output predetermined quantity, wherein N is greater than 1 predetermined integers.

17. speech recognition systems according to claim 15, recognition result comparison and output unit also comprise:

Be configured in the case of the 1st optimal identification result of output the first speech recognition engine, from the recognition result of the first speech recognition engine, remove exported recognition result, and in the recognition result relevant with original entry of the second speech recognition engine, contain with exported recognition result and there is the device that removes this recognition result of same word from the recognition result relevant with original entry and have the recognition result of same word;

Be configured to, in the case of 1st optimal identification result relevant with original entry of output the second speech recognition engine, from the recognition result relevant with original entry of the second speech recognition engine, remove the device of exported recognition result; And

Be configured to for the residue recognition result of the first speech recognition engine and the residue recognition result relevant with original entry of the second speech recognition engine, repeat for the comparison of the 1st optimal identification result, export and remove operation until exported the device of the recognition result of predetermined quantity.

18. speech recognition systems according to claim 17, recognition result comparison and output unit also comprise:

If be configured to, in the residue recognition result of the first speech recognition engine and the residue recognition result relevant with original entry of the second speech recognition engine, any is sky, directly export the recognition result in another residue recognition result, until exported the device of the recognition result of predetermined quantity.

19. speech recognition systems according to claim 17, recognition result comparison and output unit also comprise:

Be all empty if be configured to the residue recognition result of the first speech recognition engine and the residue recognition result relevant with original entry of the second speech recognition engine, make this recognition result comparison and output unit stop the device of carrying out.