CN1150452C

CN1150452C - Speech recognition correction for equipment wiht limited or no displays

Info

Publication number: CN1150452C
Application number: CNB011217235A
Authority: CN
Inventors: B・E・巴拉德; B·E·巴拉德; 刘易斯; J·R·刘易斯; 奥特加; K·A·奥特加; 范布斯柯克; R·E·范布斯柯克; 王慧芳
Original assignee: International Business Machines Corp
Current assignee: Nuance Communications Inc
Priority date: 2000-07-05
Filing date: 2001-07-04
Publication date: 2004-05-19
Anticipated expiration: 2021-07-04
Also published as: CN1356628A; US7200555B1; EP1170726A1

Abstract

A novel apparatus and method for correcting speech recognized text, the method comprising the following steps: Audio speech input can be received and speech- to-text converted to speech recognized text; a first speech correction command for performing a correction operation on speech recognized text stored in a text buffer can be detected in the speech recognized text; if a speech correction command is not detected in the speech recognized text, the speech recognized text can be added to the text buffer; if a speech command is detected in the speech recognized text, the detected correction speech command can be performed on speech recognized text stored in the text buffer.

Description

The speech recognition correction method and apparatus

Technical field

The present invention relates to the speech recognition computer application program, relate more specifically to such as the apparatus and method of having only the positive text string of the prevailing environment lieutenant colonel of voice by phone dictation messages etc.

Background technology

Best, when the author prepared electronic information for the recipient who expects, this author enjoyed use standard qwerty keyboard and the intrinsic whole convenience of visual surveillance device.Particularly, keyboard makes things convenient for the efficient input of electronic information and author that the visual surveillance device provides visual feedback to make electronic information can guarantee electronic information quilt record correctly before transmission.Yet the author can not use keyboard or monitor often efficiently.For example, in the situation of truck-mounted computer, when steering vehicle, may take author's hand and eye, thereby can not utilize the standard qwerty keyboard.

Similarly, when using " wearable computer ", qwerty keyboard might can not be utilized.Wearable computer comprises battery powered band in spokesman's computer system on one's body, for example with on spokesman's waistband, knapsack, underwear etc.Wearable computer is for automobile and mainly is to design with the computer operation of hand that wearable computer comprises the display of wearing usually and has the device that receives with the processed voice input.Yet wearable computer does not comprise complete exercisable qwerty keyboard usually.

At last, in the use of cellular phone, pager, personal digital assistant or other portable computing device, can not utilize traditional alphanumeric keyboard.Particularly, the author may want to write electronic information with portable computing device, even it does not comprise qwerty keyboard.The recipient that the example of this occasion is included as expection makes pager message or dictation for the information of using on standardized tabular such as the order of doing shopping such as shipping tag or business to business.

Yet the voice signal that modern speech recognition application programming interface can utilize computing machine that microphone is received converts the data set that can work to and help that need not qwerty keyboard.Subsequently, this data set can be used in other computer program miscellaneous widely, wherein include file preparation, data input, order and control, information receiving and transmitting are got in touch and other program is used.Thereby speech recognition is to be suitable for very much not having keyboard to import the technology of using in the equipment of the superiority of feeding back with monitor.

Yet, because most speakers' pronunciation miscellaneous, individual stress and various phonetic feature, even effective speech recognition still can be a difficult problem in traditional calculating.Neighbourhood noise also often makes speech recognition process complicated, because computing machine may be attempted ground unrest identification and be construed to voice.Like this, speech recognition system forces the speaker to miss the correction of recognizing voice through regular meeting's mistake recognizing voice input.

Usually, in such as traditional computers such as desktop PC, can under the assistance of visual displays and keyboard, miss the correction of the voice of identification.Yet, also be complicated even the correction of the mistake recognizing voice in having only equipment limited or that do not have display can not be worked.As a result, exist demand to the bearing calibration of the speech recognition application programming interface in having equipment limited or that do not have display, operated.This system is being used for having only the speech recognition system of dictating Email, phone and other message on the equipment limited or that do not have display channel to have special purposes.

Summary of the invention

Provide a kind of and provide the method and apparatus of speech recognition correction for having equipment limited or that do not have display channel.The most handy machine-readable storage mechanism that stores computer program above it of this method realizes that this method comprises the steps.At first receive the audio speech input and convert its speech-to-text to the speech recognition text.The second, detecting the first voice corrective command of carrying out correct operation on the speech recognition text be used for being stored in text buffer on the speech recognition text.The 3rd, if in this speech recognition text, do not detect the voice corrective command, just the speech recognition text can be added in the text buffer.The 4th, if in the speech recognition text, detect voice command, just can carry out this detected correction voice command on the speech recognition text in being stored in text buffer.

In addition, receiving step can further comprise the speech-to-text conversion that can confirm the speech recognition text with listening.Listened to the ground of speech recognition text confirm the step of speech-to-text conversion can comprise can play the speech recognition text that writes down with listening in case the speech recognition text that can determine to be write down whether in switch process by wrong identification.The first voice corrective command can be indicated and be wished the terminated speech bearing calibration.Detect the first voice corrective command of this type in the voice responsive identification text, whether the speech recognition text that can determine to be stored in the text buffer combines into syllables out.If the speech recognition text that is stored in the text buffer combines into syllables out, just the speech recognition text can be added in the speech recognition vocabulary table of the discernible word of voice.But terminated speech bearing calibration after this.

The first voice corrective command can further be indicated the text of wishing to proofread and correct mistake identification in the text buffer.Response detects the first voice corrective command of this type in the speech recognition text, can play voice with listening and proofread and correct candidate list, and candidate proofreaied and correct in each voice in wherein should showing is to alternative identification text on the statistics of audio speech input.Subsequently, select among the candidate a kind of proofreaied and correct in the voice that can receive in the table; And can proofread and correct the text that the candidate replaces the mistake identification in the text buffer with the voice of selecting.

Replace receiving and select, can receive the text that the second voice corrective command is represented preferably to replace text and wish to replace with preferable replacement text mistake identification in text buffer.Response receives this second voice corrective command, can discern text with the mistake that preferable replacement text is replaced in the text buffer.In addition, the second voice corrective command can be indicated the text of wishing to replace with the replacement text that combines into syllables out the mistake identification in the text buffer.Respond the reception of this second voice corrective command, can accept the replacement text that can combine into syllables out with listening, the replacement text that can combine into syllables out comprises a series of mouthfuls of alphanumeric characters of saying with listening.This alphanumeric character series of saying is flatly carried out the speech-to-text conversion and the alphanumeric character after each speech-to-text conversion is stored in the temporary buffer.Alphanumeric character set after the speech-to-text conversion can be closed in the replacement text that into combines into syllables out, and replace the text of the mistake identification in the text buffer with the replacement text that combines into syllables out.In preferred embodiment, before accepting the replacement text that can combine into syllables out, can play the instruction group of the pre-stored that is used to provide the replacement text that combines into syllables out with listening with listening.

Particularly, can in the replacement text that can combine into syllables out, detect the 3rd voice corrective command with listening.The 3rd voice corrective command can be indicated and be wished that deletion is stored in the particular letter numerical character in the temporary buffer, and response detects this 3rd voice corrective command, can delete this particular letter numerical character from temporary buffer.In addition, the 3rd voice corrective command can be indicated and be wished preferably to replace alphanumeric character and replace specific alphanumeric character with this preferable replacement alphanumeric character in the temporary buffer.Response detects this 3rd voice corrective command, and the preferable alphanumeric character in the available temporary buffer is replaced this particular letter numerical character.

According to another aspect of the present invention, provide a kind of speech recognition application programming interface voice means for correcting of the text of identification by mistake that is used for proofreading and correct, comprise

Receive the audio speech input and the audio speech of described reception is imported the device that speech-to-text converts the speech recognition text to;

In described speech recognition text, detect the device of carrying out the first voice corrective command of correct operation on the speech recognition text be used for being stored in text buffer;

If in described speech recognition text, do not detect the first voice corrective command, just described speech recognition text is increased to device in the described text buffer; And

If in described speech recognition text, detect the first voice corrective command, just carry out the device of described detected voice corrective command on the speech recognition text in being stored in described text buffer.

From the following description, above-mentioned and other purpose, advantage and characteristic of the present invention will be conspicuous.With reference to constituting its a part of accompanying drawing, wherein show preferred embodiment of the present invention in the instructions with by way of example.Yet this embodiment might not represent gamut of the present invention, thereby for scope of the present invention correctly is described, also must be with reference to the claim here.

Description of drawings

Fig. 1 illustrates the computer installation that can be used to put into practice method of the present invention;

Fig. 2 is the block diagram that illustrates for the typical Advanced Computer Architecture of using in the computer installation of Fig. 1;

Fig. 3 is for showing the process flow diagram of dictating the method for text body according to the present invention;

Fig. 4 is used to realize the process flow diagram of the method that the STOP (stopping) of Fig. 3 orders for displaying;

Fig. 5 is used to realize the process flow diagram of the method that the CORRECT (correction) of Fig. 3 orders for displaying; And

Fig. 6 is used to realize the process flow diagram of the method that the SPELL (combining into syllables) of Fig. 5 orders for displaying.

Embodiment

The present invention is the apparatus and method of proofreading and correct the voice of the mistake identification in the speech recognition application programming interface of operating in having computer equipment limited or that do not have display.In order to compensate the input of limited keyboard and display fan-out capability of computer equipment, method of the present invention can provide audible feedback to the speaker so that speaker identification's mistake identification error.In addition, method of the present invention can be provided for proofreading and correct the voice command and the control function of mistake identification.This function comprises " deletion " and " replacement " voice command.Moreover this function comprises " combining word into syllables " function that the accurate spelling that is used for word that will mistake identification offers speech recognition application programming interface.

Fig. 1 illustrates that to can be used to put into practice having of method of the present invention limited or do not have a computer equipment 10 of display.Computer equipment 10 can be embedded in the means of transport, for example computer equipment can be included in the means of transport navigational system.In addition, computer equipment 10 can be used as the part of portable computing device or wearable computer.At last, computer equipment 10 can be included in the telephone system.Still, the invention is not restricted to form or use according to computer equipment 10.But spirit of the present invention and scope comprise and have all computer equipments limited or that do not have display and use causes computer equipment limited or that do not have display.

Computer equipment 10 preferably comprises CPU (central processing unit) (CPU) 12, such as internal memory device such as random-access memory (ram) 14, and such as fixing storage medium 16 such as internal storage or hard disk drive.Storage operating system 18 and can be used for putting into practice the speech recognition application programming interface 20 of method of the present invention in the fixing storage medium 16.Computer phoneme frequency circuit (CAC) 28 also is also it being included in the computer equipment 10 so that provide the Audio Processing ability to computer equipment 10 of wanting.Can be provided with like this, just that both receive for audio input signal of handling and the audio output signal that provides machine voicefrequency circuit 28 as calculated to handle such as voice input devices such as microphones and such as audio output devices such as loudspeakers in computer phoneme frequency circuit 28.Particularly, when computer equipment 10 during, voice input device 6 and audio output device 8 can be included in the speaker and be used in the telephone receiver with telephone system communications as telephone system a part of.

Alternatively, computer equipment 10 can add comprise for be operatively coupled on computer equipment 10 reciprocation purposes on it the keyboard (not shown) with such as at least one speaker interface display units such as VDT (not shown).Yet, present invention is not limited in this respect, and according to this inventive apparatus, for operational computations machine 10 suitably neither needs keyboard also not need VDT.In fact, method of the present invention is intended to provide the voice calibration capability to having equipment limited or that do not have display and keyboard.Thereby in preferred embodiment, computer equipment 10 does not comprise keyboard or VDT.

Fig. 2 illustrates the preferable architecture of the computer equipment 10 of Fig. 1., operating system 18 can be stored in the read-only storage 16 as shown in both as Fig. 1 and 2.Operating system 18 is preferably such as the embedded OSs such as Vxwork  of QNX Neutrino  or Wind River system.Yet operating system 18 is not limited to these, and the present invention also can be used for the computer operating system of any other type, for example can be from Washington, and Windows CE  that the Microsoft of Redmond buys or Windows NT .

In addition, speech recognition application programming interface 20 can be stored in the read-only storage 16.Best, speech recognition application programming interface 20 comprises the voice correct application program 26 that speech recognition engine 22, speech synthesis engine 24 reach according to this inventive apparatus.Though the application program of separating of speech recognition application programming interface 20 has been shown among Fig. 2, the present invention also is not intended to be limited to like this and does, and these various application programs can realize as more complicated single application program equally, during the guiding of computer equipment 10, operating system 18 can be loaded in the internal memory device 14 and also be carried out.Subsequently, operating system 18 can load speech recognition application programming interface 20 of the present invention advance in the internal memory device 14.During loading, speech recognition application programming interface 20 can be carried out in internal memory device 14.Particularly, speech recognition application programming interface 20 can comprise a plurality of code segments that are used to carry out speech recognition, phonetic synthesis and bearing calibration of the present invention, and each code segment can comprise the executable instruction of CPU12.During execution, in order to carry out method of the present invention, CPU12 can load and carry out the instruction that is included in the speech recognition application programming interface.

In the operation, receive the analogue audio frequency input signal in the microphone 6 on the computer phoneme frequency circuit 28 in being operably connected to computer equipment 10.Computer phoneme frequency circuit 28 can convert the analogue audio frequency input signal to digital audio-frequency data and cross over the communication bus (not shown) it is passed to computer equipment 10.Subsequently, operating system 18 can obtain digital audio-frequency data and can provide it to speech recognition engine 22 so that carry out the speech identifying function of being carried out by well-known speech recognition engine in the present technique usually thereon with traditional approach.The voice dictation session of the voice of dictation to computer equipment 10 is provided the speaker, and computer phoneme frequency circuit 28 can convert the simulated audio signal of representative dictation voice to digital audio-frequency data.In preferred embodiment,, simulated audio signal can be transformed into digital audio-frequency data by at fixed sample rate up-sampling simulated audio signal such as every 10-20 millisecond.Digital audio-frequency data passes to speech recognition engine 22 the most at last.Speech recognition engine 22 can carry out the speech-to-text conversion to speaker's voice with well-known speech-to-text switch technology in the present technique.Particularly, as traditional speech recognition system, for word of saying and phrase represented in the discriminating digit voice data, speech recognition engine 22 can be handled digital audio-frequency data.

As in typical speech recognition system, speech recognition engine 22 can miss recognizing voice sometimes.This is, though the speaker has dictated a word, speech recognition engine 22 might convert this word to the text of the word of not representing dictation.For example, though the speaker has dictated word " milk ", speech recognition engine 22 may convert the word " milk " of dictation to text " mill ".Yet, there are not enough visual feedback, the speaker can not know that speech recognition engine 22 misses the word of having discerned dictation.

For what compensate that limited display device by computer equipment 10 causes the speaker is lacked visual feedback, the present invention can provide audible feedback with speech Compositing Engine 24, particularly, execution contexts adopts the speech synthesis engine 24 of well-known Text To Speech (TTS) technology in the present technique to play the text that draws to the speaker after speech conversion in speech recognition engine 22.The speaker can discern the dictation text of by mistake having been discerned.

When providing the mode of audible feedback mistake to be described when discerning with speech synthesis engine 24 to the speaker, the speaker just can use according to the voice correct application program 26 of this inventive apparatus and proofread and correct the text of mistake identification.Voice correct application program 26 disclosed herein can be realized as computer program with the developing instrument of buying of selected operating system 18 with reference to the process flow diagram shown in Fig. 3 to 6 by the programmer, the creative embodiment of the whole expression of these process flow diagrams voice correct application program 26.In preferred embodiments, the speaker can dictate voice and can check the mistake identification that is wherein comprised from the text and the correction of dictation speech conversion one-tenth subsequently in having computer equipment limited or that do not have display.In preferred embodiment, having computer equipment limited or that do not have display is telephone system, and the speaker can be therein by telephone bandset and computer equipment interactive dialogue.

Specifically referring to Fig. 3, this method is preferably from by being connected to speaker (speaker) traditional approach of computer equipment 10 communicatedly.Particularly, connection can start with the voice activation telephone system by the button on the promotion portable equipment or by the magnetic rate service device that wherein has induction (telepathy) card.Particularly, magnetic rate service device is a well-known equipment and can be used for connecting multiple phone line communicatedly in the present technique, and each telephone wire has speaking separately.In case connect, the speaker just can provide audio frequency to input to computer equipment 10 and can receive from the audio frequency of computer equipment 10 by magnetic rate service device by magnetic rate service device and export.

Here operate on the single or multiple vocal resonance incident that the method that will describe can receive in speech recognition engine 22, each speech events comprises in a plurality of alphanumeric characters, word, phrase, sentence or its combination any.Utilize conventional art, speech recognition engine 22 can be programmed for and on step 30, detect the vocal resonance incident, preferably as discussed above passing through at predefined fixed sample rate up-sampling audio input device 6.(promptly do not have therein after the time delay that speech events takes place by method being programmed for after pre-determined voiceless sound signal time slot, such as the silent amount that prolongs), perhaps by promoting same button or different button, perhaps use known any other alternative method of person skilled in the art person, temporarily stop to listen to speech events and stop sample window automatically.

Detect on step 30 after the vocal resonance incident, speech recognition engine 22 can be handled this speech events, so that convert speech events to the speech recognition text.Subsequently, on step 32, can the interim memory buffer unit of text entry in the internal memory device 14 of computer equipment 10 with speech recognition in.In step 34, in order to play the text of the speech recognition that is comprised in the interim memory buffer unit to the speaker, the text of speech synthesis engine 24 energy processed voice identifications.But the speech that speech synthesis engine 24 using systems generate plays to the speaker with the text of speech recognition, and this preferably transmits by audio output apparatus 8, and in the situation of telephone system, it is to be operatively coupled on speaker's the audio input device 6.Thereby whether step 34 allows the speech events of record in speaker's determining step 32 correctly to be discerned.

In step 34, with other is the same Anywhere in this method, speech synthesis system 24 preferably can be distinguished such as " to ", " too " and unisonance different antonyms such as " two " by the theme word being combined into syllables with listening to the speaker.Moreover, can be with such as speech synthesis system 24 being programmed for during playing with word " Alpha " the alphabetical A of expression or for the clarification of any necessity in detail traditional play-back technologies such as " A among the Apple " being described in detail and improving the speaker's of the broadcast of forming by single letter understanding.

If speech events is not correctly discerned, control is delivered to step 38 from step 36, in this case can be from the interim memory buffer unit that writes down it other text of deletion misrecognition.Otherwise, if speech events is correct identification, control is delivered to step 40 from step 36, can check whether the text of speech recognition is to comprise the voice corrective command in the text of determining speech recognition by voice correct application program 26 therein, this the most handy preferable key word indication will be described in detail as following.

In step 40-46, voice correct application program 26 can detect preferable key word in the text of speech recognition.Each preferable key word can be to be used for indicating the speaker to want to proofread and correct the voice corrective command of the text of the speech recognition that is stored in impact damper.In preferred embodiment, four preferable key words are arranged, be listed as not according to certain order: delete, replace, stop and proofreading and correct.For the purpose of this instructions, will in whole description, use this four key words.Yet, should be understood that available be intended to easily with (preferably including the unlikely word of saying of speaker or the phrase of dictation speech events) other word of the actual content difference of speech events itself or phrase (SCRATCH THAT) etc. realize other preferable key word or phrase similarly such as dashing out it.

The speaker can not dictate other a plurality of key words and not vectoring computer equipment 10 go to carry out the order of related with it regulation.Yet, this description is not limited to this on the one hand, listen to technology because also can imagine other, for example, dictation " SCRATCH (dashing out) ", allow sample window subsequently then, dictation " THAT " more subsequently, or alternatively use such as word such as " COMPUTER SCRATCHTHAT " attention, so that signal to that computer equipment 10 goes to carry out indicated voice corrective command and the speech events of back is not stored in the text buffer of internal storage 14 of computer equipment 10, discuss with reference to step 48 as following.

When screening the speech recognition text of speech events for the existence of preferable key word, i.e. check word DELETE (deletion) in step 40, REPLACE in step 42 (replacement), STOP in step 44 (stopping), and CORRECT (corrections) exists phonetic entry and broadcast five kinds possible situations after the speech events of identification correctly in step 46.Particularly, can in step 48, speech events be added in the text buffer, or as an alternative, the speech events that writes down previously that is stored in the text buffer be carried out one of four kinds of orders of pointing out that triggered by suitable preferable key word.Be described in more detail below respectively these five kinds of situations.

Situation 1: speech events is added in the text buffer

Show after the open period that is used for phonetic entry when input equipment during sample window, just can dictate the speech events that comprises message.For example, suppose that the initial speech incident that will dictate and write down comprises message " Stop on your way home to buy milk ".In when prompting, speaker or state whole message or in a plurality of message sections, dictate it.For the purpose supposition of discussing is back one situation, initially dictation can be " Stop on your way home; " in step 30, detect this first message section in this case, record in step 32, in step 34, play to then the speaker borrow the whether correct identification of decision-making system it.Suppose correct identification, the step 40-46 of Fig. 3 is walked around in control, because potential voice corrective command asked in the preferable key word of statement of no use.On step 48, first message section is added in the text buffer of internal memory device 14 or other memory devices then, and voice correct application program 26 turns back to the speech events that step 30 goes to continue to listen to the back.

When pointing out the phonetic entry of back, the speaker can dictate second message section according to the hypothesis of front, i.e. " to buy milk ".Because method of the present invention is repetition, detects second message section on step 30, record in step 32 plays to the speaker then in step 34, discusses with reference to first message section as the front.Yet, in step 34, preferably only state nearest speech events, i.e. " to buy milk " supposes that on step 36 suitably software is discerned second speech events, because the potential order of mode request of the preferable key word of statement of no use and software advances once again by the step 40-46 of Fig. 3.On step 48, this second message section is added in the first message section back in the text buffer then, these two sections constitute synthetic desired word message, " Stop on your way home to buy milk " now.

Person skilled in the art person will appreciate that the speaker can provide the message of dictation in above-mentioned one or more speech events.Advise that as top when utilizing a plurality of speech events to make message, single voice segments sequentially merges well on step 48, so that finish whole message in text buffer for record.In case successfully dictated whole message, just can order the write phase of listening that withdraws from this method with STOP, finally start the transmission of the requirement of electronic information.With reference to situation 4 the STOP order is described more fully below.

Situation 2:DELETE (deletion) order

Turn back to the example of front, suppose that the speaker becomes the first message section misstatement and desired statement " Stop on your way home ".Opposite " Stop on your to work ". in this situation, after speech synthesis system 24 is carried out the broadcast of text of speech recognitions in step 34, even speech recognition engine 22 is actually recognizing voice incident correctly, the speaker can recognize the mistake dictation.Therefore, suppose that the speaker wants to eliminate the mistake dictation with the DELETE order.More specifically, suppose that the speaker correctly states that " DELETE " instructs the last increase of elimination to interim memory buffer unit after the speech recognition speech events in step 36.In addition, the speaker can state " DELETE " immediately in phrase " the Stop on your way home " back of mistake dictation.In any case, in step 34,, order the message of dictating together with mistake no matter it is independent DELETE order or DELETE always speech events is played to the speaker.Suppose at this moment correct speech recognition in step 36, in step 40, detect the DELETE order and method is transferred to step 38, in this case, from originally it is placed with the text buffer that is stored in wherein this speech events of deletion.After this, method shifts back step 30, and method can recover to listen to next speech events therein.

Situation 3:REPLACE (replacement) order

The example of continuation situation 2 is supposed that the speaker correctly states to lay one's finger on " Stop on your way home " by first message section it successfully adds in the text buffer in step 48.After supposing that in step 30 speech events detects next time, second message section mistake listened write as " To buy juice " different with the statement that requires " To buy milk ".After step 34, the speaker hears mistake dictation, even speech events is at next step, promptly is actually correct identification in the step 36.Yet, with the multistep method of discussing in the situation 2 (promptly, at first eliminate the text of mistake dictation with the DELETE order, on its position, import proper messages then) difference, because voice correct application program 26 shifts back step 30, the speaker wishes the speech events of deletion mistake dictation in one step and replaces it with the proper messages section.

More specifically, detect after the mistake dictation, on step 30, detect the statement " Replace to buy milk " of back.This speech events of record and in step 34, it is played to that the speaker is for confirmation correctly to have discerned order according to step 36 in step 32 again.Suppose correctly to discern the statement of back, voice correct application program 26 is ordered owing to first word of importing in a minute detects REPLACE in step 42 then, and causes transferring to new instruction group.Here with the last input of replacing immediately following the replacement message section of command keyword REPLACE in the text buffer, i.e. dictation " to buy juice " by mistake.Thereby, in one step, cause replacing the phrase " To buy juice " of mistake dictation with the phrase " To buy milk " after proofreading and correct.In step 34, speech synthesis engine 24 is play again and is replaced text for its correctness of speaker verification.Suppose that text is replaced in correct identification in step 36, in step 48, make the message after text buffer comprises the correction of being made up of the message " Stop on your way home to buy milk " of polymerization.

Situation 4:STOP order

In step 48, successfully the message that requires is added before or after the text buffer, just can on step 44, withdraw from voice correct application program 26 well with the STOP order.Mate with the voice segments of each expection of speaker verification and corresponding broadcast and to finish desired message with repetitive mode and correctly be stored in judgement in the text buffer.In case made this judgement, such as the ending of the example in situation 3, the speaker can state the STOP order, as previously mentioned, on step 34 it is played to the speaker.Suppose correct this order of identification on step 36 again, voice correct application program 26 detects the STOP order on step 44, at this moment control the process " A " shown in the step 50 that is delivered to Fig. 4.

Referring now to Fig. 4,, step 50 is used for judging whether added new word well in uncompleted dictation.Here " newly " word is included in the word in the uncompleted dictation but the word that do not find in the vocabulary words database that speech recognition engine 22 is safeguarded well.If comprise new word in vocabulary, available " combining into syllables out " process adds them, and this process is the part of the following process that will discuss with reference to situation 5.

In case by combining out process into syllables new word is added in the uncompleted dictation, preferably finishes the detection of new word by the set sign related with new word.Perhaps, in order to define new word, the word in the uncompleted dictation can be to the lexical data base crosscheck, because any word in the dictation of not finding in vocabulary all is new.Also can realize other conventional art of new word identification, as in the field of speech recognition personnel understood.

If combine the word that makes new advances into syllables according to step 50, be preferably in the step 52 they are added in the software vocabulary for using in dictation dialogue in the future.Otherwise, if in step 50, do not detect new word, perhaps in step 52, be added to new word in the software vocabulary after, method of the present invention finishes, as shown in FIG. 4, expression dictation dialogue completes successfully whereby.After this, can realize that traditional technology handles successfully the electronic information of dictation.

Situation 5:CORRECT (correction) order

The dictation dialogue that turns back to Fig. 3 and suppose previously supposes correctly to dictate and discerned first message section from phrase " Stop on your way home " successfully is added in the text buffer step 48.Suppose that during the message section of dictation back the speaker has missed last word " milk " and lost " k " sound, its consequence is the broadcast statement " Tobuy mill " of software in step 34 rather than the statement " To buy milk " that requires.Without DELETE or REPLACE order, the speaker selects to use CORRECT (correction) order.

More specifically, the part of the subsequent voice incident of the first speech events back of proofreading and correct as needs supposes that the speaker has stated order " CORRECT mill ".At this moment in step 34, in order to confirm correct speech recognition, speaker's order is played to the speaker, and supposition correct speech recognition on step 36, the speaker can issue CORRECT voice corrective command.Subsequently, voice correct application program 26 detects the CORRECT order in step 46.After this, control is delivered to the process " B " of step 54 among Fig. 5.

Referring now to Fig. 5,, in step 54, voice correct application program 26 can play to the speaker audibly with " proofread and correct candidate " table, i.e. the table of determining traditionally near speech recognition engine 22 on pronunciation or phonetic for letter, word or the phrase of proofreading and correct letter separately, word or the phrase selected.In the example of this hypothesis, for word " mill " has been selected in correction.Thereby, proofread and correct the word that can be included in the candidate list on pronunciation or the phonetic near " mill ".Moreover each element on the candidate list can comprise an identifier (such as number) so that the speaker selects the correction that requires.For example, in the situation of this hypothesis, candidate list can comprise " 1.milk; 2.meal; 3.mark ".

Proofreading and correct candidate list can generate traditionally with any way that person skilled in the art person is known.For example, when the speaker dictated theme message " To buy milk ", the table of the possible candidate of each word in the statement is set up and considered to the speech recognition engine 22 of just carrying out traditional voice identification on theme message.Subsequently, for the accurate speech-to-text conversion of each word is provided, speech recognition engine 22 is selected statistics for each word select and is gone up most probable candidate.The candidate's table of corrections that can therefrom select word " mill " also is the candidate list that will utilize in the step 60, will describe in detail as following.

When speech synthesis engine 24 can be play the correction candidate list of " mill " with listening in step 54, the speech recognition engine 22 of cooperating with voice correct application program 26 can detect the speech events of selecting such as the speaker such as correction candidate simultaneously with well-known technology in the present technique from proofread and correct candidate list.For example, " swarming into (barge-in) " technology permission speech recognition software is spoken simultaneously and is listened to, and the speaker can " swarm into " by word of mouth or by promoting button.

Receive swarming into or other input of speaker in step 56, and control is delivered to step 58-62, at this moment the speaker can enumerate one of three voice corrective commands, preferably the SELECT that here lists not according to certain order, SPELL or REPLACE.Each processing of finishing in these three voice corrective commands is discussed respectively below.

Situation 5A:SELECT (selection) order

Turn back to the dictation dialogue of the hypothesis of situation 5, recall first message section and correctly dictate and discern, therefore phrase " Stop on your way home " correctly is added in the text buffer in step 48.Recall simultaneously during dictation second voice segments, speaker's mistake has been pronounced word " milk ", causes message section " to buy mill " is joined in the interim memory buffer unit.Then, further recall the speaker and state " CORRECT mill " in next speech events, at this moment speech synthesis engine 24 is play the correction candidate list of " mill " audibly in step 54.

In CORRECT order context, the SELECT order can be the key word input that receives on step 56.If like this, when candidate list has comprised the timing of requirement, the SELECT order is best.For example, if the candidate list of " mill " be: 1.milk; 2.meal; 3.mark can hear that etc., speaker the correction of requirement is that first in the broadcast of step 54 selected.Therefore, speaker's statement " SELECT ONE (selecting 1) ", this will and be handled by voice correct application program 26 subsequently by speech recognition engine 22 identifications.

Therefore speech recognition engine 22 generates text from speaker's phonetic entry, can discern the SELECT order on step 58.The transfer to step 64 that draws causes selecting number 1, promptly desired word " milk " from the broadcast of hearing of candidate list.Being similar to from candidate list other may select if select a sound with obscuring, and voice correct application program 26 can cause speech synthesis engine 24 to combine out the correction of selection into syllables selectively, but this is not the situation in this hypothesis and not providing in the drawings.

In step 70, whether the selection that voice correct application program 26 causes speech synthesis engine 24 can require the speaker verification to play in step 64 with listening is correct or satisfied.If the speaker answers " no (no) ", then voice correct application program 26 turns back to step 54 so that play the candidate list of " mill " again, provides the meeting of another inspection machine to the speaker.

Otherwise if answer " yes (being) " to the inquiry speaker of step 70, then " milk " of speaker's selection replaces for the theme word of proofreading and correct, i.e. " mill ".In addition,, control the step 30 of transmitting back Fig. 3 from step 70, because the dictation that now will require successfully adds text buffer if selection is correct.From this moment, can make the speaker withdraw from voice correct application program 26 with STOP (stopping) order as discussed earlier.In addition, the speaker can be according to methodology dictation additional text of the present invention and order.

Situation 5B:REPLACE (replacement) order

When theme word or phrase do not comprise the correction wanted but the speaker knows or believes that comprising the timing of wanting in the vocabulary preferably uses the REPLACE order.Example for from situation 5A this means that the candidate list of " mill " does not comprise word " milk ".Yet, supposing that the speaker knows or believe that word " milk " is among vocabulary, the speaker can state " REPLACE milk ", and this is the input that receives on step 56.

Thereby in step 62, software can detect the REPLACE voice corrective command that is converted to text by speech recognition engine 22 as previously discussed traditionally.The result transfers to step 68, supposes and really found word " milk " in speech recognition vocabulary, and speech synthesis engine 24 just can be play the word of wanting " milk " with listening.If do not find, voice correct application program 26 preferably can be transmitted the audio message of acquiescence.

In step 74, whether voice correct application program 26 can make speech synthesis engine 24 require the replacement of broadcast in speaker verification's step 68 correct.If the speaker replies " Yes (being) ".Then use speaker's replacement word " milk " and the theme word that will proofread and correct, i.e. " mill ", exchange.In addition, control the step 30 of transmitting back Fig. 3 from step 74, because the dictation that requires is now in text buffer.At this moment, the speaker can withdraw from voice correct application program 26, discusses with the STOP order as the front.In addition, the speaker can dictate other text and order, as previously discussed.

Yet, if the speaker replies "No" in step 74, voice correct application program 26 preferably turns back to step 54, again play the candidate list of " mill " in this case, though the correction of using SELECT or REPLACE to order unlikely generation to be wanted, the former since the selection wanted not in candidate list, and the latter was owing to before can not find the replacement word of wanting in vocabulary.Like this, SPELL (combining into syllables) order is logical selection in this case, will describe in detail below.

Situation 5C:SPELL (combining into syllables) order

When the speaker know or the correction finding to want neither in the theme candidate list also not in vocabulary, then can adopt the SPELL order.For the purpose of discussing, suppose to combine into syllables out the correction of wanting " milk " and do not proofread and correct on the candidate list because it is not included in.Thereby when the speaker stated " SPELL " subsequently, speech recognition engine 22 received these inputs and it is passed to voice correct application program 26 in step 56.The speaker need not be in order to identify " mill " statement " SPELL mill " clearly for proofreading and correct, because selected word " mill " for proofreading and correct by statement " CORRECT mill " in the step 46 of Fig. 3.In addition, clearly statement " SPELL milk " does not have any effect, because " milk " be not in vocabulary; If it, then the speaker can preferably use SELECT or REPLACE order rather than SPELL order.

Speech recognition engine 22 usefulness classic methods generate text from speaker's audio frequency input, and correspondingly identification SPELL order in step 60.After this, control is transferred to process " C " Fig. 6 from step 60, more specifically arrives step 78, starts the correction process that combines into syllables with this.

Voice bearing calibration 24 is play the SPELL order with speech synthesis engine 22 binding energy in step 78, though it also can be repeated for theme letter, word or the phrase proofreaied and correct.Here, in step 46, identified word " mill " for proofreading and correct.Thereby in this hypothesis, voice correct application program 26 can cause statement " SPELL " or " SPELL mill " simply.In step 78, voice correct application program 26 can be stated for the speaker and combine the instruction of observing in the correction into syllables in dictation simultaneously.

For example, these instructions can be instructed the speaker: 1) wait for software prompt; 2) character string of statement requirement; Or 3) keep the predetermined reticent period so that the current dialogue that combines into syllables of indication software is finished.It can be the routine of identifier word interbody spacer with software programming that these instructions also can provide, and allows to combine into syllables the word with the order of separating at interval whereby.Person skilled in the art person will appreciate that in step 78 also can realize the instruction that other is traditional.

After predetermined software prompt, at first reach the audio speech input that in step 80, receives the speaker subsequently by voice correct application program 26 by speech recognition engine 22.Best, phonetic entry can comprise one of following four kinds of possibilities: a series of characters, DELETE, REPLACE or FINISH order.Each of these situations is discussed below.

Situation 5C-1: character statement

About receive input in step 80, speech recognition engine 22 is preferably in the definable input of speaking that receives the speaker in the period of listening to.In preferred embodiment, respectively to listen to the period after the message that the prompting speaker that voice correct application program 26 starts speaks, to begin, the reticent period of appointment then can stop listening to the period.Yet person skilled in the art person can realize defining other scheme of listening to the period.

Thereby when software prompt or other showed for this phonetic entry of reception after the open period of phonetic entry and in step 80, the speaker combined the correction of expection into syllables, i.e. word in this example " milk ".When the predetermined reticent period finishes, above-mentionedly listen to the period and finish.Thereby, if never call order, can walk around step 82-86 so that arrive step 88, will import " milk " whereby and be added to the word " mill " of replacing front mistake dictation in the text buffer.In step 100, the synthesized phonetic sound that speech synthesis engine 24 generates can be repeated the input " milk " of increase so that the speaker can confirm to have made suitable correction then.Referring to step 80, the speaker can alternatively withdraw from the calibration phase that combines into syllables of voice correct application program 26 with FINISH (end) order again.Order below with reference to situation 5C-4 full-time instruction FINISH.

Situation 5C-2:DELETE (deletion) order

Referring to the hypothesis among the situation 5C-1, suppose that broadcast in the step 100 discloses speaker's mistake and combined the correction that combines into syllables that correction or speech recognition engine 22 mistakes have been discerned the speaker into syllables.If like this, the speaker can state simply next time by step 80 time that " DELETE " calls the DELETE order.In step 82, voice correct application program 26 can detect the DELETE order and can bring out the broadcast of DELETE order in step 90 then, and affirmation speech recognition engine 22 has suitably been discerned speaker's voice corrective command.In the step 96 of back, delete the character group that is added to the mistake in the text buffer at last, receive other audio speech input and further provide another time chance to supply successfully input word " milk " of speaker so that speech recognition engine 22 can be cooperated with voice correct application program 26 thereby can make this moment voice correct application program 26 return step 80.

Situation 5C-3:REPLACE (replacement) order

The REPLACE order can be used for deleting the text that mistake combines into syllables in one step from text buffer, replaces the text with the text string that correctly combines into syllables, and these are different with multistep method discussed above.Referring to the hypothesis of situation 5C-1, suppose that the voice playing announcement speaker mistake in the step 78 has combined correction into syllables, or speech recognition engine 22 mistakes have been discerned speaker's the correction that combines into syllables again.At this moment call REPLACE voice corrective command, the speaker can statement " REPLACE milk " next time by step 80 time.Yet in this case, statement letter " m-i-l-k " defines to replace and combines into syllables.Preferably statement can be included in the minibreak that occurs between the pronunciation of each single letter, so that speech recognition engine 22 can be discerned each single letter.

Then in step 84, cooperate with speech recognition engine 22, voice correct application program 26 can detect REPLACE voice corrective command, and preferably can in step 92, play this REPLACE voice corrective command together with the relative substitute character of REPLACE voice corrective command, be the suitable speech recognition of speaker verification whereby.In step 98, replace the last error character group that adds in the text buffer then with substitute character.Subsequently, voice correct application program 26 can turn back to step 80 make the speaker can with the FINISH voice corrective command of discussing below selectively combining into syllables in the correction process from voice correct application program 26 withdraw from.

Situation: 5C-4:FINISH (end) order

In case proofreaied and correct the errors in text of selecting, verify from any way discussed above as the speaker, state that FINISH voice corrective command is so that the speaker can withdraw from the correction process that combines into syllables by

step

86 and 94 in the time of can be afterwards by step 80.After the step 94, the step 30 of main speech list entries is returned in the control redirect, and allowing increases new text, or termination messages generates process, as above discusses in the face of the STOP order.

Spirit of the present invention is not limited to above-described any embodiment.Therefore, do not depart from the scope of the present invention, other are revised person skilled in the art person is conspicuous.Thereby, it must be understood that the detailed description of the present invention and accompanying drawing is for example, but not be used for restriction.

Claims

1. one kind is used for proofreading and correct the speech recognition application programming interface voice bearing calibration of the text of identification by mistake, comprises the steps:

Receive the audio speech input and the audio speech of described reception is imported speech-to-text and convert the speech recognition text to;

In described speech recognition text, detect the first voice corrective command of carrying out correct operation on the speech recognition text be used for being stored in text buffer;

If in described speech recognition text, do not detect the first voice corrective command, just described speech recognition text is added in the described text buffer; And

If in described speech recognition text, detect the first voice corrective command, just carry out described detected voice corrective command on the speech recognition text in being stored in described text buffer.

2. according to the process of claim 1 wherein that described receiving step further comprises the steps:

Can confirm the described speech-to-text conversion of described speech recognition text with listening.

3. according to the method for claim 2, can confirm that wherein the described step of the described speech-to-text conversion of described speech recognition text comprises the steps: with listening

Described speech recognition text can be play with listening, thereby the speech recognition text of described record can be determined in described speech-to-text switch process, whether to discern by mistake.

4. according to the method for claim 1, also comprise the steps:

Response detects the described first voice corrective command in described speech recognition text, described first voice corrective command indication is wished to stop described voice bearing calibration, and whether the described speech recognition text of determining to be stored in the described text buffer combines into syllables out;

The described speech recognition text that is defined as having combined into syllables out is increased in the speech recognition vocabulary of the discernible word of voice; And

Stop described voice bearing calibration.

5. according to the method for claim 1, also comprise the steps:

Response detects the described first voice corrective command in described speech recognition text, described first voice corrective command indication wishes to proofread and correct the mistake identification text in the described text buffer, can play voice with listening and proofread and correct candidate list, the candidate proofreaied and correct in each voice in the wherein said table is to selectable identification text on the statistics of described audio speech input;

Receive voice described in the described table and proofread and correct one of candidate's selection; And

Proofread and correct the text that the candidate replaces the described mistake identification in the described text buffer with the voice of described selection.

6. according to the method for claim 1, also comprise the steps:

Receive the second voice corrective command, the described second voice corrective command is not only indicated preferable replacement text but also indicate selection to replace described mistake with described preferable replacement text in described text buffer and discern text; And

Response receives the described second voice corrective command, replaces the text of mistake identification in the described text buffer with described preferable replacement text.

7. according to the method for claim 1, also comprise the steps:

Receive the second voice corrective command, described second voice corrective command indication wishes to replace with the replacement text that combines into syllables out the text of the described mistake identification in the described text buffer;

Response receives the described second voice corrective command, the replacement text that acceptance can combine into syllables out with listening, and the described replacement text that combines into syllables out comprises a series of alphanumeric characters of saying with listening;

The described alphanumeric character series of saying of speech-to-text conversion, the alphanumeric character that each speech-to-text is changed is stored in the temporary buffer, and with the synthetic replacement text that combines into syllables out of the alphanumeric character set of described speech-to-text conversion; And

Discern text with the described mistake that the described replacement text that combines into syllables is out replaced in the described text buffer.

8. according to the method for claim 7, also comprise the steps:

Detect the 3rd voice corrective command in the described replacement text that combines into syllables out, described the 3rd voice corrective command indication wishes that deletion is stored in the particular letter numerical character in the described temporary buffer with listening; And

Response detects described the 3rd voice corrective command, the described particular letter numerical character of deletion from described temporary buffer.

9. according to the method for claim 7, also comprise the steps:

Detect the 3rd voice corrective command in the described replacement text that combines into syllables out, described the 3rd voice corrective command is not only indicated preferable replacement alphanumeric character but also indicate the described preferable replacement alphanumeric character of wishing with in the described temporary buffer to replace specific alphanumeric character with listening; And

Response detects described the 3rd voice corrective command, replaces described particular letter numerical character with the described preferable alphanumeric character in the described temporary buffer.

10. according to the method for claim 7, the wherein said step of accepting further comprises the steps:

Before accepting to combine out the replacement text into syllables, play the one group of instruction of storage in advance that is used to provide the described replacement text that combines into syllables out with listening.

11. one kind is used for proofreading and correct the speech recognition application programming interface voice means for correcting of the text of identification by mistake, comprises

12. according to the device of claim 11, wherein said receiving trap further comprises:

Can confirm the device of the described speech-to-text conversion of described speech recognition text with listening.

13. according to the device of claim 12, the device of the described speech-to-text conversion of the described speech recognition text of wherein said affirmation comprises with listening:

Described speech recognition text can be play with listening so that can judge the device of in described speech-to-text conversion equipment, whether having discerned the speech recognition text of described record.

14. according to the device of claim 11, also comprise following apparatus, this device:

Response detects the described first voice corrective command in described speech recognition text, described first voice corrective command indication is wished to stop described voice means for correcting, determines whether to combine into syllables out the described speech recognition text that is stored in the described text buffer;

But the described speech recognition text that is defined as having combined into syllables out is added in the speech recognition vocabulary of word of recognizing voice; And

Stop described voice means for correcting.

15. according to the device of claim 11, also comprise following apparatus, this device:

Response detects the described first voice corrective command in described speech recognition text, described first voice corrective command indication selects to proofread and correct the text of the mistake identification in the described text buffer, can play voice with listening and proofread and correct candidate list, the candidate proofreaied and correct in each voice in the wherein said table is to selectable identification text on the statistics of described audio speech input;

Receive described voice in the described table and proofread and correct one of candidate's selection; And

16. according to the device of claim 11, also comprise following apparatus, this device:

Receive the second voice corrective command, the described second voice corrective command is not only indicated preferable replacement text but also indicate the text of selecting to replace in described text buffer with described preferable replacement text described mistake identification; And

Response receives the described second voice corrective command, replaces the text of the described mistake identification in the described text buffer with described preferable replacement text.

17. according to the device of claim 11, also comprise following apparatus, this device:

Response detects the described first voice corrective command in described speech recognition text, described first voice corrective command indication selects to proofread and correct the text of the mistake identification in the described text buffer, can play the voice table of corrections, the candidate proofreaied and correct in each voice in the wherein said table is to selectable identification text on the statistics of described audio speech input with listening;

Receive the second voice corrective command, described second voice corrective command indication selects to replace with the replacement text that combines into syllables out the text of the described mistake identification in the described text buffer;

Replace the text of the described mistake identification in the described text buffer with the described replacement text that combines into syllables out.

18. according to the device of claim 17, also comprise following apparatus, this device:

Detect the 3rd voice corrective command in the described replacement text of hearing that combines into syllables out, described the 3rd voice corrective command indication wishes that deletion is stored in the particular letter numerical character in the described temporary buffer; And

19. according to the device of claim 17, also comprise following apparatus, this device:

20. according to the device of claim 17, the wherein said step of accepting further comprises following apparatus, this device:

Before the alphanumeric text of accepting to combine into syllables out, play the instruction group of the prior storage that is used to provide the described replacement text that combines into syllables out with listening.