CN103021412B

CN103021412B - Voice recognition method and system

Info

Publication number: CN103021412B
Application number: CN201210584746.2A
Authority: CN
Inventors: 何婷婷; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2014-12-10
Anticipated expiration: 2032-12-28
Also published as: CN103021412A

Abstract

An embodiment of the invention discloses voice recognition method and system. The method includes: subjecting voice signals input by a user to voice recognition so as to obtain voice recognition results and voice segments corresponding to characters in the voice recognition results; receiving error correction information input by the user independently and generating error correction character strings; determining the erred voice segments in the voice signals input by the user according to the error correction character strings; determining character strings corresponding to the recognition-erred voice segments in the voice recognition results according to the voice segments corresponding to the character strings in the voice recognition results, and using the character strings as error character strings; and replacing the error character strings with the error correction character strings. The recognition-erred voice segments are determined according to the error correction character strings generated by the error correction information independently input by the user, the error character strings corresponding to the voice segments in the voice recognition results are located then, automatic location of the error character strings in the voice recognition results is achieved, and the problem than manual location is inconvenient is solved.

Description

Audio recognition method and system

Technical field

The present invention relates to speech recognition technology field, more particularly, relate to audio recognition method and system.

Background technology

Speech recognition technology is that a kind of voice signal to user's typing is identified, and is finally converted into the technology of text/character string (being also that recognition result is text), and its man-machine interaction that is Natural humanity is provided convenience.To adopt the mobile device of speech recognition technology as example, under the support of speech recognition technology, user, as long as speak facing to mobile device, will automatically form word after speech recognition system identification, has greatly improved user's input efficiency.

But under the applied environment of arbitrarily saying at large vocabulary, speech recognition technology still can not reach very correct discrimination, need manually recognition result to be revised to editor.Mobile device (speech recognition system) is shown to voice identification result behind the input text area of screen, first user to the voice identification result editor that modifies, need to locate the character that needs to revise (also can be described as to be modified) as want in voice identification result.

And on mobile device, particularly on the finger touch-screen equipment of the small screen, because screen size is limited,

User in the time positioning certain definite character from the continuous text of section greatly, and particularly, in the time that adjacent two intercharacters insert editor's cursor, inconvenient problem is located in existence.

Summary of the invention

In view of this, embodiment of the present invention object is to provide audio recognition method and system, manually positions the problem of the location inconvenience existing to solve above-mentioned user.

For achieving the above object, the embodiment of the present invention provides following technical scheme:

According to the embodiment of the present invention aspect, a kind of audio recognition method is provided, comprising:

Voice signal to user's input carries out speech recognition, obtains the first optimum decoding path, and the described first optimum decoding path comprises sound bite corresponding to each character in voice identification result and described voice identification result;

Receive the error correction information of the independent input of user and generate corresponding error correction character string, described error correction information is by non-voice mode or voice mode input;

Determine the voice segments that produces identification error in the voice signal of described user input according to described error correction character string;

According to sound bite corresponding to each character in described voice identification result, determine voice segments corresponding character string in described voice identification result of described generation identification error, as the error character string that produces identification error;

Utilize described error correction character string to replace the error character string of described generation identification error.

According to another aspect of the embodiment of the present invention, a kind of speech recognition system is provided, comprising:

Voice recognition unit, for the voice signal of user's input is carried out to speech recognition, obtains the first optimum decoding path, and the described first optimum decoding path comprises sound bite corresponding to each character in voice identification result and described voice identification result;

Error correction word string input block, for receiving the error correction information of the independent input of user and generating corresponding error correction character string, described error correction information is by non-voice mode or voice mode input;

Automatic error-correcting unit, for determining that according to described error correction character string the voice signal of described user's input produces the voice segments of identification error; According to sound bite corresponding to each character in described voice identification result, determine voice segments corresponding character string in described voice identification result of described generation identification error, as the error character string that produces identification error; Utilize described error correction character string to replace the error character string of described generation identification error.

Can find out from above-mentioned technical scheme, the disclosed technical scheme of the embodiment of the present invention determines according to user's error correction character string that the error correction information of input generates separately the voice segments that produces identification error, find again the error character string of its corresponding generation identification error in voice identification result by this voice segments, realize the error correction information that user inputs and the error correction character string generating and error character string corresponding, and then realize the automatic location to error character string in voice identification result, solve the problem that user manually positions the location inconvenience of existence.

Brief description of the drawings

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

The audio recognition method process flow diagram that Fig. 1 provides for the embodiment of the present invention;

The handwriting input identification process figure that Fig. 2 provides for the embodiment of the present invention;

The Minimum Area schematic diagram that Fig. 3 provides character to cover for the embodiment of the present invention;

The automatic error-correcting process flow diagram flow chart that Fig. 4 provides for the embodiment of the present invention;

The error correction character string retrieval network structural representation that Fig. 5 provides for the embodiment of the present invention;

The speech recognition system structural representation that Fig. 6 provides for the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

As the simple and convenient and efficient input mode of one, speech recognition has changed traditional keyboard mode based on complexity coding or Pinyin Input, is the man-machine interaction of the Natural humanity condition of providing convenience.Particularly in recent years along with the innovation of scientific and technological development and wireless communication networks is universal, various online speech recognition application, as sent out a microblogging, create message, network instant communications etc. have received increasing concern.Under the support of speech recognition technology, user, as long as speak facing to mobile device, will automatically form word after system identification, has greatly improved user's input efficiency.

But under the applied environment of arbitrarily saying at large vocabulary, speech recognition technology still can not reach very correct discrimination, need manually recognition result to be revised to editor.Mobile device (speech recognition system) is shown to voice identification result behind the input text area of screen, and user to the voice identification result editor that modifies, need to locate the character that needs to revise (also can be described as to be modified) as want in recognition result.

And on mobile device, particularly, on the finger touch-screen equipment of the small screen, because screen size is limited, user in the time positioning certain definite character from continuous large section text, particularly, in the time that adjacent two intercharacters insert editor's cursor, there is the inaccurate problem in location.

For ease of understanding, now speech recognition is described below:

If one section of voice signal to be identified is denoted as to S, S is carried out to the phonetic feature sequence O that answered in contrast after a series of processing, be denoted as O={O ₁, O ₂..., O _i..., O _t, wherein Oi is i phonetic feature, T is the total number of phonetic feature.The word string being made up of many words can be regarded as in the sentence that voice signal S is corresponding, is denoted as W={w ₁, w ₂..., w _n.The task of speech recognition is exactly according to known phonetic feature sequence O, obtains most probable word string W '.

In the detailed process of speech recognition, generally first extract the speech characteristic parameter that voice signal is corresponding, subsequently in the web search space that the acoustic model by preset and language model form, according to default searching algorithm (such as Viterbi algorithm), search is with respect to extract the to obtain optimal path (also optimum decoding path) of speech characteristic parameter.

Understanding after some concepts of speech recognition, now the technical scheme of the embodiment of the present invention is being described below.

For solving the problem of above-mentioned location inconvenience, the audio recognition method that the embodiment of the present invention provides at least comprises the steps:

Speech recognition process: the voice signal to user's input carries out speech recognition, obtains optimum decoding path, and wherein, optimum decoding path comprises sound bite corresponding to each character in voice identification result and voice identification result;

Process concatenated in error correction character: receive the error correction information of the independent input of user and generate corresponding error correction character string, above-mentioned error correction information allows by non-voice mode or voice mode input;

Automatic error-correcting process: determine the voice segments that produces identification error in the voice signal that user inputs according to error correction character string; According to sound bite corresponding to each character in voice identification result, determine the voice segments corresponding character string in described voice identification result that produces identification error, as the error character string that produces identification error; And utilize error correction character string to replace the error character string of described generation identification error.For calling conveniently, follow-up record mistake in using character strings are as the abbreviation of " producing the error character string of identification error " herein.

Below each process is introduced one by one.

One, speech recognition process

For maximum possible meet the daily interaction demand of user, the embodiment of the present invention adopts large vocabulary continuous speech recognition technology, to realize the text-converted to saying arbitrarily voice.

Wherein, referring to Fig. 1, above-mentioned speech recognition process specifically comprises:

S11, tracking gather the voice signal (being also above-mentioned one section of voice signal to be identified) of user's input;

In other embodiments of the invention, can deposit above-mentioned voice signal in data buffer area;

S12, above-mentioned voice signal is carried out to pre-service, to obtain through pretreated speech data;

Above-mentioned pre-service can comprise the noise effect that voice signal sampling, anti aliasing bandpass filtering, point frame processing, removal individual pronunciation difference and equipment, environment cause, end-point detection.For the above-mentioned pre-service of robustness that improves speech recognition system specifically also can comprise front end noise reduction process, think that subsequent voice processing provides comparatively pure voice.

S13, every frame speech data in the pretreated speech data of above-mentioned process is carried out respectively to feature extraction, to obtain feature vector sequence.

In step S13, every frame speech data is carried out can extracting efficient voice feature (or eigenvector) after feature extraction.Like this, after feature extraction, each frame speech data forms an eigenvector, and corresponding, above-mentioned speech data is that an available feature vector sequence represents;

It will be understood by those skilled in the art that, if to comprising 30 frame speech datas through pretreated speech data, this 30 frame speech data just can extract 30 eigenvectors so, and these 30 eigenvectors can form above-mentioned feature vector sequence according to time order and function order.

In other embodiments of the invention, above-mentioned efficient voice feature can be linear prediction cepstrum coefficient or MFCC(Mel cepstrum) feature.Concrete, be characterized as example with MFCC, every frame speech data that can move 10ms to the long 25ms frame of window obtains the single order/second order difference of MFCC parameter and/or MFCC parameter by short-time analysis, amount to 39 and tie up.Like this, the feature extraction of every frame speech data process can obtain the eigenvector of one 39 dimension.

In other embodiments of the invention, above-mentioned phonetic feature/speech characteristic vector sequence can be deposited in feature buffer area.

S14, (above-mentioned retrieval network is mainly by the acoustic model of systemic presupposition in the retrieval network building in advance, above-mentioned feature vector sequence to be carried out to optimum route search, dictionary, the formations such as language model), to obtain the model string that there is maximum model likelihood probability with above-mentioned feature vector sequence as voice identification result output (showing).

In the specific implementation, can adopt the Viterbi searching algorithm based on Dynamic Programming Idea of main flow in the industry, to meeting in each eigenvector traversal retrieval network, pre-conditioned live-vertex calculates accumulated history path probability and reservation meets the live-vertex of pre-conditioned historical path as subsequent searches network, finally decodes by the path with maximum historical path probability (being also the above-mentioned first optimum decoding path) recalled to the identification realizing input voice.In decoding, the first optimum decoding path all retains its corresponding recognition unit model to every frame speech data, and then all can obtain its corresponding sound bite to each character in voice identification result, certainly, also can obtain each character start position information and the end position information of corresponding sound bite.

It should be noted that, the above-mentioned sound bite of mentioning both can be the sound bite in the voice signal of user input, also can be through at least one frame speech data in pretreated speech data, also can be the eigenvector subsequence in feature vector sequence.For calling conveniently, the follow-up voice signal that user is inputted, the pretreated speech data of process and feature vector sequence are referred to as voice signal to be identified herein.

Also, the following voice signal to be identified of mentioning specifically can be voice signal, the pretreated speech data of process or the feature vector sequence that user inputs herein.And herein the following sound bite of mentioning specifically can be sound bite, at least one frame speech data or the eigenvector subsequence in the voice signal of user's input.

That is to say, we can will be divided into the sound bite corresponding with character in voice identification result through the feature vector sequence in pretreated speech data or step S13 in the voice signal in step S11 or step S12, thereby make the corresponding definite sound bite of each character in voice identification result.

By way of example, if voice identification result is " we go to climb the mountain " this character string, decoding routing information corresponding to this character string can save as: (0,000,000 2200000), (2,200,000 3600000), (36000004300000), (4,300,000 5000000), (5,000,000 7400000).

Above-mentioned (0,000,000 2200000) have indicated start position information and the end position information of the corresponding voice snippet of " I " this character.Wherein, the 0000000th, the reference position (moment) of " I " corresponding voice snippet in voice signal to be identified, and 2200000 are " I " corresponding voice snippet end positions (moment) in voice signal to be identified.

Two, process concatenated in error correction character

The embodiment of the present invention is supported user with non-voice mode or voice mode input error correction information and is generated error correction character string.

In the time adopting voice mode input error correction information, the error correction information of inputting is specially voice signal, due to the same with speech recognition process be to input with voice mode, system possibly cannot determine that current phonetic entry is the phonetic entry in order to continue new text, or for urtext being carried out to voice error correction input.Therefore, independent error correction information input control button can be set, control from the phonetic entry of new text and switch to the voice error correction input to urtext.Under the pattern with voice mode input error correction information, because error correction information is voice signal, processing procedure in the time converting it into error correction character string is the same, and to state speech recognition process identical, therefore not to repeat here, and, also can provide multiple identification candidate characters statements based on collusion users to select to improve the accuracy rate that generates error correction character string.

In addition, the embodiment of the present invention also supports user to input error correction information in non-voice modes such as key-press input (such as Pinyin Input, stroke input, region-position code input etc.), handwriting inputs, now, as with key-press input, the error correction information of inputting is specially keystroke sequence, as with handwriting input, the error correction information of inputting is specially written handwriting.

Now, taking Pinyin Input and handwriting input as example, non-voice mode input process is introduced.

Its idiographic flow still refers to Fig. 1:

S21, judge user's input mode, phonetic key-press input proceeds to step S22 in this way, if handwriting input proceeds to step S23.

S22, the keystroke sequence that user is inputted convert candidate error correction character string to.

Wherein, step S22 specifically can comprise:

S221, follows the tracks of the keystroke sequence that gathers user, by its corresponding one-tenth alphabetic string sequence;

S222, mates to find candidate error correction character string by the alphabetic string sequence collecting and preset phonetic dictionary, and shows.

Such as user is after input qinghua, system may show that multiple candidate error correction characters statements based on collusion users such as Tsing-Hua University, blue and white, parent is magnificent select.

The written handwriting of S23, identification user input, is converted at least one candidate error correction character string by the written handwriting of user's input;

Wherein, referring to Fig. 2, step S23 can specifically comprise:

S231, follows the tracks of the written handwriting of user's input, and the written handwriting collecting is kept in handwriting data buffer area;

In on-line handwritten recognition system, user's written handwriting conventionally represents with the two dimension (position coordinates) of a sequence or three-dimensional point (position coordinates with lift the pen/state of starting to write) coordinate, in order to describe the room and time information of character writing.

S232, carries out pre-service to above-mentioned written handwriting.

Because collecting device or user are in reasons such as dithering in writings, acquired original to written handwriting in may there are various noise jamming.In order to improve the robustness of system, can carry out pre-service to the person's handwriting collecting.Concrete, can pass through character boundary normalization, wild point remove, level and smooth, the processing modes such as resampling are combined, to reduce as far as possible the problem of the discrimination decline that noise jamming brings.

S233, to carrying out handwriting characteristic extraction through pretreated written handwriting.

Similar with speech recognition, in handwriting recognition, also need to extract from original person's handwriting track the character feature of reflection character feature.

Concrete, the present embodiment extracts eight conventional direction characters of handwriting recognition field, and the differentiation that improves handwriting characteristic by technology such as LDA.

S234, mates the character feature of extraction with preset model, calculate similarity.

S235, choose at least one the preset model that there is highest similarity with above-mentioned character feature as candidate error correction character string, and show.

The accuracy rate of considering Pinyin Input and handwriting recognition technology is often fine, thereby the number of common above-mentioned candidate error correction character string can select 3 to 5.

Certainly, it will be appreciated by persons skilled in the art that in the time that user's non-voice is inputted long enough, also may only have a candidate error correction character string.

S25, from candidate error correction character string determine error correction character string.

Step S25 can specifically comprise:

Accept user's selection and specify, from least one candidate error correction character string, determine unique error correction character string.

S25 can list separately, as the further confirmation to error correction character string, with compatible phonetic entry and non-voice input mode.

Three, automatic error-correcting process

Consider that the corresponding voice segments of error character string that produces identification error in error correction character string and voice identification result often has consistance, the core concept of embodiment of the present invention automatic error-correcting is: error correction character string is mapped in voice segments, find its corresponding words (also producing the error character string of identification error) in voice identification result by this voice segments again, thereby realized the corresponding of error correction character string and error character string.Like this, just realize the automatic location to error character string in voice identification result, solved user and manually position the problem of the location inconvenience of existence.

Specifically, first in voice signal to be identified, find the voice segments corresponding to above-mentioned error correction character string.In voice identification result, locate subsequently the character string corresponding with this voice segments as " producing the error character string of identification error ".Above-mentioned " producing the error character string of identification error " is the substring in the model string obtaining in step S14, initial time and the finish time of this substring corresponding voice segments in voice signal to be identified, with the initial time of above-mentioned error correction character string corresponding voice segments in voice signal to be identified with have consistance the finish time.

The flow process of automatic error-correcting process please, still referring to Fig. 1, comprising:

S31, determine the voice segments that produces identification error in voice signal to be identified according to error correction character string;

S32, according to sound bite corresponding to each character in voice identification result, the voice segments of determining above-mentioned generation identification error is corresponding character string in the voice identification result in the first optimum decoding path, sets it as " the error character string that produces identification error ";

S33, utilize error correction character string to replace the error character string of above-mentioned generation identification error.

In other embodiments of the invention, step S33 can comprise the steps:

Equal at 1 o'clock at the number of error character string that produces identification error, directly utilize the error correction information that user inputs and the error correction character string generating replaced the error character string of this generation identification error;

Number at the error character string that produces identification error is greater than at 1 o'clock, utilizes error correction character string to replace the error character string of the generation identification error of user's appointment.

Some embodiments of the invention can be accepted user and initiatively participate in selecting, and therefore, the idiographic flow of above-mentioned " utilizing error correction character string to replace the error character string of the generation identification error of user's appointment " can comprise:

A highlights the error character string of all generation identification errors in voice identification result.

In other embodiments of the invention, except highlighting the error character string of all generation identification errors, other recognition result that can also arrange except wrong character string is non-active state, to improve setting accuracy;

B, accepts user's selection and specifies, and utilizes above-mentioned error correction character string to upgrade user-selected fixed error character string.

In addition, in other embodiments of the invention, also can support user's Fuzzy Selection to specify---and do not require user's precise positioning error character string, but position by neighbour's mode: in the time that starting to write of writing pencil a little falls into error character string neighbour region, be automatically located on corresponding error character string.

Specifically, calculate the bee-line of the Minimum Area that the each error character string of a distance covers of starting to write, select the error character string with minimum " bee-line " as the selected error character string of user.For example, referring to Fig. 3, the height H that can set the Minimum Area that a character (I) covers be the high h of this character words A doubly, and the width W of the Minimum Area that character covers be the wide w of this character words B doubly, A and B can be the positive count that is more than or equal to 1.So, the Minimum Area that error character string covers is for forming the summation of the Minimum Area that in this error character string, all characters cover.

Referring to Fig. 4, in other embodiments of the invention, above-mentioned steps S31 can specifically comprise the steps:

S311, concatenates into error correction character string retrieval network according to above-mentioned error correction character.

Refer to Fig. 5, above-mentioned error correction character string retrieval network comprises error correction character string model and preset absorbing model.

Wherein, error correction character string model is concatenated into by error correction character: by preset dictionary, error correction character string is expanded to corresponding model sequence and obtain corresponding error correction character string model.The error correction character string generating due to the error correction information of the each input of user is all not quite similar, and therefore, the error correction character string model in error correction character string network needs real-time update.

Therefore, above-mentioned steps S31 can specifically comprise again:

Obtain error correction character string model corresponding to error correction character string;

Obtain preset absorbing model;

Generate error correction character string retrieval network according to the error correction character string model obtaining and absorbing model.

It should be noted that, if there is non-conterminous and incoherent many places identification error in voice identification result, such as there are " Tsing-Hua University " and " western station " two place's identification errors in voice identification result, need Multiple through then out voice or non-voice mode to input error correction information and generate error correction character string.And error correction information to each input and the error correction character string that generates, no matter it comprises how many words, is all regarded as an independently error correction character string.Such as, user, in the time of certain input error correction character string, has inputted 3 Chinese characters altogether, and error correction character string comprises 3 Chinese characters, by dictionary, the error correction character string that comprises these 3 Chinese characters is extended to corresponding error correction character string model subsequently.

In the time that error correction character string is extended to error correction character string model, can adopt different extended modes according to the difference of preset acoustic model.Such as, acoustic model that can be based on syllable-based hmm unit is (as the acoustic model based on syllable-based hmm unit, individual Chinese character is made up of 1 syllable), also acoustic model that can be based on phoneme model unit is (as the acoustic model based on phoneme model unit, individual Chinese character is made up of 2 phonemes), specifically determined by the model unit adopting in the time carrying out speech recognition.Therefore, obtain the error correction character string model being in series by 3 syllable-based hmm unit or the error correction character string model being in series by 6 phoneme model unit as the above-mentioned error correction character string that comprises 3 Chinese characters expanded, can being expanded.

Be to train at magnanimity speech data the background model obtaining in advance by system as for absorbing model, also can adopt multiple absorbing models to improve the accuracy of complicated voice match.It should be noted that multiple independent absorbing models are in parallel.

S312 treats recognition of speech signals and again decodes and obtain the second optimum decoding path in error correction character string retrieval network.

Wherein, the voice segments that the second optimum path of decoding comprises that error correction character string model is corresponding is as the voice segments that produces identification error.

Concrete, the corresponding voice segments of above-mentioned error correction character string model can be the voice segments in the voice signal of user's input, also can be through at least one frame speech data in pretreated speech data, also can be the eigenvector subsequence in feature vector sequence.For the purpose of simplifying, can select the corresponding eigenvector subsequence of erroneous character correction symbol string model as the voice segments that produces identification error.Step S312 can specifically comprise:

In error correction character string retrieval network, search, corresponding to the optimal path (i.e. the second optimal path) of feature vector sequence, obtains reference position and the end position of the corresponding eigenvector subsequence of above-mentioned error correction character string model in whole feature vector sequence.

Decoding in step S312, S14 is similar with above-mentioned steps, the difference of the two is, the network that step S312 utilizes is the error correction character string retrieval network of concatenating according to error correction character, and the scope of the retrieval network that step S14 utilizes is greater than above-mentioned error correction character string retrieval network.Therefore, the decoding of step S312, still can adopt the Viterbi searching algorithm based on Dynamic Programming Idea of main flow in the industry, meet the live-vertex of pre-conditioned historical path as subsequent searches network to meeting pre-conditioned live-vertex and retain in every frame feature vector traversal error correction character string retrieval network, finally by the path with maximum historical path probability (second optimum decoding path) obtained to voice segments corresponding to error correction character string model, thereby determine the voice segments that produces identification error.

Due in step S312, reference position (moment) and the end position (moment) of the voice segments that error correction character string model is corresponding are obtained, therefore, in follow-up step S32, can, according to sound bite corresponding to each character in voice identification result, determine reference position corresponding bebinning character in voice identification result of the voice segments that produces identification error.Meanwhile, can determine end position corresponding termination character in voice identification result of the voice segments that produces identification error, determine after bebinning character and termination character, just can determine the error character string that produces identification error.

More specifically, can carry out in the following way to determine bebinning character:

Using character corresponding reference position as the first character, and using the corresponding voice snippet of this first character as the first voice snippet;

If above-mentioned reference position is positioned at the front portion of the first voice snippet, using this first character as bebinning character, otherwise select next character in voice identification result as bebinning character.

And determining when termination character, can be in the following way:

Using character corresponding end position as the second character, using the corresponding voice snippet of the second character as the second voice snippet;

If when end position is positioned at the second voice snippet anterior, select a upper character in voice identification result as termination character, otherwise, using the second character as termination character.

Still taking aforesaid " we go to climb the mountain " this voice identification result as example, before address, in this voice identification result, reference position and the end position of the corresponding sound bite of each character are respectively: (00000002200000), (2,200,000 3600000), (3,600,000 4300000), (4,300,000 5000000), (5,000,000 7400000).

By way of example, suppose, in step S312, the reference position and the end position that produce the voice segments of identification error are (0,000,050 3600000), because reference position 0000050 is in the front portion of (0,000,000 2200000), can determine that " I " be as bebinning character, and end position 3600000 is at the rear portion of (2,200,000 3600000), can determine that " " is for termination character.Known, " we " are voice segments corresponding error character string in voice identification result of above-mentioned generation identification error.

Corresponding with said method, the embodiment of the present invention also provides speech recognition system.Fig. 6 shows a kind of structure of said system, comprising:

Voice recognition unit 1, for the voice signal of user's input is carried out to speech recognition, obtains optimum decoding path, and wherein, optimum decoding path comprises sound bite corresponding to each character in voice identification result and described voice identification result;

More specifically, voice recognition unit can comprise processor, by processor, the voice signal of user's input is carried out to speech recognition.

Unit 2 concatenated in error correction character, for receiving the error correction information of the independent input of user and generating corresponding error correction character string;

More specifically, as inputted error correction information with voice mode, error correction character is concatenated into unit and still can be comprised above-mentioned processor, by processor, error correction information is carried out to speech recognition and generates error correction character string;

As inputted error correction information in key-press input mode, error correction character is concatenated into unit and at least can be comprised keyboard and processor, by processor, the keystroke sequence that user is inputted is converted to candidate error correction character string, and accept user selection specify, from least one candidate error correction character string determine unique error correction character string.Certainly also can by another independently chip or processor convert the keystroke sequence of user's input to candidate error correction character string, and the selection of accepting user specifies, and determines unique error correction character string from least one candidate error correction character string.

As inputted error correction information in handwriting input mode, error correction character is concatenated into unit and at least can be comprised writing pencil, touch-screen and processor, by processor, the written handwriting that user is inputted is converted to candidate error correction character string, and accept user selection specify, from least one candidate error correction character string determine unique error correction character string.Certainly also can by another independently chip or processor convert the written handwriting of user's input to candidate error correction character string, and the selection of accepting user specifies, and determines unique error correction character string from least one candidate error correction character string.

Certainly,, in order to ensure that user can adopt various ways input error correction information, unit concatenated in error correction character also can comprise above-mentioned multiple device simultaneously.

Automatic error-correcting unit 3, produce the voice segments of identification error for really state the voice signal of user's input according to error correction character string, according to sound bite corresponding to each character in voice identification result, determine the voice segments corresponding character string in voice identification result that produces identification error, as the error character string that produces identification error; And utilize error correction character string to replace the error character string that produces identification error.

More specifically, the function of automatic error-correcting unit 3 also can be by above-mentioned processor or other independently chip or processor realizations.

The more detailed function of above-mentioned each unit can be recorded referring to preceding method, and therefore not to repeat here.

Those of ordinary skill in the art can recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, computer software or the combination of the two, for the interchangeability of hardware and software is clearly described, composition and the step of each example described according to function in the above description in general manner.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can realize described function with distinct methods to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.

Those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the device of foregoing description and unit, can, with reference to the corresponding process in preceding method embodiment, not repeat them here.

In the several embodiment that provide in the application, should be understood that disclosed apparatus and method can realize by another way.For example, device embodiment described above is only schematic, for example, the division of described unit, be only that a kind of logic function is divided, when actual realization, can have other dividing mode, for example multiple unit or assembly can in conjunction with or can be integrated into another system, or some features can ignore, or do not carry out.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, indirect coupling or the communication connection of device or unit can be electrically, machinery or other form.

The described unit as separating component explanation can or can not be also physically to separate, and the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed on multiple unit.Can select according to the actual needs some or all of unit wherein to realize the object of the present embodiment scheme.

In addition, the each functional unit in each embodiment of the present invention can be integrated in a processing unit, can be also that the independent physics of unit exists, and also can be integrated in a unit two or more unit.Above-mentioned integrated unit both can adopt the form of hardware to realize, and also can adopt the form of SFU software functional unit to realize.

If described integrated unit is realized and during as production marketing independently or use, can be stored in a computer read/write memory medium using the form of SFU software functional unit.Based on such understanding, the all or part of of the part that technical scheme of the present invention contributes to prior art in essence in other words or this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprise that some instructions (can be personal computers in order to make a computer equipment, server, or the network equipment etc.) carry out all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium comprises: various media that can be program code stored such as USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CDs.

To the above-mentioned explanation of the disclosed embodiments, make professional and technical personnel in the field can realize or use the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiment, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to these embodiment shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims

1. an audio recognition method, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, produces the voice segments of identification error in the described voice signal of determining described user's input according to described error correction character string, comprising:

Concatenate into error correction character string retrieval network according to described error correction character, comprise: obtain error correction character string model corresponding to described error correction character string, obtain preset absorbing model, generate described error correction character string retrieval network according to the error correction character string model obtaining and absorbing model; Described error correction character string retrieval network comprises error correction character string model that described error correction character string is corresponding and preset absorbing model, described error correction character string model is concatenated into by error correction character: by preset dictionary, error correction character string is expanded to corresponding model sequence and obtain corresponding error correction character string model, described absorbing model is to train at magnanimity speech data the background model obtaining in advance by system;

In described error correction character string retrieval network, search is corresponding to the second optimum decoding path of the voice signal of described user's input, the described second optimum decoding path refers to the path with maximum historical path probability, and the voice segments that the described second optimum path of decoding comprises that described error correction character string model is corresponding is as the voice segments of described generation identification error;

Determine voice segments corresponding reference position and end position in the voice signal of described user's input of described generation identification error.

3. the method for claim 1, it is characterized in that, according to sound bite corresponding to each character in described voice identification result, determine voice segments corresponding character string in described voice identification result of described generation identification error, as the error character string that produces identification error, comprising:

Determine reference position corresponding bebinning character in described voice identification result of the voice segments of described generation identification error;

Determine end position corresponding termination character in described voice identification result of the voice segments of described generation identification error;

According to described bebinning character and termination character, determine the character string in described voice identification result, as the error character string that produces identification error.

4. method as claimed in claim 3, determine and comprise reference position corresponding bebinning character in described voice identification result of the voice segments of described generation identification error:

Using the character corresponding reference position of the voice segments of described generation identification error as the first character, and using the corresponding voice snippet of described the first character as the first voice snippet;

In the time that the reference position of the voice segments of described generation identification error is positioned at described the first voice snippet anterior, using described the first character as bebinning character;

In the time that the reference position of the voice segments of described generation identification error is positioned at the rear portion of described the first voice snippet, select next character in described voice identification result as bebinning character.

5. method as claimed in claim 3, determine and comprise end position corresponding termination character in described voice identification result of the voice segments of described generation identification error:

Using the character corresponding end position of the voice segments of described generation identification error as the second character, and using the corresponding voice snippet of described the second character as the second voice snippet;

In the time that the end position of the voice segments of described generation identification error is positioned at described the second voice snippet anterior, select a upper character in described voice identification result as termination character;

In the time that the end position of the voice segments of described generation identification error is positioned at the rear portion of described the second voice snippet, using described the second character as termination character.

6. the method as described in claim 1 to 5 any one, is characterized in that, the described error character string that utilizes described error correction character string to replace described generation identification error specifically comprises:

Number at the error character string of described generation identification error equals at 1 o'clock, directly utilizes described error correction character string to replace the error character string of described generation identification error;

Number at the error character string of described generation identification error is greater than at 1 o'clock, utilizes described error correction character string to replace the error character string of the generation identification error that user specifies.

7. method as claimed in claim 6, is characterized in that, the described error character string that utilizes described error correction character string to replace the generation identification error of user's appointment specifically comprises:

In described voice identification result, highlight the error character string of all generation identification errors;

Accept user and select, utilize described error correction character string to upgrade the error character string of the generation identification error that user selectes.

8. a speech recognition system, is characterized in that, comprising:

Unit concatenated in error correction character, and for receiving the error correction information of the independent input of user and generating corresponding error correction character string, described error correction information is by non-voice mode or voice mode input;

Automatic error-correcting unit, for determining that according to described error correction character string the voice signal of described user's input produces the voice segments of identification error; According to sound bite corresponding to each character in described voice identification result, determine described in the voice segments of described generation identification error corresponding character string in voice identification result, as the error character string that produces identification error; Utilize described error correction character string to replace the error character string of described generation identification error.

9. system as claimed in claim 8, is characterized in that:

The voice segments that produces identification error in the described voice signal of determining described user's input according to described error correction character string, comprising: