CN103971681A

CN103971681A - Voice recognition method and system

Info

Publication number: CN103971681A
Application number: CN201410168436.1A
Authority: CN
Inventors: 穆向禹; 彭守业; 刘思成; 贾磊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2014-04-24
Filing date: 2014-04-24
Publication date: 2014-08-06

Abstract

The embodiment of the invention provides a voice recognition method. The voice recognition method comprises the steps that first voice frequency data are collected; by utilizing a first model and a second model, voice recognition is performed on the first voice frequency data to obtain a voice recognition result, wherein the first model is used for recognizing second voice frequency data, played by a client terminal, contained in the first voice frequency data, and the second model is used for recognizing third voice frequency data, except for the second voice frequency data played by the client terminal, contained in the first voice frequency data. The embodiment of the invention further provides a voice recognition system. According to the voice recognition method and system, the success rate of voice wake-up in the voice recognition system can be increased.

Description

A kind of audio recognition method and system

[technical field]

The present invention relates to speech recognition technology, relate in particular to a kind of audio recognition method and system.

[background technology]

Speech recognition technology is obtained marked improvement in recent years, speech recognition technology will enter the every field such as industry, household electrical appliances, communication, automotive electronics, medical treatment, home services, consumption electronic product.For example, speech recognition technology is often applied in airmanship, and due to user's inconvenient manual manipulation navigation client in driving procedure, therefore, phonetic entry is a kind of well interactive mode; Navigation client is under listening state, can monitor user's phonetic order, and phonetic order is carried out to voice recognition processing, to obtain voice identification result, in the time that voice identification result meets wake-up condition, wake the speech navigation function of navigation client up, the traffic information of audio form is provided to user.

But, navigation client needs to play frequently traffic information sometimes, in the user's that the client that makes to navigate listens to phonetic order, the voice data of often playing doped with navigation client self, make user's phonetic order can not effectively wake navigation client up, the probability of failure that causes waking up navigation client is higher.

[summary of the invention]

In view of this, the embodiment of the present invention provides a kind of audio recognition method and system, can realize and improve the success ratio that in speech recognition system, voice wake up.

The embodiment of the present invention provides a kind of audio recognition method, comprising:

Gather the first voice data;

Utilize the first model and the second model, described the first voice data is carried out to speech recognition, to obtain voice identification result;

Wherein, the second audio data that described the first model is play for identifying client that described the first voice data comprises, described the second model is for identifying the 3rd voice data except the second audio data that described client is play that described the first voice data comprises.

In said method, described the first model and the second model of utilizing, carries out speech recognition to described the first voice data, and before obtaining voice identification result, described method also comprises:

Obtain the corresponding text message of second audio data that described client is play;

Described text message is carried out to cutting processing, and to obtain M character, described M is greater than or equal to 2 integer;

A described M character is carried out to clustering processing or Screening Treatment, and to obtain N character, described N is the positive integer that is less than or equal to M;

According to a described N character, obtain described the first model.

In said method, the phonetic order that described the 3rd voice data is user; Described the first model is that model refused to know in voice, and the second model is that voice wake model up.

In said method, described the first model and the second model of utilizing, carries out speech recognition to described the first voice data, to obtain voice identification result, comprising:

Described the first voice data gathering is carried out to echo cancellation process;

Utilize described the first model and described the second model, described the first voice data obtaining is carried out to speech recognition, to obtain described voice identification result after echo cancellation process.

In said method, described described the first voice data to collection carries out echo cancellation process, comprising:

Obtain the reference position of described the 3rd voice data with respect to described second audio data;

Described the 3rd voice data is converted to the first frequency domain data, the described second audio data after described reference position is converted to the second frequency domain data;

According to described the second frequency domain data, described the first frequency domain data is carried out to filtering processing.

The embodiment of the present invention also provides a kind of speech recognition system, comprising:

Data input cell, for gathering the first voice data;

Data identification unit, for utilizing the first model and the second model, carries out speech recognition to described the first voice data, to obtain voice identification result;

In said system, described system also comprises:

Model generation unit, the corresponding text message of second audio data of playing for obtaining described client; Described text message is carried out to cutting processing, and to obtain M character, described M is greater than or equal to 2 integer; A described M character is carried out to clustering processing or Screening Treatment, and to obtain N character, described N is the positive integer that is less than or equal to M; According to a described N character, obtain described the first model.

In said system, the phonetic order that described the 3rd voice data is user; Described the first model is that model refused to know in voice, and the second model is that voice wake model up.

In said system, described data identification unit specifically for:

In said system, described data identification unit is carried out echo cancellation process to described the first voice data gathering, and specifically comprises:

As can be seen from the above technical solutions, the embodiment of the present invention has following beneficial effect:

Client utilizes the first model to identify the voice data of collection, the voice data of being play to identify client, therefore, in the embodiment of the present invention, can utilize the voice data of identifying interference for the model that identifies the voice data that client plays, thereby can reduce the interference to final voice identification result of voice identification result that voice data that client plays is corresponding, thereby can reduce voice identification result that voice data that client plays is corresponding as the probability for differentiating the voice identification result whether waking up, improve the success ratio that in speech recognition system, voice wake up.

[brief description of the drawings]

In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, to the accompanying drawing of required use in embodiment be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the schematic diagram of the navigation client that uses of the technical scheme that provides of the embodiment of the present invention;

Fig. 2 is the schematic flow sheet of the audio recognition method that provides of the embodiment of the present invention;

Fig. 3 is the schematic diagram of the first model of providing of the embodiment of the present invention;

Fig. 4 is the exemplary plot that client that the embodiment of the present invention provides utilizes the first model and the second model to carry out speech recognition;

Fig. 5 is the functional block diagram of the speech recognition system that provides of the embodiment of the present invention.

[embodiment]

Technical scheme for a better understanding of the present invention, is described in detail the embodiment of the present invention below in conjunction with accompanying drawing.

Should be clear and definite, described embodiment is only the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making all other embodiment that obtain under creative work prerequisite, belong to the scope of protection of the invention.

The term using is in embodiments of the present invention only for describing the object of specific embodiment, but not is intended to limit the present invention." one ", " described " and " being somebody's turn to do " of the singulative using in the embodiment of the present invention and appended claims are also intended to comprise most forms, unless context clearly represents other implications.It is also understood that term "and/or" used herein refer to and comprise one or more projects of listing that are associated any or all may combine.

Although should be appreciated that and may adopt in embodiments of the present invention term first, second, third, etc. to describe various voice datas and frequency domain data, these voice datas and frequency domain data should not be limited to these terms.These terms are only used for voice data and frequency domain data to be distinguished from each other out.

Depend on linguistic context, as used in this word " if " or " if " can be construed as into " ... time " or " when ... time " or " in response to determining " or " in response to detecting ".Similarly, depend on linguistic context, phrase " if determining " or " if detecting (the conditioned disjunction event of statement) " can be construed as " when definite " or " in response to determining " or " in the time detecting (the conditioned disjunction event of statement) " or " in response to detecting (the conditioned disjunction event of statement) ".

Taking client as navigation client is as example, the navigation client that the technical scheme that the embodiment of the present invention provides is used as shown in Figure 1, mainly formed by speech recognition system and speech guide system, the method and system that the embodiment of the present invention provides is realized in the speech recognition system of navigation client, be mainly used in waking up speech guide system, to make speech guide system provide Voice Navigation service to user, realize the speech navigation function of client.

In the embodiment of the present invention, described client except can be navigation client, can also be to utilize interactive voice mode that the client of the information of audio form is provided to user.Described client can be positioned on navigation terminal, intelligent television or subscriber equipment; Described subscriber equipment can comprise personal computer (Personal Computer, PC), notebook computer, mobile phone or panel computer etc.

The embodiment of the present invention provides a kind of audio recognition method, please refer to Fig. 2, the schematic flow sheet of its audio recognition method providing for the embodiment of the present invention, and as shown in the figure, the method comprises the following steps:

S201, gathers the first voice data.

Concrete, client gathers the first voice data.

Preferably, the first voice data can comprise second audio data that client self plays and the 3rd voice data except the second audio data that described client is play.

Preferably, if this client is navigation client, the second audio data that this client self is play can be the voice data based on Text To Speech (Text to Speech, TTS), as the traffic information of client terminal playing etc.For example, " there is hypervelocity camera at 500 meters of of road ahead " that client is play can be above-mentioned second audio data.Again for example, the 3rd voice data except the second audio data that described client is play can be the phonetic order sending in the time that user need to use speech navigation function, and this phonetic order is for waking the speech navigation function of client up.

Preferably, client can utilize audio collecting device to gather above-mentioned the first voice data.For example, when client is positioned on mobile phone or panel computer, client can utilize microphone to gather the first voice data.

S202, utilizes the first model and the second model, described the first voice data is carried out to speech recognition, to obtain voice identification result; Wherein, the second audio data that described the first model is play for identifying client that described the first voice data comprises, described the second model is for identifying the 3rd voice data except the second audio data that described client is play that described the first voice data comprises.

Concrete, client is collecting after the first voice data, and client need to be utilized the first model and the second model, described the first voice data is carried out to speech recognition, to obtain voice identification result.Wherein, the second audio data that described the first model is play for identifying client that described the first voice data comprises, described the second model is for identifying the 3rd voice data except the second audio data that described client is play that described the first voice data comprises.

Preferably, described client is utilized the first model and the second model, and described the first voice data is carried out to speech recognition, before obtaining voice identification result, needs to set in advance the first model and the second model in client.Wherein, this first model can comprise that voice refuse to know model, model refused to know in these voice is in the embodiment of the present invention, to need to set in advance in client, and the second model can comprise that voice wake model up, and it is in client, to have set in prior art that these voice wake model up.

Illustrate, refuse to know model if described the first model comprises voice, the generation method that sets in advance the first model in client can comprise:

First, obtain the corresponding text message of second audio data that described client is play.For example, if client is navigation client, when this navigation client terminal playing second audio data, first according to default report text library, determine the text message that needs the second audio data of reporting, then utilize TTS technology second audio data corresponding to text information to convert to, finally utilize loudspeaker to play second audio data, so, client in the embodiment of the present invention can be preserved play history record, thereby can be according to the play history record of client, add up the broadcasting time of each second audio data, then obtain the corresponding text message of second audio data that broadcasting time is wherein greater than default broadcasting time threshold value.Here, do not need to obtain all text messages of reporting in text library, but obtain the wherein more corresponding text message of second audio data of broadcasting time, can reduce data processing amount while generating the first model.For example, the broadcasting time of the second audio data of " there is hypervelocity camera at 500 meters of of road ahead " and " road ahead is turned right " correspondence is more, can obtain this two corresponding text messages of second audio data.

Then, the text message obtaining is carried out to cutting processing, to obtain M character, described M is greater than or equal to 2 integer.For example, after the corresponding text message of second audio data of playing, each text message is carried out respectively to cutting processing in acquisition client, so just the text message of acquisition can be cut into R character, each character is an independent word; Then the numeral in this R character is removed, and carried out duplicate removal processing, to obtain M character; Duplicate removal is processed for merging R the character that character is identical; Wherein, R is greater than or equal to 2 integer, and M is less than or equal to R, and M is greater than or equal to 2 integer.

For example, text message " there is hypervelocity camera at 500 meters of of road ahead " and " road ahead is turned right " are carried out respectively to cutting processing, obtain following character: front, side, road, road, 500, rice, locate, have, surpass, speed, take the photograph, as, head, front, side, road, road, the right side, turn.Preferably, numeral in above-mentioned character " 500 " can also be converted to corresponding Chinese character, as " 500 " are converted to corresponding " 500 ", only retain one of them for the character repeating, the final character obtaining is: front, side, road, road, five, hundred, rice, locate, have, surpass, speed, take the photograph, as,, the right side, turn.

Then, a described M character is carried out to clustering processing or Screening Treatment, to obtain N character, described N is the positive integer that is less than or equal to M.Preferably, M character being carried out to clustering processing can be: in M character, a classification can be served as in each independent character, in order to reduce the number of classification, need to merge similar classification.For example, can, according to M character, obtain the phonetic that each character is corresponding, calculate the similarity of two characters according to phonetic corresponding to each character; A character merged in two characters that similarity are greater than to default similarity threshold, as selected arbitrarily a character in similarity is greater than two characters of default similarity threshold, retains the character of selecting, and removes remaining character.Preferably, a described M character is carried out to Screening Treatment can be: can retain every a character M character, remaining character by screened fall; For example, M character is: front, side, road, road, five, hundred, rice, locate, have, surpass, speed, take the photograph, as, head, right, turn, to obtaining after this M character screening: front, road, five, rice, have, fast, as, the right side.The above-mentioned object that M character carried out to clustering processing or Screening Treatment is to reduce number of characters.

Finally, according to a described N character, obtain described the first model.Be understandable that, can be relevant between character and character, namely can transfer to another character from a character, between every two characters, there is transition probability, the transition probability that another other characters transferred in only each character is different, therefore the branch mode combining according to the various arrangement of N character, can be obtained up to a few character string, in each character string, can comprise at least two characters.Then, can obtain described the first model according at least one character string; Wherein, the first model can comprise all character strings that obtain according to N character, also can comprise some character strings of weighted value maximum in all character strings that obtain according to N character.Wherein, the weighted value of character string can equal the product of the transition probability between every two characters in character string, the model of the transition probability between every two characters can utilize default acoustic model to obtain, this acoustic model is a probability model, can comprise transition probability between probability, character and the character that initial consonant and simple or compound vowel of a Chinese syllable occur simultaneously etc.

For example, please refer to Fig. 3, the schematic diagram of its first model providing for the embodiment of the present invention, as shown in the figure, 14 characters that obtain comprise: front, side, face, road, road, have, left and right, straight, capable, take the photograph, as,, turn, can obtain 4 character strings shown in Fig. 3 according to these 14 characters, i.e. there is camera in road ahead craspedodrome, front, turns left above, turns right above.

Preferably, client can first be carried out echo cancellation process to described the first voice data gathering; Then, client is utilized described the first model and described the second model, described the first voice data obtaining after echo cancellation process is carried out to speech recognition, to obtain described voice identification result, like this, client, before the first voice data is carried out to voice recognition processing, just can utilize echo cancellation technology to filter out the second audio data that client is partly play.

Illustrate, the method that described client is carried out echo cancellation process to described the first voice data gathering can comprise:

First, client obtains the reference position of described the 3rd voice data with respect to described second audio data.Here, client need to be play second audio data to user, and therefore client can obtain the second audio data that self plays.For example, client can be utilized auto-correlation algorithm, the first voice data to client collection and the second audio data of client terminal playing carry out autocorrelation calculation, to obtain the 3rd voice data that comprised in the first voice data reference position with respect to second audio data.

Then, client according to obtain reference position, and utilize echo cancellation technology to gather the first voice data carry out echo cancellation process.For example, the first voice data gathering is converted to the first frequency domain data by client, and the described second audio data after described reference position is converted to the second frequency domain data.Client is by the first frequency domain data and the second frequency domain data input filter, wave filter can be according to described the second frequency domain data like this, described the first frequency domain data is carried out to filtering processing, utilize echo cancellation technology thereby can realize, the second audio data of the client terminal playing that filtering the first voice data comprises in the first voice data collecting.

It should be noted that, it is a kind of preferred embodiment that client is carried out echo cancellation process to the first voice data, and client also can not carried out echo cancellation process to the first voice data, directly the first voice data is carried out to voice recognition processing.

Illustrate, client is utilized the first model and the second model, described the first voice data is carried out to speech recognition, can comprise with the method that obtains voice identification result: please refer to Fig. 4, the exemplary plot that its client providing for the embodiment of the present invention utilizes the first model and the second model to carry out speech recognition, as shown in the figure, client utilizes the first model to carry out speech recognition to the first voice data, to obtain the first voice identification result, here, because the first model is that the corresponding text message of second audio data of playing according to client obtains, therefore the first model is in the time carrying out voice recognition processing to the first voice data that comprises second audio data, can identify the second audio data of the client terminal playing comprising in the first voice data, as shown in Figure 4, because the character in the first model has passed through clustering processing or Screening Treatment, so only comprise the character in the text message that the second audio data of part is corresponding in the first voice identification result, make discrimination lower, discrimination equals the ratio of character total number in number of characters in recognition result and voice data, the weighted value of the first voice identification result and discrimination are relation in direct ratio, therefore the weighted value of the first voice identification result is lower.Meanwhile, utilize the second model to carry out voice recognition processing to the first voice data, to obtain the second voice identification result, wherein, because the second model is that voice wake model up, voice wake model up and comprise that at least one wakes keyword (as the Baidu navigation in Fig. 4) up, therefore, utilize the second model to carry out after voice recognition processing the first voice data, can obtain the second voice identification result corresponding to the 3rd voice data (as user's phonetic order) comprising in the first voice data, the weighted value of the weighted value of this second voice identification result and the first voice identification result is compared, using the voice identification result of weighted value maximum wherein as final voice identification result.

Optionally, after obtaining final voice identification result, client can judge in this final voice identification result, whether to comprise the default keyword that wakes up, wake keyword up if comprised, client can be waken the speech navigation function of client up, to make client to provide Voice Navigation service to user, realize the speech navigation function of client.Otherwise, wake keyword up if do not comprised, client is not waken speech navigation function up.

It should be noted that, in prior art, the first model is that model is known in general refusing, be not that model is known in set the refusing of second audio data of playing for client, in practical application, when client to self gather the first voice data, utilizing general refusing to know model and voice wakes model up and carries out respectively voice recognition processing, while obtaining voice identification result respectively, the weighted value of refusing to know the voice identification result that model is corresponding in most of situation can be greater than or equal to voice and wake up the weighted value of model, like this, client will will refuse to know the corresponding voice identification result of model as final voice identification result, and judge whether refuse to know the voice identification result that model is corresponding comprises the default keyword that wakes up, owing to refusing to know the keyword that wakes up that model generally can not comprise user preset, therefore cause waking up speech navigation function failure.The embodiment of the present invention is utilized said method, text message corresponding to second audio data of playing for client builds the first model, utilize the first model as refusing to know model, the first voice data is carried out to voice recognition processing, utilize clustering processing or Screening Treatment to character to reduce the discrimination that the first model is corresponding to the second audio data comprising in the first voice data simultaneously, thereby can reduce the weighted value of the voice identification result that utilizes the first model acquisition, client can be exported the second voice identification result that utilizes the second model to obtain as far as possible as final voice identification result, like this, utilize the second voice identification result to judge whether to comprise and wake keyword up, the second voice identification result is owing to being to obtain for user's phonetic order, therefore generally can comprise and wake keyword up, thereby can successfully wake speech navigation function up, just can improve under the disturbed condition of voice data that has client terminal playing, the success ratio that voice wake up.

In the embodiment of the present invention, in order to wake the speech navigation function of navigation client up, navigation client need to identify the keyword that wakes up setting in advance from the voice data gathering, the model that wakes keyword up that is used for the voice data that identifies collection is exactly that above-mentioned voice wake model up, voice wake model up and can comprise at least one that set in advance and wake keyword up, if the voice data gathering can hit voice and wake in model one up and wake keyword up, just can successfully wake speech navigation function up.For other voice datas beyond user's phonetic order, can define some non-keywords that wake up, the non-model that wakes keyword up that is used for the voice data that identifies collection is exactly the above-mentioned model of refusing to know, refuse to know model and can comprise at least one the non-keyword that wakes up setting in advance, the non-keyword that wakes up that the voice data of collection hits can not wake speech navigation function up.

The embodiment of the present invention further provides the device embodiment that realizes each step and method in said method embodiment.

Please refer to Fig. 5, the functional block diagram of its speech recognition system providing for the embodiment of the present invention.As shown in the figure, this system comprises:

Data input cell 501, for gathering the first voice data;

Data identification unit 502, for utilizing the first model and the second model, carries out speech recognition to described the first voice data, to obtain voice identification result;

Preferably, described system also comprises:

Model generation unit 503, the corresponding text message of second audio data of playing for obtaining described client; Described text message is carried out to cutting processing, and to obtain M character, described M is greater than or equal to 2 integer; A described M character is carried out to clustering processing or Screening Treatment, and to obtain N character, described N is the positive integer that is less than or equal to M; According to a described N character, obtain described the first model.

Preferably, the phonetic order that described the 3rd voice data is user; Described the first model is that model refused to know in voice, and the second model is that voice wake model up.

Preferably, described data identification unit 502 specifically for: to gather described the first voice data carry out echo cancellation process; Utilize described the first model and described the second model, described the first voice data obtaining is carried out to speech recognition, to obtain described voice identification result after echo cancellation process.

Preferably, described data identification unit 502 is carried out echo cancellation process to described the first voice data gathering, and specifically comprises: obtain the reference position of described the 3rd voice data with respect to described second audio data; Described the 3rd voice data is converted to the first frequency domain data, the described second audio data after described reference position is converted to the second frequency domain data; According to described the second frequency domain data, described the first frequency domain data is carried out to filtering processing.

Due to the method for the each unit in the present embodiment shown in can execution graph 2, the part that the present embodiment is not described in detail, can be with reference to the related description to Fig. 2.

The technical scheme of the embodiment of the present invention has following beneficial effect:

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. an audio recognition method, is characterized in that, described method comprises:

Gather the first voice data;

2. method according to claim 1, is characterized in that, described the first model and the second model of utilizing, carries out speech recognition to described the first voice data, and before obtaining voice identification result, described method also comprises:

According to a described N character, obtain described the first model.

3. method according to claim 1 and 2, is characterized in that,

The phonetic order that described the 3rd voice data is user;

Described the first model is that model refused to know in voice, and the second model is that voice wake model up.

4. method according to claim 1 and 2, is characterized in that, described the first model and the second model of utilizing, carries out speech recognition to described the first voice data, to obtain voice identification result, comprising:

5. method according to claim 4, is characterized in that, described described the first voice data to collection carries out echo cancellation process, comprising:

6. a speech recognition system, is characterized in that, described system comprises:

Data input cell, for gathering the first voice data;

7. system according to claim 6, is characterized in that, described system also comprises:

8. according to the system described in claim 6 or 7, it is characterized in that,

The phonetic order that described the 3rd voice data is user;

9. according to the system described in claim 6 or 7, it is characterized in that, described data identification unit specifically for:

10. system according to claim 9, is characterized in that, described data identification unit is carried out echo cancellation process to described the first voice data gathering, and specifically comprises: