CN103440867A

CN103440867A - Method and system for recognizing voice

Info

Publication number: CN103440867A
Application number: CN2013103350500A
Authority: CN
Inventors: 朱国正; 任严佳
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2013-08-02
Filing date: 2013-08-02
Publication date: 2013-12-11
Anticipated expiration: 2033-08-02
Also published as: CN103440867B

Abstract

The invention discloses a method and system for recognizing voice. The method comprises the step of obtaining a voice message sent by a user, the step of sequentially sending the voice message to a cloud terminal recognition engine and a local recognition engine to enable the cloud terminal recognition engine and the local recognition engine to respectively recognize the voice message, the step of outputting a cloud terminal recognition result if the cloud terminal recognition result returned by the cloud terminal recognition engine is received at first, and the step of outputting a local recognition result if the local recognition result of the local recognition engine is received at first and the confidence coefficient corresponding to the local recognition result exceeds the upper limit of a set confidence coefficient section. By means of the method and system for recognizing the voice, a reliable voice recognition result can also be provided for the user under the condition that the network is poor or the network does not exist.

Description

Audio recognition method and system

Technical field

The present invention relates to the speech recognition technology field, be specifically related to a kind of audio recognition method and system.

Background technology

Growing along with Computer Science and Technology, speech recognition technology is ripe gradually.And be widely used in mobile phone, TV, the field such as vehicle-mounted.Take vehicle-mounted is example, because the people can not use easily the manual manipulation interface when driving, makes speech recognition as a kind of interactive mode relatively easily, and making vehicle-mountedly can provide more function.In prior art, the pattern of speech recognition is generally: receive user's voice messaging, connect with the high in the clouds speech recognition server, send voice messaging to server, by server, this information is identified, then returned to recognition result to client.But not necessarily have stable network to connect on mobile device, high in the clouds is returned and may be experienced larger delay in this case, reduce the user and experience, even there is no network, cause high in the clouds identification not available.

Summary of the invention

The invention provides a kind of audio recognition method and system, can be in the situation that network be bad or do not have network also can provide reliable voice identification result for the user.

For this reason, the invention provides following technical scheme:

A kind of audio recognition method comprises:

Obtain the voice messaging that the user sends;

Described voice messaging is sent to respectively to high in the clouds identification engine and local identification engine, so that described high in the clouds identification engine and local identification engine are identified described voice messaging respectively;

If first receive the high in the clouds recognition result that described high in the clouds identification engine returns, export described high in the clouds recognition result;

If first receive the local recognition result of described local identification engine, and degree of confidence corresponding to described local recognition result be greater than the confidence interval upper limit of setting, exports described local recognition result.

Preferably, described method also comprises:

If described degree of confidence, in described confidence interval, reduces the described confidence interval upper limit successively within the waiting time of setting;

If receive the high in the clouds recognition result that described high in the clouds identification engine returns within described waiting time, export described high in the clouds recognition result;

If receive the high in the clouds recognition result that described high in the clouds identification engine returns within described waiting time, and degree of confidence corresponding to described local recognition result be greater than the confidence interval upper limit after reduction, exports described local recognition result.

Preferably, each waiting time is identical or different.

Preferably, described method also comprises:

If after reducing the frequency threshold value of number of times over setting of the described confidence interval upper limit, the degree of confidence that described local recognition result is corresponding still is less than the confidence interval lower limit after reduction, and do not receive yet described high in the clouds recognition result, to the user, return to recognition failures information.

Preferably, described method also comprises:

If first receive described local recognition result, and degree of confidence corresponding to described local recognition result be less than the confidence interval lower limit of setting, abandons described local recognition result, continues to wait for that described high in the clouds identification engine returns to the high in the clouds recognition result;

If the stand-by period surpasses the obstruction duration of setting, to the user, return to recognition failures information.

Preferably, described method also comprises:

After receiving the speech recognition request of user's transmission, open high in the clouds identification engine and the local engine of identifying.

A kind of speech recognition system comprises:

The voice messaging acquiring unit, the voice messaging sent for obtaining the user;

Transmitting element, for described voice messaging being sent to respectively to high in the clouds identification engine and local identification engine, so that described high in the clouds identification engine and local identification engine are identified described voice messaging respectively;

Receiving element, the high in the clouds recognition result and the described local local recognition result of identifying engine that for receiving described high in the clouds identification engine, return;

Output unit, during for the high in the clouds recognition result that first receives at described receiving element that described high in the clouds identification engine returns, export described high in the clouds recognition result; First receive the local recognition result of described local identification engine at described receiving element, and degree of confidence corresponding to described local recognition result be greater than on the confidence interval of setting in limited time, export described local recognition result.

Preferably, described system also comprises:

The degree of confidence adjustment unit for when described degree of confidence is in described confidence interval, reduces the described confidence interval upper limit successively within the waiting time of setting;

Described output unit, during the high in the clouds recognition result that also for described receiving element within described waiting time, receives that described high in the clouds identification engine returns, export described high in the clouds recognition result; Within described waiting time, described receiving element does not receive the high in the clouds recognition result that described high in the clouds identification engine returns, and degree of confidence corresponding to described local recognition result be greater than on the confidence interval after reduction in limited time, exports described local recognition result.

Preferably, described system also comprises:

Statistic unit, reduce the number of times of the described confidence interval upper limit for adding up described degree of confidence adjustment unit;

Described output unit, also for after surpassing at described number of times the frequency threshold value of setting, if the degree of confidence that local recognition result is corresponding still is less than the confidence interval lower limit after reduction, and do not receive yet described high in the clouds recognition result, to the user, return to recognition failures information.

Preferably, described receiving element, also for formerly receiving described local recognition result, and degree of confidence corresponding to described local recognition result is less than under the confidence interval of setting in limited time, abandon described local recognition result, continue to wait for that described high in the clouds identification engine returns to the high in the clouds recognition result; And, after surpassing in the stand-by period obstruction duration of setting, to the user, return to recognition failures information.

Preferably, described system also comprises:

Trigger element, after the speech recognition request receiving user's transmission, open high in the clouds identification engine and the local engine of identifying.

The audio recognition method that the embodiment of the present invention provides and system, combine this locality identification with high in the clouds identification, after the voice messaging that receives user's transmission, send to respectively high in the clouds identification engine and local identification engine to be identified described voice messaging.And, during the high in the clouds recognition result that formerly receives that identification engine in high in the clouds returns, directly export the high in the clouds recognition result.If first receive the local recognition result of local identification engine, and degree of confidence corresponding to local recognition result be greater than on the confidence interval of setting in limited time, exports local recognition result.And adhere to that the high in the clouds recognition result is better than local recognition result, if high in the clouds identification can return results before this locality identification provides a relatively accurate recognition result, adopt the high in the clouds recognition result.Thereby can when there is no network insertion, complete utilize local identification engine complete need not network local function, as make a phone call, send short messages, listen to the music etc.

Further, if the degree of confidence of the local recognition result first received is lower, in the confidence interval arranged, by constantly reducing the degree of confidence thresholding of local identification, until a qualified output or recognition failures are arranged.

The scheme provided due to the embodiment of the present invention combines this locality identification with high in the clouds identification, can guarantee in the situation that network is bad or do not have network that reliable voice identification result is provided as much as possible.

The accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, below will the accompanying drawing of required use in embodiment be briefly described, apparently, the accompanying drawing the following describes is only some embodiment that put down in writing in the present invention, for those of ordinary skills, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is a kind of process flow diagram of embodiment of the present invention audio recognition method;

Fig. 2 is the another kind of process flow diagram of embodiment of the present invention audio recognition method;

Fig. 3 is a kind of structural representation of embodiment of the present invention speech recognition system;

Fig. 4 is the another kind of structural representation of embodiment of the present invention speech recognition system.

Embodiment

In order to make those skilled in the art person understand better the scheme of the embodiment of the present invention, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.

The embodiment of the present invention provides a kind of audio recognition method and system, in conjunction with high in the clouds identification and local identification, can when there is no network insertion, complete utilize local identification engine complete need not network local function, as make a phone call, send short messages, listen to the music etc.Also can connect according to network time delay dynamic reducing to the requirement of local engine results.

As shown in Figure 1, be a kind of process flow diagram of embodiment of the present invention audio recognition method, comprise the following steps:

Step 101, obtain the voice messaging that the user sends.

Step 102, send to respectively high in the clouds identification engine and local identification engine by described voice messaging, so that described high in the clouds identification engine and local identification engine are identified described voice messaging respectively.

The voice messaging that can send with the recording module recording user particularly.The voice messaging of recording can directly send to high in the clouds identification engine and local identification engine; Also can first with the voice detection module, filter out effective load point, and then send to high in the clouds identification engine and local identification engine.

Step 103, if first receive the high in the clouds recognition result that described high in the clouds identification engine returns, export described high in the clouds recognition result.

Because the server in high in the clouds identification engine performance is powerful, recognition result has higher degree of confidence, therefore, after preferentially receiving the high in the clouds recognition result, can directly export this recognition result.

Step 104, if first receive the local recognition result of described local identification engine, and degree of confidence corresponding to described local recognition result be greater than the confidence interval upper limit of setting, exports described local recognition result.

Due in the situation that network environment is bad, the recognition result in high in the clouds may have sizable delay.Now, obtain the confidence value of the corresponding local recognition result of this voice messaging and this result, if this confidence value is greater than the degree of confidence thresholding that system arranges, illustrate that this recognition result is fully available, therefore export local recognition result, without waiting for again the high in the clouds recognition result.

Visible, the audio recognition method that the embodiment of the present invention provides, this locality identification is identified and combined with high in the clouds, and the degree of confidence of the priority of returning according to high in the clouds recognition result and local recognition result and the local recognition result preferentially returned decides the knowledge of selecting to add result.And the result of adhering to all the time high in the clouds is better than this locality, if high in the clouds identification can return results before this locality identification provides a relatively accurate identification, just adopt the result in high in the clouds.

In order further to solve in network delay or the unavailable situation of network and also to access the voice identification result with certain accuracy rate, another embodiment of audio recognition method of the present invention can also dynamically adjust the degree of confidence thresholding of local identification according to current network condition, export best result in the shortest time delay.

As shown in Figure 2, be the another kind of process flow diagram of embodiment of the present invention audio recognition method, comprise the following steps:

Step 201 in Fig. 2 is identical to step 103 with the step 101 in Fig. 1 to step 203, does not repeat them here.

Step 204, if first receive the local recognition result of local identification engine, obtain the degree of confidence that local recognition result is corresponding.

In addition, in step 204, need to determine follow-up processing operation according to the degree of confidence of local recognition result, guarantee the best result of output within the shortest time delay.Particularly, if degree of confidence is less than the confidence interval lower limit of setting, perform step 205; If degree of confidence, in the confidence interval of setting, performs step 208; If degree of confidence is greater than the confidence interval upper limit of setting, perform step 213.

Step 205, abandon local recognition result, continues to wait for that high in the clouds identification engine returns to the high in the clouds recognition result.

Step 206, judge whether the stand-by period surpasses the obstruction duration of setting; If so, perform step 207; Otherwise continue to wait for.

Step 207, return to recognition failures information to the user.

Step 208 reduces the described confidence interval upper limit successively within the waiting time of setting.

Step 209, whether judgement receives the high in the clouds recognition result within described waiting time.If so, perform step 210; Otherwise, perform step 211.

Step 210, output high in the clouds recognition result.

Step 211, judge the confidence interval upper limit after whether degree of confidence that local recognition result is corresponding is greater than current reduction.If so, perform step 213; Otherwise, perform step 212.

Step 212, whether the number of times that judgement reduces the described confidence interval upper limit surpasses the frequency threshold value set (such as being that frequency threshold value can be 1 to 3 etc.).If so, perform step 207; Otherwise, return to step 208.

Step 213, export local recognition result.

It should be noted that, the waiting time of mentioning in above-mentioned steps 208 is the time interval of reducing the confidence interval upper limit, such as can be 2-5 second etc., and reduce time interval of the confidence interval upper limit at every turn can be identical, also can be different.And the stand-by period of mentioning in above-mentioned steps 206 and above-mentioned waiting time are two different concepts, the described stand-by period refers to waits for the time that receives the high in the clouds recognition result, its starting point can be to send to respectively high in the clouds identification engine and local identification engine to start timing described voice messaging, can be also to start timing from abandoning local recognition result, this embodiment of the present invention is not done to restriction.

In addition, in actual applications, do not receive the high in the clouds recognition result in certain hour after the described confidence interval upper limit of each reduction, and in the situation that degree of confidence corresponding to local recognition result can not meet the demands, whether the number of times that also can not go judgement to reduce the described confidence interval upper limit surpasses the frequency threshold value of setting, but whether the time that judgement is waited for surpasses the stand-by period limited, if surpass, to the user, return to recognition failures information, to prevent waits for too long, affect the user and experience.

The speech data that has powerful server handling ability and magnanimity due to high in the clouds is compared, the recognition result degree of confidence is high, and local identification need not network support, very high recognition speed and the very wide scope of application are arranged, on especially applicable mobile devices that connect without stabilizing network.Therefore, the audio recognition method of the embodiment of the present invention combines this locality identification with high in the clouds identification, take into account both advantages separately, after the voice messaging that gets user's transmission, sends to high in the clouds identification engine and local identification engine to be identified it simultaneously.If high in the clouds identification can return results before this locality identification provides a relatively accurate identification, adopt the high in the clouds recognition result.Otherwise, constantly reducing the degree of confidence thresholding of local identification, until a qualified output or recognition failures are arranged, therefore can guarantee in the situation that network is bad or do not have network that reliable voice identification result is provided as much as possible.

The audio recognition method of the embodiment of the present invention, meet network identification to local command when obstructed by simple local identification engine efficiently, in addition, the time delay that can reduce identification due to the choice strategy to high in the clouds and local recognition result, can dynamically adjust according to current network condition the degree of confidence thresholding of local identification, thereby guarantee to export in the shortest time delay best result.

In addition, it should be noted that, in actual applications, can, after the speech recognition request that receives user's transmission, open high in the clouds identification engine and the local engine of identifying.Such as, described speech recognition request can send when the user presses the speech recognition key, or provides the voice arousal function to the user, and on backstage, one direct-open recording sends when recognizing special key words.

Can adopt some conventional recognition methodss for this locality identification engine to the identification of special key words, such as, local identification engine reads the grammar file that predefined is good, this document has defined the set of the order word that speech recognition supports, and the set of same function order word all exists in dictionary, the efficiently access of local identification engine.Local identification engine generates a recognition network by grammar file, local identification engine extracts the characteristic information of input voice and mates in the enterprising walking along the street of recognition network footpath, final every user says any a word in this grammar file range of definition, all can, by system identification, thereby know, described special key words.

Certainly, high in the clouds identification engine and local identification engine are specifically adopted to which kind of speech recognition technology, and the embodiment of the present invention is not done restriction, especially to this locality identification engine, can need to select according to concrete application scenarios, can not affect the above-mentioned effect that the present invention can reach.

Correspondingly, the embodiment of the present invention also provides a kind of speech recognition system, as shown in Figure 3, is a kind of structural representation of this system.

In this embodiment, described system comprises:

Voice messaging acquiring unit 301, the voice messaging sent for obtaining the user.

Transmitting element 302, for described voice messaging being sent to respectively to high in the clouds identification engine and local identification engine, so that described high in the clouds identification engine and local identification engine are identified described voice messaging respectively.

Receiving element 303, the high in the clouds recognition result and the described local local recognition result of identifying engine that for receiving described high in the clouds identification engine, return.

Output unit 304, during for the high in the clouds recognition result that first receives at receiving element 303 that described high in the clouds identification engine returns, export described high in the clouds recognition result; First receive the local recognition result of described local identification engine at receiving element 303, and degree of confidence corresponding to described local recognition result be greater than on the confidence interval of setting in limited time, export described local recognition result.

The speech recognition system that the embodiment of the present invention provides, identify this locality identification to combine with high in the clouds, and the degree of confidence of the priority of returning according to high in the clouds recognition result and local recognition result and the local recognition result preferentially returned decides the knowledge of selecting to add result.And the result of adhering to all the time high in the clouds is better than this locality, if high in the clouds identification can return results before this locality identification provides a relatively accurate identification, just adopt the result in high in the clouds.

In order further to solve in network delay or the unavailable situation of network and also to access the voice identification result with certain accuracy rate, another embodiment of speech recognition system of the present invention can also dynamically adjust the degree of confidence thresholding of local identification according to current network condition, export best result in the shortest time delay.

As shown in Figure 4, be the structural representation of another embodiment of speech recognition system of the present invention.

From embodiment illustrated in fig. 3 different, in this embodiment, described system also comprises:

Degree of confidence adjustment unit 401 for when described degree of confidence is in described confidence interval, reduces the described confidence interval upper limit successively within the waiting time of setting.

Correspondingly, in this embodiment, during the recognition result of described output unit 304 also receives for receiving element 303 within described waiting time that described high in the clouds identification engine returns high in the clouds, export described high in the clouds recognition result; Within described waiting time, receiving element 303 does not receive the high in the clouds recognition result that described high in the clouds identification engine returns, and degree of confidence corresponding to described local recognition result be greater than on the confidence interval after reduction in limited time, exports described local recognition result.

In addition, in order to prevent from waiting for the overlong time of recognition result output, affect the user and experience, as shown in Figure 4, this system also can further comprise: statistic unit 402, and for adding up the number of times of the described confidence interval upper limit of described degree of confidence adjustment unit 401 reduction.

Correspondingly, after output unit 304 also is used in the frequency threshold value of number of times over setting of described statistic unit 401 statistics, if the degree of confidence that local recognition result is corresponding still is less than the confidence interval lower limit after reduction, and do not receive yet described high in the clouds recognition result, to the user, return to recognition failures information.

In order to guarantee the accuracy rate of local recognition result of output, above-mentioned Fig. 3 and embodiment illustrated in fig. 4 in, described receiving element 303 also can be used for formerly receiving described local recognition result, and degree of confidence corresponding to described local recognition result is less than under the confidence interval of setting in limited time, abandon described local recognition result, continue to wait for that described high in the clouds identification engine returns to the high in the clouds recognition result; And, after surpassing in the stand-by period obstruction duration of setting, to the user, return to recognition failures information.Certainly, in actual applications, also can be by receiving element 303 by above-mentioned advisory output unit 304, and return to recognition failures information by output unit 304 to the user.

In addition, the unlatching of high in the clouds identification engine and local identification engine can have different modes, such as, in the various embodiments described above, described system also can comprise trigger element (not shown), after the speech recognition request receiving user's transmission, open high in the clouds identification engine and the local engine of identifying.Described speech recognition request can send when the user presses the speech recognition key, or provides the voice arousal function to the user, and on backstage, one direct-open recording sends when recognizing special key words.

Visible by foregoing description, the speech recognition system of the embodiment of the present invention, meet network identification to local command when obstructed by simple local identification engine efficiently, in addition, the time delay that can reduce identification due to the choice strategy to high in the clouds and local recognition result, can dynamically adjust according to current network condition the degree of confidence thresholding of local identification, thereby guarantee to export in the shortest time delay best result.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and between each embodiment, identical similar part is mutually referring to getting final product, and each embodiment stresses is the difference with other embodiment.Especially, for system embodiment, due to it, substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part gets final product referring to the part explanation of embodiment of the method.

It should be noted that, system embodiment described above is only schematic, the wherein said unit as the separating component explanation can or can not be also physically to separate, the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed on a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the purpose of the present embodiment scheme.Those of ordinary skills in the situation that do not pay creative work, can understand and implement.

Above the embodiment of the present invention is described in detail, has applied embodiment herein the present invention is set forth, the explanation of above embodiment is just for helping to understand method and apparatus of the present invention; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention simultaneously.

Claims

1. an audio recognition method, is characterized in that, comprising:

Obtain the voice messaging that the user sends;

2. method according to claim 1, is characterized in that, described method also comprises:

3. method according to claim 2, is characterized in that, each waiting time is identical or different.

4. method according to claim 2, is characterized in that, described method also comprises:

5. method according to claim 2, is characterized in that, described method also comprises:

6. according to the described method of claim 1 to 5 any one, it is characterized in that, described method also comprises:

7. a speech recognition system, is characterized in that, comprising:

8. system according to claim 7, is characterized in that, described system also comprises:

9. system according to claim 8, is characterized in that, described system also comprises:

10. system according to claim 8, is characterized in that,

Described receiving element, also for formerly receiving described local recognition result, and degree of confidence corresponding to described local recognition result is less than under the confidence interval of setting in limited time, abandons described local recognition result, continues to wait for that described high in the clouds identification engine returns to the high in the clouds recognition result; And, after surpassing in the stand-by period obstruction duration of setting, to the user, return to recognition failures information.

11. according to the described system of claim 7 to 10 any one, it is characterized in that, described system also comprises: