CN103440867B

CN103440867B - Audio recognition method and system

Info

Publication number: CN103440867B
Application number: CN201310335050.0A
Authority: CN
Inventors: 朱国正; 任严佳
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2013-08-02
Filing date: 2013-08-02
Publication date: 2016-08-10
Anticipated expiration: 2033-08-02
Also published as: CN103440867A

Abstract

The invention discloses a kind of audio recognition method and system, the method includes: obtain the voice messaging that user sends；Described voice messaging is sent respectively to high in the clouds and identifies that engine and this locality identify engine, so that described high in the clouds identifies that described voice messaging is identified by engine and local identification engine respectively；If first receiving the high in the clouds recognition result that described high in the clouds identifies that engine returns, then export described high in the clouds recognition result；If first receiving the described local local recognition result identifying engine, and confidence level corresponding to described local recognition result is more than the confidence interval upper limit set, then export described local recognition result.Utilize the present invention, can be bad at network or also be able to provide the user reliable voice identification result in the case of there is no network.

Description

Audio recognition method and system

Technical field

The present invention relates to technical field of voice recognition, be specifically related to a kind of audio recognition method and system.

Background technology

Growing along with Computer Science and Technology, speech recognition technology is the most ripe.And be widely used in Mobile phone, TV, the field such as vehicle-mounted.As a example by vehicle-mounted, owing to people can not operate interface with hands easily when driving so that voice Identify as a kind of interactive mode the most easily, make vehicle-mounted to provide more function.In prior art, speech recognition Pattern is usually: receive the voice messaging of user, sets up with high in the clouds speech recognition server and is connected, and sends voice messaging to servicing Device, is identified this information by server, returns again to recognition result to client.But not necessarily have stable in mobile device Network connects, and high in the clouds returns and may experience bigger delay in this case, reduces Consumer's Experience, even without network, leads Cause high in the clouds identification can not use.

Summary of the invention

The present invention provides a kind of audio recognition method and system, can be bad at network or also can in the case of not having network Enough provide the user reliable voice identification result.

To this end, the present invention provides following technical scheme:

A kind of audio recognition method, including:

Obtain the voice messaging that user sends；

Described voice messaging is sent respectively to high in the clouds and identifies engine and local identification engine, draw so that described high in the clouds identifies Hold up and described voice messaging is identified by local identification engine respectively；

If first receiving the high in the clouds recognition result that described high in the clouds identifies that engine returns, then export described high in the clouds and identify knot Really；

If first receiving the described local local recognition result identifying engine, and described local recognition result being corresponding Confidence level more than the confidence interval upper limit set, then exports described local recognition result.

Preferably, described method also includes:

If described confidence level is in described confidence interval, within the waiting time set, reduce described confidence the most successively The interval upper limit of degree；

If receiving the high in the clouds recognition result that described high in the clouds identifies that engine returns within described waiting time, then export institute State high in the clouds recognition result；

If do not receive the high in the clouds recognition result that described high in the clouds identifies that engine returns within described waiting time, and institute The confidence level stating local recognition result corresponding is more than the confidence interval upper limit after reducing, then export described local recognition result.

Preferably, each waiting time is identical or different.

Preferably, described method also includes:

If after the number of times reducing the described confidence interval upper limit exceedes the frequency threshold value of setting, described local recognition result Corresponding confidence level is still less than the confidence interval lower limit after reducing, and does not receives described high in the clouds recognition result yet, then to User returns recognition failures information.

Preferably, described method also includes:

If first receiving described local recognition result, and confidence level corresponding to described local recognition result is less than setting Confidence interval lower limit, then abandon described local recognition result, continue waiting for described high in the clouds and identify that engine returns to high in the clouds and identifies Result；

If the waiting time exceedes the obstruction duration of setting, then return recognition failures information to user.

Preferably, described method also includes:

After receiving the speech recognition request that user sends, open high in the clouds and identify that engine and this locality identify engine.

A kind of speech recognition system, including:

Voice messaging acquiring unit, for obtaining the voice messaging that user sends；

Transmitting element, identifies engine and local identification engine for described voice messaging is sent respectively to high in the clouds, so that Described high in the clouds identifies that described voice messaging is identified by engine and local identification engine respectively；

Receive unit, identify that the high in the clouds recognition result of engine return and described local identification are drawn for receiving described high in the clouds The local recognition result held up；

Output unit, for first receiving, at described reception unit, the high in the clouds recognition result that described high in the clouds identifies that engine returns Time, export described high in the clouds recognition result；The described local local recognition result identifying engine is first received at described reception unit, And when the confidence level that described local recognition result is corresponding is more than the confidence interval upper limit set, export described local identification knot Really.

Preferably, described system also includes:

Confidence level adjustment unit, for when described confidence level is in described confidence interval, successively in the wait set The described confidence interval upper limit is reduced in duration；

Described output unit, is additionally operable to described reception unit within described waiting time and receives described high in the clouds identification engine During the high in the clouds recognition result returned, export described high in the clouds recognition result；Within described waiting time, described reception unit does not receives Identify the high in the clouds recognition result that engine returns to described high in the clouds, and confidence level corresponding to described local recognition result is more than reducing After the confidence interval upper limit time, export described local recognition result.

Preferably, described system also includes:

Statistic unit, reduces the number of times of the described confidence interval upper limit for adding up described confidence level adjustment unit；

Described output unit, is additionally operable to after described number of times exceedes the frequency threshold value of setting, if local recognition result pair The confidence level answered, still less than the confidence interval lower limit after reducing, and does not receives described high in the clouds recognition result yet, then to Family returns recognition failures information.

Preferably, described reception unit, it is additionally operable to formerly receive described local recognition result, and described local identification When confidence level corresponding to result is less than the confidence interval lower limit set, abandons described local recognition result, continue waiting for described High in the clouds identifies that engine returns high in the clouds recognition result；And after the waiting time exceedes the obstruction duration of setting, return identification to user Failure information.

Preferably, described system also includes:

Trigger element, for after receiving the speech recognition request that user sends, opens high in the clouds and identifies engine and this locality Identify engine.

The audio recognition method of embodiment of the present invention offer and system, identify this locality and combine with high in the clouds identification, connecing After receiving the voice messaging that user sends, described voice messaging is sent respectively to high in the clouds and identifies that engine and local identification engine enter Row identifies.And when formerly receiving the high in the clouds recognition result that high in the clouds identifies engine return, directly output high in the clouds recognition result.As Fruit first receives the local local recognition result identifying engine, and confidence level corresponding to local recognition result is more than the confidence set During the interval upper limit of degree, then the local recognition result of output.And adhere to that high in the clouds recognition result is better than local recognition result, if high in the clouds Identification can return result before this locality identifies and provides a relatively accurate recognition result, then use high in the clouds recognition result.From And can complete when there is no network insertion to utilize local identification engine to complete the local function without network, as made a phone call, Send short messages, listen music etc..

Further, if the confidence level of the local recognition result first received is relatively low, in the confidence interval arranged, Then by constantly reducing the confidence level thresholding that this locality identifies, until having a qualified output or recognition failures.

This locality identification is combined by the scheme provided due to the embodiment of the present invention with high in the clouds identification, it is ensured that at network not Well or provide reliable voice identification result as much as possible in the case of there is no network.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to institute in embodiment The accompanying drawing used is needed to be briefly described, it should be apparent that, the accompanying drawing in describing below is only described in the present invention A little embodiments, for those of ordinary skill in the art, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is a kind of flow chart of embodiment of the present invention audio recognition method；

Fig. 2 is the another kind of flow chart of embodiment of the present invention audio recognition method；

Fig. 3 is a kind of structural representation of embodiment of the present invention speech recognition system；

Fig. 4 is the another kind of structural representation of embodiment of the present invention speech recognition system.

Detailed description of the invention

In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings and implement The embodiment of the present invention is described in further detail by mode.

The embodiment of the present invention provides a kind of audio recognition method and system, identifies in conjunction with high in the clouds and this locality identifies, Ke Yi Do not have during network insertion to complete to utilize local identification engine to complete the local function without network, as made a phone call, send short messages, listening Music etc..Can also be according to the requirement dynamically reduced the time delay that network connects local engine results.

As it is shown in figure 1, be a kind of flow chart of embodiment of the present invention audio recognition method, comprise the following steps:

Step 101, obtains the voice messaging that user sends.

Step 102, is sent respectively to described voice messaging high in the clouds and identifies engine and local identification engine, so that described cloud End identifies that described voice messaging is identified by engine and local identification engine respectively.

Specifically, the voice messaging that can send with recording module record user.The voice messaging recorded can be straight Sending and receiving are given high in the clouds and are identified engine and local identification engine；First can also filter out effective information start-stop with voice detection module Point, is then then forwarded to high in the clouds and identifies engine and local identification engine.

Step 103, if first receiving the high in the clouds recognition result that described high in the clouds identifies that engine returns, then exports described high in the clouds Recognition result.

Because the server identification engine performance in high in the clouds is powerful, recognition result has higher confidence level, is therefore preferentially connecing After receiving high in the clouds recognition result, can directly export this recognition result.

Step 104, if first receiving the described local local recognition result identifying engine, and described local identification is tied The confidence level that fruit is corresponding is more than the confidence interval upper limit set, then export described local recognition result.

Owing to, in the case of network environment is bad, the recognition result in high in the clouds may have sizable delay.Now, obtain Local recognition result corresponding to this voice messaging and the confidence value of this result, if what this confidence value was arranged more than system Confidence level thresholding, illustrates that this recognition result is completely available, therefore the local recognition result of output, it is not necessary to wait that high in the clouds identifies again Result.

Visible, that the embodiment of the present invention provides audio recognition method, identifies this locality and combines, according to cloud with high in the clouds identification Priority and the confidence level of the preferential local recognition result returned that end recognition result and local recognition result return determine choosing Knowledge add result.And adhere to that the result in high in the clouds is better than this locality all the time, if high in the clouds identifies can identify one phase of offer in this locality To returning result before identifying accurately, just use the result in high in the clouds.

In order to solve further network delay or network unavailable in the case of also be able to the language that obtains that there is certain accuracy rate Sound recognition result, another embodiment of audio recognition method of the present invention can also dynamically adjust local knowledge according to current network condition Other confidence level thresholding, in the result that the shortest output time delay is best.

As in figure 2 it is shown, be the another kind of flow chart of embodiment of the present invention audio recognition method, comprise the following steps:

Step 201 in Fig. 2 is identical with the step 101 in Fig. 1 to step 103 to step 203, does not repeats them here.

Step 204, if first receiving the local local recognition result identifying engine, then obtains local recognition result corresponding Confidence level.

It addition, in step 204, the confidence level according to local recognition result is needed to determine follow-up process operation, it is ensured that Best result is exported within the shortest time delay.Specifically, if confidence level is less than the confidence interval lower limit set, then Perform step 205；If confidence level is in the confidence interval set, then perform step 208；If confidence level is more than setting The confidence interval upper limit, then perform step 213.

Step 205, abandons local recognition result, continues waiting for high in the clouds and identifies that engine returns high in the clouds recognition result.

Step 206, it is judged that whether the waiting time exceedes the obstruction duration of setting；If it is, perform step 207；Otherwise Continue waiting for.

Step 207, returns recognition failures information to user.

Step 208, reduces the described confidence interval upper limit successively within the waiting time set.

Step 209, it is judged that whether receive high in the clouds recognition result within described waiting time.If it is, execution step 210；Otherwise, step 211 is performed.

Step 210, output high in the clouds recognition result.

Step 211, it is judged that whether the confidence level that local recognition result is corresponding is more than on the confidence interval after current reduction Limit.If it is, perform step 213；Otherwise, step 212 is performed.

Step 212, it is judged that reduce the number of times of the described confidence interval upper limit and whether exceed the frequency threshold value of setting and (such as may be used Be frequency threshold value can be 1 to 3 etc.).If it is, perform step 207；Otherwise, step 208 is returned.

Step 213, the local recognition result of output.

It should be noted that the waiting time mentioned in above-mentioned steps 208 is between the time reducing the confidence interval upper limit Every, can be such as the 2-5 second etc., and the time interval every time reducing the confidence interval upper limit can be identical, it is also possible to be different. And the waiting time mentioned in above-mentioned steps 206 is two different concepts from above-mentioned waiting time, the described waiting time refers to Waiting the time receiving high in the clouds recognition result, its starting point can be described voice messaging to be sent respectively to high in the clouds identify engine Identify that engine starts timing with this locality, it is also possible to be to start timing, to this embodiment of the present invention after abandoning local recognition result Do not limit.

It addition, in actual applications, do not receive in the certain time after every time reducing the described confidence interval upper limit High in the clouds recognition result, and in the case of confidence level corresponding to local recognition result can not meet requirement, it is also possible to do not go to judge Whether the number of times reducing the described confidence interval upper limit exceedes the frequency threshold value of setting, but judges whether the time waited exceedes The waiting time limited, if it does, then return recognition failures information to user, to prevent the waiting time long, affect user Experience.

Owing to high in the clouds has the speech data comparison of powerful server handling ability and magnanimity, recognition result confidence level Height, and local identification is without network support, has the highest recognition speed and the widest scope of application, more especially suitable nothings are stable In the mobile device that network connects.Therefore, this locality is identified and ties mutually with high in the clouds identification by the audio recognition method of the embodiment of the present invention Close, take into account both respective advantages, after getting the voice messaging that user sends, be sent simultaneously to high in the clouds and identify engine Identify that engine is identified with this locality.Can return before this locality identifies and provides a relatively accurate identification if high in the clouds identifies As a result, then high in the clouds recognition result is used.Otherwise, constantly reduce the confidence level thresholding that this locality identifies, until have one qualified Output or recognition failures, therefore can ensure that bad at network or provide reliable voice as far as possible in the case of not having network Recognition result.

The audio recognition method of the embodiment of the present invention, by simple efficient local identify engine meet network obstructed time Identification to local command, during further, since accept or reject, to high in the clouds and local recognition result, the delay that strategy can reduce identification Between, the confidence level thresholding of local identification can be dynamically adjusted according to current network condition, thus ensure in the shortest delay The result that time output is best.

In addition, it is necessary to explanation, in actual applications, can receive user send speech recognition request after, Open high in the clouds and identify engine and local identification engine.Such as, described speech recognition request can press speech recognition key user Time send, or provide a user with voice arousal function, on backstage, always on recording, sends out when recognizing special key words Send.

This locality is identified that engine can use the recognition methods of some routines to the identification of special key words, such as, originally Ground identifies that engine reads the grammar file that predefined is good, That file defines the set of the order word that speech recognition is supported, And the set of identical action command word all exists in dictionary, local identify that engine can efficiently access.Local identification engine passes through Grammar file generates one and identifies network, and the local characteristic information identifying engine extraction input voice is also carried out on network identifying Route matching, final every user says any a word as defined in the range of this grammar file, all can be recognized by the system, Thus know and described special key words.

Certainly, high in the clouds identifying, which kind of speech recognition technology engine and local identification engine specifically use, the present invention implements Example does not limits, and especially this locality is identified engine, can need to select, all without affecting this according to concrete application scenarios The bright the effect above that can reach.

Correspondingly, the embodiment of the present invention also provides for a kind of speech recognition system, as it is shown on figure 3, be a kind of knot of this system Structure schematic diagram.

In this embodiment, described system includes:

Voice messaging acquiring unit 301, for obtaining the voice messaging that user sends.

Transmitting element 302, identifies engine and local identification engine for described voice messaging is sent respectively to high in the clouds, with Described voice messaging is identified by engine and local identification engine respectively to make described high in the clouds identify.

Receive unit 303, identify that the high in the clouds recognition result of engine return and described this locality are known for receiving described high in the clouds The local recognition result of other engine.

Output unit 304, for first receiving, at reception unit 303, the high in the clouds identification knot that described high in the clouds identifies that engine returns Time really, export described high in the clouds recognition result；The described local local identification knot identifying engine is first received receiving unit 303 Really, and confidence level corresponding to described local recognition result more than the confidence interval upper limit set time, export and described local know Other result.

The speech recognition system that the embodiment of the present invention provides, identifies this locality and combines with high in the clouds identification, know according to high in the clouds Priority and the confidence level of the preferential local recognition result returned that other result returns with local recognition result determine selection Know and add result.And adhere to that the result in high in the clouds is better than this locality all the time, if high in the clouds identifies can identify in this locality that providing one aligns Return result before true identification, just use the result in high in the clouds.

In order to solve further network delay or network unavailable in the case of also be able to the language that obtains that there is certain accuracy rate Sound recognition result, another embodiment of speech recognition system of the present invention can also dynamically adjust local knowledge according to current network condition Other confidence level thresholding, in the result that the shortest output time delay is best.

As shown in Figure 4, it is the structural representation of another embodiment of speech recognition system of the present invention.

Unlike embodiment illustrated in fig. 3, in this embodiment, described system also includes:

Confidence level adjustment unit 401, for when described confidence level is in described confidence interval, successively set etc. The described confidence interval upper limit is reduced in treating duration.

Correspondingly, in this embodiment, described output unit 304 is additionally operable within described waiting time receive unit 303 When receiving the high in the clouds recognition result that described high in the clouds identifies engine return, export described high in the clouds recognition result；When described wait In long, reception unit 303 does not receives the high in the clouds recognition result that described high in the clouds identifies that engine returns, and described local identification is tied When the confidence level that fruit is corresponding is more than the confidence interval upper limit after reducing, export described local recognition result.

It addition, in order to prevent from waiting the overlong time of recognition result output, affect Consumer's Experience, as shown in Figure 4, this system Also can farther include: statistic unit 402, be used for adding up described confidence level adjustment unit 401 and reduce described confidence interval The number of times of limit.

Correspondingly, the number of times that output unit 304 can be additionally used in described statistic unit 401 statistics exceedes the number of times threshold of setting After value, if confidence level corresponding to local recognition result is still less than the confidence interval lower limit after reducing, and receive not yet Described high in the clouds recognition result, then return recognition failures information to user.

In order to ensure the accuracy rate of the local recognition result of output, in above-mentioned Fig. 3 and embodiment illustrated in fig. 4, described Reception unit 303 can be additionally used in and formerly receives described local recognition result, and the confidence that described local recognition result is corresponding When degree is less than the confidence interval lower limit set, abandon described local recognition result, continue waiting for described high in the clouds and identify that engine returns Return high in the clouds recognition result；And after the waiting time exceedes the obstruction duration of setting, return recognition failures information to user.Certainly, In actual applications, it is also possible to by receive unit 303 by above-mentioned situation notify output unit 304, and by output unit 304 to Family returns recognition failures information.

It addition, high in the clouds identify engine and the local unlatching identifying engine can by have different in the way of, such as, in above-mentioned each reality Executing in example, described system may also include trigger element (not shown), is used for after receiving the speech recognition request that user sends, Open high in the clouds and identify engine and local identification engine.Described speech recognition request can be sent out when user presses speech recognition key Send, or provide a user with voice arousal function, the always on recording on backstage, send when recognizing special key words.

Visible by foregoing description, the speech recognition system of the embodiment of the present invention, drawn by simple efficient local identification Hold up meet network obstructed time identification to local command, further, since can to the choice strategy in high in the clouds and local recognition result To reduce the time delay identified, the confidence level thresholding of local identification can be dynamically adjusted according to current network condition, from And ensure in the result that the shortest output time delay is best.

Each embodiment in this specification all uses the mode gone forward one by one to describe, identical similar portion between each embodiment Dividing and see mutually, what each embodiment stressed is the difference with other embodiments.Real especially for system For executing example, owing to it is substantially similar to embodiment of the method, so describing fairly simple, relevant part sees embodiment of the method Part illustrate.

It should be noted that system embodiment described above is only schematically, wherein said as separated part The unit of part explanation can be or may not be physically separate, and the parts shown as unit can be or also may be used Not to be physical location, i.e. may be located at a place, or can also be distributed on multiple NE.Can be according to reality Need select some or all of module therein to realize the purpose of the present embodiment scheme.Those of ordinary skill in the art exist In the case of not paying creative work, i.e. it is appreciated that and implements.

Being described in detail the embodiment of the present invention above, the present invention is carried out by detailed description of the invention used herein Illustrating, the explanation of above example is only intended to help to understand the method and apparatus of the present invention；Simultaneously for this area one As technical staff, according to the thought of the present invention, the most all will change, to sum up institute Stating, this specification content should not be construed as limitation of the present invention.

Claims

1. an audio recognition method, it is characterised in that including:

Obtain the voice messaging that user sends；

Described voice messaging is sent respectively to high in the clouds identify engine and local identify engine so that described high in the clouds identify engine and Described voice messaging is identified by local identification engine respectively；

If first receiving the high in the clouds recognition result that described high in the clouds identifies that engine returns, then export described high in the clouds recognition result；

If first receive the described local local recognition result identifying engine, and the confidence that described local recognition result is corresponding Degree more than the confidence interval upper limit set, then exports described local recognition result；

If described confidence level is in described confidence interval, within the waiting time set, reduce described confidence level district the most successively Between the upper limit；

If receiving the high in the clouds recognition result that described high in the clouds identifies that engine returns within described waiting time, then export described cloud End recognition result；

If within described waiting time, do not receive the high in the clouds recognition result that described high in the clouds identifies that engine returns, and described The confidence level that ground recognition result is corresponding is more than the confidence interval upper limit after reducing, then export described local recognition result.

Method the most according to claim 1, it is characterised in that each waiting time is identical or different.

Method the most according to claim 1, it is characterised in that described method also includes:

If after the number of times reducing the described confidence interval upper limit exceedes the frequency threshold value of setting, described local recognition result is corresponding Confidence level still less than the confidence interval lower limit after reducing, and do not receive described high in the clouds recognition result yet, then to user Return recognition failures information.

If first receive described local recognition result, and confidence level the putting less than setting that described local recognition result is corresponding Confidence interval lower limit, then abandon described local recognition result, continues waiting for described high in the clouds and identifies that engine returns high in the clouds recognition result；

5. according to the method described in any one of Claims 1-4, it is characterised in that described method also includes:

6. a speech recognition system, it is characterised in that including:

Receive unit, identify that the high in the clouds recognition result of engine return and described this locality identify engine for receiving described high in the clouds Local recognition result；

Output unit, during for first receiving the high in the clouds recognition result of described high in the clouds identification engine return at described reception unit, Export described high in the clouds recognition result；The described local local recognition result identifying engine is first received at described reception unit, and And confidence level corresponding to described local recognition result more than the confidence interval upper limit set time, export and described local identify knot Really；

Confidence level adjustment unit, for when described confidence level is in described confidence interval, successively in the waiting time set The interior reduction described confidence interval upper limit；

Described output unit, is additionally operable to described reception unit within described waiting time and receives described high in the clouds identification engine return High in the clouds recognition result time, export described high in the clouds recognition result；Within described waiting time, described reception unit does not receives institute State the high in the clouds recognition result that high in the clouds identifies that engine returns, and after confidence level corresponding to described local recognition result is more than reducing During the confidence interval upper limit, export described local recognition result.

System the most according to claim 6, it is characterised in that described system also includes:

Described output unit, is additionally operable to after described number of times exceedes the frequency threshold value of setting, if local recognition result is corresponding Confidence level is still less than the confidence interval lower limit after reducing, and does not receives described high in the clouds recognition result yet, then return to user Return recognition failures information.

System the most according to claim 6, it is characterised in that

Described reception unit, is additionally operable to formerly receive described local recognition result, and described local recognition result is corresponding When confidence level is less than the confidence interval lower limit set, abandon described local recognition result, continue waiting for the identification of described high in the clouds and draw Hold up return high in the clouds recognition result；And after the waiting time exceedes the obstruction duration of setting, return recognition failures information to user.

9. according to the system described in any one of claim 6 to 8, it is characterised in that described system also includes:

Trigger element, for after receiving the speech recognition request that user sends, opens high in the clouds and identifies that engine identifies with local Engine.