CN104916283A

CN104916283A - Voice recognition method and device

Info

Publication number: CN104916283A
Application number: CN201510319421.5A
Authority: CN
Inventors: 段弘; 唐立亮; 谢延; 彭守业
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-06-11
Filing date: 2015-06-11
Publication date: 2015-09-16

Abstract

The invention discloses a voice recognition method and device. The voice recognition method comprises the following steps: S1) receiving input voice information and dividing the voice information into a plurality of voice cache fragments; S2) sequentially carrying out online recognition on the plurality of voice cache fragments; S3) when the online recognition fails, obtaining a plurality of first recognition results corresponding to a plurality of voice cache fragments which have finished online recognition, carrying out off-line recognition on a plurality of voice cache fragments which fail to finish online recognition, and obtaining a plurality of second recognition results corresponding to the off-line recognition; and S4) combining the plurality of first recognition results and the plurality of second recognition results to generate a final recognition result. According to the voice recognition method and device, stability and precision of voice recognition are improved, and furthermore, use experience of a user is improved.

Description

Audio recognition method and device

Technical field

The present invention relates to technical field of voice recognition, particularly relate to a kind of audio recognition method and device.

Background technology

Speech recognition technology is a cross discipline, relates to multiple technical field.Along with the continuous progress of science, the range of application of speech recognition technology is also more and more wide, such as phonitic entry method, can be word by the speech conversion of user, thus save the time of user's input characters.

At present, speech recognition technology can be divided into the speech recognition (online speech recognition) based on high in the clouds engine and the speech recognition (off-line speech recognition) two kinds based on local engine.Online speech recognition has high accuracy of identification, high real-time, does not take the advantages such as client device resource, but require higher to network environment, if network network speed is fast not, ONLINE RECOGNITION process will become and slowly even cannot identify, therefore poor stability.And based on this locality, off-line speech recognition mainly identifies that engine identifies voice, therefore can depart from the dependence to network, ensure the stability identified, but identify that precision is poor.

Under present case, also product is had can be simultaneously used in line identification and identified off-line, but be all realize, namely when being in online identifying, if network generation problem based on the strategy of retry, ONLINE RECOGNITION procedure failure will be pointed out, need user to re-enter voice messaging, carry out speech recognition again, then judge to use ONLINE RECOGNITION still to use identified off-line according to network condition, operation inconvenience, poor user experience.

Therefore, a kind of stability is high, identification precision is high recognition methods or device is needed badly.

Summary of the invention

The present invention is intended to solve one of technical matters in correlation technique at least to a certain extent.For this reason, one object of the present invention is to propose a kind of audio recognition method, and the method can improve stability and the precision of identification, and then promotes Consumer's Experience.

Second object of the present invention is to propose a kind of speech recognition equipment.

To achieve these goals, first aspect present invention embodiment proposes a kind of audio recognition method, comprising: the voice messaging of S1, reception input, and described voice messaging is cut into multiple speech buffer storage fragment; S2, successively ONLINE RECOGNITION is carried out to described multiple speech buffer storage fragment; S3, when described ONLINE RECOGNITION makes a mistake, obtain and completed multiple first recognition results corresponding to multiple speech buffer storage fragments of ONLINE RECOGNITION, and identified off-line is carried out to the multiple speech buffer storage fragments not completing ONLINE RECOGNITION, and obtain multiple second recognition results corresponding to described identified off-line; And S4, described multiple first recognition result of merging and described multiple second recognition result are to generate final recognition result.

The audio recognition method of the embodiment of the present invention, by voice messaging being cut into multiple speech buffer storage fragment, successively ONLINE RECOGNITION is carried out to multiple speech buffer storage fragment, and when ONLINE RECOGNITION makes a mistake, directly identified off-line is carried out to the unidentified speech buffer storage fragment completed, and merge multiple first recognition result corresponding to ONLINE RECOGNITION and multiple second recognition results corresponding to identified off-line, improve stability and the precision of speech recognition, and then improve the experience of user.

Second aspect present invention embodiment proposes a kind of speech recognition equipment, comprising: cutting module, for receiving the voice messaging of input, and described voice messaging is cut into multiple speech buffer storage fragment; ONLINE RECOGNITION module, for carrying out ONLINE RECOGNITION to described multiple speech buffer storage fragment successively; Obtain module, for when described ONLINE RECOGNITION makes a mistake, acquisition has completed multiple first recognition results corresponding to multiple speech buffer storage fragments of ONLINE RECOGNITION; Identified off-line module, for when described ONLINE RECOGNITION makes a mistake, carries out identified off-line to the multiple speech buffer storage fragments not completing ONLINE RECOGNITION, described acquisition module, also for obtaining multiple second recognition results corresponding to described identified off-line; And merging module, for merging described multiple first recognition result and described multiple second recognition result to generate final recognition result.

The speech recognition equipment of the embodiment of the present invention, by voice messaging being cut into multiple speech buffer storage fragment, successively ONLINE RECOGNITION is carried out to multiple speech buffer storage fragment, and when ONLINE RECOGNITION makes a mistake, directly identified off-line is carried out to the unidentified speech buffer storage fragment completed, and merge multiple first recognition result corresponding to ONLINE RECOGNITION and multiple second recognition results corresponding to identified off-line, improve stability and the precision of speech recognition, and then improve the experience of user.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of audio recognition method according to an embodiment of the invention.

Fig. 2 is the process flow diagram of the audio recognition method according to the present invention's specific embodiment.

Fig. 3 is the structural representation of speech recognition equipment according to an embodiment of the invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Be exemplary below by the embodiment be described with reference to the drawings, be intended to for explaining the present invention, and can not limitation of the present invention be interpreted as.

Below with reference to the accompanying drawings audio recognition method and the device of the embodiment of the present invention are described.

As shown in Figure 1, audio recognition method can comprise:

The voice messaging of S1, reception input, and voice messaging is cut into multiple speech buffer storage fragment.

In an embodiment of the present invention, the voice messaging of user by input equipment such as microphone input can be received, then the voice messaging of reception is cut into multiple speech buffer storage fragment.

Particularly, multipair sound end can be obtained based on speech terminals detection technology, the speech data then between buffer memory often pair sound end, to generate multiple speech buffer storage fragment.Wherein, often pair of sound end comprises voice starting point and the voice terminal corresponding with voice starting point.

For example, the voice that user inputs can be analyzed, the sound end obtained is s1, e1, s2, e2, s3, e3 ... wherein, s1 is first voice starting point, e1 is first voice terminal, then the speech data between buffer memory s1 and e1, thus generate first speech buffer storage fragment v1; S2 is second voice starting point, and e2 is second voice terminal, the speech data then between buffer memory s2 and e2, thus generates second speech buffer storage fragment v2, by that analogy.The multiple speech buffer storage fragment of final generation.

S2, successively ONLINE RECOGNITION is carried out to multiple speech buffer storage fragment.

After voice messaging being cut into multiple speech buffer storage fragment, ONLINE RECOGNITION can be carried out to multiple speech buffer storage fragment successively.

Particularly, by high in the clouds engine, feature extraction is carried out to speech buffer storage fragment, to generate acoustic feature sequence, then according to acoustic model and dictionary, acoustic feature sequence is decoded, thus obtain the acoustic model sequence with acoustic feature sequences match, finally obtain the word sequence of answering with acoustic model sequence pair according to language model, as the first recognition result that speech buffer storage fragment is corresponding.Consistent with the technology of ONLINE RECOGNITION in prior art herein, therefore do not repeat.

S3, when ONLINE RECOGNITION makes a mistake, obtain and completed multiple first recognition results corresponding to multiple speech buffer storage fragments of ONLINE RECOGNITION, and identified off-line is carried out to the multiple speech buffer storage fragments not completing ONLINE RECOGNITION, and obtain multiple second recognition results corresponding to identified off-line.

In the process of ONLINE RECOGNITION, network may produce exception, the slack-off or suspension of such as network speed, then cause ONLINE RECOGNITION to produce mistake.Now, directly can transfer identified off-line to, instead of retry is carried out to ONLINE RECOGNITION.For example, when ONLINE RECOGNITION is to the 3rd speech buffer storage fragment v3, create mistake, then can obtain the first two and complete recognition result a1 and a2 corresponding to the speech buffer storage fragment of identification, delete first and second speech buffer storage fragment v1 and v2 simultaneously.Then, the speech buffer storage fragment v3 never completed starts, and carries out identified off-line, thus ensure that the seamless connection of ONLINE RECOGNITION and identified off-line, ensure that integrity degree and the degree of accuracy of speech recognition.

Wherein, the technology of identified off-line is consistent with the technology of ONLINE RECOGNITION, and difference is that identified off-line uses local engine.Particularly, by local engine, feature extraction is carried out to speech buffer storage fragment, to generate acoustic feature sequence, then according to acoustic model and dictionary, acoustic feature sequence is decoded, thus obtain the acoustic model sequence with acoustic feature sequences match, finally obtain the word sequence of answering with acoustic model sequence pair according to language model, as the second recognition result that speech buffer storage fragment is corresponding.Consistent with the technology of identified off-line in prior art herein, therefore do not repeat.

After identified off-line completes, corresponding multiple second recognition results can be obtained.

In the present embodiment, recognition result corresponding to ONLINE RECOGNITION is the first recognition result, and recognition result corresponding to identified off-line is the second recognition result, and both all can be text message.For example, a1 and a2 is recognition result corresponding to ONLINE RECOGNITION, is the first recognition result; And the recognition result after a3, being identified off-line obtains, be then the second recognition result.

S4, merge multiple first recognition result and multiple second recognition result to generate final recognition result.

After multiple first recognition result of acquisition and multiple second recognition result, can union operation be performed, thus generate final recognition result, such as A=a1+a2+a3+ ...

As shown in Figure 2, audio recognition method can comprise:

S201, receives the voice messaging of input, and carries out speech terminals detection to voice messaging.

Such as user input voice messaging be " hello, is Baidu here, could you tell me and whom looks for? ", then can carry out speech detection based on voice endpoint detection technique to this voice messaging, thus the sound end obtained is s1, e1, s2, e2, s3, e3.Wherein, s1 is the voice starting point of " hello ", and e1 is the voice terminal of " hello ", the voice starting point that s2 is " being Baidu here ", the voice terminal that e2 is " being Baidu here ", s3 is the voice starting point of " could you tell me and whom looks for ", and e3 is the voice terminal of " could you tell me and whom looks for ".

S202, according to the sound end detected, generates multiple speech buffer storage.

According to above-mentioned sound end, " hello ", " being Baidu ", " could you tell me and whom looks for " three speech buffer storages can be generated here.

S203, carries out ONLINE RECOGNITION to multiple speech buffer storage.

Successively ONLINE RECOGNITION is carried out to " hello ", " being Baidu ", " could you tell me and whom looks for " here.

S204, when network makes a mistake, obtains and has completed the first recognition result corresponding to multiple speech buffer storages of identification, and carry out identified off-line to the multiple speech buffer storages not completing identification, to obtain the second corresponding recognition result.

When supposing that network makes a mistake, speech buffer storage " hello " and " being Baidu here " have identified, then can obtain corresponding the first recognition result " hello " and " being Baidu " here, and the first recognition result is that ONLINE RECOGNITION obtains, and is text message.Then, from speech buffer storage " be could you tell me and whom is looked for ", can carry out identified off-line, finally obtain the second corresponding recognition result and " could you tell me and whom looks for ", the second recognition result is that identified off-line obtains, and is also text message.

S205, carries out concatenation to the first recognition result and the second recognition result, to generate final recognition result.

Text message " hello ", " being Baidu here ", " could you tell me and whom looks for " are carried out to concatenation, thus are generated final recognition result, namely text message " hello, is Baidu here, may I ask that who are you? "

The audio recognition method of the embodiment of the present invention, by input voice information is carried out speech terminals detection, then according to the sound end detected, generate multiple speech buffer storage, and ONLINE RECOGNITION is carried out to multiple speech buffer storage, and when network makes a mistake, obtain the first recognition result identified, then identified off-line is carried out to the unidentified multiple speech buffer storages completed, obtain the second corresponding recognition result, finally the first recognition result and the second recognition result are spliced, obtain final voice identification result, improve stability and the precision of speech recognition, and then improve the experience of user.

For achieving the above object, the present invention also proposes a kind of speech recognition equipment.

As shown in Figure 3, this speech recognition equipment can comprise: cutting module 110, ONLINE RECOGNITION module 120, acquisition module 130, identified off-line module 140 and merging module 150.

Voice messaging for receiving the voice messaging of input, and is cut into multiple speech buffer storage fragment by cutting module 110.

In an embodiment of the present invention, cutting module 110 can receive the voice messaging of user by input equipment such as microphone input, then the voice messaging of reception is cut into multiple speech buffer storage fragment.

Particularly, cutting module 110 can obtain multipair sound end based on speech terminals detection technology, the speech data then between buffer memory often pair sound end, to generate multiple speech buffer storage fragment.Wherein, often pair of sound end comprises voice starting point and the voice terminal corresponding with voice starting point.

ONLINE RECOGNITION module 120 is for carrying out ONLINE RECOGNITION to multiple speech buffer storage fragment successively.

After voice messaging is cut into multiple speech buffer storage fragment by cutting module 110, ONLINE RECOGNITION module 120 can carry out ONLINE RECOGNITION to multiple speech buffer storage fragment successively.

Particularly, ONLINE RECOGNITION module 120 carries out feature extraction by high in the clouds engine to speech buffer storage fragment, to generate acoustic feature sequence, then according to acoustic model and dictionary, acoustic feature sequence is decoded, thus obtain the acoustic model sequence with acoustic feature sequences match, finally obtain the word sequence of answering with acoustic model sequence pair according to language model, as the first recognition result that speech buffer storage fragment is corresponding.Consistent with the technology of ONLINE RECOGNITION in prior art herein, therefore do not repeat.

Obtain module 130 for when ONLINE RECOGNITION makes a mistake, acquisition has completed multiple first recognition results corresponding to multiple speech buffer storage fragments of ONLINE RECOGNITION.

Identified off-line module 140, for when ONLINE RECOGNITION makes a mistake, carries out identified off-line to the multiple speech buffer storage fragments not completing ONLINE RECOGNITION, obtains module 130 also for obtaining multiple second recognition results corresponding to identified off-line.

In the process of ONLINE RECOGNITION, network may produce exception, the slack-off or suspension of such as network speed, then cause ONLINE RECOGNITION to produce mistake.Now, directly can transfer identified off-line to, instead of retry is carried out to ONLINE RECOGNITION.For example, when ONLINE RECOGNITION is to the 3rd speech buffer storage fragment v3, create mistake, then can obtain the first two and complete recognition result a1 and a2 corresponding to the speech buffer storage fragment of identification, delete first and second speech buffer storage fragment v1 and v2 simultaneously.Then, the speech buffer storage fragment v3 never completed starts, and carries out identified off-line.

Particularly, identified off-line module 140 carries out feature extraction by local engine to speech buffer storage fragment, to generate acoustic feature sequence, then according to acoustic model and dictionary, acoustic feature sequence is decoded, thus obtain the acoustic model sequence with acoustic feature sequences match, finally obtain the word sequence of answering with acoustic model sequence pair according to language model, as the second recognition result that speech buffer storage fragment is corresponding.Consistent with the technology of identified off-line in prior art herein, therefore do not repeat.

After identified off-line completes, obtain module 130 and can obtain corresponding multiple second recognition results.

Merge module 150 for merging multiple first recognition result and multiple second recognition result to generate final recognition result.

After acquisition module 130 obtains multiple first recognition result and multiple second recognition result, merging module 150 can perform union operation, thus generates final recognition result, such as A=a1+a2+a3+ ...

In describing the invention, it will be appreciated that, term " " center ", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward ", " clockwise ", " counterclockwise ", " axis ", " radial direction ", orientation or the position relationship of the instruction such as " circumference " are based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of indicate or imply that the device of indication or element must have specific orientation, with specific azimuth configuration and operation, therefore limitation of the present invention can not be interpreted as.

In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance or imply the quantity indicating indicated technical characteristic.Thus, be limited with " first ", the feature of " second " can express or impliedly comprise at least one this feature.In describing the invention, the implication of " multiple " is at least two, such as two, three etc., unless otherwise expressly limited specifically.

In the present invention, unless otherwise clearly defined and limited, the term such as term " installation ", " being connected ", " connection ", " fixing " should be interpreted broadly, and such as, can be fixedly connected with, also can be removably connect, or integral; Can be mechanical connection, also can be electrical connection; Can be directly be connected, also indirectly can be connected by intermediary, can be the connection of two element internals or the interaction relationship of two elements, unless otherwise clear and definite restriction.For the ordinary skill in the art, above-mentioned term concrete meaning in the present invention can be understood as the case may be.

In the present invention, unless otherwise clearly defined and limited, fisrt feature second feature " on " or D score can be that the first and second features directly contact, or the first and second features are by intermediary indirect contact.And, fisrt feature second feature " on ", " top " and " above " but fisrt feature directly over second feature or oblique upper, or only represent that fisrt feature level height is higher than second feature.Fisrt feature second feature " under ", " below " and " below " can be fisrt feature immediately below second feature or tiltedly below, or only represent that fisrt feature level height is less than second feature.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not must for be identical embodiment or example.And the specific features of description, structure, material or feature can combine in one or more embodiment in office or example in an appropriate manner.In addition, when not conflicting, the feature of the different embodiment described in this instructions or example and different embodiment or example can carry out combining and combining by those skilled in the art.

Although illustrate and describe embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, revises, replace and modification.

Claims

1. an audio recognition method, is characterized in that, comprising:

The voice messaging of S1, reception input, and described voice messaging is cut into multiple speech buffer storage fragment;

S2, successively ONLINE RECOGNITION is carried out to described multiple speech buffer storage fragment;

S3, when described ONLINE RECOGNITION makes a mistake, obtain and completed multiple first recognition results corresponding to multiple speech buffer storage fragments of ONLINE RECOGNITION, and identified off-line is carried out to the multiple speech buffer storage fragments not completing ONLINE RECOGNITION, and obtain multiple second recognition results corresponding to described identified off-line; And

S4, merge described multiple first recognition result and described multiple second recognition result to generate final recognition result.

2. the method for claim 1, is characterized in that, described described voice messaging is cut into multiple speech buffer storage fragment, comprising:

Obtain multipair sound end based on speech terminals detection technology, wherein, often pair of sound end comprises voice starting point and the voice terminal corresponding with described voice starting point;

Speech data described in buffer memory between often pair of sound end, to generate described multiple speech buffer storage fragment.

3. the method for claim 1, is characterized in that, describedly carries out ONLINE RECOGNITION to described multiple speech buffer storage fragment successively, comprising:

By high in the clouds engine, feature extraction is carried out to described speech buffer storage fragment, to generate acoustic feature sequence;

According to acoustic model and dictionary, described acoustic feature sequence is decoded, obtain the acoustic model sequence with described acoustic feature sequences match; And

The word sequence of answering with described acoustic model sequence pair is obtained, as the first recognition result that described speech buffer storage fragment is corresponding according to language model.

4. the method for claim 1, is characterized in that, the described speech buffer storage fragment to not completing ONLINE RECOGNITION carries out identified off-line, comprising:

By local engine, feature extraction is carried out to described speech buffer storage fragment, to generate acoustic feature sequence;

The word sequence of answering with described acoustic model sequence pair is obtained, as the second recognition result that described speech buffer storage fragment is corresponding according to language model.

5. the method for claim 1, is characterized in that, described first recognition result and described second recognition result are text message.

6. a speech recognition equipment, is characterized in that, comprising:

Cutting module, for receiving the voice messaging of input, and is cut into multiple speech buffer storage fragment by described voice messaging;

ONLINE RECOGNITION module, for carrying out ONLINE RECOGNITION to described multiple speech buffer storage fragment successively;

Obtain module, for when described ONLINE RECOGNITION makes a mistake, acquisition has completed multiple first recognition results corresponding to multiple speech buffer storage fragments of ONLINE RECOGNITION;

Identified off-line module, for when described ONLINE RECOGNITION makes a mistake, carries out identified off-line to the multiple speech buffer storage fragments not completing ONLINE RECOGNITION, described acquisition module, also for obtaining multiple second recognition results corresponding to described identified off-line; And

Merge module, for merging described multiple first recognition result and described multiple second recognition result to generate final recognition result.

7. device as claimed in claim 6, is characterized in that, described cutting module, specifically for:

Multipair sound end is obtained based on speech terminals detection technology, and the speech data described in buffer memory between often pair of sound end, to generate described multiple speech buffer storage fragment, wherein, often pair of sound end comprises voice starting point and the voice terminal corresponding with described voice starting point.

8. device as claimed in claim 6, is characterized in that, described ONLINE RECOGNITION module, specifically for:

For carrying out feature extraction by high in the clouds engine to described speech buffer storage fragment, to generate acoustic feature sequence, according to acoustic model and dictionary, described acoustic feature sequence is decoded, and obtain the acoustic model sequence with described acoustic feature sequences match, and obtain the word sequence of answering with described acoustic model sequence pair according to language model, as the first recognition result that described speech buffer storage fragment is corresponding.

9. method as claimed in claim 6, is characterized in that, described identified off-line module, specifically for:

By local engine, feature extraction is carried out to described speech buffer storage fragment, to generate acoustic feature sequence, according to acoustic model and dictionary, described acoustic feature sequence is decoded, and obtain the acoustic model sequence with described acoustic feature sequences match, and obtain the word sequence of answering with described acoustic model sequence pair according to language model, as the second recognition result that described speech buffer storage fragment is corresponding.

10. device as claimed in claim 6, it is characterized in that, described first recognition result and described second recognition result are text message.