CN109545203A

CN109545203A - Audio recognition method, device, equipment and storage medium

Info

Publication number: CN109545203A
Application number: CN201811534858.0A
Authority: CN
Inventors: 俞诗洪
Original assignee: OneConnect Smart Technology Co Ltd
Current assignee: OneConnect Smart Technology Co Ltd
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2019-03-29

Abstract

The present embodiments relate to a kind of audio recognition method, device, computer equipment and storage mediums, the described method includes: after detecting that user inputs voice, establish the dialogue with the user, if in the homonym for carrying out detecting corresponding multiple semantemes during speech recognition to voice messaging, it then initiates to talk with for the clarification of the polyphonic word, to allow user to determine the correct semanteme of polyphonic word, the correct semanteme of polyphonic word is finally determined according to the context in the reply of user and session.Compared to the mode of existing single interactive identification, the embodiment of the present invention uses man-machine double interactive modes, by establishing the one-to-one session with user, scene support can be provided for speech recognition, speech model is enabled to better understand semanteme expressed by voice messaging by the context in session.In addition, method provided in an embodiment of the present invention can also initiate clarification session for polyphonic word, to allow user to confirm the semanteme of polyphonic word, so as to improve the accuracy rate of speech recognition.

Description

Audio recognition method, device, equipment and storage medium

Technical field

The present embodiments relate to technical field of data processing, more particularly to audio recognition method, device, computer set Standby and storage medium.

Background technique

Speech recognition is currently a more popular technical field.Speech recognition technology can be applied to all multi-products In, such as mobile phone, wearable device, smart home etc..User carries out certain operation by voice, that is, controllable device.Mesh Speech recognition technology in front platform is a kind of traditional single interactive identification.Namely machine the problem of only answering this, office Limit is in single-wheel dialogue.

Such as:

Does user: " middle mountain " have anything to be fond of eating?

Machine: good, I has found following restaurant: (can default and recommend nearby restaurant to user)

User: I does not feel like a meal.

Machine: good.

However, in such speech recognition mode, the problem of machine only answers this, only it is confined in single-wheel dialogue, Lack language contexts to support, and this single interactive voice recognition mode for homonym, polysemant discrimination it is correct Rate is lower.

Summary of the invention

Based on this, the embodiment of the invention provides a kind of audio recognition method, device, equipment and storage mediums, for mentioning The recognition correct rate of homonym in high speech recognition process.

In a first aspect, the embodiment of the present invention may include:

After detecting user's input voice information, generated according to the first information by session manager with the user's Session；Wherein the first information is the characteristic information detected for characterizing the user or preset period；

In the session, during the voice messaging inputted to user carries out speech recognition, however, it is determined that institute's predicate Homonym comprising corresponding multiple semantic results in message breath then initiates to talk with for the clarification of the homonym, the clarification Talk with the corresponding correct semanteme for confirming the homonym to user；

After detecting user for the reply of the clarification dialogue, above and below in the reply and the session Text determines the correct semanteme of the homonym.

Optionally, the determining voice messaging includes the homonym of corresponding multiple semantic results, comprising:

Voice messaging is identified, corresponding multiple syllables are obtained；

Participle operation is carried out to obtained multiple syllables, obtains word segmentation result；

Semantic understanding is carried out to word segmentation result, if the corresponding semanteme of the first participle is multiple, it is determined that the first participle is same Sound word.

Optionally, the clarification dialogue includes the corresponding mark of the multiple semantic and each semanteme, the mark Conventional semantic and conventional semantic corresponding weight is stored in advance in the model for carrying out speech recognition；

The user is correct semantic corresponding mark for the reply of the clarification dialogue；

The correct semanteme that the homonym is determined according to reply, comprising:

Correct semanteme corresponding to the mark that user is replied stores in a model, and it is corresponding that the correct semanteme is arranged Weight is greater than the conventional semantic corresponding weight of the mark；

The corresponding all semantemes of the mark are ranked up by weight is descending, sequence is determined near preceding semanteme For the correct semanteme of the homonym.

Optionally, the method also includes:

When detection meets session termination condition, terminate the session；

Correct semanteme corresponding to the mark is deleted from the model.

Optionally, the method also includes:

The correct semantic completion of the homonym is tied into the recognition result to the voice messaging, and to the identification Fruit carries out semantic understanding.

Optionally, the method also includes:

According to semantic understanding as a result, search for corresponding reply content, and show the reply content.

Optionally, the characteristic information of the user includes: the account information of user or the voiceprint of user.

Second aspect, the embodiment of the invention provides a kind of speech recognition equipments, comprising:

Session Control Unit, for passing through session management according to the first information after detecting user's input voice information Device generates the session with the user；Wherein the first information is to detect for characterizing the characteristic information of the user, Or the preset period；

Semantic understanding unit is used in the session, during carrying out speech recognition to the voice messaging, if It determines that voice messaging includes the homonym of corresponding multiple semantic results, then initiates to talk with for the clarification of the homonym, it is described The corresponding correct semanteme for confirming the homonym to user is talked in clarification；

The semantic understanding unit is also used to after detecting user for the reply of the clarification dialogue, according to described Context in reply and the session determines the correct semanteme of the homonym.

In some embodiments, the semantic understanding unit 302 determines that voice messaging includes corresponding multiple semantic results Homonym, comprising:

Voice messaging is identified, corresponding multiple syllables are obtained；

In some embodiments, the clarification dialogue includes the corresponding mark of the multiple semantic and each semanteme, The routine of the mark is semantic and the semantic corresponding weight of routine is stored in advance in the model for carrying out speech recognition；

The semantic understanding unit determines the correct semanteme of the homonym according to replying, comprising:

In some embodiments, the Session Control Unit is also used to:

When detection meets session termination condition, terminate the session；

Correct semanteme corresponding to the mark is deleted from the model.

In some embodiments, the semantic understanding unit is also used to:

In some embodiments, the characteristic information of the user includes: the account information of user or the vocal print letter of user Breath.

The third aspect, the embodiment of the invention provides a kind of computer equipment, including memory and processor, the storages Computer-readable instruction is stored in device, when the computer-readable instruction is executed by the processor, so that the processor The step of executing audio recognition method as described in relation to the first aspect.

Fourth aspect, the embodiment of the invention provides a kind of storage medium for being stored with computer-readable instruction, the meters When calculation machine readable instruction is executed by one or more processors, so that one or more processors execute as described in relation to the first aspect The step of audio recognition method.

The embodiment of the invention provides a kind of audio recognition method, device, computer equipment and storage medium, the methods It include: to establish the dialogue with the user after detecting that user inputs voice, in the process for carrying out speech recognition to voice messaging If the homonym that corresponding multiple semantemes are detected in initiates to talk with for the clarification of the polyphonic word, to allow user to determine multitone The correct semanteme of word finally determines the correct semanteme of polyphonic word according to the context in the reply of user and session.Compared to The mode of existing list interactive identification, the embodiment of the present invention uses man-machine double interactive modes, one-to-one with user by establishing Session can provide scene support for speech recognition, speech model is better understood by the context in session Semanteme expressed by voice messaging.In addition, method provided in an embodiment of the present invention can also initiate clarification session for polyphonic word, To allow user to confirm the semanteme of polyphonic word, so as to improve the accuracy rate of speech recognition.

Detailed description of the invention

Fig. 1 is the internal structure block diagram of computer equipment in one embodiment；

Fig. 2 is the flow chart of audio recognition method in one embodiment；

Fig. 3 is the structural block diagram of speech recognition equipment in one embodiment.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

In a first aspect, Fig. 1 is the schematic diagram of internal structure of computer equipment in one embodiment.As shown in Figure 1, the calculating Machine equipment includes processor, non-volatile memory medium, memory and the network interface connected by system bus.Wherein, should The non-volatile memory medium of computer equipment is stored with operating system, database and computer-readable instruction, can in database It is stored with control information sequence, when which is executed by processor, processor may make to realize that a kind of voice is known Other method.When the computer-readable instruction is executed by processor, processor may make to realize a kind of audio recognition method.The calculating The processor of machine equipment supports the operation of entire computer equipment for providing calculating and control ability.The computer equipment It can be stored with computer-readable instruction in memory, when which is executed by processor, processor may make to hold A kind of audio recognition method of row.The network interface of the computer equipment with PERCOM peripheral communication for connecting.Those skilled in the art can To understand, structure shown in Fig. 1, only the block diagram of part-structure relevant to application scheme, is not constituted to this Shen Please the restriction of computer equipment that is applied thereon of scheme, specific computer equipment may include than as shown in the figure more or Less component perhaps combines certain components or with different component layouts.

Second aspect, as shown in Fig. 2, the embodiment of the invention provides a kind of audio recognition methods, comprising the following steps:

S201, after detecting user's input voice information, according to the first information by session manager generate with it is described The session of user；

Wherein, here detect user's input voice information can be detect user trigger voice input by Key, or detect that user has directly inputted voice messaging.

Here the first information can be to detect for characterizing the characteristic information of user, such as can be for user's ID, the account name of user, voiceprint of user etc. can be identified for that the information of user identity.Feature letter in addition to that can be user Except breath, the first information here can also be the preset period, this preset period can come according to the actual situation Setting, namely can be set the session for establishing a preset duration, for example, can establish one with the user one 10 minutes Session.Or the first information can be the characteristic information of user and the combination of preset time period, such as according to user's Characteristic information establishes the one-to-one session with the user, and the time of the session persistence is set as 10 minutes.

Session manager is one and is capable of providing the container that some web of concrete management applies all sessions, session here Manager can generate and the user according to the ID of user, the voiceprint of the account name of user or user, preset period One-to-one dialogue.Session manager can also maintain the session after generating dialogue, until detecting conversation end item Part occurs.Wherein the condition of conversation end can be with are as follows: detects that user account, ID exit or have reached the preset period.? After detecting the generation of conversation end condition, the session is just ended automatically.

S202, in the session, to the voice messaging carry out speech recognition during, however, it is determined that voice messaging Homonym comprising the multiple semantic results of correspondence then initiates to talk with for the clarification of the homonym, and the clarification dialogue is used for The corresponding correct semanteme of the homonym is confirmed to user；

Specifically, due in S201 session manager generate the one-to-one dialogue with the user, for this The voice messaging identification of user also carries out in corresponding session.

The homonym detected in speech recognition process is the vocabulary that a syllable can correspond to multiple semantemes.For example, with Family voice inputs " how is the weather on middle mountain ".Wherein, this syllable of zhong shan can correspond to " the middle mountain " of Zhongshan city, " Zhong Mountain " of Zhongshan County can also be corresponded to.These semantemes are that speech recognition modeling is trained by a large amount of voice data It arrives, while speech recognition modeling can determine that the syllable is corresponding according to the corresponding semantic frequency occurred of the voice in training data The probability of each voice.It can be appreciated that its higher corresponding probability of the frequency that the semanteme occurs is higher.The prior art is detecting It is highest semantic for correct semanteme of the syllable in this word that such homonym can directly determine probability.But in some feelings Under condition, the semanteme that the mode directly determined in this way is selected not is the real meaning that user is intended by, for example, user think it is defeated What is entered is " Zhong Mountain ", but due to zhong shan it is corresponding semanteme in " middle mountain " probability highest, the result of identification be always " in Mountain " can bring bad experience in this way for user.Therefore, such same detecting in method provided in an embodiment of the present invention When sound word occurs, clarification dialogue can be initiated to user.Clarification dialogue can provide the sound in the form of asking in reply user for user Corresponding all semantemes are saved, confirm that syllable user really thinks which expression means to user, without such as existing skill It goes to default the demand for judging user based on big data in art.For aforementioned citing, clarification dialogue can be to confirm zhong to user Shan is " Zhong Mountain " of Zhongshan city " middle mountain " or Zhongshan County.Certainly there are many kinds of the forms of clarification dialogue, the present invention is implemented Example is not especially limited this.

S203, detect user for it is described clarification dialogue reply after, according to it is described reply and the session in Context determine the correct semanteme of the homonym.

User can indicate that the user really thinks the syllable meaning of expression, therefore root for the reply of clarification dialogue According to this reply, while according to the session context carried out in the session with user, that is, it can determine the corresponding correct language of the syllable Justice.

In method provided in an embodiment of the present invention, after detecting that user inputs voice, the dialogue with the user is established, If detecting the homonym of corresponding multiple semantemes during carrying out speech recognition to voice messaging, initiate to be directed to the polyphonic word Clarification dialogue, to allow user to determine the correct semanteme of polyphonic word, finally according to the context in the reply of user and session Determine the correct semanteme of polyphonic word.Compared to the mode of existing single interactive identification, the embodiment of the present invention is mutual using man-machine double cross Mode can provide scene support for speech recognition, speech model is led to by establishing the one-to-one session with user The context crossed in session better understands semanteme expressed by voice messaging.In addition, method pair provided in an embodiment of the present invention Clarification session can also be initiated in polyphonic word, to allow user to confirm the semanteme of polyphonic word, so as to improve the standard of speech recognition True rate.

In some embodiments, determine whether the mode comprising homonym has much in voice messaging in step S202 Kind, one of optional embodiment are as follows:

S2021, voice messaging is identified, obtains corresponding multiple syllables；

Such as: user is inputted by voice: how is the weather of zhong shan? in the voice number for detecting user's input According to later, natural language understanding (Natural Language Understanding, abbreviation NLU) technology can be used to voice It is identified, is each syllable by speech recognition, obtains: zhong shan de tian qi zenme yang.

S2022, participle operation is carried out to obtained multiple syllables, obtains word segmentation result；

Here it can be segmented using NLU technology.The algorithm of participle is more mature, namely by a large amount of data to mould Type is trained, and model is after receiving the data newly inputted, and the result that can be trained before is if it is determined that the two sounds As soon as the probability that section becomes vocabulary is higher, the two syllables are synthesized a vocabulary.For example, by zhong shande The result that tianqi zenme yang is segmented is zhong shan, de, tian qi, zenme yang.

S2023, semantic understanding is carried out to word segmentation result, if the corresponding semanteme of the first participle is multiple, it is determined that first point Word is homonym.

After participle, each word segmentation result is subjected to semantic understanding, is also translated as syllable by word segmentation result Text.Such as: zhong shan, de, tian qi, zenme yang be translated as zhong shan weather how.Here It is zhong shan here can corresponding be Zhongshan city " middle mountain " that zhong shan, which is not translated as the reason of text, Can corresponding be Zhongshan County " Zhong Mountain " namely the participle it is corresponding it is semantic be multiple, therefore the participle is homonym.

In some embodiments, the clarification dialogue initiated in above method step S202 for homonym can have very much Kind form, one of optional embodiment may include: the corresponding mark of multiple semantic and each semantemes.Here mark Know to be number, letter etc..For example, this homonym of zhong shan may be corresponding semantic for " middle mountain " and " clock Mountain ".It is understandable to be, due to clarification dialogue be also user is played in a manner of voice, if only comprising " middle mountain " and " Zhong Mountain ", then the syllable heard for a user be it is identical, there is no methods to distinguish, therefore, clarification dialogue in Include it is semantic need to be the semanteme that can allow user that can judge difference from syllable, namely clarification dialogue may include " in Mountain city " and " Zhongshan County ", along with corresponding mark can obtain 1: " Zhongshan city ", 2: " Zhongshan County ".

As described in aforementioned, clarify dialogue purpose be in order to allow user confirm homonym correct semanteme, therefore in addition to need It to include that the purpose of dialogue is also clarified to instruction manual except semantic and mark.Such as: user speech input: zhong How is the weather of shan? does is machine: may I ask that you say which? 1: " Zhongshan city ", 2: " Zhongshan County ".So as to allow user Know that zhong shan is homonym, it is needed to confirm that correct semanteme is.Certainly one kind of above-mentioned only clarification dialogue Optional embodiment can also can allow user to confirm correct semantic form using others, and the embodiment of the present invention is to this It is not especially limited.

In some embodiments, the step S203 in above method embodiment is according in the reply and session of user Context determines that the correct semanteme of homonym can have implementations in very much, below to one of optional embodiment into Row explanation.

After having initiated clarification dialogue to user, user can be replied accordingly according to the call format of session.For For clarification dialogue described above, user can reply mark corresponding to correct semanteme.Namely:

Does is machine: may I ask that you say which? 1: " Zhongshan city ", 2: " Zhongshan County "；

User: being 1.

It is not difficult to find out by the context understanding in the session, although user's reply is 1, it thinks the meaning of expression Be on machine 1 in one enquirement representated by Zhongshan city.But 1 itself also has other meanings, such as number 1 itself, therefore It needs based on context to understand 1 meaning to know correct semanteme.It is provided in an embodiment of the present invention according to reply with And context determines that a kind of specific embodiment of the correct semanteme of homonym may include:

In a model, and the correct semanteme is arranged in correct semantic storage corresponding to S2031, the mark for replying user Corresponding weight is greater than the conventional semantic corresponding weight of the mark；

Specifically, the identification information for clarifying each semanteme in the format and format of dialogue is pre-set 's.It is understandable to be, mark here may exist it is some conventional semantic, such as 1 further include number 1 itself this contain Justice.The conventional semantic and corresponding weight of mark can be stored in advance in the model for carrying out speech recognition.Then it is detecting After replying clarification dialogue to user, the correct semanteme that user replys also is stored in model, while this is that this is correct semantic Weight is greater than conventional semantic corresponding weight.

For example, the meaning for identifying 1 may include: number 1 itself (weight w1), meaning 1 (weight w2), (power of meaning 2 Weight w3).These are stored in advance in a model.After user answers 1, at this moment 1 is marked using information completion technology For Zhongshan city, and the weight that " 1 represents Zhongshan city " is arranged is greater than the weight of " 1 is number 1 ".Namely 1 meaning after completion Are as follows: 1 (weight w1), meaning 1 (weight w2), meaning 2 (weight w3), Zhongshan city (weight w4) of number itself, and w4 be greater than w1, w2、w3。

S2032, the corresponding all semantemes of the mark are ranked up by weight is descending, will be sorted near preceding language Justice is determined as the correct semanteme of the homonym.

Also by taking citing above-mentioned as an example, by the corresponding weight of 1 meaning be ranked up after obtain: w4 be greater than w1, w2, W3, then the just correct semanteme by the corresponding meaning of w4 (Zhongshan city) as homonym zhong shan.

Understandable to be, for mark 1, " Zhongshan city " this meaning for newly increasing is only at this with the user's Be in dialogue it is useful, for when other or other users this be meant that useless, and language when can also affect on other The understanding of justice.Therefore, when detecting this conversation end, by this it is newly-increased semantic delete, thus when not influencing other pair In the semantic understanding of the mark.

After the correct semanteme of homonym has been determined, method provided in an embodiment of the present invention can also include:

S204, by the correct semantic completion of the homonym into the recognition result to the voice messaging, and to described Recognition result carries out semantic understanding.

It, can will after determining and being correctly meant that " middle mountain " expressed by zhong shan also by taking above example as an example Zhong shan " middle mountain " replaces zhong shan, and identifies the knot of " zhong shan de tian qi zen me yang " Fruit be middle mountain weather how.

After recognition result has been determined, method provided in an embodiment of the present invention further include:

S205, according to semantic understanding as a result, search for corresponding reply content, and show reply content.

Such as determine user to be obtained be middle mountain Weather information, then search in mountain Weather information, searched It, can also be with voice broadcast to user's displaying as a result, can for example show text after hitch fruit.

The third aspect, as shown in figure 3, the embodiment of the invention provides a kind of speech recognition equipment, the speech recognition equipment It can integrate in above-mentioned computer equipment 110, may include Session Control Unit 301 and semantic understanding unit 302.

Session Control Unit 301, for passing through session pipe according to the first information after detecting user's input voice information Manage the session of device generation and the user；Wherein the first information is the feature letter for characterizing the user detected Breath or preset period；

Semantic understanding unit 302 is used in the session, in the process for carrying out speech recognition to the voice messaging In, however, it is determined that voice messaging includes the homonym of corresponding multiple semantic results, then initiates to talk with for the clarification of the homonym, The corresponding correct semanteme for confirming the homonym to user is talked in the clarification；

The semantic understanding unit 302 is also used to after detecting user for the reply of the clarification dialogue, according to institute State the correct semanteme that the context in reply and the session determines the homonym.

Voice messaging is identified, corresponding multiple syllables are obtained；

The semantic understanding unit 302 determines the correct semanteme of the homonym according to replying, comprising:

In some embodiments, the Session Control Unit 301 is also used to:

When detection meets session termination condition, terminate the session；

Correct semanteme corresponding to the mark is deleted from the model.

In some embodiments, the semantic understanding unit 302 is also used to:

Fourth aspect, the embodiment of the invention provides a kind of computer equipment, the computer equipment includes memory, place It manages device and is stored in the computer program that can be run on the memory and on the processor, described in the processor execution Step described in the embodiment of the method for first aspect is realized when computer program.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, which can be stored in a computer-readable storage and be situated between In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, storage medium above-mentioned can be The non-volatile memory mediums such as magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random storage note Recall body (RandomAccess Memory, RAM) etc..

Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.

The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of audio recognition method characterized by comprising

After detecting user's input voice information, the meeting with the user is generated by session manager according to the first information Words；Wherein the first information is the characteristic information detected for characterizing the user or/and preset period；

In the session, during the voice messaging inputted to user carries out speech recognition, however, it is determined that the voice letter Homonym comprising corresponding multiple semantic results in breath then initiates to talk with for the clarification of the homonym, the clarification dialogue For confirming the corresponding correct semanteme of the homonym to user；

It is true according to the context in the reply and the session after detecting user for the reply of the clarification dialogue The correct semanteme of the fixed homonym.

2. the method according to claim 1, wherein the determining voice messaging includes corresponding multiple semantic results Homonym, comprising:

Voice messaging is identified, corresponding multiple syllables are obtained；

Semantic understanding is carried out to word segmentation result, if the corresponding semanteme of the first participle is multiple, it is determined that the first participle is homonym.

3. the method according to claim 1, wherein clarification dialogue includes the multiple semantic and each Semantic corresponding mark, the routine of the mark is semantic and the semantic corresponding weight of routine is stored in advance in carry out speech recognition Model in；

In a model, and the corresponding weight of the correct semanteme is arranged in correct semantic storage corresponding to the mark that user is replied Greater than the conventional semantic corresponding weight of the mark；

The corresponding all semantemes of the mark are ranked up by weight is descending, sequence is determined as institute near preceding semanteme State the correct semanteme of homonym.

4. according to the method described in claim 3, it is characterized in that, the method also includes:

When detection meets session termination condition, terminate the session；

Correct semanteme corresponding to the mark is deleted from the model.

5. the method according to claim 1, wherein the method also includes:

By the correct semantic completion of the homonym into the recognition result to the voice messaging, and to the recognition result into Row semantic understanding.

6. according to the method described in claim 5, it is characterized in that, the method also includes:

7. -6 any method according to claim 1, which is characterized in that the characteristic information of the user includes: user's Account information or the voiceprint of user.

8. a kind of speech recognition equipment characterized by comprising

Session Control Unit, for passing through session manager life according to the first information after detecting user's input voice information At the session with the user；Wherein the first information is to detect for characterizing the characteristic information of the user, or pre- If period；

Semantic understanding unit is used in the session, during carrying out speech recognition to the voice messaging, however, it is determined that Voice messaging includes the homonym of corresponding multiple semantic results, then initiates to talk with for the clarification of the homonym, the clarification Talk with the corresponding correct semanteme for confirming the homonym to user；

The semantic understanding unit is also used to after detecting user for the reply of the clarification dialogue, according to the reply And the context in the session determines the correct semanteme of the homonym.

9. a kind of computer equipment, including memory and processor, it is stored with computer-readable instruction in the memory, it is described When computer-readable instruction is executed by the processor, so that the processor executes such as any one of claims 1 to 7 right It is required that the step of described audio recognition method.

10. a kind of storage medium for being stored with computer-readable instruction, the computer-readable instruction is handled by one or more When device executes, so that one or more processors execute the speech recognition as described in any one of claims 1 to 7 claim The step of method.