CN110060662A

CN110060662A - Audio recognition method and device

Info

Publication number: CN110060662A
Application number: CN201910293280.2A
Authority: CN
Inventors: 马赛; 杜念冬
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2019-07-26
Anticipated expiration: 2039-04-12
Also published as: CN110060662B

Abstract

The present invention proposes a kind of audio recognition method and device, and wherein method includes: by obtaining voice and parameter information to be identified；The parameter information includes: present mode, the identification serial number of the voice, Internal and External Noise information and azimuth information；Extract the corresponding feature vector of the voice；According to the present mode, the identification serial number, determine whether the voice belongs to single and wake up non-voice for the first time in multiple recognition mode；If the voice, which belongs to single, wakes up non-voice for the first time in multiple recognition mode, then according to the parameter information and described eigenvector, obtain speech recognition result, acoustics determines result and semantic judgement result, determine result and the semantic judgement as a result, determining whether the voice belongs to music field according to the acoustics；If the voice belongs to music field, according to institute's speech recognition result, the corresponding instruction of voice and/or resource are determined.Due to acquiring Internal and External Noise information and azimuth information etc., the accuracy rate of speech recognition is improved, and can automatically and accurately identify the voice of music field in the case where single wakes up multiple recognition mode, provide the music service of high quality convenient for subsequent for user.

Description

Audio recognition method and device

Technical field

The present invention relates to voice processing technology field more particularly to a kind of audio recognition methods and device.

Background technique

With the development of artificial intelligence technology, more and more dialog mode artificial intelligence products are emerged.With dialog mode people For work intellectual product is intelligent sound box, voice awakening technology and speech recognition technology are the core technologies of intelligent sound box, directly Affect the interactive experience of user and intelligent sound box.

Since in practical application scene, intelligent sound box faces complex environment, for example, various Internal and External Noise (such as rooms The sound of interior TV programme, outdoor vehicle sound etc.), except user's other people voice thought, the tongue stumbled etc. Speech recognition effect can be interfered, when acquiring voice at present, the parameters such as Internal and External Noise information, azimuth information is not acquired, leads to voice Recognition result is easy to appear deviation, and poor accuracy influences interaction effect.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, the first purpose of this invention is to propose a kind of audio recognition method, for solving language in the prior art The problem of sound recognition result is easy to appear deviation, poor accuracy, influences interaction effect.

Second object of the present invention is to propose a kind of speech recognition equipment.

Third object of the present invention is to propose another speech recognition equipment.

Fourth object of the present invention is to propose a kind of non-transitorycomputer readable storage medium.

5th purpose of the invention is to propose a kind of computer program product.

In order to achieve the above object, first aspect present invention embodiment proposes a kind of audio recognition method, comprising:

Obtain voice and parameter information to be identified；The parameter information includes: the identification of present mode, the voice Serial number, Internal and External Noise information and azimuth information；

Extract the corresponding feature vector of the voice；

According to the present mode, the identification serial number, determine whether the voice belongs to single and wake up repeatedly identification mould Non- voice for the first time in formula；

If the voice, which belongs to single, wakes up non-voice for the first time in multiple recognition mode, according to the parameter information and Described eigenvector obtains speech recognition result, acoustics determines result and semantic judgement is as a result, determine result according to the acoustics With the semantic judgement as a result, determining whether the voice belongs to music field；

If the voice belongs to music field, according to institute's speech recognition result, determine the corresponding instruction of voice and/or Resource.

Further, described according to the present mode, the identification serial number, determine whether the voice belongs to single and call out The non-voice for the first time waken up in multiple recognition mode, comprising:

Judge whether the present mode is that single wakes up multiple recognition mode；

If the present mode is that single wakes up multiple recognition mode, determine that the voice is according to the identification serial number No is non-voice for the first time.

Further, described according to the parameter information and described eigenvector, obtain speech recognition result, acoustics determines As a result determine with semanteme as a result, determining result and the semantic judgement as a result, determining whether the voice belongs to according to the acoustics In music field, comprising:

By the Internal and External Noise information, the azimuth information and described eigenvector, acoustics identification model is inputted, is obtained Institute's speech recognition result harmony determines result；

Determined according to the acoustics as a result, determining whether the voice acoustically belongs to music field；

If the voice acoustically belongs to music field, by institute's speech recognition result, the Internal and External Noise information And the azimuth information, semantics recognition model is inputted, the semantic judgement result is obtained；

According to the semantic judgement as a result, determining whether the voice belongs to music field semantically；

If the voice is semantically belonging to music field, it is determined that the voice belongs to music field.

Further, described according to the parameter information and described eigenvector, obtain speech recognition result, acoustics determines As a result determine with semanteme as a result, determining result and the semantic judgement as a result, determining whether the voice belongs to according to the acoustics In music field, further includes:

If the voice is acoustically not belonging to music field or the voice is semantically being not belonging to music field, Then determine that the voice is not belonging to music field.

Further, the method also includes:

If the present mode is that perhaps extremely objective mode or the voice belong to single to single wake-up single recognition mode The voice for the first time in multiple recognition mode is waken up, then according to the parameter information and described eigenvector, obtains speech recognition knot Fruit；

According to institute's speech recognition result, corresponding instruction and/or resource are determined.

Further, described according to institute's speech recognition result, after determining the corresponding instruction of voice and/or resource, also Include:

Described instruction is executed, and/or, the resource is supplied to the user of intelligent sound box.

The audio recognition method of the embodiment of the present invention, by obtaining voice and parameter information to be identified；The parameter Information includes: present mode, the identification serial number of the voice, Internal and External Noise information and azimuth information；Extract the voice pair The feature vector answered；According to the present mode, the identification serial number, determine whether the voice belongs to single and wake up repeatedly knowledge Non- voice for the first time in other mode；If the voice, which belongs to single, wakes up non-voice for the first time in multiple recognition mode, basis The parameter information and described eigenvector obtain speech recognition result, acoustics determines result and semantic judgement is as a result, according to institute Stating acoustics determines result and the semantic judgement as a result, determining whether the voice belongs to music field；If the voice belongs to Music field determines the corresponding instruction of voice and/or resource then according to institute's speech recognition result.Due to acquiring inside and outside make an uproar Message breath and azimuth information etc. improve the accuracy rate of speech recognition, and can in the case where single wakes up multiple recognition mode It automatically and accurately identifies the voice of music field, provides the music service of high quality convenient for subsequent for user.

In order to achieve the above object, second aspect of the present invention embodiment proposes a kind of speech recognition equipment, comprising:

Module is obtained, for obtaining voice and parameter information to be identified；The parameter information include: present mode, Identification serial number, Internal and External Noise information and the azimuth information of the voice；

Extraction module, for extracting the corresponding feature vector of the voice；

Determining module, for determining whether the voice belongs to single and call out according to the present mode, the identification serial number The non-voice for the first time waken up in multiple recognition mode；

The determining module, is also used to belong to single in the voice and wakes up non-voice for the first time in multiple recognition mode When, according to the parameter information and described eigenvector, obtain speech recognition result, acoustics determines result and semantic judgement knot Fruit determines result and the semantic judgement as a result, determining whether the voice belongs to music field according to the acoustics；

The determining module is also used to when the voice belongs to music field, according to institute's speech recognition result, is determined The corresponding instruction of voice and/or resource.

Further, the determining module is specifically used for,

Further, the determining module is specifically also used to,

Further, the acquisition module, be also used to the present mode be single wake up single recognition mode or When extremely objective mode or the voice belong to single and wake up the voice for the first time in multiple recognition mode, according to the parameter information And described eigenvector, obtain speech recognition result；

The determining module is also used to determine corresponding instruction and/or resource according to institute's speech recognition result.

Further, described device further include: execution module, for executing described instruction, and/or, the resource is mentioned Supply the user of intelligent sound box.

The speech recognition equipment of the embodiment of the present invention, by obtaining voice and parameter information to be identified；The parameter Information includes: present mode, the identification serial number of the voice, Internal and External Noise information and azimuth information；Extract the voice pair The feature vector answered；According to the present mode, the identification serial number, determine whether the voice belongs to single and wake up repeatedly knowledge Non- voice for the first time in other mode；If the voice, which belongs to single, wakes up non-voice for the first time in multiple recognition mode, basis The parameter information and described eigenvector obtain speech recognition result, acoustics determines result and semantic judgement is as a result, according to institute Stating acoustics determines result and the semantic judgement as a result, determining whether the voice belongs to music field；If the voice belongs to Music field determines the corresponding instruction of voice and/or resource then according to institute's speech recognition result.Due to acquiring inside and outside make an uproar Message breath and azimuth information etc. improve the accuracy rate of speech recognition, and can in the case where single wakes up multiple recognition mode It automatically and accurately identifies the voice of music field, provides the music service of high quality convenient for subsequent for user.

In order to achieve the above object, third aspect present invention embodiment proposes another speech recognition equipment, comprising: storage Device, processor and storage are on a memory and the computer program that can run on a processor, which is characterized in that the processor Audio recognition method as described above is realized when executing described program.

To achieve the goals above, fourth aspect present invention embodiment proposes a kind of computer-readable storage of non-transitory Medium is stored thereon with computer program, which realizes audio recognition method as described above when being executed by processor.

To achieve the goals above, fifth aspect present invention embodiment proposes a kind of computer program product, when described When instruction processing unit in computer program product executes, audio recognition method as described above is realized.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is a kind of flow diagram of audio recognition method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of another audio recognition method provided in an embodiment of the present invention；

Fig. 3 is a kind of structural schematic diagram of speech recognition equipment provided in an embodiment of the present invention；

Fig. 4 is the structural schematic diagram of another speech recognition equipment provided in an embodiment of the present invention.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

Below with reference to the accompanying drawings the audio recognition method and device of the embodiment of the present invention are described.

Fig. 1 is a kind of flow diagram of audio recognition method provided in an embodiment of the present invention.As shown in Figure 1, the voice Recognition methods the following steps are included:

S101, voice and parameter information to be identified are obtained.

The executing subject of audio recognition method provided by the invention is speech recognition equipment, and speech recognition equipment can be for eventually The hardware devices such as end equipment, server, or the software to be installed on hardware device.The speech recognition equipment has dialog mode people Work intelligent function, by with user carry out interactive voice, in addition to smart home device can be controlled, be also equipped with look into weather, information, It listens audio, set an alarm, guide of wearing the clothes, traffic, quotations on the stock market etc. basic function.

In the present embodiment, after speech recognition equipment wake-up, user can engage in the dialogue with speech recognition equipment.It is general next It says, speech recognition equipment receives the voice that voice for the first time is user；And to the non-voice for the first time received, it may not be user Voice.

By taking speech recognition equipment is intelligent sound box as an example, since in practical application scene, intelligent sound box faces complicated Environment, for example, various Internal and External Noises (sound of indoor TV programme, outdoor vehicle sound etc.), think except user its Other people voice, the tongue stumbled, can interfere voice processing effect.In general, after user wakes up intelligent sound box, The voice for the first time that intelligent sound box receives can be understood as reflecting the user speech of the real demand of user.For example, user thinks Song is listened, " I wants the song for listening me often to listen " can be said against intelligent sound box immediately after waking up intelligent sound box, intelligent sound box is broadcast Corresponding song is put to meet the needs of users.After intelligent sound box plays corresponding song, if user opens television-viewing TV Program, second of voice that intelligent sound box receives may be the sound of TV programme, alternatively, second of voice or user Voice, but the noise of the sound of many such as TV programme is doped in second of voice, it at this moment needs to connect intelligent sound box The voice to be identified received is screened.

In the present embodiment, the parameter information of voice to be identified includes the identification serial number, inside and outside of present mode, the voice Speech recognition can be improved by obtaining the information such as Internal and External Noise information and azimuth information in noise information and azimuth information Accuracy rate, promoted interactive experience.

In the present embodiment, present mode is that any single in following mode wakes up single recognition mode, single wakes up Multiple recognition mode, extremely objective mode, but it is not limited to this.

Wherein, single wakes up single recognition mode it is to be understood that user is in the interaction of each and speech recognition equipment, all It needs first to say " voice comprising waking up word " and wakes up speech recognition equipment, besides interactive voice out.It is with speech recognition equipment For intelligent sound box, " voice comprising waking up word " that activation single wakes up single recognition mode is " small degree, small degree, activation Single wakes up single recognition mode ".First time user thinks Cha Tianqi, and user needs first to say that " small degree, small degree, activation single are called out Wake up single recognition mode ", intelligent sound box has executed look into weather task after, into suspend mode.When user looks into stock next time, use Family still needs first to say " the small small degree of degree (waking up word), activation single wake up single recognition mode ", to wake up intelligent sound box.

Wherein, single wakes up multiple recognition mode it is to be understood that user is every time and when speech recognition equipment interacts, After saying " voice comprising waking up word " wake-up speech recognition equipment, voice can be repeatedly said against speech recognition equipment. By taking speech recognition equipment is intelligent sound box as an example, " voice comprising waking up word " that activation single wakes up multiple recognition mode is " the small small degree of degree (waking up word), activation single wake up multiple recognition mode ".First time user thinks that Cha Tianqi, second of user want to listen Song, third time user want to look into stock, and when user is interacted three times with speech recognition equipment, speech recognition equipment is located always In awakening mode.

Wherein, extremely objective mode can be understood as user in the interaction of each and speech recognition equipment, not need to say every time Voice out comprising wake-up word ", can carry out in a short time continuous several times dialogue with speech recognition equipment.With speech recognition Device is for intelligent sound box, " voice comprising waking up word " of the extremely objective mode of activation is that " small degree is small to spend and (wake up word), activation Extremely objective mode ".First time user thinks that Cha Tianqi, second of user want to listen song, and third time user wants to look into stock, in user and language When sound identification device interact three times, speech recognition equipment is constantly in awakening mode.Extremely objective mode and single wake up multiple The difference of recognition mode is that single wakes up multiple recognition mode can be with the multiple voices of continuous acquisition, and extremely objective mode is to some language Before sound processing is completed, other voices will not be acquired.

In the present embodiment, identification serial number can be understood as receiving to be identified after characterizing speech recognition equipment wake-up Voice be which time voice.For example, identification serial number is 1, voice to be identified is characterized as voice for the first time；Identification serial number is 2, table Levying voice to be identified is second of voice；Identification serial number is 3, and characterizing voice to be identified is third time voice.The present embodiment In, Internal and External Noise information can be understood as the environmental noise of speech recognition equipment local environment.It is intelligence with speech recognition equipment For speaker, when intelligent sound box be placed on parlor, bedroom, kitchen etc. place when, intelligent sound box can first identify parlor, bedroom, The environmental noise in the place such as kitchen, it is subsequent when handling voice to be identified, it can be according to Internal and External Noise information to be identified Voice carry out noise treatment, to improve the recognition accuracy of voice.

In the present embodiment, azimuth information can be understood as sound source position information, and speech recognition equipment passes through auditory localization skill Art can identify the sound source position information of voice to be identified, to improve the recognition accuracy of voice.

It should be pointed out that the parameter of voice to be identified is not limited to the identification serial number, inside and outside of present mode, the voice Noise information and azimuth information, for example, the parameter of voice to be identified can also include identification marking.Identification marking can be managed Solution is the activation number for characterizing present mode.By taking present mode is the multiple recognition mode of single wake-up as an example, identification marking It is 1, it is excessively primary that characterization single wakes up multiple recognition mode co-activation.Identification marking is 2, characterization single wakes up repeatedly identification mould Formula co-activation is excessively secondary.Identification marking is 3, and characterization single wakes up multiple recognition mode co-activation and crosses three times.

S102, the corresponding feature vector of the voice is extracted.

In the present embodiment, the feature vector of voice to be identified is the basis of speech recognition.Specifically, by to be identified Voice carry out feature extraction to obtain the feature vector of voice to be identified.Feature extracting method is carried out to voice to be identified It can be any one of following methods: linear prediction analysis (Linear Prediction Coefficients, LPC), sense Know linear predictor coefficient (Perceptual Linear Predictive, PLP), linear prediction residue error (Linear Predictive Cepstrum Coefficient, LPCC), mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC), but it is not limited to features described above extracting method.

S103, according to the present mode, the identification serial number, determine whether the voice belongs to single and wake up and repeatedly know Non- voice for the first time in other mode.

In the present embodiment, the process that speech recognition equipment executes step S103 is specifically as follows, and judges the present mode It whether is that single wakes up multiple recognition mode；If the present mode is that single wakes up multiple recognition mode, according to the knowledge Other serial number determines whether the voice is non-voice for the first time.

If being pointed out that, the present mode is that single wakes up single recognition mode or extremely objective mode, Huo Zhesuo Predicate sound belongs to single and wakes up voice for the first time in multiple recognition mode, then according to the parameter information and described eigenvector, Obtain speech recognition result；According to institute's speech recognition result, corresponding instruction and/or resource are determined.

In the present embodiment, instruction corresponding with voice and/or resource are set according to practical situation.For example, if wait know Other voice is " turning on the aircondition ", then intelligent sound box executes instruction as starting of air conditioner instruction；If voice to be identified is " turning on light ", Intelligent sound box executes instruction as instruction of turning on light.If voice to be identified is " wanting to listen the song of certain singer ", intelligent sound box executes money Source is the song of the singer.

If S104, the voice, which belong to single, wakes up non-voice for the first time in multiple recognition mode, according to the parameter Information and described eigenvector obtain speech recognition result, acoustics determines result and semantic judgement according to the acoustics as a result, sentence Result and the semantic judgement are determined as a result, determining whether the voice belongs to music field.

The present embodiment, when the present mode of speech recognition equipment is that single wakes up multiple recognition mode, to what is received Non- voice for the first time is screened, and the acoustically fields of only non-voice for the first time are just recognized with when semantically fields are consistent It is also the voice of user for non-voice for the first time, while illustrates that user thinks the service for obtaining the field again.

In the present embodiment, according to the parameter information and feature vector of voice to be identified, can obtain speech recognition result, Acoustics determines that result and semanteme determine as a result, determining result and semantic judgement as a result, determining language to be identified then according to acoustics Whether sound belongs to music field.

Wherein, speech recognition result is by carrying out what speech recognition obtained to voice to be identified.

Wherein, acoustics determines that result is used to characterize the acoustic feature fields of voice to be identified, for example, acoustics determines It as a result is 1, voice fields to be identified are music field, and acoustics determines that result is 2, and voice fields to be identified are Weather lookup field, acoustics determine that result is 3, and voice fields to be identified are stock quotes field, and acoustics determines result It is 4, voice fields to be identified are smart home device control field etc..

Wherein, the semantic semantic feature fields for determining result and being used to characterize voice to be identified.For example, semantic determine It as a result is 1, voice fields to be identified are music field, and semanteme determines that result is 2, and voice fields to be identified are Weather lookup field, semanteme determine that result is 3, and voice fields to be identified are stock quotes field, and semanteme determines result It is 4, voice fields to be identified are smart home device control field etc..

If S105, the voice belong to music field, according to institute's speech recognition result, the corresponding instruction of voice is determined And/or resource.

Further, after step S105, the method can with the following steps are included: execute described instruction, and/ Or, the resource to be supplied to the user of intelligent sound box.

The present embodiment, instruction corresponding with voice and/or resource are set according to practical situation.For example, if to be identified Voice be " song sound too small or too big ", then intelligent sound box executes instruction as sound regulating command；If voice to be identified For " changing a song ", then intelligent sound box executes instruction as song switching command.If voice to be identified is " to want to listen certain singer's Song ", then intelligent sound box executes the song that resource is the singer.

Fig. 2 is the flow diagram of another audio recognition method provided in an embodiment of the present invention.As shown in Fig. 2, in Fig. 1 It is described " according to the parameter information and described eigenvector, to obtain speech recognition result, acoustics on the basis of illustrated embodiment Determine that result determines with semantic as a result, determining result and the semantic judgement according to the acoustics as a result, determining that the voice is It is no to belong to music field " specific implementation the following steps are included:

S1041, by the Internal and External Noise information, the azimuth information and described eigenvector, input acoustics and identify mould Type obtains institute's speech recognition result harmony and determines as a result, executing step S1042.

In the present embodiment, training sample training deep neural network (DNN, the Deep Neural of magnanimity is utilized Network), convolutional neural networks (CNN, Convolutional Neural Network), recurrent neural network (RNN, Recursive Neural Network) etc. any neural network, obtain acoustics identification model.It should be pointed out that acoustics is known When other model training, it is contemplated that the information such as Internal and External Noise information, azimuth information can be improved the accuracy rate of speech recognition.

Wherein, training sample is feature vector, the Internal and External Noise information of known voice, known language of known voice The azimuth information of sound, known voice acoustically fields.Known voice is, for example, " I wants to listen pleasant song ", Acoustically fields are, for example, " known voice fields are music field " to the voice known.Known voice in another example For " today, weather was how ", it is known that voice acoustically fields be, for example, " known voice fields be weather Inquiry field ".Known voice in another example be " today, how is stock market ", it is known that voice acoustically fields are, for example, " known voice fields are that field is inquired by stock market ".The Internal and External Noise information of known voice can pass through existing noise Detection technique obtains, it is known that the azimuth information of voice can be obtained by existing auditory localization technology.Specific model instruction Practice method, more introduce is detailed in the relevant technologies, and details are not described herein again.

S1042, determined according to the acoustics as a result, determine whether the voice acoustically belongs to music field, if so, Step S1043 is executed, if it is not, executing step S1046.

If S1043, the voice acoustically belong to music field, by institute's speech recognition result, described inside and outside make an uproar Message breath and the azimuth information, input semantics recognition model, obtain the semantic judgement as a result, executing step S1044.

In the present embodiment, after determining that result determines that voice to be identified belongs to music field according to acoustics, then obtain The semantic of voice to be identified is taken to determine as a result, determining that result determines that voice to be identified belongs to music field according to semanteme When, just think that voice to be identified belongs to music field.

In the present embodiment, training sample training deep neural network (DNN, the Deep Neural of magnanimity is utilized Network), convolutional neural networks (CNN, Convolutional Neural Network), recurrent neural network (RNN, Recursive Neural Network) etc. any neural network, obtain semantics recognition model.Wherein, training sample is known Voice speech recognition result, the Internal and External Noise information of known voice, the azimuth information of known voice, known voice In semantically fields.The speech recognition result of known voice is, for example, " I wants to listen pleasant song ", known voice It is music field in semantically fields；The speech recognition result of known voice is, for example, " today, how is weather ", The voice known is weather lookup field in semantically fields；The speech recognition result of known voice is, for example, " today strand City is how ", known voice semantically fields be stock market inquire field.The Internal and External Noise information of known voice can To be obtained by existing noise detection technique, it is known that the azimuth information of voice can be obtained by existing auditory localization technology It takes.Specific model training method, more introduce are detailed in the relevant technologies, and details are not described herein again.

S1044, according to it is described it is semantic determine as a result, determine whether the voice belongs to music field semantically, if so, Step S1045 is executed, if it is not, executing step S1046.

If S1045, the voice are semantically belonging to music field, it is determined that the voice belongs to music field.

If S1046, the voice are acoustically not belonging to music field or the voice is semantically being not belonging to music Field, it is determined that the voice is not belonging to music field.

Fig. 3 is a kind of structural schematic diagram of speech recognition equipment provided in an embodiment of the present invention.As shown in figure 3, the packet It includes: obtaining module 31, extraction module 32, determining module 33.

Module 31 is obtained, for obtaining voice and parameter information to be identified；The parameter information includes: current mould Formula, the identification serial number of the voice, Internal and External Noise information and azimuth information；

Extraction module 32, for extracting the corresponding feature vector of the voice；

Determining module 33, for determining whether the voice belongs to single according to the present mode, the identification serial number Wake up the non-voice for the first time in multiple recognition mode；

The determining module 33, is also used to belong to single in the voice and wakes up non-voice for the first time in multiple recognition mode When, according to the parameter information and described eigenvector, obtain speech recognition result, acoustics determines result and semantic judgement knot Fruit determines result and the semantic judgement as a result, determining whether the voice belongs to music field according to the acoustics；

The determining module 33 is also used to when the voice belongs to music field, according to institute's speech recognition result, really The corresponding instruction of attribute sound and/or resource.

Further, the determining module 33 is specifically used for,

Further, the determining module 33 is specifically also used to,

Further, the acquisition module 31, be also used to the present mode be single wake up single recognition mode or When the extremely objective mode of person or the voice belong to single and wake up the voice for the first time in multiple recognition mode, believed according to the parameter Breath and described eigenvector obtain speech recognition result；

The determining module 33 is also used to determine corresponding instruction and/or resource according to institute's speech recognition result.

It should be noted that the aforementioned voice for being also applied for the embodiment to the explanation of audio recognition method embodiment Identification device, details are not described herein again.

Fig. 4 is the structural schematic diagram of another speech recognition equipment provided in an embodiment of the present invention.The speech recognition equipment Include:

Memory 1001, processor 1002 and it is stored in the calculating that can be run on memory 1001 and on processor 1002 Machine program.

Processor 1002 realizes the audio recognition method provided in above-described embodiment when executing described program.

Further, speech recognition equipment further include:

Communication interface 1003, for the communication between memory 1001 and processor 1002.

Memory 1001, for storing the computer program that can be run on processor 1002.

Memory 1001 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non- Volatile memory), a for example, at least magnetic disk storage.

Processor 1002 realizes audio recognition method described in above-described embodiment when for executing described program.

If memory 1001, processor 1002 and the independent realization of communication interface 1003, communication interface 1003, memory 1001 and processor 1002 can be connected with each other by bus and complete mutual communication.The bus can be industrial standard Architecture (Industry Standard Architecture, referred to as ISA) bus, external equipment interconnection (Peripheral Component, referred to as PCI) bus or extended industry-standard architecture (Extended Industry Standard Architecture, referred to as EISA) bus etc..The bus can be divided into address bus, data/address bus, control Bus processed etc..Only to be indicated with a thick line in Fig. 4, it is not intended that an only bus or a type of convenient for indicating Bus.

Optionally, in specific implementation, if memory 1001, processor 1002 and communication interface 1003, are integrated in one It is realized on block chip, then memory 1001, processor 1002 and communication interface 1003 can be completed mutual by internal interface Communication.

Processor 1002 may be a central processing unit (Central Processing Unit, referred to as CPU), or Person is specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC) or quilt It is configured to implement one or more integrated circuits of the embodiment of the present invention.

The present invention also provides a kind of non-transitorycomputer readable storage mediums, are stored thereon with computer program, the journey Audio recognition method as described above is realized when sequence is executed by processor.

The present invention also provides a kind of computer program products, when the instruction processing unit in the computer program product executes When, realize audio recognition method as described above.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention Embodiment person of ordinary skill in the field understood.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..

Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.

It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention Type.

Claims

1. a kind of audio recognition method characterized by comprising

Obtain voice and parameter information to be identified；The parameter information includes: the identification sequence of present mode, the voice Number, Internal and External Noise information and azimuth information；

Extract the corresponding feature vector of the voice；

According to the present mode, the identification serial number, determine whether the voice belongs to single and wake up in multiple recognition mode Non- voice for the first time；

If the voice, which belongs to single, wakes up non-voice for the first time in multiple recognition mode, according to the parameter information and described Feature vector obtains speech recognition result, acoustics determines result and semantic judgement is as a result, determine result and institute according to the acoustics Predicate justice determines as a result, determining whether the voice belongs to music field；

If the voice belongs to music field, according to institute's speech recognition result, the corresponding instruction of voice and/or money are determined Source.

2. the method according to claim 1, wherein described according to the present mode, the identification serial number, really Whether the fixed voice, which belongs to single, wakes up non-voice for the first time in multiple recognition mode, comprising:

If the present mode is that single wakes up multiple recognition mode, according to the identification serial number determine the voice whether be Non- voice for the first time.

3. the method according to claim 1, wherein described according to the parameter information and described eigenvector, Obtain speech recognition result, acoustics determines result and semantic judgement as a result, determining that result and the semanteme are sentenced according to the acoustics Determine as a result, determining whether the voice belongs to music field, comprising:

By the Internal and External Noise information, the azimuth information and described eigenvector, acoustics identification model is inputted, described in acquisition Speech recognition result and acoustics determine result；

If the voice acoustically belongs to music field, by institute's speech recognition result, the Internal and External Noise information and The azimuth information inputs semantics recognition model, obtains the semantic judgement result；

4. according to the method described in claim 3, it is characterized in that, described according to the parameter information and described eigenvector, Obtain speech recognition result, acoustics determines result and semantic judgement as a result, determining that result and the semanteme are sentenced according to the acoustics Determine as a result, determining whether the voice belongs to music field, further includes:

If the voice is acoustically not belonging to music field or the voice and is semantically being not belonging to music field, really The fixed voice is not belonging to music field.

5. the method according to claim 1, wherein further include:

If the present mode is that perhaps extremely objective mode or the voice belong to single wake-up to single wake-up single recognition mode Voice for the first time in multiple recognition mode obtains speech recognition result then according to the parameter information and described eigenvector；

6. determining voice pair the method according to claim 1, wherein described according to institute's speech recognition result After the instruction and/or resource answered, further includes:

7. a kind of speech recognition equipment characterized by comprising

Module is obtained, for obtaining voice and parameter information to be identified；The parameter information includes: present mode, described Identification serial number, Internal and External Noise information and the azimuth information of voice；

Determining module, for it is more to determine whether the voice belongs to single wake-up according to the present mode, the identification serial number Non- voice for the first time in secondary recognition mode；

The determining module is also used to when the voice belongs to single and wakes up the non-voice for the first time in multiple recognition mode, root According to the parameter information and described eigenvector, obtain speech recognition result, acoustics determines result and it is semantic determine as a result, according to The acoustics determines result and the semantic judgement as a result, determining whether the voice belongs to music field；

The determining module is also used to, according to institute's speech recognition result, determine voice when the voice belongs to music field Corresponding instruction and/or resource.

8. device according to claim 7, which is characterized in that the determining module is specifically used for,

9. device according to claim 7, which is characterized in that the determining module is specifically used for,

10. device according to claim 9, which is characterized in that the determining module is specifically also used to,

11. device according to claim 7, which is characterized in that

The acquisition module, be also used to the present mode be single wake up single recognition mode perhaps extremely objective mode or When the voice belongs to single and wakes up the voice for the first time in multiple recognition mode, according to the parameter information and the feature to Amount obtains speech recognition result；

12. device according to claim 7, which is characterized in that further include: execution module, for executing described instruction, And/or the resource is supplied to the user of intelligent sound box.

13. a kind of speech recognition equipment characterized by comprising

Memory, processor and storage are on a memory and the computer program that can run on a processor, which is characterized in that

The processor realizes the audio recognition method as described in any in claim 1-7 when executing described program.

14. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program The audio recognition method as described in any in claim 1-7 is realized when being executed by processor.

15. a kind of computer program product realizes such as right when the instruction processing unit in the computer program product executes It is required that any audio recognition method in 1-7.