CN110517682A

CN110517682A - Audio recognition method, device, equipment and storage medium

Info

Publication number: CN110517682A
Application number: CN201910822237.0A
Authority: CN
Inventors: 朱振岭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2019-11-29
Anticipated expiration: 2039-09-02
Also published as: CN110517682B

Abstract

This application provides a kind of audio recognition method, device, equipment and storage mediums, wherein the described method includes: the first voice messaging to acquisition carries out ADBF processing, obtains frequency spectrum at least two directions；In the frequency spectrum at least two direction, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, is determined as target direction；The second voice messaging acquired in the target direction is obtained, and speech recognition is carried out to second voice messaging.By the application, the second voice messaging on accurate direction can be obtained, the accuracy rate of speech recognition is improved.

Description

Audio recognition method, device, equipment and storage medium

Technical field

This application involves technical field of electronic equipment, relates to, but are not limited to a kind of audio recognition method, device, equipment and deposit Storage media.

Background technique

Currently, for the electronic equipment with speech identifying function, when realizing speech identifying function, the signal of front end Treatment process is usually that microphone signal collected is carried out echo cancellor and single channel noise reduction, the signal that obtains that treated, And according to treated, signal wakes up electronic equipment, after electronic equipment is waken up, carries out speech recognition.

But audio recognition method in the related technology, pass through echo cancellor (Acoustic Echo Cancellation, AEC) and single channel noise reduction (Noise suppression, NS) processing after signal in include its other party To directional interference noise, and be easy to appear when interfering larger or live auditory localization inaccuracy ask Topic, so that the accuracy rate of the speech recognition after will lead to reduces.

Summary of the invention

The embodiment of the present application provides a kind of audio recognition method, device, equipment and storage medium, and sound source can be accurately positioned Direction, to improve the accuracy rate of speech recognition.

The technical solution of the embodiment of the present application is achieved in that

The embodiment of the present application provides a kind of audio recognition method, comprising:

ADBF processing is carried out to the first voice messaging of acquisition, obtains frequency spectrum at least two directions；

In the frequency spectrum at least two direction, spectrum signature is met to side corresponding to the frequency spectrum of preset condition To being determined as target direction；

The second voice messaging acquired in the target direction is obtained, and voice knowledge is carried out to second voice messaging Not.

The embodiment of the present application provides a kind of speech recognition equipment, comprising:

First processing module is obtained for carrying out ADBF processing to the first voice messaging of acquisition at least two directions On frequency spectrum；

Determining module, in the frequency spectrum at least two direction, spectrum signature to be met to the frequency of preset condition The corresponding direction of spectrum, is determined as target direction；

Second processing module, for obtaining the second voice messaging in target direction acquisition, and to second language Message breath carries out speech recognition.

The embodiment of the present application provides a kind of speech recognition apparatus, comprising:

Memory, for storing executable instruction；

Processor when for executing the executable instruction stored in the memory, realizes above-mentioned method.

The embodiment of the present application provides a kind of storage medium, is stored with executable instruction, real when for causing processor to execute Existing above-mentioned method.

The embodiment of the present application has the advantages that

ADBF processing is carried out to the first voice messaging of acquisition, obtains frequency spectrum at least two directions；And described In frequency spectrum at least two directions, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, is determined as target side To, in such manner, it is possible to the accurately direction of localization of sound source, so that accurate direction can be obtained in subsequent speech recognition process On the second voice messaging, improve the accuracy rate of speech recognition.

Detailed description of the invention

Fig. 1 is the implementation process schematic diagram of audio recognition method in the related technology；

Fig. 2A is an optional configuration diagram of speech recognition system provided by the embodiments of the present application；

Fig. 2 B is the structural schematic diagram of server provided by the embodiments of the present application；

Fig. 3 A is an optional flow diagram of audio recognition method provided by the embodiments of the present application；

Fig. 3 B is an optional schematic diagram of a scenario of audio recognition method provided by the embodiments of the present application；

Fig. 3 C is an optional schematic diagram of a scenario of audio recognition method provided by the embodiments of the present application；

Fig. 3 D is an optional schematic diagram of a scenario of audio recognition method provided by the embodiments of the present application；

Fig. 4 is an optional flow diagram of audio recognition method provided by the embodiments of the present application；

Fig. 5 is an optional flow diagram of audio recognition method provided by the embodiments of the present application；

Fig. 6 A is an optional flow diagram of audio recognition method provided by the embodiments of the present application；

Fig. 6 B is the realization schematic diagram in determining user direction provided by the embodiments of the present application；

Fig. 7 is an optional flow diagram of audio recognition method provided by the embodiments of the present application；

Fig. 8 is an optional flow diagram of audio recognition method provided by the embodiments of the present application；

Fig. 9 A is the beam position figure in 0 ° of direction of the embodiment of the present application；

Fig. 9 B is the beam position figure in 90 ° of directions of the embodiment of the present application；

Fig. 9 C is the beam position figure in 180 ° of directions of the embodiment of the present application；

Figure 10 is delivered to the wake-up word spectrum diagram of the wakeup unit of electronic equipment.

Specific embodiment

In order to keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application make into It is described in detail to one step, described embodiment is not construed as the limitation to the application, and those of ordinary skill in the art are not having All other embodiment obtained under the premise of creative work is made, shall fall in the protection scope of this application.

In the following description, it is related to " some embodiments ", which depict the subsets of all possible embodiments, but can To understand, " some embodiments " can be the same subsets or different subsets of all possible embodiments, and can not conflict In the case where be combined with each other.

Unless otherwise defined, all technical and scientific terms used herein and the technical field for belonging to the application The normally understood meaning of technical staff is identical.Term used herein is intended merely to the purpose of description the embodiment of the present application, It is not intended to limit the application.

Before explanation is further expalined in the audio recognition method to the embodiment of the present application, first in the related technology Audio recognition method be illustrated.

Fig. 1 is the implementation process schematic diagram of audio recognition method in the related technology, as shown in Figure 1, the method includes Following steps:

Step S101 acquires voice signal by microphone.

Step S102 carries out AEC processing to the voice signal, obtains AEC treated voice signal.

Step S103 carries out NS processing to the AEC treated voice signal, obtains NS treated voice signal.

The wake-up module of the NS treated voice signal issues electronic equipment is carried out electronic equipment by step S104 Wake-up processing.

Step S105, judges whether electronic equipment is waken up.

If it is judged that be it is yes, then follow the steps S106, if it is judged that be it is no, then terminate process.

Step S106, when the electronic equipment is waken up, the electronic equipment opens speech identifying function, to acquisition Voice messaging carries out speech recognition.

But the above method in the related technology, the prior art has at least the following problems:

1) due to before electronic equipment wakes up, sound source be it is unknown, the voice signal of acquisition is only carried out in the related technology Therefore, in the NS treated voice signal AEC processing and NS processing further include the directional interference noise in other directions, It will lead to issue in this way and also remain higher interference signal in the voice signal of the wake-up module of electronic equipment, thus cannot It is effective to wake up electronic equipment, cause wake-up rate lower.

2) also need to carry out the angle estimation of sound source at the wake-up moment of electronic equipment, then, for dual microphone Electronic equipment, it is wrong to be easy for occurring the even auditory localization of auditory localization inaccuracy when interfering larger or live Accidentally the problem of, the process after causing cannot not only enhance voice, or even can damage voice, serious to reduce speech recognition effect, Reduce the accuracy rate of speech recognition.

In order to accurately carry out speech recognition, the embodiment of the present application provide a kind of audio recognition method, device, equipment and Storage medium, can be accurately positioned the direction of sound source, to improve the accuracy rate of speech recognition.

Illustrate the exemplary application of speech recognition apparatus provided by the embodiments of the present application below, it is provided by the embodiments of the present application Speech recognition apparatus may be embodied as server.In the following, by exemplary application when illustrating that equipment is embodied as server.

A referring to fig. 2, Fig. 2A are an optional framework signal of speech recognition system 10 provided by the embodiments of the present application Figure, to realize that the voice messaging to user carries out speech recognition, terminal 100 (illustrates terminal 100-1 and terminal 100- 2) server 300 is connected by network 200, network 200 can be wide area network or local area network, or be combination.

Terminal 100 is shown on current interface 110 (illustrating current interface 110-1 and current interface 110-2) Using the interface of (Application, APP), for example, the APP can be the APP with speech voice input function.Wherein, terminal 100-1 and terminal 100-2 can be the corresponding text information of identified voice messaging in current interface.The embodiment of the present application In, server 300 obtains terminal 100-1 or terminal 100-2 the first voice messaging collected by network 200, to acquisition First voice messaging carries out AD BF processing, obtains frequency spectrum at least two directions；Frequency at least two direction In spectrum, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, is determined as target direction；Then terminal acquisition is obtained The second voice messaging, speech recognition is carried out to the voice messaging, and recognition result is fed back into terminal.It needs to illustrate It is that voice messaging shown in Fig. 2A includes first voice messaging and second voice messaging.

Fig. 2 B is the structural schematic diagram of server 300 provided by the embodiments of the present application, as shown in Figure 2 B, the server 300 include: at least one processor 210, memory 250, at least one network interface 220 and user interface 230.Server Various components in 300 are coupled by bus system 240.It is understood that bus system 240 for realizing these components it Between connection communication.Bus system 240 further includes power bus, control bus and status signal in addition to including data/address bus Bus.But for the sake of clear explanation, various buses are all designated as bus system 240 in fig. 2b.

Processor 210 can be a kind of IC chip, the processing capacity with signal, such as general processor, number Word signal processor (Digital Signal Processor, DSP) either other programmable logic device, discrete gate or Transistor logic, discrete hardware components etc., wherein general processor can be microprocessor or any conventional processing Device etc..

User interface 230 include make it possible to present one or more output devices 231 of media content, including one or Multiple loudspeakers and/or one or more visual display screens.User interface 230 further includes one or more input units 232, packet Include the user interface component for facilitating user's input, for example keyboard, mouse, microphone, touch screen display screen, camera, other are defeated Enter button and control.

Memory 250 can be it is removable, it is non-removable or combinations thereof.Illustrative hardware device includes that solid-state is deposited Reservoir, hard disk drive, CD drive etc..Memory 250 optionally includes one geographically far from processor 210 A or multiple storage equipment.Memory 250 includes volatile memory or nonvolatile memory, may also comprise volatibility and non- Both volatile memory.Nonvolatile memory can be read-only memory (Read Only Memory, ROM), volatibility Memory can be random access memory (Random Access Memory, RAM).The memory of the embodiment of the present application description 250 are intended to include the memory of any suitable type.In some embodiments, memory 250 can storing data it is each to support Kind of operation, the example of these data includes program, module and data structure or its subset or superset, below exemplary illustration.

Operating system 251, including for handle various basic system services and execute hardware dependent tasks system program, Such as ccf layer, core library layer, driving layer etc., for realizing various basic businesses and the hardware based task of processing；

Network communication module 252, for reaching other calculating via one or more (wired or wireless) network interfaces 220 Equipment, illustrative network interface 220 include: bluetooth, Wireless Fidelity (WiFi) and universal serial bus (Universal Serial Bus, USB) etc.；

Input processing module 253, for one to one or more from one of one or more input units 232 or Multiple user's inputs or interaction detect and translate input or interaction detected.

In some embodiments, device provided by the embodiments of the present application can realize that Fig. 2 shows deposit using software mode The speech recognition equipment 254 in memory 250 is stored up, which can be the soft of the forms such as program and plug-in unit Part, including following software module: first processing module 2541, determining module 2542 and Second processing module 2543, these modules It is that in logic, therefore can be combined arbitrarily according to the function of being realized or further split.It will be described hereinafter The function of modules.

In further embodiments, device provided by the embodiments of the present application can be realized using hardware mode, as an example, Device provided by the embodiments of the present application can be the processor using hardware decoding processor form, be programmed to perform this Shen Please embodiment provide information recommendation method, for example, the processor of hardware decoding processor form can using one or more Application specific integrated circuit (Application Specific Integ rated Circuit, ASIC), DSP, it may be programmed and patrol Collect device (Programmable Logic Device, PLD), Complex Programmable Logic Devices (Complex Programmable Logic Device, CPLD), field programmable gate array (Field-Programmable Gate Array, FPGA) or other Electronic component.

Below in conjunction with the exemplary application and implementation of speech recognition apparatus provided by the embodiments of the present application, illustrate the application The audio recognition method that embodiment provides.

Referring to Fig. 3 A, Fig. 3 A is an optional flow diagram of audio recognition method provided by the embodiments of the present application, The step of showing in conjunction with Fig. 3 A, is illustrated.

Step S301, server carry out Adaptive beamformer (Adaptive to the first voice messaging of acquisition Beamforming, ADBF) processing, obtain frequency spectrum at least two directions.

Here, the acquisition of the first voice messaging, the voice collecting list are carried out by the voice collecting unit of electronic equipment Member can be the microphone on the electronic equipment.In the embodiment of the present application, the electronic equipment be can be with more wheats The electronic equipment of gram wind, such as can have dual microphone, it is realized by dual microphone and the accurate and effective of voice is acquired.

The ADBF processing refers to using the priori data information obtained, according to adaptive algorithm and criterion, changes weighting Coefficient, to achieve the purpose that retain desired signal, filter out interference.In the embodiment of the present application, the first acquired voice messaging It can be and acquire in the form of a sound wave, therefore can be handled by the ADBF and sound wave is handled, retain the phase in sound wave It hopes signal, and filters out the noise of interference.

It should be noted that the embodiment of the present application is to carry out ADBF to first voice messaging at least two directions Processing, the direction can be determined by the angle between the voice collecting unit of the electronic equipment, for example, with described Angle between electronic equipment can be 0 °, 90 °, 180 ° etc., and accordingly, the direction can then be expressed as 0 ° of direction, 90 ° of sides To, 180 ° of directions etc..

In ADBF treatment process, the ADBF carried out simultaneously in multiple directions to first voice messaging is handled, and Corresponding each party is upward, obtains a frequency spectrum, the frequency spectrum and the direction are one-to-one relationship.The frequency spectrum is used for The wake-up word for waking up electronic equipment is carried, and then electronic equipment is waken up by the wake-up word, so that electronic equipment is in work shape State.

For example, user says " please be switched on " to electronic equipment on 90 ° of directions, then the voice messaging of user is set by electronics Standby voice collecting unit collects, and respectively on 0 ° of direction, 90 ° of directions, 180 ° of three, directions direction to voice messaging into Row ADBF processing obtains 0 ° of direction, 90 ° of directions, frequency spectrum on the direction of 180 ° of three, directions, wherein the frequency spectrum on these three directions In carry request electronic equipment booting wake-up word.

Spectrum signature is met the frequency spectrum institute of preset condition in the frequency spectrum at least two direction by step S302 Corresponding direction, is determined as target direction.

Here, the target direction is and the immediate direction of the actual direction of user.In the embodiment of the present application, it can incite somebody to action The corresponding direction of frequency spectrum that spectrum signature meets preset condition is determined as the target direction.

The spectrum signature is the corresponding attribute information of the frequency spectrum, and the attribute information can react the matter of the frequency spectrum Measure parameter corresponding with the frequency spectrum.

Step S303, obtains the second voice messaging for acquiring in the target direction, and to second voice messaging into Row speech recognition.

Here, after determining the target direction, then show that the user therefore is worked as on the target direction When carrying out speech recognition to the second voice messaging after first voice messaging of user, then the target direction can be acquired On the second voice messaging, in this way, due to during voice collecting, the second voice messaging of user is actually required normal Voice, and other voices other than the voice of user then can be regarded as noise, therefore, only acquire on target direction The second voice messaging, can only acquisition normal voice, accurately obtain user voice messaging, avoid obtaining on other directions Noise.

In the embodiment of the present application, carrying out speech recognition to second voice messaging be can be using the knowledge of any one voice Other mode identifies second voice messaging, is that electronic equipment can by the vocabulary Content Transformation in the second voice messaging The input content of reading, such as key, binary coding perhaps character string or are converted to the voice messaging identified readable Text information and show output.

It should be noted that the sound source of second voice messaging can be identical with the sound source of first voice messaging, It can also be different.For example, the embodiment of the present application can be applied to following scene:

Scene one, as shown in Figure 3B, user 31 issue the first voice messaging in 90 ° of directions of electronic equipment 30, then service Device carries out ADBF processing to the first voice messaging, obtains frequency spectrum at least two directions；And at least two direction On frequency spectrum in, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, is determined as target direction, that is, determine and use The immediate direction in family 31 is 90 ° of directions, and continues the second voice messaging of acquisition in 90 ° of directions by electronic equipment 30, this Two voice messagings are also what user 31 issued, and need to carry out speech recognition to the voice messaging of user 31 at this time, therefore, to this Second voice messaging carries out speech recognition.

Scene two, as shown in Figure 3 C, user 31 issue the first voice messaging in 90 ° of directions of electronic equipment 30, then service Device carries out ADBF processing to the first voice messaging, obtains frequency spectrum at least two directions；And at least two direction On frequency spectrum in, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, is determined as target direction, that is, determine and use The immediate direction in family 31 is 90 ° of directions, and continues the second voice messaging of acquisition in 90 ° of directions by electronic equipment 30, this Two voice messagings are that user 32 issues, that is to say, that user 31 and user 32 are co-located, pass through the first of user 31 Voice messaging carries out the wake-up of electronic equipment, and needs to carry out speech recognition to the voice messaging of user 32 at this time, therefore, to this Second voice messaging carries out speech recognition.

Scene three, please continue to refer to Fig. 3 C, user 31 issues the first voice messaging in 90 ° of directions of electronic equipment 30, then Server carries out ADBF processing to the first voice messaging, obtains frequency spectrum at least two directions；And described at least two In frequency spectrum on direction, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, is determined as target direction, that is, is determined It is 90 ° of directions with the immediate direction of user 31, and continues to acquire the second voice messaging in 90 ° of directions by electronic equipment, it should Second voice messaging is that user 32 issues, that is to say, that user 31 and user 32 are co-located, pass through the of user 31 One voice messaging carries out the wake-up of electronic equipment, but needs to carry out speech recognition to the voice messaging of user 31 at this time, therefore, The method of the embodiment of the present application can also include the steps that voice messaging judges, judge tone color and the institute of second voice messaging Whether the tone color for stating the first voice messaging is identical, if identical, shows that user 32 and user 31 are the same persons, then can be right Second voice messaging of acquisition directly carries out speech recognition, if it is different, then show that user 32 and user 31 are not the same persons, It then needs to resurvey voice messaging, until collecting the second voice messaging of user 31.Alternatively, in other embodiments In, when collected voice messaging is the voice messaging of user 32, and do not need to carry out language to the voice messaging of user 32 at this time Sound identification, then can save the voice messaging of user 32.

Scene four, please continue to refer to Fig. 3 C, user 31 issues the first voice messaging in 90 ° of directions of electronic equipment 30, then Server carries out ADBF processing to the first voice messaging, obtains frequency spectrum at least two directions；And described at least two In frequency spectrum on direction, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, is determined as target direction, that is, is determined It is 90 ° of directions with the immediate direction of user 31, and continues to acquire the second voice messaging in 90 ° of directions by electronic equipment, it should Second voice messaging is that user 32 issues, that is to say, that user 31 and user 32 are co-located, pass through the of user 31 One voice messaging carries out the wake-up of electronic equipment, and at this time and regulation useless has to which voice messaging to carry out voice knowledge to , therefore, the voice messaging of user 31 to be received can not waited to identify, can also using the second voice messaging of user 32 as First voice messaging realizes the secondary wake-up to electronic equipment, to realize that the use current to electronic equipment of user 32 needs It asks.

Scene five, as shown in Figure 3D, user 31 issue the first voice messaging in 90 ° of directions of electronic equipment 30, then service Device carries out ADBF processing to the first voice messaging, obtains frequency spectrum at least two directions；And at least two direction On frequency spectrum in, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, is determined as target direction, that is, determine and use The immediate direction in family 31 is 90 ° of directions, and continues to acquire the second voice messaging in 90 ° of directions by electronic equipment 30.If Electronic equipment 30 does not collect the second voice messaging in 90 ° of directions within a preset time, then can acquire on other directions Voice messaging therefore, can continue according to user for example, it may be collecting the voice messaging of user 32 on 60 ° of directions 32 voice messaging determines that the direction of user 32 is target direction, and further acquires the second voice messaging of user 32, goes forward side by side Row speech recognition.

Audio recognition method provided by the embodiments of the present application carries out ADBF processing to the first voice messaging of acquisition, obtains Frequency spectrum at least two directions；And in the frequency spectrum at least two direction, spectrum signature is met into preset condition Frequency spectrum corresponding to direction, be determined as target direction, in such manner, it is possible to accurately localization of sound source direction, thus subsequent In speech recognition process, the second voice messaging on accurate direction can be obtained, the accuracy rate of speech recognition is improved.

Fig. 4 is an optional flow diagram of audio recognition method provided by the embodiments of the present application, as shown in figure 4, It the described method comprises the following steps:

Step S401, server determine the keyword that first voice messaging is included.

Here, it determines that keyword that first voice messaging is included can be to solve first voice messaging Analysis, determines the keyword in first voice messaging, obtains for example, can carry out text identification to first voice messaging Text information, and word segmentation processing is carried out to the text information, at least one word is obtained, then, according to the word of each word Property, the word for meeting default part of speech condition is determined as the keyword.Alternatively, can also be using artificial intelligence technology to described First voice messaging is parsed, and determines the keyword for including in first voice messaging.

It for example, then can be with when first voice messaging is that user says " please play music " against electronic equipment Determine that keyword is " music ", it is therefore desirable to start application relevant to music on electronic equipment.

Step S402 carries out the ADBF processing to first voice messaging at least two directions, obtain with often The corresponding wave beam including the keyword in one direction.

Here, first voice messaging is a sound wave, when getting the sound wave, to the sound from least two directions Wave carries out ADBF processing, simultaneously as first voice messaging includes the keyword, therefore carries out at ADBF to the sound wave Obtained wave beam also includes the keyword after reason.

Step S403 determines the wake-up on corresponding direction according to the corresponding wave beam including the keyword in each direction Word frequency spectrum.

Here, when obtaining the corresponding wave beam in each direction, the distribution situation of the frequency of the wave beam can be determined, The curve of frequency distribution of the wave beam is obtained, i.e., the described frequency spectrum.Also, due to including the keyword in the wave beam, The keyword can also be carried on the frequency spectrum, the keyword is for waking up electronic equipment, then institute's shape At the frequency spectrum including the keyword be wake-up word frequency spectrum.

Spectrum signature is met preset condition in the wake-up word frequency spectrum at least two direction by step S404 The corresponding direction of word frequency spectrum is waken up, target direction is determined as.

Here, the target direction is and the immediate direction of the actual direction of user.In the embodiment of the present application, it can incite somebody to action The wake-up word frequency that spectrum signature meets preset condition composes corresponding direction and is determined as the target direction.

The spectrum signature is that the wake-up word frequency composes corresponding attribute information, and the attribute information can react described and call out The quality and the wake-up word frequency of word frequency of waking up spectrum compose corresponding parameter.

Step S405 obtains the second voice messaging acquired in the target direction, and carries out language to the voice messaging Sound identification.

Audio recognition method provided by the embodiments of the present application carries out ADBF processing to the first voice messaging of acquisition, obtains Wake-up word frequency spectrum at least two directions；And in the wake-up word frequency spectrum at least two direction, by spectrum signature Meet the corresponding direction of the wake-up word frequency spectrum of preset condition, is determined as target direction, in such manner, it is possible to accurate localization of sound source Speech recognition is improved so that the second voice messaging on accurate direction can be obtained in subsequent speech recognition process in direction Accuracy rate.

Fig. 5 is an optional flow diagram of audio recognition method provided by the embodiments of the present application, as shown in figure 5, It the described method comprises the following steps:

Step S501, server determine the keyword that first voice messaging is included.

Step S502 carries out the ADBF processing to first voice messaging at least two directions, obtain with often The corresponding wave beam including the keyword in one direction.

It should be noted that step S501 to step S502 is identical to step S402 as above-mentioned steps S401, the application is real Example is applied to repeat no more.

Step S503, wave beam including the keyword corresponding to each direction carry out speech enhan-cement processing, obtain pair Answer the speech enhan-cement wave beam in direction.

Here, the speech enhan-cement processing is in order to carry out signal enhancing to the wave beam, so that speech enhan-cement wave beam Keyword can accurately be identified.

In the embodiment of the present application, speech enhan-cement processing can be carried out by following two mode:

Mode one, the wave beam corresponding to each direction carry out single-channel voice enhancing processing, obtain corresponding direction Speech enhan-cement wave beam.

Here, the single-channel voice enhancing processing is for carrying out the speech enhan-cement on corresponding direction, institute to the wave beam The signal strength of obtained speech enhan-cement wave beam is higher than the signal strength of the original wave beam, subsequent so as to be convenient for The frequency spectrum of speech enhan-cement wave beam is accurately confirmed.

Mode two, the noise in the wave beam corresponding to each direction are eliminated, and the voice for obtaining corresponding direction increases High-amplitude wave beam.

Here, by eliminating to the noise in the wave beam, reduce influence of the noise to efficient voice, due to noise Reduce, accordingly, the intensity of efficient voice signal is opposite to be enhanced, therefore can also reach to corresponding to the first voice messaging The effect that wave beam is enhanced, to obtain the speech enhan-cement wave beam.

Step S504, by the upward speech enhan-cement wave beam of each party, the wake-up word frequency being determined as on corresponding direction is composed.

It here, include wake-up word in the wake-up word frequency spectrum, the wake-up word can be the keyword, the wake-up Word is for waking up electronic equipment.

Spectrum signature is met preset condition in the wake-up word frequency spectrum at least two direction by step S505 The corresponding direction of word frequency spectrum is waken up, target direction is determined as.

Step S506 obtains the second voice messaging acquired in the target direction, and carries out language to the voice messaging Sound identification.

It should be noted that step S505 to step S506 is identical to step S405 as above-mentioned steps S404, the application is real Example is applied to repeat no more.

In some embodiments, the spectrum signature includes signal-to-noise ratio or wake-up rate；Accordingly, above-mentioned steps S302 can be with It is realized by following two mode:

The first, in the frequency spectrum at least two direction, by side corresponding to the frequency spectrum with highest signal to noise ratio To being determined as the target direction.

Second, by direction corresponding to the frequency spectrum with highest wake-up rate, it is determined as the target direction.

Fig. 6 A is an optional flow diagram of audio recognition method provided by the embodiments of the present application, such as Fig. 6 A institute Show, the described method comprises the following steps:

Step S601 acquires the first voice messaging by the voice collecting unit on electronic equipment, to first voice Information carries out echo cancellor.

Here, the electronic equipment can have two voice collecting units, for example, the electronic equipment can for The intelligent sound box of two microphones.The first voice messaging of user is acquired by two microphones on intelligent sound box.

In the embodiment of the present application, after collecting first voice messaging, first voice messaging is returned Sound is eliminated, to remove the echo in first voice messaging.

In other embodiments, in the first voice messaging of microphone acquisition user, the electronic equipment can be Audio is played, therefore the microphone can also acquire some back production signals, the back production information is the intelligent sound box inner part Voice signal.It when intelligent sound box collects the back production signal, needs to carry out voice elimination to the back production information, to eliminate The back production signal avoids influence of the back production signal to the first voice messaging of user.

Step S602 carries out ADBF processing to the first voice messaging after echo cancellor at least two direction, Obtain the frequency spectrum at least two direction.

Here, a kind of method determining the direction is provided first, by taking microphone there are two above-mentioned intelligent sound box tools as an example, The determination in the direction of the embodiment of the present application is illustrated.

As shown in Figure 6B, there is the first microphone 601 and second microphone 602, the first microphone 601 on intelligent sound box 60 With the midpoint 603 of second microphone 602, when user speaks against intelligent sound box 60, the position of user 61 as shown, Line segment between the position of user 61 and midpoint 603 is determined as the first line, by the first microphone 601 and the second Mike Line segment between wind 602 is determined as the second line, then, the direction of user is the angle between the first line and the second line 62.I.e. user 61 is located on 60 angle of intelligent sound box, 62 direction.

The method of clamp angular direction really is provided based on Fig. 6 B, in the embodiment of the present application, when the first language for getting user After message breath, ADBF can be carried out to the first voice messaging after echo cancellor on arbitrary at least two angle direction Processing, obtains frequency spectrum corresponding with each angle direction.

Step S603 is obtained and is waken up word included in the frequency spectrum.

Step S604 wakes up electronic equipment by the wake-up word, to realize that the voice is known by the electronic equipment Not.

For example, the wake-up word can be the identification information being pre-stored in intelligent sound box, when getting the frequency spectrum In when including wake-up word corresponding with the identification information of intelligent sound box, then wake up the intelligent sound box.

Spectrum signature is met the frequency spectrum institute of preset condition in the frequency spectrum at least two direction by step S605 Corresponding direction, is determined as target direction.

Step S606 obtains the second voice messaging acquired in the target direction, and carries out language to the voice messaging Sound identification.

It should be noted that step S605 to step S606 is identical to step S303 as above-mentioned steps S302, the application is real Example is applied to repeat no more.

Fig. 7 is an optional flow diagram of audio recognition method provided by the embodiments of the present application, as shown in fig. 7, It the described method comprises the following steps:

Step S701, server carry out ADBF processing to the first voice messaging of acquisition, obtain at least two directions Frequency spectrum.

Spectrum signature is met the frequency spectrum institute of preset condition in the frequency spectrum at least two direction by step S702 Corresponding direction, is determined as target direction.

It should be noted that step S701 to step S702 is identical to step S302 as above-mentioned steps S301, the application is real Example is applied to repeat no more.

Step S703 obtains the voice collecting direction prestored on electronic equipment.

Here, preset voice collecting direction, the preset voice are stored in storage unit on an electronic device Acquisition direction can be the direction obtained in historical time section for activating the voice messaging of electronic equipment.

Step S704 stores the target direction when the target direction and the voice collecting direction difference.

Here, due to the direction that identified target direction is with the actual direction of user relatively, i.e., the described target Direction is the direction with Sounnd source direction relatively.Therefore, after determining the target direction, by the target direction with The voice collecting direction prestored on electronic equipment is compared, if the target direction is identical as the voice collecting direction, The voice collecting direction then can be directly used as the target direction, carry out subsequent processing.If the target side To different from the voice collecting direction, then show sound source on other directions different from the voice collecting direction prestored, because This, can also store the target direction in the storage unit of the electronic equipment, to realize in subsequent speech processes mistake The direction that can be acquired target direction is changed as voice messaging in journey.

Step S705 obtains the second voice messaging acquired in the target direction, and carries out language to the voice messaging Sound identification.

Audio recognition method provided by the embodiments of the present application is deposited the target direction as the voice collecting direction prestored It stores up in the storage unit of electronic equipment, in this way, during subsequent speech recognition and speech processes, when Sounnd source direction and institute When the voice collecting direction of storage is consistent, the voice collecting direction that prestores can be directly used to carry out the acquisition of voice messaging simultaneously Identification, so, it is possible to guarantee that subsequent voice collecting direction is consistent with history voice collecting direction, to realize to same direction On voice messaging be acquired.

In the following, will illustrate exemplary application of the embodiment of the present application in an actual application scenarios.

The embodiment of the present application provides a kind of audio recognition method, can greatly improve the wake-up under very noisy and strong reverberation Rate, discrimination.

Fig. 8 is an optional flow diagram of audio recognition method provided by the embodiments of the present application, as shown in figure 8, It the described method comprises the following steps:

Step S801 acquires the first voice messaging of user by voice collecting unit.

Step S802 is handled by AEC and is carried out echo cancellor to first voice messaging.

Step S803 carries out ADBF processing to first voice messaging on 0 °, 90 °, 180 ° of directions respectively, obtains 0 °, 90 °, the wave beam on 180 ° of these three different directions.

As shown in figure 8, respectively illustrate step S803a carries out ADBF processing on 0 ° of direction, step S803b is 90 ° of sides ADBF processing is carried out upwards, and step S803c carries out ADBF processing on 180 ° of directions.

Step S804 carries out NS processing to the wave beam on different directions, obtains the speech enhan-cement wave beam of corresponding direction.

Here, as shown in figure 8, step S804 includes step S804a, step S804b and step S804c, respectively to 0 ° of side Upward wave beam carries out NS processing, the wave beam progress NS processing on 90 ° of directions, the wave beam progress NS processing on 180 ° of directions.

Step S805 is carried out keyword identification to each speech enhan-cement wave beam, is determined keyword using KWS technology.

Here, as shown in figure 8, step S805 includes step S805a, step S805b and step S805c, respectively to 0 ° of side Upward speech enhan-cement wave beam carries out keyword identification, the speech enhan-cement wave beam on 90 ° of directions carries out keyword identification, 180 ° of sides Upward speech enhan-cement wave beam carries out keyword identification.

Step S806 wakes up electronic equipment according to the keyword, and judges whether effectively to wake up.

Here, if it is judged that be it is yes, then follow the steps S807；If it is judged that be it is no, then terminate process.

Step S807 is carried out Mutual coupling to the speech enhan-cement wave beam, is utilized array information using DOA technology Estimate the direction of sound source.

Step S808 carries out Adaptive beamformer to the speech enhan-cement wave beam using ADBF technology, according to real-time number According to design weighting coefficient, the wave beam for having directive property is formed.

Step S809 carries out ASR speech recognition to the wave beam with directive property is formed by.

The audio recognition method of the embodiment of the present application can be applied to tool, and there are two the electronic equipment of microphone, the two wheats Gram wind acquires the voice messaging of user simultaneously, obtains two voice signals.

In the embodiment of the present application, 0 °, 90 ° and 180 ° direction is done respectively to two voice signals of two-way microphone acquisition Adaptive beam design, respectively obtains three wave beams, then carries out single-channel voice enhancing to three obtained wave beam.Wherein, The directive property of three wave beams shows the direction figure of wave beam 901 as shown in Fig. 9 A to Fig. 9 C in Fig. 9 A to Fig. 9 C, Fig. 9 A indicates 0 ° The beam position figure in direction, Fig. 9 B indicate that the beam position figure in 90 ° of directions, Fig. 9 C indicate the beam position figure in 180 ° of directions.Three A wave beam can effectively inhibit the noise outside wave beam, as shown in Fig. 9 A to Fig. 9 C, because the wave beam of two-way microphone is wider, diagonally The degrees of tolerance for spending mistake is more slightly higher, and speaker is that will not believe voice in a certain range for deviateing wave beam principal direction It number causes obviously to damage.

As an example it is assumed that speaker direction, in 160 ° of directions, interference noise is delivered to electronics and sets in 60 ° of directions, Figure 10 The wake-up word frequency of standby wakeup unit is composed, and wherein a figure is clean voice signal, and b figure is the voice signal of superposition interference, c figure It is the voice after single channel noise reduction, d figure, e figure and f figure are the wave beam output frequency in 0 °, 90 ° and 180 ° three directions respectively Spectrum.Since noise type is voice interference, the single channel noise reduction in c figure does not almost work；And since interference noise comes from 60 ° of directions, then what d figure, e figure retained is then more the signal of noise；F figure retains more targeted voice signals, the letter of f figure It makes an uproar than highest, then, wake-up rate naturally also highest.

In an experimental result, a figure, b figure, c figure, d figure, e figure and f figure wake-up score be 0.98 respectively, 0.32, 0.33,0.10,0.09,0.9, it can be seen that, the method for the embodiment of the present application has higher wake-up rate.

In other embodiments, please continue to refer to Figure 10, treated on above-mentioned 3 tunnel, and signal is sent to calling out for electronic equipment in real time Awake unit, to detect whether being to wake up word, if electronic equipment wakes up, the DOA unit of electronic equipment utilizes the wake-up word cached Voice carries out speaker's angle estimation.Usual two-way microphone will appear biggish angle estimation under high reverberation or very noisy Error even estimates mistake, and first divides angular range using information is waken up, and angle estimation is carried out in the angular range of delimitation, The accuracy of speaker's angle estimation can then be greatly improved.

As shown in Figure 10, since the wake-up highest scoring for scheming f estimates angle when estimating speaker's angle using DOA technology It counts range to be arranged in the range of more inclined 180 °, for example 120 ° to 180 ° can be taken, it is clear that will be greatly reduced the estimation of angle Mistake.

In the embodiment of the present application, after electronic equipment is waken up, the voice of speaker passes through dereverberation, Wave beam forming, list It after channel speech enhances the processing such as dereverberation, is sent to ASR unit and is identified, to complete the interactive voice with equipment.

Audio recognition method provided by the embodiments of the present application is waken up using multi-beam by two-way microphone, be can be improved Wake-up rate, and using the estimation accuracy for waking up information raising angle, reduce the speech damage of beam angle deviation noise, Voice signal-to-noise ratio is improved, to further increase the accuracy rate of speech recognition.

Continue with explanation speech recognition equipment 254 provided by the embodiments of the present application is embodied as the exemplary of software module Structure, in some embodiments, as shown in Figure 2 B, the software module being stored in the speech recognition equipment 254 of memory 250 can To include: first processing module 2541, determining module 2542 and Second processing module 2543.

First processing module 2541 is obtained for carrying out ADBF processing to the first voice messaging of acquisition at least two Frequency spectrum on direction；

Determining module 2542, in the frequency spectrum at least two direction, spectrum signature to be met preset condition Frequency spectrum corresponding to direction, be determined as target direction；

Second processing module 2543, for obtaining the second voice messaging in target direction acquisition, and to described the Two voice messagings carry out speech recognition.

In some embodiments, the frequency spectrum is to wake up word frequency spectrum；Accordingly, the first processing module is also used to: really The keyword that fixed first voice messaging is included；Described in being carried out at least two directions to first voice messaging ADBF processing, obtains the wave beam including the keyword corresponding with each direction；Corresponding according to each direction includes described The wave beam of keyword determines the wake-up word frequency spectrum on corresponding direction.

In some embodiments, described device further include:

Speech enhan-cement processing module, for carrying out speech enhan-cement to the corresponding wave beam including the keyword in each direction Processing, obtains the speech enhan-cement wave beam of corresponding direction；

Accordingly, the first processing module is also used to: by the upward speech enhan-cement wave beam of each party, being determined as counterparty Upward wake-up word frequency spectrum.

In some embodiments, the speech enhan-cement processing module is also used to:

The wave beam corresponding to each direction carries out single-channel voice enhancing processing, obtains the speech enhan-cement of corresponding direction Wave beam；Alternatively, the noise in the wave beam corresponding to each direction is eliminated, the speech enhan-cement wave of corresponding direction is obtained Beam.

In some embodiments, the spectrum signature includes signal-to-noise ratio or wake-up rate；

Accordingly, the determining module is also used to: in the frequency spectrum at least two direction, will have highest noise Direction corresponding to the frequency spectrum of ratio is determined as the target direction, alternatively, by side corresponding to the frequency spectrum with highest wake-up rate To being determined as the target direction.

In some embodiments, described device further include:

Echo cancellation module, for the first voice messaging of acquisition carry out ADBF processing before, to first language Message breath carries out echo cancellor；

Accordingly, the first processing module is also used to: at least two direction, to first after echo cancellor Voice messaging carries out ADBF processing, obtains the frequency spectrum at least two direction.

In some embodiments, described device further include:

First obtains module, wakes up word included in the frequency spectrum for obtaining；

Wake-up module, for waking up electronic equipment by the wake-up word, to realize institute's predicate by the electronic equipment Sound identification.

In some embodiments, described device further include:

And obtain module, for obtaining the voice collecting direction prestored on electronic equipment；

Memory module, for storing the target direction when the target direction and the voice collecting direction difference.

It should be noted that the description of the embodiment of the present application device, is similar, tool with the description of above method embodiment There is the similar beneficial effect of same embodiment of the method, therefore does not repeat them here.For undisclosed technical detail in present apparatus embodiment, It please refers to the description of the application embodiment of the method and understands.

The embodiment of the present application provides a kind of storage medium for being stored with executable instruction, wherein it is stored with executable instruction, When executable instruction is executed by processor, processor will be caused to execute method provided by the embodiments of the present application, for example, such as Fig. 3 A The method shown.

In some embodiments, storage medium can be ferroelectric memory (Ferromagnetic Random Access Memory, FRAM), read-only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read Only Memory, PROM), Erasable Programmable Read Only Memory EPROM (Erasable Programmable Read Only Memory, EPROM), band Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read Only Memory, EEP ROM), flash memory, magnetic surface storage, CD or compact disc read-only memory (Compact Disk-ReadOnly Memory, CD-ROM) etc. memories；It is also possible to include each of one of above-mentioned memory or any combination Kind equipment.

In some embodiments, executable instruction can use program, software, software module, the form of script or code, By any form of programming language (including compiling or interpretative code, or declaratively or process programming language) write, and its It can be disposed by arbitrary form, including be deployed as independent program or be deployed as module, component, subroutine or be suitble to Calculate other units used in environment.

As an example, executable instruction can with but not necessarily correspond to the file in file system, can be stored in A part of the file of other programs or data is saved, for example, being stored in hypertext markup language (Hyper Text Markup Language, HTML) in one or more scripts in document, it is stored in the single file for being exclusively used in discussed program, Alternatively, being stored in multiple coordinated files (for example, the file for storing one or more modules, subprogram or code section).

As an example, executable instruction can be deployed as executing in a calculating equipment, or it is being located at one place Multiple calculating equipment on execute, or, be distributed in multiple places and by multiple calculating equipment of interconnection of telecommunication network Upper execution.

The above, only embodiments herein are not intended to limit the protection scope of the application.It is all in this Shen Made any modifications, equivalent replacements, and improvements etc. within spirit and scope please, be all contained in the application protection scope it It is interior.

Claims

1. a kind of audio recognition method characterized by comprising

Adaptive beamformer ADBF processing is carried out to the first voice messaging of acquisition, obtains frequency at least two directions Spectrum；

In the frequency spectrum at least two direction, spectrum signature is met to direction corresponding to the frequency spectrum of preset condition, really It is set to target direction；

The second voice messaging acquired in the target direction is obtained, and speech recognition is carried out to second voice messaging.

2. the method according to claim 1, wherein the frequency spectrum is to wake up word frequency spectrum；Accordingly, it adopts for described pair First voice messaging of collection carries out ADBF processing, obtains frequency spectrum at least two directions, comprising:

Determine the keyword that first voice messaging is included；

The ADBF processing is carried out to first voice messaging at least two directions, obtains packet corresponding with each direction Include the wave beam of the keyword；

According to the corresponding wave beam including the keyword in each direction, the wake-up word frequency spectrum on corresponding direction is determined.

3. according to the method described in claim 2, it is characterized in that, the method also includes:

The wave beam including the keyword corresponding to each direction carries out speech enhan-cement processing, and the voice for obtaining corresponding direction increases High-amplitude wave beam；

Accordingly, described according to the corresponding wave beam including the keyword in each direction, determine the wake-up word on corresponding direction Frequency spectrum, comprising: by the upward speech enhan-cement wave beam of each party, the wake-up word frequency being determined as on corresponding direction is composed.

4. according to the method described in claim 3, it is characterized in that, described corresponding to each direction including the keyword Wave beam carries out speech enhan-cement processing, obtains the speech enhan-cement wave beam of corresponding direction, comprising:

The wave beam corresponding to each direction carries out single-channel voice enhancing processing, obtains the speech enhan-cement wave of corresponding direction Beam；Alternatively, the noise in the wave beam corresponding to each direction is eliminated, the speech enhan-cement wave beam of corresponding direction is obtained.

5. the method according to claim 1, wherein the spectrum signature includes signal-to-noise ratio or wake-up rate；

Accordingly, in the frequency spectrum at least two direction, the frequency spectrum institute that spectrum signature is met preset condition is right The direction answered, is determined as target direction, comprising:

In the frequency spectrum at least two direction, by direction corresponding to the frequency spectrum with highest signal to noise ratio, it is determined as institute Target direction is stated, alternatively, direction corresponding to the frequency spectrum with highest wake-up rate is determined as the target direction.

6. method according to any one of claims 1 to 5, which is characterized in that the method also includes:

Before carrying out ADBF processing to the first voice messaging of acquisition, echo cancellor is carried out to first voice messaging；

Accordingly, the first voice messaging of described pair of acquisition carries out ADBF processing, obtains frequency spectrum at least two directions, wraps It includes:

On at least two direction, to after echo cancellor the first voice messaging carry out ADBF processing, obtain it is described extremely Frequency spectrum in few both direction.

7. method according to any one of claims 1 to 5, which is characterized in that the method also includes:

It obtains and wakes up word included in the frequency spectrum；Electronic equipment is waken up by the wake-up word, to set by the electronics It is standby to realize the speech recognition；

Alternatively, the method also includes: obtain the voice collecting direction prestored on electronic equipment；

When the target direction and the voice collecting direction difference, the target direction is stored.

8. a kind of speech recognition equipment characterized by comprising

First processing module obtains at least two directions for carrying out ADBF processing to the first voice messaging of acquisition Frequency spectrum；

Determining module, in the frequency spectrum at least two direction, spectrum signature to be met to the frequency spectrum institute of preset condition Corresponding direction, is determined as target direction；

Second processing module is believed for obtaining the second voice messaging in target direction acquisition, and to second voice Breath carries out speech recognition.

9. a kind of speech recognition apparatus characterized by comprising

Memory, for storing executable instruction；

Processor when for executing the executable instruction stored in the memory, is realized described in any one of claim 1 to 7 Method.

10. a kind of storage medium, which is characterized in that being stored with executable instruction, when for causing processor to execute, realizing right It is required that 1 to 7 described in any item methods.