CN108986822A

CN108986822A - Audio recognition method, device, electronic equipment and non-transient computer storage medium

Info

Publication number: CN108986822A
Application number: CN201811011980.XA
Authority: CN
Inventors: 栗强; 雷欣; 胡亚光; 周羊
Original assignee: Chumen Wenwen Information Technology Co Ltd
Current assignee: Chumen Wenwen Information Technology Co Ltd
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2018-12-11

Abstract

The present embodiments relate to speech processes field, a kind of audio recognition method, device, electronic equipment and non-transient computer storage medium are provided, wherein audio recognition method includes: to obtain the collected audio data to be identified of terminal；Then by vad algorithm, determine in audio data to be identified whether include voice signal；Then if including voice signal, word is waken up based on audio data identification.The method of the embodiment of the present invention, when so that including voice signal only in audio data, just wake up to audio data the identification of word, which type of effectively prevent that audio data no matter collected, it is required to the case where carrying out waking up word identification to it, to greatly reduce the power consumption of system, the user experience is improved.

Description

Audio recognition method, device, electronic equipment and non-transient computer storage medium

Technical field

The present embodiments relate to voice processing technology fields, more particularly to a kind of audio recognition method, device, electronics Equipment and non-transient computer storage medium.

Background technique

With the continuous development of terminal device, intelligent sound equipment using more and more extensive, such as intelligent sound, machine People etc., user can by intelligent sound hardware device input one section of voice signal, then, intelligent sound hardware device or The background server of person's intelligent sound hardware device can carry out semantics recognition to this section of voice signal, and according to semantics recognition As a result corresponding operation is executed, corresponding operation result can also be returned to user in some cases.

Currently, needing first to detect the sound letter got after intelligent sound equipment gets the audio data of user's input In number whether include wake up word, if include wake up word, speech recognition system will be activated, come to the voice signal got into Row identification, if not including waking up word, does not activate voice to execute corresponding operation according to the voice signal identified Identifying system would not also identify the voice signal got.I.e. voice awakening technology is a kind of with switch entrance The function of attribute, user can initiate the operation of human-computer interaction by waking up the wake-up of word, i.e., intelligent sound equipment only by with After wake-up word described in family wakes up, the next voice signal of user can just be identified.

In the specific implementation process, following defect exists in the prior art in inventor: as long as at intelligent sound equipment In open state, then no matter which type of voice signal is intelligent sound equipment get, for example only includes the sound of vehicle whistle Signal, the voice signal for only including engine roar etc., wake up word detection module require to the voice signal got into Row detection, has been significantly greatly increased the power consumption of system, has caused user experience not good enough.

Summary of the invention

In view of this, the embodiment of the invention provides a kind of audio recognition method, device, electronic equipment and non-transient calculating Machine storage medium can greatly reduce the power consumption of system, promote user experience.

To solve the above-mentioned problems, the embodiment of the present invention mainly provides the following technical solutions:

In a first aspect, the embodiment of the invention provides a kind of audio recognition methods, this method comprises:

Obtain collected audio data to be identified；

By voice activity detection vad algorithm, determine in audio data to be identified whether include voice signal；

If wake up based on audio data the identification of word including voice signal.

Second aspect, the embodiment of the invention also provides a kind of speech recognition equipment, which includes:

Module is obtained, for obtaining collected audio data to be identified；

Determining module, for whether by voice activity detection vad algorithm, determining in audio data to be identified including language Sound signal；

Identification module, for wake up based on audio data the identification of word when including voice signal.

The third aspect, the embodiment of the invention also provides a kind of electronic equipment, comprising:

At least one processor；

And at least one processor connected to the processor, bus；Wherein,

Processor, memory complete mutual communication by bus；

Processor is used to call the program instruction in memory, to execute above-mentioned audio recognition method.

Fourth aspect, the embodiment of the invention also provides a kind of non-transient computer readable storage mediums, wherein non-transient Computer-readable recording medium storage computer instruction, computer instruction make computer execute above-mentioned audio recognition method.

By above-mentioned technical proposal, technical solution provided in an embodiment of the present invention is at least had the advantage that

Audio recognition method provided in an embodiment of the present invention obtains the collected audio data to be identified of terminal, after being Whether continuous determine in audio data to be identified includes that voice signal provides premise guarantee；By vad algorithm, determine to be identified Whether include voice signal in audio data, to carry out Preliminary detection, screening to the audio data got, is based on to be subsequent Audio data identification wakes up word and establishes solid foundation；If waking up word based on audio data identification including voice signal, so that When including voice signal only in audio data, just wake up to audio data the identification of word, effectively prevent no matter adopting Which type of audio data collected, the case where carrying out waking up word identification to it is required to, so that the power consumption of system is greatly reduced, The user experience is improved.

Above description is only the general introduction of technical solution of the embodiment of the present invention, in order to better understand the embodiment of the present invention Technological means, and can be implemented in accordance with the contents of the specification, and in order to allow above and other mesh of the embodiment of the present invention , feature and advantage can be more clearly understood, the special specific embodiment for lifting the embodiment of the present invention below.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention The limitation of embodiment.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 shows the flow diagram of audio recognition method provided in an embodiment of the present invention；

Fig. 2 shows the basic structure schematic diagrams of speech recognition equipment provided in an embodiment of the present invention；

Fig. 3 shows the detailed construction schematic diagram of speech recognition equipment provided in an embodiment of the present invention；

Fig. 4 shows the structural schematic diagram of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

Embodiment one

The embodiment of the invention provides a kind of audio recognition methods, as shown in Figure 1, comprising: step S110, acquisition collect Audio data to be identified；Step S120, by voice activity detection vad algorithm, determining in audio data to be identified is No includes voice signal；Step S130: if including voice signal, wake up based on audio data the identification of word.

Specifically, the embodiment of the present invention can be executed by intelligent sound equipment, can also be executed by server, and the present invention is real It applies example not to be limited it, wherein intelligent sound equipment is used as the speech recognition process and server of executing subject and executes The speech recognition process of main body is similar, below using intelligent sound equipment as executing subject for, to the language of the embodiment of the present invention Voice recognition method describes in detail, as follows:

Step S110 obtains collected audio data to be identified.

It specifically, when intelligent sound equipment is in the open state, is acquired by the high performance audio built in it The audio data in equipment (such as microphone, microphone array) the acquisition in real time external world, wherein audio data, which can be, only includes vapour The voice signal of vehicle whistle, is also possible to only include the voice signal of engine roar, is also possible to only include user and speak Voice signal, can also be the voice signal including various noises.

It further, can be collected by this after the audio collecting device of intelligent sound equipment collects audio data Audio data is transmitted to the audio processing modules of intelligent sound equipment, so that audio processing modules can carry out audio data Identifying processing, the i.e. audio processing modules of intelligent sound equipment obtain collected audio data to be identified.

Step S120 determines in audio data to be identified whether include that voice is believed by voice activity detection vad algorithm Number.

Specifically, after the audio processing modules of intelligent sound equipment get audio data to be identified, this can be waited knowing Whether other audio data carries out respective handling, in the audio data to be identified include voice signal to determine, wherein can be with By VAD (Voice Activity Detection, voice activity detection) algorithm, to determine in the audio data to be identified It whether include voice signal.

Wherein, vad algorithm can not only distinguish voice signal and background noise in audio data, can determine signal It is voice signal or noise signal, and voice and silent section can also be distinguished in the communications.

Step S130: if including voice signal, wake up based on audio data the identification of word.

Specifically, when determining in audio data to be identified includes voice signal, intelligent sound equipment can pass through it Built-in speech recognition module, wake up based on audio data the identification of word, i.e. speech recognition module can identify audio number It whether include waking up word in, wherein it wakes up word and pre-sets, such as can preset wake-up word is " small A ", this When speech recognition module can carry out the identification of " small A " in audio data.

Further, it is pre-set although waking up word, it is subsequent in the escalation process of intelligent sound equipment, Also it can according to need and wake-up word be modified, such as " small B " is changed to by wake-up word " small A ", at this time speech recognition module Need to carry out the identification of " small B " in audio data.

It is collected to be identified to obtain terminal compared with prior art for audio recognition method provided in an embodiment of the present invention Audio data, in subsequent determination audio data to be identified whether include voice signal premise guarantee is provided；It is calculated by VAD Method determines in audio data to be identified whether to include voice signal, thus to the audio data got carry out Preliminary detection, Screening establishes solid foundation based on audio data identification wake-up word to be subsequent；If being based on audio data including voice signal Identification wakes up word, when so that including voice signal only in audio data, just wake up to audio data the identification of word, has Effect avoids which type of audio data no matter collected, and the case where carrying out waking up word identification to it is required to, to greatly drop The low power consumption of system, the user experience is improved.

Embodiment two

The embodiment of the invention provides alternatively possible implementations, further include implementing on the basis of example 1 Method shown in example two, wherein

Step S120 specifically includes step S1201 (being not marked in figure), step S1202 (being not marked in figure) and step S1203 (is not marked in figure), wherein

Step S1201: audio data to be identified is subjected to sub-frame processing, obtains multiple audio frames.

Step S1202: traversing multiple audio frames by vad algorithm, and whether detection current audio frame is speech frame.

Step S1203: if current audio frame is speech frame, it is determined that include that voice is believed in audio data to be identified Number.

Step S130 specifically includes step S1301 (being not marked in figure) and step S1302 (being not marked in figure), wherein

Step S1301: voice signal is extracted from audio data.

Step S1302: wake up based on voice signal the identification of word.

Specifically, audio data is transmitted usually as unit of audio frame, i.e., by audio data to be identified Sub-frame processing is carried out, consequently facilitating the subsequent detection for carrying out voice signal based on audio frame, makes it possible to quickly and accurately carry out The detection of voice signal.

Further, during in detecting each audio frame whether including voice signal, vad algorithm pair can be passed through Multiple audio frames are traversed, and determine whether current audio frame is speech frame, illustrate in audio data to be identified whether include Voice signal.

Further, it when determining in audio data to be identified includes voice signal, is waken up based on audio data The identification of word, i.e., identification wakes up word in audio data, wherein, can be only to voice when identification wakes up word in audio data Signal carry out wake up word identification, without to non-speech audio carry out wake up word identification, at this point it is possible to first by voice signal from It is extracted in audio data, then wake up based on voice signal the identification of word, this is not only shortened needed for waking up word identification The time wanted promotes user experience, and reduces and wake up the power consumption that the detection of word detection module wakes up word.

Further, when extracting voice signal from audio data, the frequency of noise and voice can first be preset Range then carries out Frequency Response Analysis to collected audio data, therefrom isolates noise signal and voice signal, then according to Noise signal phase carries out reverse phase inhibition to noise signal, obtains residual voice signal, then according to voice signal and residual language Sound signal obtains current speech signal.

For the embodiment of the present invention, by vad algorithm, determine in audio data to be identified whether include voice signal, To carry out Preliminary detection, screening to the audio data that gets, for it is subsequent based on audio data identification wake up word establish it is solid Basis by extracting voice signal from audio data, and based on voice signal wake up the identification of word, not only shortens and call out Time required for word of waking up identifies, user experience is promoted, and reduces and wake up the power consumption that the detection of word detection module wakes up word.

Embodiment three

The embodiment of the invention provides alternatively possible implementations, further include implementing on the basis of example 2 Method shown in example three.

Specifically, the embodiment of the present invention determines whether current audio frame is speech frame primarily with regard to by vad algorithm Several possible methods, such as signal-to-noise ratio method, short-time energy method and spectrum energy method, naturally it is also possible to can determine for others Audio frame whether be speech frame method, the embodiment of the present invention is not limited it.Below with signal-to-noise ratio method, short-time energy method and For spectrum energy method, determine whether current audio frame is that speech frame describes in detail to by vad algorithm, as follows:

A. signal-to-noise ratio method

Step S1202 specifically includes step S12021 (being not marked in figure), step S12022 (being not marked in figure) and step S12023 (is not marked in figure), wherein

Step S12021: the snr value of current audio frame is calculated.

Step S12022: whether detection signal-to-noise ratio value is greater than the first preset threshold.

Step S12023: if it is greater than the first preset threshold, it is determined that current audio frame is speech frame.

Specifically, intelligent sound equipment traverses multiple audio frames, really by voice activity detection vad algorithm Determine to be based on vad algorithm, calculate the letter of the audio frame currently traversed during whether current audio frame be speech frame It makes an uproar ratio (i.e. the ratio of signal and noise), and detects whether the snr value is greater than the first preset threshold, if it is greater than first Preset threshold, it is determined that current audio frame is speech frame.

B. short-time energy method

Step S1202 specifically includes step S12024 (being not marked in figure), step S12025 (being not marked in figure) and step S12026 (is not marked in figure), wherein

Step S12024: the short-time energy value of current audio frame is calculated.

Step S12025: whether detection short-time energy value is greater than the second preset threshold.

Step S12026: if it is greater than the second preset threshold, it is determined that current audio frame is speech frame.

Specifically, intelligent sound equipment traverses multiple audio frames, really by voice activity detection vad algorithm Determine to be based on vad algorithm, calculate the short of the audio frame currently traversed during whether current audio frame be speech frame When energy value, and detect whether the short-time energy value is greater than the second preset threshold, if it is greater than the second preset threshold, it is determined that when Preceding audio frame is speech frame.Wherein, short-time energy usually with the absolute value of signal amplitude and/or quadratic sum counted.

C. spectrum energy method

Step S1202 specifically includes step S12027 (being not marked in figure), step S12028 (being not marked in figure) and step S12029 (is not marked in figure), wherein

Step S12027: the spectral energy values of current audio frame are calculated.

Step S12028: whether detection spectral energy values are greater than third predetermined threshold value.

Step S12029: if it is greater than third predetermined threshold value, it is determined that current audio frame is speech frame.

Specifically, intelligent sound equipment traverses multiple audio frames, really by voice activity detection vad algorithm Determine to be based on vad algorithm, calculate the frequency of the audio frame currently traversed during whether current audio frame be speech frame Spectrum energy value, and detect whether the spectral energy values are greater than third predetermined threshold value, if it is greater than third predetermined threshold value, it is determined that when Preceding audio frame is speech frame.

It should be noted that according to the needs in practical application, it can be by above-mentioned signal-to-noise ratio method, short-time energy method and frequency Spectrum energy method be combined with each other, to carry out the detection of speech frame, to improve the Detection accuracy of speech frame.

It can be quickly and accurately by signal-to-noise ratio method or short-time energy method or spectrum energy method for the embodiment of the present invention Determine whether current audio frame is speech frame.

Example IV

Fig. 2 is a kind of structural schematic diagram of speech recognition equipment provided in an embodiment of the present invention, as shown in Fig. 2, the device 20 may include obtaining module 21, determining module 22 and identification module 23, wherein

Module 21 is obtained for obtaining collected audio data to be identified；

Determining module 22 is used for through voice activity detection vad algorithm, determine in audio data to be identified whether include Voice signal；

Identification module 23 is used to wake up based on audio data the identification of word when including voice signal.

Specifically, it is determined that module 22 includes framing submodule 221, traversal submodule 222 and determines submodule 223, such as Fig. 3 It is shown, wherein

Framing submodule 221 is used to audio data to be identified carrying out sub-frame processing, obtains multiple audio frames；

Traversal submodule 222 for being traversed by vad algorithm to multiple audio frames, detect current audio frame whether be Speech frame；

Determine submodule 223 for when current audio frame is speech frame, determining in audio data to be identified including language Sound signal.

Further, traversal submodule 222 may include the first computation subunit 2221 (being not marked in figure), the first detection Subelement 2222 (being not marked in figure) determines subelement 2223 (being not marked in figure) with first, wherein

First computation subunit 2221 is used to calculate the snr value of current audio frame；

Whether the first detection sub-unit 2222 is greater than the first preset threshold for detection signal-to-noise ratio value；

First determines that subelement 2223 is used for when being greater than the first preset threshold, determines that current audio frame is speech frame.

Further, traversal submodule 222 may include the second computation subunit 2224 (being not marked in figure), the second detection Subelement 2225 (being not marked in figure) determines subelement 2226 (being not marked in figure) with second, wherein

Second computation subunit 2224 is used to calculate the short-time energy value of current audio frame；

Second detection sub-unit 2225 is for detecting whether short-time energy value is greater than the second preset threshold；

Second determines that subelement 2226 is used for when being greater than the second preset threshold, determines that current audio frame is speech frame.

Further, traversal submodule 222 may include third computation subunit 2227 (being not marked in figure), third detection Subelement 2228 (being not marked in figure) and third determine subelement 2229 (being not marked in figure), wherein

Third computation subunit 2227 is used to calculate the spectral energy values of current audio frame；

Third detection sub-unit 2228 is for detecting whether spectral energy values are greater than third predetermined threshold value；

Third determines subelement 2229 for when being greater than third predetermined threshold value, determining that current audio frame is speech frame.

Further, identification module 23 includes extracting sub-module 231 and wakes up word identification submodule 232, as shown in figure 3, Wherein,

Extracting sub-module 231 from audio data for extracting voice signal；

Wake up the identification that word identification submodule 232 is used to carry out waking up based on voice signal word.

It is collected to be identified to obtain terminal compared with prior art for speech recognition equipment provided in an embodiment of the present invention Audio data, in subsequent determination audio data to be identified whether include voice signal premise guarantee is provided；It is calculated by VAD Method determines in audio data to be identified whether to include voice signal, thus to the audio data got carry out Preliminary detection, Screening establishes solid foundation based on audio data identification wake-up word to be subsequent；If being based on audio data including voice signal Identification wakes up word, when so that including voice signal only in audio data, just wake up to audio data the identification of word, has Effect avoids which type of audio data no matter collected, and the case where carrying out waking up word identification to it is required to, to greatly drop The low power consumption of system, the user experience is improved.

Since the speech recognition equipment that the embodiment of the present invention is introduced is that can execute the voice in the embodiment of the present invention to know The device of other method, so based on audio recognition method described in the embodiment of the present invention, those skilled in the art's energy The specific embodiment and its various change form of the speech recognition equipment of solution the present embodiment much of that, so herein for the language How sound identification device realizes that the audio recognition method in the embodiment of the present invention is no longer discussed in detail.As long as the affiliated technology in this field Personnel implement device used by audio recognition method in the embodiment of the present invention, belong to the range of the invention to be protected.

Embodiment five

The embodiment of the invention provides a kind of electronic equipment, as shown in figure 4, electronic equipment shown in Fig. 4 40 includes: processing Device 41 and memory 42.Wherein, processor 41 is connected with memory 42, is such as connected by bus 43.Further, electronic equipment 40 can also include transceiver 44 (being not marked in figure).It should be noted that transceiver 44 is not limited to one in practical application, it should The structure of electronic equipment 40 does not constitute the restriction to the embodiment of the present invention.

Wherein, processor 41 is applied in the embodiment of the present invention, for realizing Fig. 2 or shown in Fig. 3 acquisition module, determination The function of module and identification module.

Processor 41 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystal Pipe logical device, hardware component or any combination thereof.It, which may be implemented or executes, combines described in the disclosure of invention Various illustrative logic blocks, module and circuit.Processor 41 is also possible to realize the combination of computing function, such as includes one The combination of a or multi-microprocessor, DSP and the combination of microprocessor etc..

Bus 43 may include an access, and information is transmitted between said modules.Bus 43 can be pci bus or EISA is total Line etc..Bus 43 can be divided into address bus, data/address bus, control bus etc..For convenient for indicating, only with a thick line in Fig. 4 It indicates, it is not intended that an only bus or a type of bus.

Memory 42 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM or Person can store the other kinds of dynamic memory of information and instruction, be also possible to EEPROM, CD-ROM or other CDs are deposited Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium or Other magnetic storage apparatus of person or can be used in carry or store have instruction or data structure form desired program code And can by any other medium of computer access, but not limited to this.

Memory 42 is used to store the application code for executing the present invention program, and execution is controlled by processor 41. Processor 41 is for executing the application code stored in memory 42, to realize the language of Fig. 2 or embodiment illustrated in fig. 3 offer The movement of sound identification device.

Electronic equipment provided in an embodiment of the present invention, including memory, processor and storage on a memory and can located The computer program that runs on reason device, when processor executes program, compared with prior art, it can be achieved that: obtain terminal and collect Audio data to be identified, in subsequent determination audio data to be identified whether include voice signal premise guarantee is provided； By vad algorithm, determine in audio data to be identified whether include voice signal, to carry out to the audio data got Preliminary detection, screening establish solid foundation based on audio data identification wake-up word to be subsequent；If including voice signal, base It is identified in audio data and wakes up word, when so that including voice signal only in audio data, just audio data is waken up The identification of word effectively prevents which type of audio data no matter collected, and is required to the case where carrying out waking up word identification to it, To greatly reduce the power consumption of system, the user experience is improved.

The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment.With The prior art is compared, and the collected audio data to be identified of terminal is obtained, to be in subsequent determination audio data to be identified No includes that voice signal provides premise guarantee；By vad algorithm, determine in audio data to be identified whether include that voice is believed Number, to carry out Preliminary detection, screening to the audio data got, heavily fortified point is established based on audio data identification wake-up word to be subsequent Real basis；If word is waken up based on audio data identification including voice signal, so that including voice only in audio data When signal, just wake up to audio data the identification of word, effectively prevent which type of audio data no matter collected, be both needed to The case where waking up word identification is carried out to it, to greatly reduce the power consumption of system, the user experience is improved.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.

It will be understood by those skilled in the art that the embodiment of the present invention can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the present invention Form.It is deposited moreover, the present invention can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.

The above is only the embodiment of the present invention, are not intended to restrict the invention.To those skilled in the art, The invention may be variously modified and varied.It is all within the spirit and principles of the present invention made by any modification, equivalent replacement, Improve etc., it should be included within scope of the presently claimed invention.

Claims

1. a kind of audio recognition method characterized by comprising

Obtain collected audio data to be identified；

By voice activity detection vad algorithm, determine in the audio data to be identified whether include voice signal；

If wake up based on the audio data identification of word including voice signal.

2. the method according to claim 1, wherein determining the audio data to be identified by vad algorithm In whether include voice signal, comprising:

The audio data to be identified is subjected to sub-frame processing, obtains multiple audio frames；

The multiple audio frame is traversed by vad algorithm, whether detection current audio frame is speech frame；

If current audio frame is speech frame, it is determined that include voice signal in the audio data to be identified.

3. according to the method described in claim 2, it is characterized in that, determining whether current audio frame is voice by vad algorithm Frame, comprising:

Calculate the snr value of current audio frame；

Detect whether the snr value is greater than the first preset threshold；

If it is greater than the first preset threshold, it is determined that current audio frame is speech frame.

4. according to the method described in claim 2, it is characterized in that, determining whether current audio frame is voice by vad algorithm Frame, comprising:

Calculate the short-time energy value of current audio frame；

Detect whether the short-time energy value is greater than the second preset threshold；

If it is greater than the second preset threshold, it is determined that current audio frame is speech frame.

5. according to the method described in claim 2, it is characterized in that, determining whether current audio frame is voice by vad algorithm Frame, comprising:

Calculate the spectral energy values of current audio frame；

Detect whether the spectral energy values are greater than third predetermined threshold value；

If it is greater than third predetermined threshold value, it is determined that current audio frame is speech frame.

6. method according to claim 1-5, which is characterized in that carry out waking up word based on the audio data Identification, comprising:

Voice signal is extracted from the audio data；

Wake up based on voice signal the identification of word.

7. a kind of speech recognition equipment characterized by comprising

Module is obtained, for obtaining collected audio data to be identified；

Determining module, for whether by voice activity detection vad algorithm, determining in the audio data to be identified including language Sound signal；

Identification module, for wake up based on the audio data identification of word when including voice signal.

8. device according to claim 7, which is characterized in that the determining module includes framing submodule, traversal submodule Block and determining submodule；

The framing submodule obtains multiple audio frames for the audio data to be identified to be carried out sub-frame processing；

Whether the traversal submodule determines current audio frame for traversing by vad algorithm to the multiple audio frame For speech frame；

The determining submodule includes for when current audio frame is speech frame, determining in the audio data to be identified Voice signal.

9. a kind of electronic equipment characterized by comprising

At least one processor；

And at least one processor, the bus being connected to the processor；Wherein,

The processor, memory complete mutual communication by the bus；

The processor is used to call the program instruction in the memory, any into claim 6 with perform claim requirement 1 Audio recognition method described in.

10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Store up computer instruction, the computer instruction requires the computer perform claim 1 to described in any one of claim 6 Audio recognition method.