CN108831508A

CN108831508A - Voice activity detection method, device and equipment

Info

Publication number: CN108831508A
Application number: CN201810605698.8A
Authority: CN
Inventors: 李超; 文铭; 朱唯鑫
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2018-11-16

Abstract

The embodiment of the present invention provides a kind of voice activity detection method, device and equipment.This method includes：Audio signal to be detected is smoothed, calculate the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is the probability of voice signal, it is the probability of voice signal according to each frame signal, determines noise signal and voice signal in audio signal.The method of the embodiment of the present invention, so that the noise signal in audio signal is substantially weakened, improves performance of the voice activity detection in noise circumstance by being smoothed to the audio signal comprising noise signal.

Description

Voice activity detection method, device and equipment

Technical field

The present embodiments relate to speech signal processing technology more particularly to a kind of voice activity detection methods, dress It sets and equipment.

Background technique

Voice activity detection (Voice Activity Detection, referred to as：VAD speech terminals detection, voice) are also known as Border detection.By the detection to voice in voice signal and non-voice, when to identify in voice signal stream and eliminate long Between the mute phase.Commonly used in playing reduction voice coder in the speech processing systems such as speech recognition, voice coding, speech enhan-cement Bit rate saves communication bandwidth, reduces energy consumption of mobile equipment, improves the effects of discrimination.

Voice signal is because of its non-stationary property, and its is easy the interference by noise signal, and the disturbance of noise can serious shadow Ring the accuracy of VAD.Then the existing VAD method based on G.729 standard sets thresholding to signal by the energy of calculating signal Each frame simply classified, however, this method can not obtain satisfactory effect in the presence of noise.

With the continuous development of voice processing technology, the requirement to voice activity detection is also higher and higher.Therefore, it is necessary to one Kind voice activity detection method can still keep good detection performance in noise circumstance.

Summary of the invention

The embodiment of the present invention provides a kind of voice activity detection method, device and equipment, exists in the prior art to solve In noise circumstance, the not high problem of voice activity detection performance.

In a first aspect, the embodiment of the present invention provides a kind of voice activity detection method, including：

Audio signal to be detected is smoothed；

Calculate the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing；

According to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is voice The probability of signal；

It is the probability of voice signal according to each frame signal, determines noise signal and voice signal in audio signal.

In one possible implementation, to audio signal to be detected be smoothed including：

An average value is calculated per N number of sampled point in audio signal to be detected, it is smoothed out as every N number of sampled point Output valve, N are the natural number greater than 1.

In one possible implementation, in calculating audio signal after smoothing processing each frame signal energy Before amount and zero-crossing rate, further include：

It is moved according to preset frame length and preset frame, sub-frame processing is carried out to the audio signal after smoothing processing, in advance If frame length greater than preset frame move.

In one possible implementation, it is the probability of voice signal according to each frame signal, determines in audio signal Noise signal and voice signal, including：

If probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal；

If probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal.

In one possible implementation, according to the energy of each frame signal, zero-crossing rate and detection trained in advance Model further includes before determining the probability that each frame signal is voice signal：

Audio signal in training corpus is smoothed and sub-frame processing, generates multiple training samples；

Using the energy of multiple training samples and zero-crossing rate as the input feature vector of detection model, by multiple training samples whether It is desired output feature of the voice signal as detection model, detection model is trained.

In one possible implementation, detection model is based on deep neural network, Logic Regression Models or support Vector machine model is trained.

Second aspect, the embodiment of the present invention provide a kind of voice activity detection apparatus, including：

Leveling Block, for being smoothed to audio signal to be detected；

Computing module, for calculating the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing；

Determining module, for determining each according to the energy of each frame signal, zero-crossing rate and detection model trained in advance Frame signal is the probability of voice signal；

Determining module, be also used to be according to each frame signal voice signal probability, determine in audio signal noise letter Number and voice signal.

In one possible implementation, Leveling Block is specifically used for, per N number of sampling in audio signal to be detected Point calculates an average value, and as every N number of smoothed out output valve of sampled point, N is the natural number greater than 1.

The third aspect, the embodiment of the present invention provide a kind of voice activity detection apparatus, including：

Memory；

Processor；And

Computer program；

Wherein, computer program stores in memory, and is configured as being executed by processor to realize such as first aspect The method of any one.

Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, are stored thereon with computer program, Computer program is executed by processor to realize the method such as any one of first aspect.

Voice activity detection method, device and equipment provided in an embodiment of the present invention, by audio signal to be detected It is smoothed, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing is calculated, according to each Energy, zero-crossing rate and the detection model trained in advance of frame signal, determine that each frame signal is the probability of voice signal, according to every One frame signal is the probability of voice signal, determines noise signal and voice signal in audio signal, realizes under noise circumstance Speech activity high-performance detection.The different characteristics as possessed by voice signal and noise signal, smoothing processing can make The amplitude of noise signal in audio signal is by significantly smooth, and the amplitude that the voice signal in audio signal is smoothed is compared It is much smaller for the amplitude that noise signal is smoothed, more variant acoustic feature can be extracted, voice is improved Performance of the activity detection in noise circumstance.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and be used to explain the principle of the present invention together with specification.

Fig. 1 is the flow chart of one embodiment of voice activity detection method provided by the invention；

Fig. 2 is the flow chart of the another embodiment of voice activity detection method provided by the invention；

Fig. 3 is the flow chart of training detection model during voice activity detection method one provided by the invention is implemented；

Fig. 4 is the structural schematic diagram of one embodiment of voice activity detection apparatus provided by the invention；

Fig. 5 is the structural schematic diagram of one embodiment of voice activity detection apparatus provided by the invention.

Through the above attached drawings, it has been shown that the specific embodiment of the present invention will be hereinafter described in more detail.These attached drawings It is not intended to limit the scope of the inventive concept in any manner with verbal description, but is by referring to specific embodiments Those skilled in the art illustrate idea of the invention.

Specific embodiment

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.

Term " includes " and " having " and their any deformations in description and claims of this specification, it is intended that It is to cover and non-exclusive includes.Such as the process, method, system, product or equipment for containing a series of steps or units do not have It is defined in listed step or unit, but optionally further comprising the step of not listing or unit, or optionally also wrap Include the other step or units intrinsic for these process, methods, product or equipment.

" first " and " second " in the present invention only plays mark action, be not understood to indicate or imply ordinal relation, Relative importance or the quantity for implicitly indicating indicated technical characteristic." multiple " refer to two or more." and/ Or ", the incidence relation of affiliated partner is described, indicates may exist three kinds of relationships, for example, A and/or B, can indicate：Individually deposit In A, A and B, these three situations of individualism B are existed simultaneously.It is a kind of "or" that character "/", which typicallys represent forward-backward correlation object, Relationship.

" one embodiment " or " embodiment " mentioned in the whole text in specification of the invention means related with embodiment A particular feature, structure, or characteristic include at least one embodiment of the application.Therefore, occur everywhere in the whole instruction " in one embodiment " or " in one embodiment " not necessarily refer to identical embodiment.It should be noted that not rushing In the case where prominent, the feature in embodiment and embodiment in the present invention be can be combined with each other.

Fig. 1 is the flow chart of one embodiment of voice activity detection method provided by the invention.As shown in Figure 1, the present embodiment The voice activity detection method of offer may include：

Step S101, audio signal to be detected is smoothed.

With the continuous development of artificial intelligence technology, the various intelligent uses based on speech recognition are constantly released.With mobile phone For, the application such as phonetic search, Voice Navigation is gradually influencing the use habit of user.And mobile phone usually passes through Mike's elegance Collect audio signal, unavoidably will receive the influence of ambient noise during acquisition, the presence of noise will affect voice letter Number process performance.

It is described in detail below by a specific scene.Smart phone provides driving mode, so as to user When driving, smart phone is controlled by voice.For example, user " can be made a phone call by voice on the run To Zhang San ", control mobile phone is made a phone call to the people for being named as Zhang San in address list；By voice " incoming call answering ", controls mobile phone and connect Logical incoming call；By voice " searching for nearest parking lot ", the navigation etc. of mobile phone offer to nearest parking lot is controlled.However, very User phonetic order all may not be issued during the entire process of user drives, alternatively, using during the entire process of user drives Family only has issued a small amount of phonetic order, and such as in 40 minutes to drive, phonetic order duration only only has 1 minute.If right This 40 minutes collected audio signals carry out voice recognition processing, can bring biggish load, meeting to the processor of mobile phone A large amount of process resource is wasted, causes mobile telephone power consumption excessively high.At this time, it may be necessary to by voice activity detection method from it is collected when 1 minute voice signal is identified in a length of 40 minutes audio signals, and only this 1 minute voice signal is carried out at identification Reason reduces mobile telephone power consumption to improve the efficiency of speech recognition.However, vehicle-mounted noise is very serious under vehicle environment, this direct shadow The performance and stability for having rung voice activity detection cause testing result inaccurate, and then influence subsequent speech recognition and place Reason process.

By the common point and otherness progress detailed analysis to noise signal and voice signal, had using voice signal Smooth performance possessed by some non-stationary properties and vehicle-mounted noise signal, to collected audio signal to be detected into Row smoothing processing weakens influence of the noise for Speech signal detection.

Smoothing processing can make the amplitude of the noise signal in audio signal by significantly smooth, and the language in audio signal It is much smaller for the amplitude that the amplitude that sound signal is smoothed is smoothed compared to noise signal, therefore, it can extract and have more The acoustic feature of otherness realizes the high-performance of voice activity detection in a noisy environment.

Step S102, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing are calculated.

Optionally, before the computation, can also include：It is moved according to preset frame length and preset frame, to by smooth place Audio signal after reason carries out sub-frame processing, and preset frame length is moved greater than preset frame.Specifically, frame length can choose 25 millis Second, frame shifting can choose 10 milliseconds, then one section 85 milliseconds of duration of audio signal is divided into 7 frames.

In the present embodiment, the energy of each frame signal can be indicated using the L2 norm of the frame signal, i.e. a frame signal Energy be equal to this frame signal in each sampled point value quadratic sum.The zero-crossing rate of each frame signal is believed using this frame Number pass through the number of zero, the i.e. changed number of frame signal symbol.Energy and zero-crossing rate calculation amount are small, come to equipment belt Calculated load it is small, have the advantages that low-power consumption.

Step S103, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, each frame letter is determined It number is the probability of voice signal.

Trained detection model in advance is inputted using the energy of obtained each frame signal and zero-crossing rate as input feature vector In, obtain the probability that each frame signal is voice signal.Optionally, can also obtain each frame signal is the general of noise signal Rate.

Step S104, it is the probability of voice signal according to each frame signal, determines noise signal and language in audio signal Sound signal.

According to obtained probability value, the noise signal and voice signal in audio signal are determined, specifically, an if frame signal It is the probability of voice signal greater than predetermined probabilities value, it is determined that the frame signal is voice signal；If a frame signal is voice signal Probability be less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal, such as predetermined probabilities value can be set to 0.5.

Alternatively, if the probability that a frame signal is voice signal is greater than the probability that the frame signal is noise signal, it is determined that should Frame signal is voice signal；If the probability that a frame signal is voice signal is less than or equal to the probability that the frame signal is noise signal, Then determine that the frame signal is noise signal.Under normal conditions, a frame signal is the probability of voice signal and the frame signal is noise The sum of probability of signal is equal to 1.

Voice activity detection method provided in an embodiment of the present invention, by smoothly being located to audio signal to be detected Reason calculates the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing, according to the energy of each frame signal Amount, zero-crossing rate and detection model trained in advance, determine that each frame signal is the probability of voice signal, are according to each frame signal The probability of voice signal determines noise signal and voice signal in audio signal, realizes the speech activity under noise circumstance High-performance detection.The different characteristics as possessed by voice signal and noise signal, smoothing processing can make in audio signal Noise signal amplitude by significantly smooth, and the amplitude that the voice signal in audio signal is smoothed is compared to noise signal It is much smaller for the amplitude being smoothed, more variant acoustic feature can be extracted, voice activity detection is improved and exists Performance in noise circumstance.

Below by a specific embodiment, to the smoothing processing in the technical solution of embodiment of the method shown in Fig. 1 into Row is described in detail.With the popularization of intelligent terminals, voice activity detection method is more run on the terminal device, therefore not only It is required that it is stable and reliable for performance, and calculation amount cannot be too big.For this feature, in the present embodiment, audio to be detected is believed It number is smoothed and may include：An average value is calculated per N number of sampled point in audio signal to be detected, as every N A smoothed out output valve of sampled point, N are the natural number greater than 1.

For example, if N is taken to be equal to 4, that is, the scale for taking average progress smooth is 4, includes 160 samplings for one section For the audio signal of point, then the audio signal exported after smooth includes 40 sampled points, wherein each is exported Sampled point is all the average value of 4 sampled points.

Voice activity detection method provided in this embodiment, by N number of sampled point meter every in audio signal to be detected An average value is calculated, as every N number of smoothed out output valve of sampled point, not only calculation amount is small for this smoothing processing method, but also Since multiple sampled points are merged into a sampled point in smoothing processing, the number of sampled point is greatly reduced, language is reduced Data processing amount in sound activity detection process can not only promote forecasting efficiency, and can satisfy the requirement of low-power consumption.

Fig. 2 is the flow chart of the another embodiment of voice activity detection method provided by the invention.As shown in Fig. 2, this implementation Example provide voice activity detection method may include：

Step S201, an average value is calculated per N number of sampled point to audio signal to be detected, as every N number of sampled point Smoothed out output valve.

Step S202, it is moved according to preset frame length and preset frame, the audio signal after smoothing processing is divided Frame processing, preset frame length are moved greater than preset frame.

Step S203, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing are calculated.

Step S204, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, each frame letter is determined It number is the probability of voice signal.

If step S205, probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal, if probability is less than or equal to Predetermined probabilities value, it is determined that the frame signal is noise signal.

Voice activity detection method provided in this embodiment merges into a sampling by being averaged multiple sampled points Point reduces the number of sampled point, reduces the data processing amount during voice activity detection, reduces power consumption；By flat Sliding processing makes the amplitude of the noise signal in audio signal significantly smoothly, be improved voice activity detection in noise circumstance Performance；By using the lesser energy of calculation amount and zero-crossing rate as input feature vector, the demand of low-power consumption can satisfy, so that Voice activity detection method provided in this embodiment can be run on the terminal device.

On the basis of the above embodiments, training process of the present embodiment for the detection model used in above-described embodiment It is described in detail.Fig. 3 is the flow chart of training detection model during voice activity detection method one provided by the invention is implemented.Such as Shown in Fig. 3, the training process for detection model may include：

Step S301, the audio signal in training corpus is smoothed and sub-frame processing, generates multiple training Sample.

Specific smoothing processing can use the calculating one per N number of sampled point with identical method in above-mentioned detection embodiment A average value, as this N number of smoothed out output valve of sampled point.It is moved according to preset frame length and preset frame, to by smooth Audio signal that treated carries out sub-frame processing, and preset frame length is moved greater than preset frame, i.e., has part between adjacent two frame Overlapping, in the training stage, can increase the sample number of training stage using this framing method.It should be noted that training rank Section is identical with the frame length of detection-phase needs, and frame moves can be different, and the audio signal of same duration, frame moves smaller obtained training Sample number is more.For example, frame length can choose 25 milliseconds, frame shifting can choose 5 milliseconds.

Training corpus can select public audio corpus, can also voluntarily acquire.When each training sample is The long audio signal for being equal to default frame length, and labeled is voice signal or noise signal.For example, voice signal sample can To be labeled as 1, noise signal sample, which can mark, is.

Step S302, the energy and zero-crossing rate of each training sample are calculated.

Wherein, energy is training sample L2 norm, and zero-crossing rate is the changed number of training sample symbol.

Step S303, using the energy of multiple training samples and zero-crossing rate as the input feature vector of detection model, by multiple instructions Practice whether sample is desired output feature of the voice signal as detection model, detection model is trained.

Detection model in the present embodiment can be based on deep neural network, Logic Regression Models or support vector machines mould Type building.By in obtained multiple training samples, two dimension of the energy and zero-crossing rate of each training sample as detection model Each training sample is that voice signal or noise signal are special as the desired output of detection model by acoustics input feature vector Sign, is trained detection model.

The embodiment of the present invention also provides a kind of voice activity detection apparatus, shown in Figure 4, the embodiment of the present invention only with It is illustrated for Fig. 4, is not offered as that present invention is limited only to this.Fig. 4 is that voice activity detection apparatus one provided by the invention is real Apply the structural schematic diagram of example.As shown in figure 4, voice activity detection apparatus 40 provided in an embodiment of the present invention includes：Leveling Block 401, computing module 402 and determining module 403.

Leveling Block 401, for being smoothed to audio signal to be detected.

Computing module 402, for calculating the energy and zero passage of each frame signal in the audio signal after smoothing processing Rate.

Determining module 403, for determining according to the energy of each frame signal, zero-crossing rate and detection model trained in advance Each frame signal is the probability of voice signal.

Determining module 403, be also used to be according to each frame signal voice signal probability, determine the noise in audio signal Signal and voice signal.

Device provided in this embodiment can be used for executing the technical solution of embodiment of the method shown in Fig. 1, realization principle Similar with technical effect, details are not described herein again.

In one possible implementation, Leveling Block 401 is specifically used for, per N number of in audio signal to be detected Sampled point calculates an average value, and as every N number of smoothed out output valve of sampled point, N is the natural number greater than 1.

In one possible implementation, voice activity detection apparatus can also include framing module, for calculating Before the energy and zero-crossing rate of each frame signal, moved according to preset frame length and preset frame, to the sound after smoothing processing Frequency signal carries out sub-frame processing, wherein preset frame length is moved greater than preset frame.

In one possible implementation, determining module 403 specifically can be also used for, if probability is greater than predetermined probabilities Value, it is determined that the frame signal is voice signal；If probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise letter Number.

The embodiment of the present invention also provides a kind of voice activity detection apparatus, shown in Figure 5, the embodiment of the present invention only with It is illustrated for Fig. 5, is not offered as that present invention is limited only to this.Fig. 5 is that voice activity detection apparatus one provided by the invention is real Apply the structural schematic diagram of example.The detection device can be mobile phone, computer, digital broadcast terminal, messaging devices, trip Play console, tablet device, Medical Devices, body-building equipment, personal digital assistant etc..As shown in figure 5, inspection provided in this embodiment Measurement equipment may include following one or more components：Processing component 501, memory 502, audio component 503, power supply module 504, communication component 505, multimedia component 506, sensor module 507 and input/output (I/O) interface 508.

Processing component 501 usually control detection device integrated operation, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing component 501 may include that one or more processors 5011 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 501 may include one or more modules, just Interaction between processing component 501 and other assemblies.For example, processing component 501 may include multi-media module, it is more to facilitate Interaction between media component 506 and processing component 501.

Memory 502 is configured as storing various types of data to support the operation in detection device.These data Example includes the instruction of any application or method for operating on detection device, contact data, telephone book data, Message, picture, video etc..Memory 502 can by any kind of volatibility or non-volatile memory device or they Combination is realized, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), it is erasable can Program read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory Reservoir, disk or CD.In the present embodiment, it is stored with computer program in memory 502, which can be by handling Device 5011 executes, to realize the technical solution of any of the above-described voice activity detection method embodiment.

Power supply module 504 provides electric power for the various assemblies of detection device.Power supply module 504 may include power management system System, one or more power supplys and other with for detection device generate, manage, and distribute the associated component of electric power.

Multimedia component 506 includes the screen of one output interface of offer between the detection device and user.? In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, Screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes that one or more touch passes Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding is dynamic The boundary of work, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more Media component 506 includes a front camera and/or rear camera.When detection device is in operation mode, as shot mould When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 503 is configured as output and/or input audio signal.For example, audio component 503 includes a Mike Wind (MIC), when detection device is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 502.The present embodiment In, the voice signal that user carries out voice control to detection device can be acquired by microphone, then via processing component 501 Voice activity detection is carried out to it, and then carries out a series of subsequent processings such as speech recognition.In some embodiments, audio component 503 further include a loudspeaker, is used for output audio signal.In the present embodiment, it can be played by loudspeaker and user is mentioned Show information.

I/O interface 508 provides interface between processing component 501 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to：Home button, volume button, start button and lock Determine button.

Sensor module 507 includes one or more sensors, and the state for providing various aspects for detection device is commented Estimate.For example, sensor module 507 can detecte the state that opens/closes of detection device, the relative positioning of component, such as institute The display and keypad that component is detection device are stated, sensor module 507 can also detect detection device or detection device one The position change of a component, the existence or non-existence that user contacts with detection device, detection device orientation or acceleration/deceleration and inspection The temperature change of measurement equipment.Sensor module 507 may include proximity sensor, be configured to connect in not any physics It is detected the presence of nearby objects when touching.Sensor module 507 can also include optical sensor, such as CMOS or ccd image sensor, For being used in imaging applications.In some embodiments, which can also include acceleration transducer, top Spiral shell instrument sensor, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 505 is configured to facilitate the communication of wired or wireless way between detection device and other equipment.This Communication component 505 is in embodiment for realizing the interaction between detection device and cloud server.Detection device can access base In the wireless network of communication standard, such as WiFi, 2G, 3G or 4G or their combination.In one exemplary embodiment, it communicates Component 505 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel.Show at one In example property embodiment, the communication component 505 further includes near-field communication (NFC) module, to promote short range communication.For example, in NFC Module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, bluetooth (BT) Technology and other technologies are realized.

In the exemplary embodiment, detection device can be by one or more application specific integrated circuit (ASIC), number Signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 502 of instruction, above-metioned instruction can be executed by the processor 5011 of detection device to complete the above method.Example Such as, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, soft Disk and optical data storage devices etc..

Voice activity detection apparatus provided in an embodiment of the present invention can be used for executing the technology of any of the above-described embodiment of the method Scheme, it is similar that the realization principle and technical effect are similar, and details are not described herein again.

The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, computer Program is executed by processor the technical solution to realize any of the above-described embodiment of the method.

Finally it should be noted that：The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Present invention has been described in detail with reference to the aforementioned embodiments for pipe, those skilled in the art should understand that：Its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of voice activity detection method, which is characterized in that including：

Audio signal to be detected is smoothed；

According to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is The probability of voice signal；

It is the probability of voice signal according to each frame signal, determines noise signal and voice letter in the audio signal Number.

2. the method according to claim 1, wherein described be smoothed packet to audio signal to be detected It includes：

An average value is calculated per N number of sampled point in the audio signal to be detected, as described smooth per N number of sampled point Output valve afterwards, N are the natural number greater than 1.

3. the method according to claim 1, wherein in the calculating in the audio signal after smoothing processing Before the energy and zero-crossing rate of each frame signal, further include：

It is moved according to preset frame length and preset frame, sub-frame processing, institute is carried out to the audio signal after smoothing processing It states preset frame length and is greater than the preset frame shifting.

4. according to each frame signal being the general of voice signal the method according to claim 1, wherein described Rate determines noise signal and voice signal in the audio signal, including：

If the probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal；

If the probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal.

5. the method according to claim 1, wherein in the energy according to each frame signal, zero passage Rate and detection model trained in advance further include before determining the probability that each frame signal is voice signal：

Using the energy of the multiple training sample and zero-crossing rate as the input feature vector of the detection model, by the multiple training Whether sample is desired output feature of the voice signal as the detection model, is trained to the detection model.

6. according to the method described in claim 5, it is characterized in that, the detection model is based on deep neural network, logic is returned Model or supporting vector machine model is returned to be trained.

7. a kind of voice activity detection apparatus, which is characterized in that including：

Leveling Block, for being smoothed to audio signal to be detected；

Determining module, described in determining according to the energy of each frame signal, zero-crossing rate and detection model trained in advance Each frame signal is the probability of voice signal；

The determining module, be also used to be according to each frame signal voice signal probability, determine in the audio signal Noise signal and voice signal.

8. device according to claim 7, which is characterized in that the Leveling Block is specifically used for, described to be detected An average value is calculated in audio signal per N number of sampled point, as described per the smoothed out output valve of N number of sampled point, N be greater than 1 natural number.

9. a kind of voice activity detection apparatus, which is characterized in that including：

Memory；

Processor；And

Computer program；

Wherein, the computer program stores in the memory, and is configured as being executed by the processor to realize such as Method described in any one of claims 1-6.

10. a kind of computer readable storage medium, which is characterized in that be stored thereon with computer program, the computer program It is executed by processor to realize as the method according to claim 1 to 6.