CN113270118B

CN113270118B - Voice activity detection method and device, storage medium and electronic equipment

Info

Publication number: CN113270118B
Application number: CN202110529801.7A
Authority: CN
Inventors: 郝一亚; 阮良; 陈功; 李莹
Original assignee: Hangzhou Netease Zhiqi Technology Co Ltd
Current assignee: Hangzhou Netease Zhiqi Technology Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2024-02-13
Anticipated expiration: 2041-05-14
Also published as: CN113270118A

Abstract

The embodiment of the invention provides a voice activity detection method and device, a storage medium and electronic equipment. The voice activity detection method comprises the steps of collecting an audio signal and determining a short-time energy histogram of the audio signal; determining a background noise energy value of the audio signal according to the short-time energy histogram; determining an energy threshold value according to the bottom noise energy value; and determining a first voice activity detection value according to the energy threshold value and the energy value of the current frame audio signal, wherein the first voice activity detection value is used for representing the audio state of the current frame audio. The technical scheme of the embodiment of the invention can improve the accuracy of voice signal recognition in real-time voice communication.

Description

Voice activity detection method and device, storage medium and electronic equipment

Technical Field

Embodiments of the present invention relate to the field of information processing, and more particularly, to a method and apparatus for detecting voice activity, a storage medium, and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

VAD (Voice Activity Detection ) is a signal processing technique that recognizes long periods of silence from within a voice signal stream. A VAD with high accuracy and good robustness can improve the performance of a plurality of audio algorithm modules at the same time.

In real-time voice communication, the voice signal is inevitably interfered by environmental noise, spatial reverberation, differences in playback acquisition apparatuses, and the like, resulting in poor recognition of the voice signal by the VAD.

Disclosure of Invention

In this context, embodiments of the present invention desirably provide a voice activity detection method and apparatus, an audio processing model training method and apparatus, a storage medium, and an electronic device.

In a first aspect of the embodiments of the present invention, there is provided a voice activity detection method, including:

collecting an audio signal and determining a short-time energy histogram of the audio signal;

determining a background noise energy value of the audio signal according to the short-time energy histogram;

determining an energy threshold value according to the background noise energy value;

and determining a first voice activity detection value according to the energy threshold value and the energy value of the current frame audio signal, wherein the first voice activity detection value is used for representing the audio state of the current frame audio.

In some embodiments of the invention, determining the short-time energy histogram of the audio signal comprises:

dividing the audio signal in a preset time period by using a preset short time period, and determining short-time energy corresponding to each divided preset short time period;

and counting the short-time energy by using a histogram, and obtaining the short-time energy histogram.

In some embodiments of the invention, determining a background noise energy value of the audio signal from the short-time energy histogram comprises:

fitting the envelope of the short-time energy histogram to obtain a short-time energy envelope map;

and determining the minimum short-time energy value in short-time energy values corresponding to peaks in the short-time energy envelope map as the bottom noise energy value.

In some embodiments of the present invention, determining the energy threshold based on the background noise energy value includes:

smoothing the background noise energy values of at least two adjacent frames to obtain smoothed background noise energy values;

and determining the energy threshold value according to the smooth background noise energy value.

In some embodiments of the present invention, determining the energy threshold based on the smoothed floor noise energy value includes:

And weighting the smooth background noise energy value to be used as the energy threshold value.

In some embodiments of the present invention, determining a first voice activity detection value according to the energy threshold value and an energy value of the current frame audio signal, wherein the first voice activity detection value is used for representing an audio state of the current frame audio comprises:

determining that the current frame audio is voice audio under the condition that the energy value of the current frame audio signal is larger than or equal to the energy threshold value;

and under the condition that the energy value of the current frame audio signal is smaller than the energy threshold value, determining that the current frame audio is non-voice audio.

In some embodiments of the invention, the method further comprises:

determining the voice probability of the current frame and the audio signal of the preset frame number before the current frame according to the first voice activity detection value;

determining a voice probability threshold value according to the voice scene;

and determining a second voice activity detection value according to the voice probability threshold value and the voice probability, wherein the second voice activity detection value is used for representing the audio state of the current frame of audio.

In some embodiments of the present invention, determining, according to the first voice activity detection value, a voice probability of a current frame and an audio signal of a preset frame number before the current frame includes:

And determining an average value of the first voice activity detection value of the current frame audio signal and the first voice activity detection value of the preset frame audio signal, and taking the average value as the voice probability.

In some embodiments of the present invention, determining the speech probability threshold value based on the speech scenario includes:

and determining the voice probability threshold value according to the misjudgment rate or the missed judgment rate of the scene.

In some embodiments of the present invention, determining a second voice activity detection value according to the voice probability threshold and the voice probability, wherein the second voice activity detection value is used to represent an audio state of the current frame audio comprises:

determining that the current frame audio is voice audio under the condition that the voice probability is greater than or equal to the voice probability threshold value;

and under the condition that the voice probability is smaller than the voice probability threshold value, determining that the current frame of audio is non-voice audio.

In some embodiments of the invention, the method further comprises:

and determining whether the current frame of audio is a voice frame according to the first voice activity detection value or the second voice activity detection value.

In a second aspect of the embodiments of the present invention, there is provided a voice activity detection apparatus, comprising:

The histogram determining module is used for collecting the audio signals and determining short-time energy histograms of the audio signals;

the energy value determining module is used for determining the bottom noise energy value of the audio signal according to the short-time energy histogram;

the energy threshold value determining module is used for determining an energy threshold value according to the background noise energy value;

and the first detection value determining module is used for determining a first voice activity detection value according to the energy threshold value and the energy value of the current frame audio signal, wherein the first voice activity detection value is used for representing the audio state of the current frame audio.

In some embodiments of the present invention, the histogram determining module is configured to divide the audio signal in a preset time period by using a preset short period, and determine short-time energy corresponding to each of the divided preset short periods; and counting the short-time energy by using a histogram, and obtaining the short-time energy histogram.

In some embodiments of the present invention, the energy value determining module is configured to fit an envelope of the short-time energy histogram to obtain a short-time energy envelope map; and determining the minimum short-time energy value in short-time energy values corresponding to peaks in the short-time energy envelope map as the bottom noise energy value.

In some embodiments of the present invention, the energy threshold determining module is configured to perform smoothing on the background noise energy values of at least two adjacent frames to obtain a smoothed background noise energy value; and determining the energy threshold value according to the smooth background noise energy value.

In some embodiments of the present invention, the energy threshold determining module is further configured to weight the smoothed floor noise energy value as the energy threshold.

In some embodiments of the present invention, the first detection value determining module is configured to determine that the current frame audio is speech audio when an energy value of the current frame audio signal is greater than or equal to the energy threshold value; and under the condition that the energy value of the current frame audio signal is smaller than the energy threshold value, determining that the current frame audio is non-voice audio.

In some embodiments of the invention, further comprising:

the voice probability determining module is used for determining the voice probability of the current frame and the audio signal of the preset frame number before the current frame according to the first voice activity detection value;

the probability threshold value determining module is used for determining a voice probability threshold value according to the voice scene;

And the second detection value determining module is used for determining a second voice activity detection value according to the voice probability threshold value and the voice probability, wherein the second voice activity detection value is used for representing the audio state of the current frame audio.

In some embodiments of the present invention, the speech probability determining module is configured to determine an average value of the first speech activity detection value of the current frame audio signal and the first speech activity detection value of the preset frame audio signal, and take the average value as the speech probability.

In some embodiments of the present invention, the probability threshold determining module is configured to determine the speech probability threshold according to a false positive rate or a missed positive rate of the scene.

In some embodiments of the present invention, the second detection value determining module is configured to determine that the current frame audio is speech audio if the speech probability is greater than or equal to the speech probability threshold value; and under the condition that the voice probability is smaller than the voice probability threshold value, determining that the current frame of audio is non-voice audio.

In some embodiments of the invention, further comprising:

and the voice frame determining module is used for determining whether the current frame of audio is a voice frame according to the first voice activity detection value or the second voice activity detection value.

In a third aspect of the embodiments of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described voice activity detection method.

In a fourth aspect of the embodiments of the present invention, there is provided an electronic device, including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the voice activity detection method described above by executing the executable instructions.

According to the voice activity detection method and device, the storage medium and the electronic equipment, the short-time energy histogram of the audio signal is obtained, the background noise energy value is determined based on the short-time energy histogram, and the stability of the determined background noise energy value is higher, so that the accuracy of the capacity threshold value determined according to the background noise energy value can be improved, the accuracy of the judgment of the audio state can be improved, and the effect of identifying the audio signal is improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates a flow chart of a voice activity detection method according to an exemplary embodiment of the present invention;

fig. 2 schematically shows a noise segment audio signal waveform according to an exemplary embodiment of the invention;

fig. 3 schematically shows a short-time energy histogram after processing the audio signal in fig. 2 according to an exemplary embodiment of the invention;

FIG. 4 schematically illustrates a waveform diagram of a speech segment audio signal according to an exemplary embodiment of the invention;

fig. 5 schematically shows a short-time energy histogram after processing the audio signal of fig. 4 according to an exemplary embodiment of the invention;

FIG. 6 schematically illustrates a waveform diagram of a speech and noise mixed segment audio signal in accordance with an exemplary embodiment of the present invention;

FIG. 7 schematically illustrates a short-time energy histogram after processing the audio signal of FIG. 6 according to an exemplary embodiment of the present invention;

fig. 8 schematically shows a short-term energy envelope map obtained by envelope fitting of fig. 7 according to an exemplary embodiment of the invention.

Fig. 9 schematically shows an audio signal waveform diagram of a speech and noise inter-switch according to an exemplary embodiment of the present invention;

Fig. 10 schematically shows a change of the background noise energy value with the number of frames after processing the audio signal shown in fig. 9;

FIG. 11 schematically illustrates a flowchart of a process for determining a second voice activity detection value according to an exemplary embodiment of the present invention;

fig. 12 schematically shows a waveform diagram of an audio signal according to an exemplary embodiment of the present invention;

fig. 13 schematically illustrates a change of a background noise energy value with the number of frames after processing the audio signal shown in fig. 12;

fig. 14 schematically shows a change of the voice frequency with the number of frames after processing the audio signal shown in fig. 12;

FIG. 15 schematically illustrates the second voice activity detection value after processing the audio signal shown in FIG. 12 as a function of the number of frames;

FIG. 16 schematically illustrates a flow chart of steps of a voice activity detection method according to an exemplary embodiment of the present invention;

FIG. 17 schematically illustrates a block diagram I of a voice activity detection apparatus according to an exemplary embodiment of the present invention;

FIG. 18 schematically illustrates a block diagram II of a voice activity detection apparatus according to an exemplary embodiment of the present invention;

FIG. 19 schematically illustrates a second block diagram of a voice activity detection apparatus according to an exemplary embodiment of the present invention;

Fig. 20 schematically shows a block diagram of an electronic device according to an exemplary embodiment of the invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that embodiments of the invention may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the invention, a voice activity detection method and a voice activity detection device are provided.

Any number of elements in the figures are for illustration and not limitation, and any naming is used for distinction only, and not for any limiting sense.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

The inventor finds that in a real-time communication application scene, factors such as environmental noise, spatial reverberation, play acquisition equipment difference and the like can greatly interfere the effect of the VAD algorithm on voice detection.

Based on the above, the invention tracks and estimates the background noise energy value based on the short-time energy histogram to provide an accurate threshold value, thereby improving the accuracy of voice activity detection in real-time communication.

Having described the basic principles of the present invention, various non-limiting embodiments of the invention are described in detail below.

Exemplary method

A voice activity detection method according to an exemplary embodiment of the present invention is described below with reference to fig. 1.

Fig. 1 schematically shows a flow chart of a voice activity detection method according to an exemplary embodiment of the invention. Referring to fig. 1, a voice activity detection method according to an exemplary embodiment of the present invention may include the steps of:

s12, collecting the audio signals and determining a short-time energy histogram of the audio signals.

The audio signal is a frequency, amplitude varying information carrier of regular sound waves with speech, music and sound effects. The voice signal is one of audio signals, and the signal collected and transmitted in real-time communication between people is the voice signal.

The audio signal is inevitably interfered by various factors, especially in real-time voice communication, environmental noise, space reverberation, playing and collecting equipment differences and other factors can influence the result of voice activity detection, and even cause that some voice detection technologies completely lose the working capacity.

In order to improve the accuracy of detection in the real-time voice communication process, the voice activity detection method provided by the exemplary embodiment of the invention particularly provides a scheme for determining a short-time energy histogram in an audio signal.

The characteristics of the audio signal are time-varying as a whole, but over a short time frame, the characteristics of the audio signal remain substantially unchanged and relatively stable with short-time stationarity. Thus, the process of determining a short-time energy histogram of an audio signal in an exemplary embodiment of the present invention may include: dividing an audio signal in a preset time period by using the preset short period, and determining short-time energy H (k) =history [ E (k), E (k-T) ] corresponding to the preset short period of each divided frame, wherein E (k) and E (k-T) respectively represent short-time energy of a k-th frame and k-T-th frame, and history [ E (k), E (k-T) ] refers to a short-time energy Histogram calculated in the range of the k-th frame and k-T-th frame.

In practical application, the preset short period may be any value from 3 to 30ms, for example, the preset short period may take 4ms. And the preset time period refers to the whole or part of the time period of the collected audio signal, for example, the preset time period is 5-10s.

Taking the preset short period of 4ms and the preset time period of 8s as an example, the audio signal in the preset time period is divided by the preset short period of 4ms, which is equivalent to the audio signal of 8s, and the audio signal can be divided into 2000 sub-audio signals corresponding to the preset short period of time, wherein one sub-audio represents one frame k. After the multi-frame sub-audio signal is obtained, the short-time energy corresponding to each frame of sub-audio signal can be determined.

The short-time energy reflects the energy of the audio signal, and if the calculated short-time energy of a certain sub-audio signal is higher, it is indicated that the energy of the sub-audio signal is higher.

After obtaining the short-time energies corresponding to the plurality of sub-audio signals, the plurality of short-time energies described above may be counted using the histogram to obtain a short-time energy histogram.

Specifically, in the exemplary embodiment of the present invention, in the process of determining the short-time energy histogram, a value range Emin-Emax of the short-time energy may be determined according to the acquired plurality of short-time energies, and the value range Emin-Emax may be used as an interval of the short-time energy. Where Emin is the minimum of the short-time energy and Emax is the maximum of the short-time energy.

The determination process of the short-time energy histogram is explained below with reference to fig. 2 and 3. The noise section audio signal shown in fig. 2 is divided using preset short periods, and short-time energy corresponding to each of the divided preset short periods is determined. Next, the minimum value emin= -55dB of the short-time energy, and the maximum value emax= -25dB of the short-time energy are determined, as shown in the abscissa of fig. 3.

The interval of short-time energy is uniformly divided into a plurality of intervals, for example, as shown in fig. 3, the interval of short-time energy is divided into 60 intervals, then the number of short-time energy falling on each interval calculated as described above is counted, and the probability of short-time energy falling on each interval is calculated from the number of short-time energy falling on each interval and the total number of short-time energy, whereby the short-time energy histogram shown in fig. 3 is obtained, wherein the abscissa is short-time energy and the ordinate is probability of short-time energy occurrence on each interval. As can be seen from fig. 3, the short-time energy histogram resembles a gaussian distribution, with short-time energy being concentrated on average around-37 dB.

Referring to the above method of determining the short-time energy histogram, a short-time energy histogram diagram 5 corresponding to the voice section audio signal shown in fig. 4 and a short-time energy histogram diagram 7 corresponding to the voice and noise mixed section audio signal shown in fig. 6 are plotted.

As can be seen from the two short-time energy histograms of fig. 5 and 7, each of the two short-time energy histograms has two gaussian distributions, wherein the position of one gaussian distribution on the horizontal axis is relatively close to the gaussian distribution position of the short-time energy histogram corresponding to the noise section in fig. 3, and each gaussian distribution has a smaller short-time energy value, and can be determined as a noise gaussian distribution. While another gaussian may be determined as a phonetic gaussian.

S14, determining the bottom noise energy value of the audio signal according to the short-time energy histogram.

In an exemplary embodiment of the present invention, after determining the short-time energy histogram of the audio signal, in order to determine a specific value of the background noise energy value, an envelope of the short-time energy histogram may be fitted to obtain a short-time energy envelope map. For example, envelope fitting is performed on the short-time energy histogram corresponding to the voice and noise mixed section audio signal shown in fig. 7, and a short-time energy envelope map as shown in fig. 8 is obtained.

After the short-time energy envelope map is obtained, the minimum short-time energy value among short-time energy values corresponding to peaks in the short-time energy envelope map can be determined as the background noise energy value. As can be seen from fig. 8, the short-time energy envelope map includes two peaks, and the short-time energy value corresponding to the peak near the left side is the smallest, and therefore, the short-time energy value corresponding to the peak is determined as the bottom noise energy value Ed.

In order to filter out the interference information in the event that less information is lost, exemplary embodiments of the present invention fit the envelope of the short-time energy histogram using a method of rejecting the samples Rejection Sampling. In practical application, the envelope of the short-time energy histogram can be fitted by adopting other various methods, and the invention is not limited in particular.

As can be seen from the short-time energy envelope map shown in fig. 8, the short-time energy envelope map removes the glitches of the disturbance in the original histogram, and retains two main peaks representing the background noise section envelope and the speech section envelope, respectively, and determines the short-time energy value corresponding to the peak of the background noise section envelope as the background noise energy value Ed.

S16, determining an energy threshold value according to the background noise energy value.

As described above, the currently estimated noise energy value noise_floor (k) may be determined by the short-time energy histogram, and the noise energy value noise_floor (k) may have a difference from the previously adjacent noise energy value noise_floor (k-1) for voice activity detection, in order to reduce abrupt changes in the voice activity detection result, in the exemplary embodiment of the present invention, the noise energy values of at least two adjacent frames are smoothed, as shown in equation (1), to obtain smoothed noise energy values:

noise_floor′(k)＝μnoise_floor(k)+(1-μ)noise_floor(k-1) (1)

In the formula (1), μ is a smoothing coefficient, and the value is between 0 and 1. The specific value of μmay be determined according to practical situations, and is not particularly limited herein.

Referring to fig. 9, an audio signal in which voice and noise are switched with each other is exemplarily shown. The audio signal of fig. 9 is processed in accordance with the above-described method of the present invention, and the corresponding background noise energy values (ordinate in fig. 10) obtained are maintained substantially in a steady state. Therefore, the voice activity detection method provided by the exemplary embodiment of the invention can obtain a stable background noise energy value, thereby laying a foundation for the determination of a stable voice activity detection result.

In the exemplary embodiment of the invention, after the background noise energy value is determined, the background noise energy value can be weighted according to the severity of detection and judgment to be used as an energy threshold value. For example, as shown in equation (2), the smoothed floor noise energy value is weighted to obtain the energy threshold value:

β _E ＝α*noise_floor′(k) (2)

s18, determining a first voice activity detection value according to the energy threshold value and the energy value of the current frame audio signal, wherein the first voice activity detection value is used for representing the audio state of the current frame audio.

In determining the energy threshold value beta _E Then, a first voice activity detection value VAD (k) can be obtained, as shown in formula (3):

wherein VAD (k) represents the first voice activity detection value of the kth frame, 0 represents non-voice audio,the 1 representation is speech audio. E (k) represents the average energy value, beta, of the k-th frame audio signal _E Representing the energy threshold described above.

In an exemplary embodiment of the present invention, the first voice activity detection value VAD (k) represents an audio state of the current frame audio. Wherein if the energy value E (k) of the current frame audio signal is smaller than the energy threshold value beta _E VAD (k) =0, representing that the current frame audio is non-speech audio; if the energy value E (k) of the current frame audio signal is greater than or equal to the energy threshold value beta _E VAD (k) =1, representing that the current frame audio is speech audio.

In practical applications, if the first voice activity detection value is only determined for the energy of the current frame of audio, two problems exist: the first is that misjudgment occurs due to noise, i.e., non-voice audio is judged as voice audio; the second is to miss the part with lower energy of voice section, i.e. to judge the voice audio as non-voice audio.

In order to solve the above two problems, the voice activity detection method according to the exemplary embodiment of the present invention further provides a second voice activity detection value based on the first voice activity detection value.

Referring to fig. 11, the determining of the second voice activity detection value may specifically include the following steps:

s112, determining the voice probability of the current frame and the audio signal of the preset frame number before the current frame according to the first voice activity detection value.

In an exemplary embodiment of the present invention, the first voice activity detection value of the current frame and the audio signal of the preset frame before the current frame is smoothed, that is, an average value of the first voice activity detection value of the audio signal of the current frame and the first voice activity detection value of the audio signal of the preset frame is determined, that is, the average value is taken for the first voice activity detection value corresponding to the audio signal of the current frame and the audio signal of the preset frame before the current frame, and the average value is taken as the voice probability. The obtained speech probability P (k) is as shown in formula (4):

where k refers to the current frame and L refers to the preset number of frames.

In practical application, the size of the preset frame number L may be determined according to practical situations, for example, L may take a value between 80 ms and 120ms, and specific L may be 100 ms.

S114, determining a voice probability threshold value according to the voice scene.

In an exemplary embodiment of the present invention, after determining the speech probability P (k), the speech probability threshold value β is also determined according to the speech scenario, for example, according to the false positive rate or the missed rate of the speech scenario _P 。

In practical application, a scene with strict requirement on the misjudgment rate, namely a scene with smaller probability of occurrence of the non-voice audio is required to be judged as voice audio, and the voice probability threshold value beta is required to be calculated _P Increasing, e.g. by comparison, with the speech probability threshold value beta _P Set to a value of approximately 1 between 0.85 and 0.99.

In practical application, a scene with strict requirement on the miss rate, namely a scene with smaller probability of occurrence of the voice audio to be judged as the non-voice audio, needs to be judged as the voice probability threshold value beta _P Lowering, e.g. by speech probability threshold value beta _P Is set to any value between 0.2 and 0.5.

For other scenes without strict requirements of misjudgment rate or missed judgment rate, the threshold value beta of the voice probability _P Can be determined according to actual needs, e.g., beta _P =0.6 or other value, an exemplary embodiment of the invention for a particular speech probability threshold value β ₎ There is no particular limitation.

S116, determining a second voice activity detection value according to the voice probability threshold value and the voice probability, wherein the second voice activity detection value is used for representing the audio state of the current frame of audio.

After determining the speech probability and the speech probability threshold value, a second speech activity detection value VAD' (k) representing the audio state of the current frame may be determined from the two values, as shown in equation (5):

as can be seen from equation (5), when the speech probability P (k) is smaller than the speech probability threshold value beta _P In the case of (a), the second voice activity detection value VAD' (k) takes 0 to represent that the current frame audio is non-voice audio; when the voice probability P (k) is greater than or equal to the voice probability threshold value beta ₎ In the case of (a), the second voice activity detection value VAD' (k) takes 1 to represent that the current frame audio is voice audio.

Referring to fig. 12-15, waveforms are shown for determining an audio state of an audio signal based on a second voice activity detection value. After the audio signal provided in fig. 12 is processed by the voice activity detection method provided in the present invention, a schematic diagram of the change of the background noise energy value with the number of frames shown in fig. 13 is obtained, and it can be seen from fig. 13 that the determined background noise energy value is basically maintained in a stable state. Fig. 14 is a diagram showing the change of the speech probability P (k) with the number of frames obtained after the audio signal is further processed. After the second voice activity detection value judgment is performed on the voice probability P (k) shown in fig. 14, a schematic diagram of the variation of the second voice activity detection value VAD ' (k) with the number of frames shown in fig. 15 can be obtained, and as can be seen from fig. 15, the frame with the second voice activity detection value VAD ' (k) of 1 is voice audio, and the frame with the second voice activity detection value VAD ' (k) of 0 is non-voice audio.

Referring to fig. 16, a flowchart illustrating steps of a voice activity detection method according to an exemplary embodiment of the present invention is shown, and as shown in fig. 16, step S1601 is first performed to collect an audio signal; next, step S1602 is entered, where a short-time energy histogram of the audio signal is determined; after the short-time energy histogram is determined, the step S1603 is performed to fit the envelope of the short-time energy histogram, and a short-time energy envelope map is obtained; step S1604 is carried out, and the bottom noise energy value is determined according to the short-time energy envelope diagram; step S1605 is entered to perform smoothing processing on the background noise energy value to obtain a smoothed background noise energy value; step S1606 is entered, and an energy threshold is determined from the smoothed floor noise energy value.

After determining the energy threshold, step S1607 may be entered to determine a first voice activity detection value based on the energy threshold and an energy value of the current frame audio signal obtained from the collected audio signal; after obtaining the first voice activity detection value, step S1608 may be further performed to determine the voice probability of the audio signal of the current frame and the preset frame number before the current frame according to the first voice activity detection value, and simply determine the voice probability; step S1609 is entered, and a voice probability threshold value is determined according to the voice scene; step S1610 is then executed to determine a second voice activity detection value according to the voice probability threshold and the voice probability. According to the determined first voice activity detection value and the second voice activity detection value, the audio state of the current frame audio can be judged to determine whether the current frame audio is a voice frame.

According to the technical scheme, on the basis of the audio signal, on one hand, the short-time energy histogram of the audio signal is obtained, the background noise energy value is determined based on the short-time energy histogram, and the determined background noise energy value is higher in stability, so that the accuracy of the capacity threshold value determined according to the background noise energy value can be improved, the accuracy of audio state judgment can be improved, and the effect of identifying the audio signal is improved. In still another aspect, according to the exemplary embodiment of the present invention, by determining the current frame and the average value of the first voice activity detection values corresponding to the preset frame number before the current frame as the voice probability, and combining the voice probability threshold value determined by the misjudgment rate or the missed judgment rate in the voice scene, the voice activity detection value for performing the audio state judgment can be determined, so that the audio state judgment under the scene with strict requirements on the missed judgment or the misjudgment can be performed, and the accuracy of the audio state judgment and the effect of identifying the audio signal can be further improved. On the other hand, since the accuracy of the audio state judgment in the above scheme is high, the voice activity detection method provided by the exemplary embodiment of the invention can be used for the audio state judgment processing in complex scenes such as real-time audio signals.

Exemplary apparatus

Having described the voice activity detection method of the exemplary embodiment of the present invention, next, a voice activity detection apparatus of the exemplary embodiment of the present invention will be described with reference to fig. 17. The device embodiment part can inherit the related description in the method embodiment, so that the device embodiment can be supported by the related detailed description of the method embodiment.

Referring to fig. 17, the voice activity detection apparatus 17 according to an exemplary embodiment of the present invention may include: a histogram determination module 171, an energy value determination module 173, an energy threshold determination module 175, and a first detection value determination module 177.

Specifically, the histogram determination module 171 may be configured to collect an audio signal and determine a short-time energy histogram of the audio signal; the energy value determining module 173 may be configured to determine a background noise energy value of the audio signal according to the short-time energy histogram; the energy threshold determining module 175 may be configured to determine an energy threshold according to the background noise energy value; the first detection value determining module 177 may be configured to determine a first voice activity detection value according to the energy threshold value and an energy value of the audio signal of the current frame, where the first voice activity detection value is used to represent an audio state of the audio signal of the current frame.

In some embodiments of the present invention, the histogram determining module 171 may be configured to divide the audio signal within the preset time period by using the preset short period, and determine short-time energy corresponding to each of the divided preset short periods; and counting short-time energy by using the histogram, and obtaining a short-time energy histogram.

In some embodiments of the present invention, the energy value determining module 173 may be configured to fit the envelope of the short-time energy histogram to obtain a short-time energy envelope map; and determining the minimum short-time energy value in short-time energy values corresponding to peaks in the short-time energy envelope map as a bottom noise energy value.

In some embodiments of the present invention, the energy threshold determining module 175 may be configured to perform smoothing on the background noise energy values of at least two adjacent frames to obtain a smoothed background noise energy value; and determining an energy threshold value according to the smoothed background noise energy value.

In some embodiments of the present invention, the energy threshold determination module 175 may also be configured to weight the smoothed floor noise energy value as an energy threshold.

In some embodiments of the present invention, the first detection value determining module 177 may be configured to determine that the current frame audio is speech audio if the energy value of the current frame audio signal is greater than or equal to the energy threshold value; and under the condition that the energy value of the current frame audio signal is smaller than the energy threshold value, determining that the current frame audio is non-voice audio.

In some embodiments of the present invention, referring to fig. 18, the voice activity detection apparatus 17 according to an exemplary embodiment of the present invention may further include: a speech probability determination module 181, a probability threshold determination module 183, and a second detection value determination module 185.

Specifically, the voice probability determining module 181 may be configured to determine, according to the first voice activity detection value, a voice probability of the current frame and an audio signal of a preset frame number before the current frame; the probability threshold value determining module 183 may be configured to determine a speech probability threshold value according to a speech scenario; the second detection value determining module 185 may be configured to determine a second voice activity detection value according to the voice probability threshold and the voice probability, where the second voice activity detection value is used to represent an audio state of the current frame of audio.

In some embodiments of the present invention, the voice probability determining module 181 may be configured to determine an average value of the first voice activity detection value of the current frame audio signal and the first voice activity detection value of the preset frame audio signal, and take the average value as the voice probability.

In some embodiments of the present invention, the probability threshold determining module 183 may be configured to determine the speech probability threshold according to a false positive rate or a missed positive rate of the scene.

In some embodiments of the present invention, the second detection value determining module 185 may be configured to determine that the current frame audio is speech audio if the speech probability is greater than or equal to the speech probability threshold value; and under the condition that the voice probability is smaller than the voice probability threshold value, determining that the current frame of audio is non-voice audio.

In some embodiments of the present invention, referring to fig. 19, the voice activity detection apparatus 17 of the exemplary embodiment of the present invention may further include: the voice frame determining module 191 is configured to determine whether the current frame of audio is a voice frame according to the first voice activity detection value or the second voice activity detection value.

Since each functional module of the voice activity detection apparatus according to the embodiment of the present invention is the same as that of the above-mentioned method embodiment, the description thereof is omitted herein.

According to the voice activity detection device, on one hand, the short-time energy histogram of the audio signal can be obtained through the histogram determination module, the energy value determination module is used for determining the background noise energy value based on the short-time energy histogram, and the stability of the determined background noise energy value is higher, so that the accuracy of the capacity threshold value determined according to the background noise energy value can be improved, the accuracy of the judgment of the audio state can be improved, and the effect of identifying the audio signal is improved. In still another aspect, according to the exemplary embodiment of the present invention, by the first detection value determining module and the second detection value determining module, an average value of first voice activity detection values corresponding to a current frame and a preset frame number before the current frame may be determined as a voice probability, and a voice probability threshold value determined by combining a misjudgment rate or a missed judgment rate in a voice scene may be determined, so that another voice activity detection value performing audio status judgment may be determined, thereby performing audio status judgment in a scene where a strict requirement is imposed on the missed judgment or the misjudgment, and further accuracy of audio status judgment and an effect of identifying an audio signal. On the other hand, since the accuracy of the audio state judgment in the above scheme is high, the voice activity detection device provided by the exemplary embodiment of the invention can be used for the audio state judgment processing in complex scenes such as real-time audio signals.

Exemplary embodimentsApparatus and method for controlling the operation of a device

Having introduced the voice activity detection method and apparatus of the exemplary embodiments of the present invention, the electronic device of the exemplary embodiments of the present invention will be described next. The electronic device of the exemplary embodiment of the invention comprises the voice activity detection device.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

In some possible embodiments, an electronic device according to the invention may comprise at least one processing unit and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps according to various exemplary embodiments of the invention described in the "method" section of the specification above.

An electronic device 2000 according to this embodiment of the present invention is described below with reference to fig. 20. The electronic device 2000 illustrated in fig. 20 is merely an example, and should not be construed to limit the functionality and scope of use of embodiments of the present invention in any way.

As shown in fig. 20, the electronic device 2000 is embodied in the form of a general purpose computing device. Components of the electronic device 2000 may include, but are not limited to: the at least one processing unit 2010, the at least one storage unit 2020, a bus 2030 connecting the different system components (including the storage unit 2020 and the processing unit 2010), and a display unit 2040.

Wherein the storage unit stores program code that is executable by the processing unit 2010 such that the processing unit 2010 performs steps according to various exemplary embodiments of the present invention described in the "exemplary methods" section of the present specification. For example, the processing unit 2010 may perform step S12 as shown in fig. 1 and 11: collecting an audio signal and determining a short-time energy histogram of the audio signal; step S14: determining a background noise energy value of the audio signal according to the short-time energy histogram; step S16: determining an energy threshold value according to the bottom noise energy value; step S18: determining a first voice activity detection value according to the energy threshold value and the energy value of the current frame audio signal, wherein the first voice activity detection value is used for representing the audio state of the current frame audio; step S112: determining the voice probability of the current frame and the audio signal of the preset frame number before the current frame according to the first voice activity detection value; step S114: determining a voice probability threshold value according to the voice scene; step S116: and determining a second voice activity detection value according to the voice probability threshold value and the voice probability, wherein the second voice activity detection value is used for representing the audio state of the current frame of audio.

The storage unit 2020 may include readable media in the form of volatile storage units such as random access memory unit (RAM) 20201 and/or cache memory unit 20202, and may further include read only memory unit (ROM) 20203.

The storage unit 2020 may also include a program/utility 20204 having a set (at least one) of program modules 20205, such program modules 20205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus 2030 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, a graphics accelerator port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 2000 may also communicate with one or more external devices 2070 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 2000, and/or any device (e.g., router, modem, etc.) that enables the electronic device 2000 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 2050. Also, the electronic device 2000 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 2060. As shown, the network adapter 2060 communicates with other modules of the electronic device 2000 via the bus 2030. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with the electronic device 2000, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

Exemplary program product

In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of the voice activity detection method or the audio processing model training method according to the various exemplary embodiments of the present invention described in the "method" section of the present specification, when the program product is run on the terminal device, for example, the terminal device may perform the steps 12 to 18 as described in fig. 1, or the terminal device may perform the steps 112 to 116 as described in fig. 11.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical disk, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In addition, as technology advances, readable storage media should also be interpreted accordingly.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that while several modules or sub-modules in the above-described apparatus are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, the features and functions of two or more modules described above may be embodied in one module in accordance with embodiments of the present invention. Conversely, the features and functions of one module described above may be further divided into a plurality of modules to be embodied.

Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for detecting voice activity, comprising:

determining the minimum short-time energy value in short-time energy values corresponding to peaks in the short-time energy envelope map as a bottom noise energy value;

2. The voice activity detection method of claim 1, wherein determining a short-time energy histogram of the audio signal comprises:

3. The voice activity detection method of claim 1, wherein determining an energy threshold based on the background noise energy value comprises:

4. The voice activity detection method of claim 3, wherein determining the energy threshold based on the smoothed floor noise energy value comprises:

5. The method of claim 1, wherein determining a first voice activity detection value based on the energy threshold and an energy value of a current frame audio signal, wherein the first voice activity detection value is used to represent an audio state of the current frame audio comprises:

6. The voice activity detection method according to any one of claims 1-5, further comprising:

determining a voice probability threshold value according to the voice scene;

7. The method of claim 6, wherein determining the speech probability of the current frame and the audio signal a preset number of frames before the current frame based on the first speech activity detection value comprises:

8. The voice activity detection method of claim 6, wherein determining a voice probability threshold based on a voice scenario comprises:

9. The voice activity detection method of claim 6, wherein determining a second voice activity detection value based on the voice probability threshold and the voice probability, wherein the second voice activity detection value is used to represent an audio state of a current frame of audio comprises:

10. The voice activity detection method of claim 9, further comprising:

11. A voice activity detection apparatus, comprising:

the energy value determining module is used for fitting the envelope of the short-time energy histogram to obtain a short-time energy envelope map; determining the minimum short-time energy value in short-time energy values corresponding to peaks in the short-time energy envelope map as a bottom noise energy value;

12. The voice activity detection apparatus according to claim 11, wherein the histogram determination module is configured to divide the audio signal in a preset period of time using preset short periods of time, and determine short-time energy corresponding to each of the divided preset short periods of time; and counting the short-time energy by using a histogram, and obtaining the short-time energy histogram.

13. The voice activity detection apparatus according to claim 11, wherein the energy threshold determining module is configured to perform smoothing on the background noise energy values of at least two adjacent frames to obtain a smoothed background noise energy value; and determining the energy threshold value according to the smooth background noise energy value.

14. The voice activity detection apparatus of claim 13, wherein the energy threshold determination module is further configured to weight the smoothed floor noise energy value as the energy threshold.

15. The voice activity detection apparatus according to claim 11, wherein the first detection value determination module is configured to determine that the current frame audio is voice audio if an energy value of the current frame audio signal is greater than or equal to the energy threshold value; and under the condition that the energy value of the current frame audio signal is smaller than the energy threshold value, determining that the current frame audio is non-voice audio.

16. The voice activity detection apparatus according to any one of claims 11-15, further comprising:

17. The voice activity detection apparatus according to claim 16, wherein the voice probability determination module is configured to determine an average value of the first voice activity detection value of the current frame audio signal and the first voice activity detection value of the preset frame audio signal, and take the average value as the voice probability.

18. The voice activity detection apparatus according to claim 16, wherein the probability threshold determining module is configured to determine the voice probability threshold according to a misjudgment rate or a missed judgment rate of the scene.

19. The voice activity detection apparatus according to claim 16, wherein the second detection value determination module is configured to determine that the current frame audio is voice audio if the voice probability is greater than or equal to the voice probability threshold; and under the condition that the voice probability is smaller than the voice probability threshold value, determining that the current frame of audio is non-voice audio.

20. The voice activity detection apparatus of claim 19, further comprising:

21. A storage medium having stored thereon a computer program, which when executed by a processor implements the voice activity detection method of any of claims 1 to 10.

22. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the voice activity detection method of any one of claims 1 to 10 via execution of the executable instructions.