CN108039182B

CN108039182B - Voice activation detection method

Info

Publication number: CN108039182B
Application number: CN201711407711.0A
Authority: CN
Inventors: 张亦希; 陈晨; 王陈春; 王业芳; 常浩宇; 王蕴; 舒敏; 王琼
Original assignee: Shaanxi Fenghuo Communication Group Co Ltd
Current assignee: Shaanxi Fenghuo Communication Group Co Ltd
Priority date: 2017-12-22
Filing date: 2017-12-22
Publication date: 2021-10-08
Anticipated expiration: 2037-12-22
Also published as: CN108039182A

Abstract

The invention belongs to the technical field of voice signal processing, and discloses a voice activation detection method, which is used for performing voice activation detection by utilizing the characteristics that a voice signal has stronger autocorrelation and noise has weaker autocorrelation, not only can realize smaller missed detection and false picking probability under a stronger noise environment, but also has lower computational complexity, and is easy to realize in various embedded platforms.

Description

Voice activation detection method

Technical Field

The invention belongs to the technical field of voice signal processing, and particularly relates to a voice activation detection method.

Background

For the radio IP gateway, since the radio station can only perform half-duplex voice communication generally, and the voice signal from the IP network is a full-duplex voice signal generally, the radio IP gateway needs to be able to realize the mutual conversion between full duplex and half duplex, that is, when it is found that there is no voice in the audio signal from the IP network and only noise exists, the radio station is in a receiving state and sends the audio signal received by the radio station to the IP network, and when the audio signal from the IP network contains a voice signal, the radio station is in a sending state and sends the voice signal from the IP network out through the radio station.

Therefore, the station IP gateway needs to detect whether the audio signal from the IP network contains voice using a voice activation detection algorithm, and the requirements for the voice activation detection algorithm generally include: (1) the radio station IP gateway usually adopts embedded platforms (such as various ARM platforms) and uses a Linux operating system to process various protocols, so that the voice activation detection algorithm has low algorithm complexity so as to be capable of running on various embedded Linux platforms; (2) the voice activation detection algorithm has strong anti-noise performance, and voice signals sent from different places through an IP network often contain noise signals with different amplitudes, so the voice activation detection algorithm has to be capable of realizing small missed detection and false detection probability under a strong noise environment.

Currently, the most used voice activity detection on embedded Linux platforms is the short-time energy and zero-crossing rate voice activity detection algorithm. The short-time energy and zero-crossing rate voice activation detection algorithm compares the calculated energy and zero-crossing rate with a preset threshold, if the calculated energy and zero-crossing rate exceed the threshold at the same time, the current frame is judged to be a voice frame, and if the calculated energy and zero-crossing rate exceed the threshold at the same time or one of the calculated energy and zero-crossing rate is lower than the other group of thresholds, the current frame is judged to be noise.

Disclosure of Invention

In view of the foregoing problems, an object of the present invention is to provide a voice activation detection method, which can not only achieve a low probability of missing detection and false detection in a relatively strong noise environment, but also have a low computational complexity, and is easy to implement in various embedded platforms.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

A voice activity detection method, the voice activity detection method comprising:

step 1, acquiring an audio signal sample stream, and dividing the audio signal sample stream into continuous multi-frame audio samples;

step 2, setting a voice threshold and a noise threshold, and calculating the autocorrelation of the ith frame of audio sample, wherein i is more than or equal to 1 and less than or equal to M, and M is the total number of audio sample frames contained in the audio signal sample stream;

step 3, when the autocorrelation degree of the ith frame audio sample is greater than the speech threshold, judging that the ith frame audio sample is a speech frame;

when the autocorrelation of the ith frame of audio sample is smaller than the noise threshold, judging that the ith frame of audio sample is a noise frame;

otherwise, when i is equal to 1, judging the 1 st frame audio sample as a noise frame;

and when i is larger than 1, the judgment result of the i frame audio sample is the same as that of the i-1 frame audio sample.

The technical scheme of the invention has the characteristics and further improvements that:

(1) in step 2, calculating the autocorrelation degree R of the ith frame of audio sample_iThe method specifically comprises the following steps:

wherein N represents the total number of sampling points contained in the ith frame of audio samples, and x_i(k) Representing the kth sample point, x, in the ith frame of audio samples_i(k +1) denotes a (k +1) th sampling point in the ith frame of audio samples, sgn (.) denotes a sign function, and C denotes a set constant greater than zero.

(2) Setting the audio sample of the 1 st frame as a noise frame, calculating the noise energy E of the audio sample of the 1 st frame, and determining a constant C according to the noise energy E:

the method of the invention utilizes the characteristic that the voice signal has stronger autocorrelation and the noise has weaker autocorrelation to carry out voice activation detection, not only can realize smaller missed detection and false picking probability under stronger noise environment, but also has lower computational complexity and is easy to realize in various embedded platforms.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a voice activity detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a comparison between a probability distribution function and a standard normal distribution function according to an embodiment of the present invention;

fig. 3 is a schematic diagram of simulation results of the conventional method and the method of the present invention according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a voice activation detection method, as shown in fig. 1, where the voice activation detection method includes:

step 1, obtaining an audio signal sample stream, and dividing the audio signal sample stream into continuous multi-frame audio samples.

And 2, setting a voice threshold and a noise threshold, and calculating the autocorrelation of the ith frame of audio sample, wherein i is more than or equal to 1 and less than or equal to M, and M is the total number of audio samples contained in the audio signal sample stream.

In step 2, calculating the autocorrelation degree R of the ith frame of audio sample_iThe method specifically comprises the following steps:

Further, the 1 st frame of audio samples is set as a noise frame, the noise energy E of the 1 st frame of audio samples is calculated, and the constant C is determined according to the noise energy E as follows:

it should be noted that, when the audio samples do not include a speech signal, x (k) can be assumed to be an additive white gaussian noise signal and obey a normal distribution

If so:

Z＝X(k)X(k+1) (1)

the probability distribution function for Z can be shown to be:

a comparison of the probability distribution function f (z) and the standard normal distribution function is shown in fig. 2. And then ordering:

U＝Z+C (3)

the probability that U is greater than or equal to 0 can be expressed as:

as can be seen from the shaded part in FIG. 2, the reasonable selection of C can make the probability P { U ≧ 0} that U is greater than or equal to 0 decrease rapidly, and the speech signal usually has stronger correlation, so the invention can significantly improve the anti-noise performance of the speech activation detection algorithm. And when C is equal to 0, P { U ≧ 0} is larger, so the short-time energy and zero-crossing rate speech activation algorithm can not distinguish the speech signal from the noise signal under a strong noise environment.

The computer simulation result also proves the effectiveness and superiority of the method. The original speech signal, the noisy speech signal, and the short-term energy, zero-crossing rate and autocorrelation of each frame thereof are shown in fig. 3, where fig. 3(a) is the original speech time domain signal, fig. 3(b) is the noisy speech time domain signal, fig. 3(c) is a schematic diagram of the detection result of the existing short-term energy and short-term zero-crossing rate method, and fig. 3(d) is a schematic diagram of the detection result of the autocorrelation method-based detection method of the present invention. As can be seen from fig. 3, when the signal-to-noise ratio is 2dB, it is already difficult to distinguish between speech and noise signals by the short-term energy and zero-crossing rate indicators, but the signals can still be effectively distinguished by using the autocorrelation. Therefore, the voice activation detection algorithm based on the autocorrelation degree effectively improves the anti-noise performance of the algorithm by increasing the operation of N times of integer multiplication, and can be operated on various embedded Linux platforms.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A voice activity detection method, characterized in that the voice activity detection method comprises:

calculating the autocorrelation degree R of the ith frame audio sample_iThe method specifically comprises the following steps:

wherein N represents the total number of sampling points contained in the ith frame of audio samples, and x_i(k) Representing the kth sample point, x, in the ith frame of audio samples_i(k +1) represents the (k +1) th sampling point in the ith frame of audio samples, sgn (.) represents a sign function, and C represents a set constant greater than zero;

step 3, when the autocorrelation degree of the ith frame audio sample is greater than the speech threshold, judging that the ith frame audio sample is a speech frame; when the autocorrelation of the ith frame of audio sample is smaller than the noise threshold, judging that the ith frame of audio sample is a noise frame;

if not, then,

when i is 1, judging the 1 st frame audio sample as a noise frame; and when i is larger than 1, the judgment result of the i frame audio sample is the same as that of the i-1 frame audio sample.

2. A voice activity detection method as claimed in claim 1,

setting the audio sample of the 1 st frame as a noise frame, calculating the noise energy E of the audio sample of the 1 st frame, and determining a constant C according to the noise energy E: