CN113327589B

CN113327589B - Voice activity detection method based on attitude sensor

Info

Publication number: CN113327589B
Application number: CN202110646290.7A
Authority: CN
Inventors: 王蒙; 胡奎; 姜黎
Original assignee: Hangzhou Ccvui Intelligent Technology Co ltd
Current assignee: Hangzhou Ccvui Intelligent Technology Co ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2023-04-25
Anticipated expiration: 2041-06-10
Also published as: CN113327589A

Abstract

The invention provides a voice activity detection method based on a gesture sensor, and relates to the technical field of man-machine interaction. According to the invention, the gesture characteristic data and the sound characteristic data are subjected to characteristic splicing to obtain mixed characteristic data; the neural network model is trained through the mixed characteristic data, so that voice activity can be accurately detected under different postures, and the problem that the user posture can influence the voice activity detection accuracy is solved; the trained neural network quantity is quantized and compressed by a three-value quantization method in the quantization and compression method, and the 32-bit floating point type weight is quantized into a 2-bit fixed point type weight, so that the occupied memory size is further reduced, and meanwhile, the consumption of calculation space and time is greatly reduced; constructing data connection of front and rear frames by using a cyclic neural network model so as to improve the model effect; and the number of the model parameters of the cyclic neural network is small, so that the occupied memory size is further reduced.

Description

Voice activity detection method based on attitude sensor

Technical Field

The invention relates to the technical field of man-machine interaction, in particular to a voice activity detection method based on a gesture sensor.

Background

Voice activity detection (Voice Activity Detection, VAD) is a classical problem of detecting Speech signal segments and non-Speech signal segments from noise-containing Speech signals, which has become an indispensable important component in various Speech signal processing systems, such as Speech Coding (Speech Coding), speech enhancement (Speech Enhancement), speech recognition (Automatic Speech Recognition), etc., and as digital devices continue to develop, voice activity detection is increasingly being used on digital devices.

As a popular product, the technology of the embedded earphone is continuously innovated. The embedded earphone is connected with the intelligent device generally, has an audio playing function, can interact with the intelligent device by collecting voice of a person, gesture information of the person and the like, and has the characteristics of being more intelligent and rich in function compared with the traditional earphone, and people's back-hold is received rapidly.

As an interactive device with an intelligent device, an embedded earphone has high requirements on its data acquisition capability, for example: when the intelligent mobile phone is controlled by the embedded earphone, clear voice needs to be acquired, although the intelligent mobile phone usually can conduct noise reduction, separation and other operations on the acquired audio data, the embedded earphone cannot guarantee the definition and accuracy of the provided audio data, and even if the audio processing software carried by the intelligent mobile phone is powerful, the intelligent mobile phone is not helpful.

The working environment of the embedded earphone is complex and various, various gestures of a user can influence the collection and recognition of sound, and the quality of collected audio data can be reduced due to the gesture changes, so that related measures are needed to improve the sound.

To this end, the application CN201911174434.2 discloses a system for detecting the voice activity of a headset wearer based on microphone technology, comprising: the microphone comprises a microphone array, a first estimation module, a second estimation module and a joint control module; a microphone array for receiving sound signals; the first estimation module is used for determining the first voice existence probability of the wearer according to the incoming wave direction of the sound source; the second estimation module is used for determining the existence probability of the second voice of the wearer according to the direct reverberation ratio of the sound source; and the joint control module is used for determining the third voice existence probability according to the first voice existence probability and the second voice existence probability and detecting voice activity of the wearer. Using microphone array technology, the headset wearer's voice activity is detected. Even under the complex acoustic scenes of low signal-to-noise ratio, high reverberation condition, multi-speaker interference and the like, the voice activity detection of the wearer can be realized, and an important basis is provided for the subsequent voice enhancement and voice recognition technology.

However, the present invention does not deal with the audio data acquisition change caused by the gesture of the user, so we need to propose a voice activity detection method for eliminating the influence of the gesture of the user to solve the above-mentioned problems.

Disclosure of Invention

In order to solve the technical problems, the voice activity detection method based on the gesture sensor is applied to an audio acquisition device with the gesture sensor, and performs neural network quantization training by constructing mixed characteristic data considering gesture characteristic data and sound characteristic data, and obtains an optimal solution of a neural network model, wherein the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:

acquiring the gesture change of the audio acquisition device through a gesture sensor and recording the gesture change as gesture characteristic data;

collecting external sound changes through an audio collecting device and taking the external sound changes as sound characteristic data;

respectively carrying out data preprocessing operation on the attitude characteristic data and the sound characteristic data;

performing feature stitching on the preprocessed gesture feature data and sound feature data to obtain mixed feature data;

and taking the mixed characteristic data as neural network quantization training data for subsequent model training.

As a more specific solution, the sound feature data is MFCC feature data, and the MFCC sound feature data extraction and sound feature data preprocessing operation is performed by:

pre-emphasis is carried out on the sound characteristic data through a high-pass filter;

framing the pre-emphasis data by a framing function;

carrying each frame into a window function to carry out windowing operation;

performing fast Fourier transform on each frame of the windowed signal to obtain an energy spectrum of each frame;

discrete cosine transforming the energy spectrum row to obtain an MFCC coefficient;

extracting first-order differential parameters from the Mel spectrogram;

and splicing the MFCC coefficients and the first-order differential parameters to obtain MFCC characteristic data.

As a more specific solution, the gesture feature data preprocessing operation is an operation of converting time domain gesture feature data into frequency domain gesture feature data, wherein the gesture feature data is gesture feature data including an X axis, a Y axis and a Z axis, and the gesture feature data preprocessing operation is performed by:

framing operation is carried out on the gesture feature data, and each frame of the gesture feature data corresponds to each frame of the sound feature data one by one;

calculating the displacement of each frame by the attitude characteristic data, wherein the calculation formula is as follows:

s(n)＝f(n)-f(n-1)；n∈(0，512]；

as(n)＝s(n)-s(n-1)；n∈(0，512]；

where s (n) represents the speed of the nth frame, as (n) represents the acceleration of the nth frame, and f (n) represents the data position tag of the nth frame;

respectively carrying out logarithmic transformation on the calculated speed and acceleration;

and splicing the speed and the acceleration together to obtain gesture characteristic data.

As a more specific solution, the pre-processed gesture feature data and sound feature data are subjected to feature stitching through the following steps:

the collected sound characteristic data and attitude characteristic data are subjected to one-to-one point location marking information according to the real-time corresponding positions;

information labeling of the starting position and the ending position of sound characteristic data is carried out on the gesture characteristic data of the gesture sensor;

according to the signal-to-noise ratio requirement, mixing the random noise data with the marked voice characteristic data in a random SNR mode, and ensuring that the mixed data corresponds to the starting position and the ending position of the voice characteristic data one by one;

performing standard matching on the mixed data and gesture feature data marked with point location information, and obtaining training data with a spliced feature;

and performing feature stitching on all the gesture feature data and the sound feature data, and obtaining a training data set after feature stitching.

As a more specific solution, the neural network model is a cyclic neural network model, and the cyclic neural network model collects information of adjacent frames and adjusts a weight matrix of voice activity detection of a current frame according to the information of the adjacent frames.

As a more specific solution, the neural network quantity after training is quantized and compressed, and the 32bit floating point type weight is quantized into a 2bit fixed point type weight through quantization and compression; the quantization compression steps are as follows:

calculating threshold delta and scaling factor alpha from the original matrix

Converting the original weight into a three-value weight

Multiplying the input X with α as a new input and then adding with the three-valued weights to forward propagate instead of the original multiplication.

Iterative training is performed using SGD algorithm back propagation.

As a more specific solution, the original weight matrix W is passed through a three-valued weight W ^t Approximately expressed by multiplying the scale factor alpha by the three-value weight W ^t Expressed as:

wherein: a threshold value is generated from the original weight matrix, the threshold value being:

wherein: i represents the corresponding sequence number of the weight items, and n represents the total sequence number of the weight items;

the scaling factor is:

wherein: the number of elements in the representation.

As a more specific solution, the windowing operation is performed by a hamming window function, which is:

the pre-emphasis weighting factor is 0.97, the first-order differential parameter extraction of the Mel spectrogram is completed by a Mel filter, and the Mel filter function is:

where f represents the actual frequency of the signal that needs to be filtered.

As a more specific solution, voice activity detection is performed through a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame characteristic data processing on an audio signal needing to be subjected to voice activity detection, and the posterior probability of voice/non-voice is calculated through a softmax function according to the calculation result of the deep neural network model; the posterior probability value is between 0 and 1, and when the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and when the posterior probability value does not exceed the judgment threshold value, the voice is considered to be non-voice.

Compared with the related art, the voice activity detection method based on the gesture sensor provided by the invention has the following steps of

The beneficial effects are that:

1. according to the invention, the gesture characteristic data and the sound characteristic data are subjected to characteristic splicing to obtain mixed characteristic data; the neural network model is trained through the mixed characteristic data, so that voice activity can be accurately detected under different postures, and the problem that the user posture can influence the voice activity detection accuracy is solved;

2. the invention carries out quantization compression on the trained neural network quantity by a three-value quantization method in the quantization compression method, quantizes the 32bit floating point type weight into a 2bit fixed point type weight, further reduces the occupied memory size and greatly reduces the consumption of calculation space and time;

3. the invention considers the influence of the information of the adjacent frames on the judgment of the VAD of the current frame, and uses a cyclic neural network model to construct the data connection of the previous and subsequent frames so as to improve the model effect; and the number of the model parameters of the cyclic neural network is small, so that the occupied memory size is further reduced.

Drawings

Fig. 1 is a system schematic diagram of a voice activity detection method based on an attitude sensor according to a preferred embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and embodiments.

As shown in fig. 1, the voice activity detection method based on the gesture sensor is applied to an audio acquisition device with the gesture sensor.

Specifically, the conventional voice activity detection method is difficult to adapt to the use scene of devices such as headphones, and the like, due to the fact that the scene of voice activity detection is continuously changed due to different gestures of a user, the accuracy of voice activity detection is difficult to ensure, and the influence caused by the gestures of the user is difficult to realize through simple algorithm improvement.

The embodiment provides a mode of combining an attitude sensor and an audio acquisition device to achieve the purposes of eliminating the attitude influence and increasing the robustness of a system, the attitude sensor is usually provided with three or more shafts and the audio acquisition device is installed together, the attitude information of the audio acquisition device can be acquired in real time through the attitude sensor, the acquired attitude information and the sound information are subjected to characteristic extraction, the neural network quantization training is carried out by constructing mixed characteristic data considering the attitude characteristic data and the sound characteristic data, the optimal solution of the neural network model is obtained, and the neural network model obtained by training through the method can be combined with the attitude information to carry out real-time voice activity detection on the sound information, so that the purposes of improving the voice activity detection accuracy and the robustness are achieved.

Specifically, the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:

It should be noted that: the mixed characteristic data can give consideration to sound characteristics and gesture characteristics, and is used for subsequent model training to enhance the adaptability and robustness of the model to voice activity detection under different gestures.

framing the pre-emphasis data by a framing function;

carrying each frame into a window function to carry out windowing operation;

extracting first-order differential parameters from the Mel spectrogram;

It should be noted that: in voice activity detection, the present embodiment uses Mel-frequency cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC for short). The MFCC is set according to the human ear hearing mechanism with different hearing sensitivity to sound waves of different frequencies. The speech signal from 200Hz to 5000Hz has a large impact on the intelligibility of speech. When two sounds of unequal loudness act on the human ear, the presence of the frequency component of higher loudness affects the perception of the frequency component of lower loudness, making it less noticeable, a phenomenon known as masking effect. Since lower frequency sounds travel farther up the cochlea's basal membrane than higher frequency sounds, generally speaking, bass sounds tend to mask treble, while treble bass sounds tend to be more difficult to mask. The critical bandwidth of sound masking at low frequencies is smaller than at higher frequencies. Therefore, a set of band-pass filters is arranged from dense to sparse according to the critical bandwidth in the frequency band from low frequency to high frequency to filter the input signal. The energy of the signal output by each band-pass filter is used as the basic characteristic of the signal, and the characteristic can be used as the input characteristic of voice after further processing. Because the characteristics do not depend on the nature of the signals, no assumption and limitation are made on the input signals, and the research results of the auditory model are utilized. Therefore, such parameters are more robust than LPCC based on vocal tract model, more fitting the auditory properties of the human ear, and still have better recognition performance when the signal-to-noise ratio is reduced.

s(n)＝f(n)-f(n-1)；n∈(0，512]；

as(n)＝s(n)-s(n-1)；n∈(0，512]；

It should be noted that: the marking point location information and the marking of the gesture feature data and the sound feature data are the precondition of ensuring the strict real-time correspondence of the gesture feature data and the sound feature data, and the training effect with good effect can be obtained only if the processing is correct.

calculating threshold delta and scaling factor alpha from the original matrix

Converting the original weight into a three-value weight

The input X is multiplied as a new input and then added to the three-valued weights to propagate forward instead of the original multiplication.

Iterative training is performed using SGD algorithm back propagation.

It should be noted that: the artificial neural network allows the computer to reach an unprecedented level of performance in processing speech recognition tasks. However, the high complexity of the model brings about high consumption of memory space and computing resources, which makes it difficult to implement the model to each hardware platform.

To address these issues, the model is compressed to minimize the model's consumption of computation space and time. Currently, mainstream networks, such as VGG16, occupy more than 500 MB space with parameters of more than 1 hundred million and 3 thousand, and require more than 300 hundred million floating point operations to complete a recognition task.

In the artificial neural network, a large number of redundant nodes exist, and only a small part (5-10%) of weights participate in main calculation, that is, only a small part of weight parameters are trained to achieve the performance similar to that of the original network. Therefore, the trained neural network model needs to be compressed, and the compression of the neural network model comprises tensor decomposition, model pruning and model quantization.

Tensor decomposition is the approximation of the network weights as a full order matrix with multiple low rank matrices, which is suitable for model compression, but is not easy to implement, involves computationally expensive decomposition operations, and requires extensive retraining to achieve convergence.

Model pruning is to reject relatively unimportant weights in the weight matrix and then to refine the network. However, the pruning of the model can lead to irregular network connection, memory occupation needs to be reduced through sparse expression, and further, a large amount of condition judgment and extra space are needed to mark 0 or non-0 parameter positions during forward propagation, so that the pruning model is not suitable for parallel computation, and unstructured sparsity needs to use a special software calculation library or hardware.

Therefore, we perform quantization compression through the quantization direction of the model, and generally, the weights of the neural network model are all represented by floating point numbers with the length of 32 bits. Many times this high precision is not required, and can be represented by quantization, for example by 8 bits. The space required for each weight is reduced by sacrificing accuracy. The precision required by SGD is only 6-8bit, and the storage volume of the model is reduced under the condition that reasonable quantization can ensure the precision. According to quantization methods, it is classified into binary quantization, ternary quantization and multi-valued quantization. In this embodiment, three-value quantization is selected, and compared with two-value quantization, 0 value is added to three-value quantization based on two values of 1 and-1 to form a three-value network, and the calculated amount is not increased.

And iterative training is performed by using SGD algorithm back propagation, wherein the calculated gradient is used for adjusting the weight of the neural network. The SGD algorithm is a form of gradient descent, and as the SGD algorithm adjusts these weights, the neural network will produce a more ideal output. The overall error of the neural network should decrease with training.

wherein: a threshold delta is generated from the original weight matrix W, which is:

the scaling factor α is:

wherein: i _Δ ＝{1≤i≤n||W _i ＞Δ|}，|I _Δ I represents I _Δ The number of elements in the matrix.

It should be noted that, the neural network model obtained through the training of the mixed feature data can be well adapted to the voice activity detection under various postures, while the softmax function is mainly used for normalizing the calculation result of the model, and the softmax function can be used for "compressing" a K-dimensional vector z containing any real number into another K-dimensional real vector o (z), so that the range of each element is between (0, 1), and the sum of all elements is 1. The voice/non-voice can be accurately classified by the softmax function.

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. The voice activity detection method based on the gesture sensor is applied to an audio acquisition device with the gesture sensor, and is characterized in that the method comprises the steps of performing neural network quantization training by constructing mixed characteristic data considering gesture characteristic data and sound characteristic data, and obtaining an optimal solution of a neural network model, wherein the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:

taking the mixed characteristic data as neural network quantized training data for subsequent model training;

and performing characteristic splicing on the preprocessed gesture characteristic data and sound characteristic data by the following steps:

2. The method for detecting voice activity based on an attitude sensor according to claim 1, wherein the sound feature data is MFCC feature data, and the MFCC sound feature data extraction and sound feature data preprocessing operations are performed by:

framing the pre-emphasis data by a framing function;

carrying each frame into a window function to carry out windowing operation;

extracting first-order differential parameters from the Mel spectrogram;

3. The method for detecting voice activity based on an attitude sensor according to claim 1, wherein the operation of preprocessing the attitude feature data is an operation of converting time domain attitude feature data into frequency domain attitude feature data, the attitude feature data being attitude feature data including an X axis, a Y axis and a Z axis, the operation of preprocessing the attitude feature data being performed by:

s(n)＝f(n)-f(n-1)；n∈(0,512]；

as(n)＝s(n)-s(n-1)；n∈(0,512]；

4. The method for detecting voice activity based on an attitude sensor according to claim 1, wherein the neural network model is a cyclic neural network model, the cyclic neural network model collects information of adjacent frames and adjusts a weight matrix for detecting voice activity of a current frame according to the information of the adjacent frames.

5. The voice activity detection method based on the gesture sensor according to claim 1, wherein the trained neural network quantity is quantized and compressed, and the 32bit floating point type weight is quantized into a 2bit fixed point type weight through quantization and compression; the quantization compression steps are as follows:

calculating threshold delta and scaling factor alpha from the original matrix

Converting the original weight into a three-value weight;

multiplying the input X with alpha as a new input, and then adding with the three-value weight to replace the original multiplication to perform forward propagation;

iterative training is performed using SGD algorithm back propagation.

6. The method for detecting voice activity based on gesture sensor of claim 5, wherein the voice activity is detected by using the gesture sensorThe weight matrix W is weighted by three values

Approximately multiplied by the scaling factor alpha, the three-value weights +.>

Expressed as:

the scaling factor α is:

7. The method of claim 2, wherein the windowing is performed by a hamming window function, the hamming window function being:

wherein n represents the intercepted signal; a, a ₀ Representing a hamming window constant, having a value of 25/46; n-1 represents interception of Hamming windowWindow length;

/>

8. A method of gesture sensor based voice activity detection according to claim 2, wherein voice activity detection is performed by means of a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame characteristic data processing on an audio signal needing to be subjected to voice activity detection, and the posterior probability of voice/non-voice is calculated through a softmax function according to the calculation result of the deep neural network model; the posterior probability value is between 0 and 1, and when the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and when the posterior probability value does not exceed the judgment threshold value, the voice is considered to be non-voice.