CN113327589B - Voice activity detection method based on attitude sensor - Google Patents

Voice activity detection method based on attitude sensor Download PDF

Info

Publication number
CN113327589B
CN113327589B CN202110646290.7A CN202110646290A CN113327589B CN 113327589 B CN113327589 B CN 113327589B CN 202110646290 A CN202110646290 A CN 202110646290A CN 113327589 B CN113327589 B CN 113327589B
Authority
CN
China
Prior art keywords
data
gesture
characteristic data
neural network
voice activity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110646290.7A
Other languages
Chinese (zh)
Other versions
CN113327589A (en
Inventor
王蒙
胡奎
姜黎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Ccvui Intelligent Technology Co ltd
Original Assignee
Hangzhou Ccvui Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Ccvui Intelligent Technology Co ltd filed Critical Hangzhou Ccvui Intelligent Technology Co ltd
Priority to CN202110646290.7A priority Critical patent/CN113327589B/en
Publication of CN113327589A publication Critical patent/CN113327589A/en
Application granted granted Critical
Publication of CN113327589B publication Critical patent/CN113327589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a voice activity detection method based on a gesture sensor, and relates to the technical field of man-machine interaction. According to the invention, the gesture characteristic data and the sound characteristic data are subjected to characteristic splicing to obtain mixed characteristic data; the neural network model is trained through the mixed characteristic data, so that voice activity can be accurately detected under different postures, and the problem that the user posture can influence the voice activity detection accuracy is solved; the trained neural network quantity is quantized and compressed by a three-value quantization method in the quantization and compression method, and the 32-bit floating point type weight is quantized into a 2-bit fixed point type weight, so that the occupied memory size is further reduced, and meanwhile, the consumption of calculation space and time is greatly reduced; constructing data connection of front and rear frames by using a cyclic neural network model so as to improve the model effect; and the number of the model parameters of the cyclic neural network is small, so that the occupied memory size is further reduced.

Description

Voice activity detection method based on attitude sensor
Technical Field
The invention relates to the technical field of man-machine interaction, in particular to a voice activity detection method based on a gesture sensor.
Background
Voice activity detection (Voice Activity Detection, VAD) is a classical problem of detecting Speech signal segments and non-Speech signal segments from noise-containing Speech signals, which has become an indispensable important component in various Speech signal processing systems, such as Speech Coding (Speech Coding), speech enhancement (Speech Enhancement), speech recognition (Automatic Speech Recognition), etc., and as digital devices continue to develop, voice activity detection is increasingly being used on digital devices.
As a popular product, the technology of the embedded earphone is continuously innovated. The embedded earphone is connected with the intelligent device generally, has an audio playing function, can interact with the intelligent device by collecting voice of a person, gesture information of the person and the like, and has the characteristics of being more intelligent and rich in function compared with the traditional earphone, and people's back-hold is received rapidly.
As an interactive device with an intelligent device, an embedded earphone has high requirements on its data acquisition capability, for example: when the intelligent mobile phone is controlled by the embedded earphone, clear voice needs to be acquired, although the intelligent mobile phone usually can conduct noise reduction, separation and other operations on the acquired audio data, the embedded earphone cannot guarantee the definition and accuracy of the provided audio data, and even if the audio processing software carried by the intelligent mobile phone is powerful, the intelligent mobile phone is not helpful.
The working environment of the embedded earphone is complex and various, various gestures of a user can influence the collection and recognition of sound, and the quality of collected audio data can be reduced due to the gesture changes, so that related measures are needed to improve the sound.
To this end, the application CN201911174434.2 discloses a system for detecting the voice activity of a headset wearer based on microphone technology, comprising: the microphone comprises a microphone array, a first estimation module, a second estimation module and a joint control module; a microphone array for receiving sound signals; the first estimation module is used for determining the first voice existence probability of the wearer according to the incoming wave direction of the sound source; the second estimation module is used for determining the existence probability of the second voice of the wearer according to the direct reverberation ratio of the sound source; and the joint control module is used for determining the third voice existence probability according to the first voice existence probability and the second voice existence probability and detecting voice activity of the wearer. Using microphone array technology, the headset wearer's voice activity is detected. Even under the complex acoustic scenes of low signal-to-noise ratio, high reverberation condition, multi-speaker interference and the like, the voice activity detection of the wearer can be realized, and an important basis is provided for the subsequent voice enhancement and voice recognition technology.
However, the present invention does not deal with the audio data acquisition change caused by the gesture of the user, so we need to propose a voice activity detection method for eliminating the influence of the gesture of the user to solve the above-mentioned problems.
Disclosure of Invention
In order to solve the technical problems, the voice activity detection method based on the gesture sensor is applied to an audio acquisition device with the gesture sensor, and performs neural network quantization training by constructing mixed characteristic data considering gesture characteristic data and sound characteristic data, and obtains an optimal solution of a neural network model, wherein the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:
acquiring the gesture change of the audio acquisition device through a gesture sensor and recording the gesture change as gesture characteristic data;
collecting external sound changes through an audio collecting device and taking the external sound changes as sound characteristic data;
respectively carrying out data preprocessing operation on the attitude characteristic data and the sound characteristic data;
performing feature stitching on the preprocessed gesture feature data and sound feature data to obtain mixed feature data;
and taking the mixed characteristic data as neural network quantization training data for subsequent model training.
As a more specific solution, the sound feature data is MFCC feature data, and the MFCC sound feature data extraction and sound feature data preprocessing operation is performed by:
pre-emphasis is carried out on the sound characteristic data through a high-pass filter;
framing the pre-emphasis data by a framing function;
carrying each frame into a window function to carry out windowing operation;
performing fast Fourier transform on each frame of the windowed signal to obtain an energy spectrum of each frame;
discrete cosine transforming the energy spectrum row to obtain an MFCC coefficient;
extracting first-order differential parameters from the Mel spectrogram;
and splicing the MFCC coefficients and the first-order differential parameters to obtain MFCC characteristic data.
As a more specific solution, the gesture feature data preprocessing operation is an operation of converting time domain gesture feature data into frequency domain gesture feature data, wherein the gesture feature data is gesture feature data including an X axis, a Y axis and a Z axis, and the gesture feature data preprocessing operation is performed by:
framing operation is carried out on the gesture feature data, and each frame of the gesture feature data corresponds to each frame of the sound feature data one by one;
calculating the displacement of each frame by the attitude characteristic data, wherein the calculation formula is as follows:
s(n)=f(n)-f(n-1);n∈(0,512];
as(n)=s(n)-s(n-1);n∈(0,512];
where s (n) represents the speed of the nth frame, as (n) represents the acceleration of the nth frame, and f (n) represents the data position tag of the nth frame;
respectively carrying out logarithmic transformation on the calculated speed and acceleration;
and splicing the speed and the acceleration together to obtain gesture characteristic data.
As a more specific solution, the pre-processed gesture feature data and sound feature data are subjected to feature stitching through the following steps:
the collected sound characteristic data and attitude characteristic data are subjected to one-to-one point location marking information according to the real-time corresponding positions;
information labeling of the starting position and the ending position of sound characteristic data is carried out on the gesture characteristic data of the gesture sensor;
according to the signal-to-noise ratio requirement, mixing the random noise data with the marked voice characteristic data in a random SNR mode, and ensuring that the mixed data corresponds to the starting position and the ending position of the voice characteristic data one by one;
performing standard matching on the mixed data and gesture feature data marked with point location information, and obtaining training data with a spliced feature;
and performing feature stitching on all the gesture feature data and the sound feature data, and obtaining a training data set after feature stitching.
As a more specific solution, the neural network model is a cyclic neural network model, and the cyclic neural network model collects information of adjacent frames and adjusts a weight matrix of voice activity detection of a current frame according to the information of the adjacent frames.
As a more specific solution, the neural network quantity after training is quantized and compressed, and the 32bit floating point type weight is quantized into a 2bit fixed point type weight through quantization and compression; the quantization compression steps are as follows:
calculating threshold delta and scaling factor alpha from the original matrix
Converting the original weight into a three-value weight
Multiplying the input X with α as a new input and then adding with the three-valued weights to forward propagate instead of the original multiplication.
Iterative training is performed using SGD algorithm back propagation.
As a more specific solution, the original weight matrix W is passed through a three-valued weight W t Approximately expressed by multiplying the scale factor alpha by the three-value weight W t Expressed as:
Figure GDA0004110348950000041
wherein: a threshold value is generated from the original weight matrix, the threshold value being:
Figure GDA0004110348950000042
wherein: i represents the corresponding sequence number of the weight items, and n represents the total sequence number of the weight items;
the scaling factor is:
Figure GDA0004110348950000043
wherein: the number of elements in the representation.
As a more specific solution, the windowing operation is performed by a hamming window function, which is:
Figure GDA0004110348950000044
the pre-emphasis weighting factor is 0.97, the first-order differential parameter extraction of the Mel spectrogram is completed by a Mel filter, and the Mel filter function is:
Figure GDA0004110348950000045
where f represents the actual frequency of the signal that needs to be filtered.
As a more specific solution, voice activity detection is performed through a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame characteristic data processing on an audio signal needing to be subjected to voice activity detection, and the posterior probability of voice/non-voice is calculated through a softmax function according to the calculation result of the deep neural network model; the posterior probability value is between 0 and 1, and when the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and when the posterior probability value does not exceed the judgment threshold value, the voice is considered to be non-voice.
Compared with the related art, the voice activity detection method based on the gesture sensor provided by the invention has the following steps of
The beneficial effects are that:
1. according to the invention, the gesture characteristic data and the sound characteristic data are subjected to characteristic splicing to obtain mixed characteristic data; the neural network model is trained through the mixed characteristic data, so that voice activity can be accurately detected under different postures, and the problem that the user posture can influence the voice activity detection accuracy is solved;
2. the invention carries out quantization compression on the trained neural network quantity by a three-value quantization method in the quantization compression method, quantizes the 32bit floating point type weight into a 2bit fixed point type weight, further reduces the occupied memory size and greatly reduces the consumption of calculation space and time;
3. the invention considers the influence of the information of the adjacent frames on the judgment of the VAD of the current frame, and uses a cyclic neural network model to construct the data connection of the previous and subsequent frames so as to improve the model effect; and the number of the model parameters of the cyclic neural network is small, so that the occupied memory size is further reduced.
Drawings
Fig. 1 is a system schematic diagram of a voice activity detection method based on an attitude sensor according to a preferred embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and embodiments.
As shown in fig. 1, the voice activity detection method based on the gesture sensor is applied to an audio acquisition device with the gesture sensor.
Specifically, the conventional voice activity detection method is difficult to adapt to the use scene of devices such as headphones, and the like, due to the fact that the scene of voice activity detection is continuously changed due to different gestures of a user, the accuracy of voice activity detection is difficult to ensure, and the influence caused by the gestures of the user is difficult to realize through simple algorithm improvement.
The embodiment provides a mode of combining an attitude sensor and an audio acquisition device to achieve the purposes of eliminating the attitude influence and increasing the robustness of a system, the attitude sensor is usually provided with three or more shafts and the audio acquisition device is installed together, the attitude information of the audio acquisition device can be acquired in real time through the attitude sensor, the acquired attitude information and the sound information are subjected to characteristic extraction, the neural network quantization training is carried out by constructing mixed characteristic data considering the attitude characteristic data and the sound characteristic data, the optimal solution of the neural network model is obtained, and the neural network model obtained by training through the method can be combined with the attitude information to carry out real-time voice activity detection on the sound information, so that the purposes of improving the voice activity detection accuracy and the robustness are achieved.
Specifically, the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:
acquiring the gesture change of the audio acquisition device through a gesture sensor and recording the gesture change as gesture characteristic data;
collecting external sound changes through an audio collecting device and taking the external sound changes as sound characteristic data;
respectively carrying out data preprocessing operation on the attitude characteristic data and the sound characteristic data;
performing feature stitching on the preprocessed gesture feature data and sound feature data to obtain mixed feature data;
and taking the mixed characteristic data as neural network quantization training data for subsequent model training.
It should be noted that: the mixed characteristic data can give consideration to sound characteristics and gesture characteristics, and is used for subsequent model training to enhance the adaptability and robustness of the model to voice activity detection under different gestures.
As a more specific solution, the sound feature data is MFCC feature data, and the MFCC sound feature data extraction and sound feature data preprocessing operation is performed by:
pre-emphasis is carried out on the sound characteristic data through a high-pass filter;
framing the pre-emphasis data by a framing function;
carrying each frame into a window function to carry out windowing operation;
performing fast Fourier transform on each frame of the windowed signal to obtain an energy spectrum of each frame;
discrete cosine transforming the energy spectrum row to obtain an MFCC coefficient;
extracting first-order differential parameters from the Mel spectrogram;
and splicing the MFCC coefficients and the first-order differential parameters to obtain MFCC characteristic data.
It should be noted that: in voice activity detection, the present embodiment uses Mel-frequency cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC for short). The MFCC is set according to the human ear hearing mechanism with different hearing sensitivity to sound waves of different frequencies. The speech signal from 200Hz to 5000Hz has a large impact on the intelligibility of speech. When two sounds of unequal loudness act on the human ear, the presence of the frequency component of higher loudness affects the perception of the frequency component of lower loudness, making it less noticeable, a phenomenon known as masking effect. Since lower frequency sounds travel farther up the cochlea's basal membrane than higher frequency sounds, generally speaking, bass sounds tend to mask treble, while treble bass sounds tend to be more difficult to mask. The critical bandwidth of sound masking at low frequencies is smaller than at higher frequencies. Therefore, a set of band-pass filters is arranged from dense to sparse according to the critical bandwidth in the frequency band from low frequency to high frequency to filter the input signal. The energy of the signal output by each band-pass filter is used as the basic characteristic of the signal, and the characteristic can be used as the input characteristic of voice after further processing. Because the characteristics do not depend on the nature of the signals, no assumption and limitation are made on the input signals, and the research results of the auditory model are utilized. Therefore, such parameters are more robust than LPCC based on vocal tract model, more fitting the auditory properties of the human ear, and still have better recognition performance when the signal-to-noise ratio is reduced.
As a more specific solution, the gesture feature data preprocessing operation is an operation of converting time domain gesture feature data into frequency domain gesture feature data, wherein the gesture feature data is gesture feature data including an X axis, a Y axis and a Z axis, and the gesture feature data preprocessing operation is performed by:
framing operation is carried out on the gesture feature data, and each frame of the gesture feature data corresponds to each frame of the sound feature data one by one;
calculating the displacement of each frame by the attitude characteristic data, wherein the calculation formula is as follows:
s(n)=f(n)-f(n-1);n∈(0,512];
as(n)=s(n)-s(n-1);n∈(0,512];
where s (n) represents the speed of the nth frame, as (n) represents the acceleration of the nth frame, and f (n) represents the data position tag of the nth frame;
respectively carrying out logarithmic transformation on the calculated speed and acceleration;
and splicing the speed and the acceleration together to obtain gesture characteristic data.
As a more specific solution, the pre-processed gesture feature data and sound feature data are subjected to feature stitching through the following steps:
the collected sound characteristic data and attitude characteristic data are subjected to one-to-one point location marking information according to the real-time corresponding positions;
information labeling of the starting position and the ending position of sound characteristic data is carried out on the gesture characteristic data of the gesture sensor;
according to the signal-to-noise ratio requirement, mixing the random noise data with the marked voice characteristic data in a random SNR mode, and ensuring that the mixed data corresponds to the starting position and the ending position of the voice characteristic data one by one;
performing standard matching on the mixed data and gesture feature data marked with point location information, and obtaining training data with a spliced feature;
and performing feature stitching on all the gesture feature data and the sound feature data, and obtaining a training data set after feature stitching.
It should be noted that: the marking point location information and the marking of the gesture feature data and the sound feature data are the precondition of ensuring the strict real-time correspondence of the gesture feature data and the sound feature data, and the training effect with good effect can be obtained only if the processing is correct.
As a more specific solution, the neural network model is a cyclic neural network model, and the cyclic neural network model collects information of adjacent frames and adjusts a weight matrix of voice activity detection of a current frame according to the information of the adjacent frames.
As a more specific solution, the neural network quantity after training is quantized and compressed, and the 32bit floating point type weight is quantized into a 2bit fixed point type weight through quantization and compression; the quantization compression steps are as follows:
calculating threshold delta and scaling factor alpha from the original matrix
Converting the original weight into a three-value weight
The input X is multiplied as a new input and then added to the three-valued weights to propagate forward instead of the original multiplication.
Iterative training is performed using SGD algorithm back propagation.
It should be noted that: the artificial neural network allows the computer to reach an unprecedented level of performance in processing speech recognition tasks. However, the high complexity of the model brings about high consumption of memory space and computing resources, which makes it difficult to implement the model to each hardware platform.
To address these issues, the model is compressed to minimize the model's consumption of computation space and time. Currently, mainstream networks, such as VGG16, occupy more than 500 MB space with parameters of more than 1 hundred million and 3 thousand, and require more than 300 hundred million floating point operations to complete a recognition task.
In the artificial neural network, a large number of redundant nodes exist, and only a small part (5-10%) of weights participate in main calculation, that is, only a small part of weight parameters are trained to achieve the performance similar to that of the original network. Therefore, the trained neural network model needs to be compressed, and the compression of the neural network model comprises tensor decomposition, model pruning and model quantization.
Tensor decomposition is the approximation of the network weights as a full order matrix with multiple low rank matrices, which is suitable for model compression, but is not easy to implement, involves computationally expensive decomposition operations, and requires extensive retraining to achieve convergence.
Model pruning is to reject relatively unimportant weights in the weight matrix and then to refine the network. However, the pruning of the model can lead to irregular network connection, memory occupation needs to be reduced through sparse expression, and further, a large amount of condition judgment and extra space are needed to mark 0 or non-0 parameter positions during forward propagation, so that the pruning model is not suitable for parallel computation, and unstructured sparsity needs to use a special software calculation library or hardware.
Therefore, we perform quantization compression through the quantization direction of the model, and generally, the weights of the neural network model are all represented by floating point numbers with the length of 32 bits. Many times this high precision is not required, and can be represented by quantization, for example by 8 bits. The space required for each weight is reduced by sacrificing accuracy. The precision required by SGD is only 6-8bit, and the storage volume of the model is reduced under the condition that reasonable quantization can ensure the precision. According to quantization methods, it is classified into binary quantization, ternary quantization and multi-valued quantization. In this embodiment, three-value quantization is selected, and compared with two-value quantization, 0 value is added to three-value quantization based on two values of 1 and-1 to form a three-value network, and the calculated amount is not increased.
And iterative training is performed by using SGD algorithm back propagation, wherein the calculated gradient is used for adjusting the weight of the neural network. The SGD algorithm is a form of gradient descent, and as the SGD algorithm adjusts these weights, the neural network will produce a more ideal output. The overall error of the neural network should decrease with training.
As a more specific solution, the original weight matrix W is passed through a three-valued weight W t Approximately expressed by multiplying the scale factor alpha by the three-value weight W t Expressed as:
Figure GDA0004110348950000091
wherein: a threshold delta is generated from the original weight matrix W, which is:
Figure GDA0004110348950000092
wherein: i represents the corresponding sequence number of the weight items, and n represents the total sequence number of the weight items;
the scaling factor α is:
Figure GDA0004110348950000093
wherein: i Δ ={1≤i≤n||W i >Δ|},|I Δ I represents I Δ The number of elements in the matrix.
As a more specific solution, the windowing operation is performed by a hamming window function, which is:
Figure GDA0004110348950000094
the pre-emphasis weighting factor is 0.97, the first-order differential parameter extraction of the Mel spectrogram is completed by a Mel filter, and the Mel filter function is:
Figure GDA0004110348950000095
where f represents the actual frequency of the signal that needs to be filtered.
As a more specific solution, voice activity detection is performed through a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame characteristic data processing on an audio signal needing to be subjected to voice activity detection, and the posterior probability of voice/non-voice is calculated through a softmax function according to the calculation result of the deep neural network model; the posterior probability value is between 0 and 1, and when the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and when the posterior probability value does not exceed the judgment threshold value, the voice is considered to be non-voice.
It should be noted that, the neural network model obtained through the training of the mixed feature data can be well adapted to the voice activity detection under various postures, while the softmax function is mainly used for normalizing the calculation result of the model, and the softmax function can be used for "compressing" a K-dimensional vector z containing any real number into another K-dimensional real vector o (z), so that the range of each element is between (0, 1), and the sum of all elements is 1. The voice/non-voice can be accurately classified by the softmax function.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims (8)

1. The voice activity detection method based on the gesture sensor is applied to an audio acquisition device with the gesture sensor, and is characterized in that the method comprises the steps of performing neural network quantization training by constructing mixed characteristic data considering gesture characteristic data and sound characteristic data, and obtaining an optimal solution of a neural network model, wherein the neural network model is used for voice activity detection, and the mixed characteristic data is constructed through the following steps:
acquiring the gesture change of the audio acquisition device through a gesture sensor and recording the gesture change as gesture characteristic data;
collecting external sound changes through an audio collecting device and taking the external sound changes as sound characteristic data;
respectively carrying out data preprocessing operation on the attitude characteristic data and the sound characteristic data;
performing feature stitching on the preprocessed gesture feature data and sound feature data to obtain mixed feature data;
taking the mixed characteristic data as neural network quantized training data for subsequent model training;
and performing characteristic splicing on the preprocessed gesture characteristic data and sound characteristic data by the following steps:
the collected sound characteristic data and attitude characteristic data are subjected to one-to-one point location marking information according to the real-time corresponding positions;
information labeling of the starting position and the ending position of sound characteristic data is carried out on the gesture characteristic data of the gesture sensor;
according to the signal-to-noise ratio requirement, mixing the random noise data with the marked voice characteristic data in a random SNR mode, and ensuring that the mixed data corresponds to the starting position and the ending position of the voice characteristic data one by one;
performing standard matching on the mixed data and gesture feature data marked with point location information, and obtaining training data with a spliced feature;
and performing feature stitching on all the gesture feature data and the sound feature data, and obtaining a training data set after feature stitching.
2. The method for detecting voice activity based on an attitude sensor according to claim 1, wherein the sound feature data is MFCC feature data, and the MFCC sound feature data extraction and sound feature data preprocessing operations are performed by:
pre-emphasis is carried out on the sound characteristic data through a high-pass filter;
framing the pre-emphasis data by a framing function;
carrying each frame into a window function to carry out windowing operation;
performing fast Fourier transform on each frame of the windowed signal to obtain an energy spectrum of each frame;
discrete cosine transforming the energy spectrum row to obtain an MFCC coefficient;
extracting first-order differential parameters from the Mel spectrogram;
and splicing the MFCC coefficients and the first-order differential parameters to obtain MFCC characteristic data.
3. The method for detecting voice activity based on an attitude sensor according to claim 1, wherein the operation of preprocessing the attitude feature data is an operation of converting time domain attitude feature data into frequency domain attitude feature data, the attitude feature data being attitude feature data including an X axis, a Y axis and a Z axis, the operation of preprocessing the attitude feature data being performed by:
framing operation is carried out on the gesture feature data, and each frame of the gesture feature data corresponds to each frame of the sound feature data one by one;
calculating the displacement of each frame by the attitude characteristic data, wherein the calculation formula is as follows:
s(n)=f(n)-f(n-1);n∈(0,512];
as(n)=s(n)-s(n-1);n∈(0,512];
where s (n) represents the speed of the nth frame, as (n) represents the acceleration of the nth frame, and f (n) represents the data position tag of the nth frame;
respectively carrying out logarithmic transformation on the calculated speed and acceleration;
and splicing the speed and the acceleration together to obtain gesture characteristic data.
4. The method for detecting voice activity based on an attitude sensor according to claim 1, wherein the neural network model is a cyclic neural network model, the cyclic neural network model collects information of adjacent frames and adjusts a weight matrix for detecting voice activity of a current frame according to the information of the adjacent frames.
5. The voice activity detection method based on the gesture sensor according to claim 1, wherein the trained neural network quantity is quantized and compressed, and the 32bit floating point type weight is quantized into a 2bit fixed point type weight through quantization and compression; the quantization compression steps are as follows:
calculating threshold delta and scaling factor alpha from the original matrix
Converting the original weight into a three-value weight;
multiplying the input X with alpha as a new input, and then adding with the three-value weight to replace the original multiplication to perform forward propagation;
iterative training is performed using SGD algorithm back propagation.
6. The method for detecting voice activity based on gesture sensor of claim 5, wherein the voice activity is detected by using the gesture sensorThe weight matrix W is weighted by three values
Figure FDA0004138979520000036
Approximately multiplied by the scaling factor alpha, the three-value weights +.>
Figure FDA0004138979520000037
Expressed as:
Figure FDA0004138979520000031
wherein: a threshold delta is generated from the original weight matrix W, which is:
Figure FDA0004138979520000032
wherein: i represents the corresponding sequence number of the weight items, and n represents the total sequence number of the weight items;
the scaling factor α is:
Figure FDA0004138979520000033
wherein: i Δ ={1≤i≤n||W i >Δ|},|I Δ I represents I Δ The number of elements in the matrix.
7. The method of claim 2, wherein the windowing is performed by a hamming window function, the hamming window function being:
Figure FDA0004138979520000034
wherein n represents the intercepted signal; a, a 0 Representing a hamming window constant, having a value of 25/46; n-1 represents interception of Hamming windowWindow length;
the pre-emphasis weighting factor is 0.97, the first-order differential parameter extraction of the Mel spectrogram is completed by a Mel filter, and the Mel filter function is:
Figure FDA0004138979520000035
/>
where f represents the actual frequency of the signal that needs to be filtered.
8. A method of gesture sensor based voice activity detection according to claim 2, wherein voice activity detection is performed by means of a trained neural network model; the neural network model is a deep neural network model, the deep neural network model carries out frame-by-frame characteristic data processing on an audio signal needing to be subjected to voice activity detection, and the posterior probability of voice/non-voice is calculated through a softmax function according to the calculation result of the deep neural network model; the posterior probability value is between 0 and 1, and when the posterior probability value exceeds the judgment threshold value, the voice is considered to be voice, and when the posterior probability value does not exceed the judgment threshold value, the voice is considered to be non-voice.
CN202110646290.7A 2021-06-10 2021-06-10 Voice activity detection method based on attitude sensor Active CN113327589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110646290.7A CN113327589B (en) 2021-06-10 2021-06-10 Voice activity detection method based on attitude sensor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110646290.7A CN113327589B (en) 2021-06-10 2021-06-10 Voice activity detection method based on attitude sensor

Publications (2)

Publication Number Publication Date
CN113327589A CN113327589A (en) 2021-08-31
CN113327589B true CN113327589B (en) 2023-04-25

Family

ID=77420338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110646290.7A Active CN113327589B (en) 2021-06-10 2021-06-10 Voice activity detection method based on attitude sensor

Country Status (1)

Country Link
CN (1) CN113327589B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818773B (en) * 2022-03-12 2024-04-16 西北工业大学 Low-rank matrix sparsity compensation method for improving reverberation suppression robustness

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120226498A1 (en) * 2011-03-02 2012-09-06 Microsoft Corporation Motion-based voice activity detection
CN106531186B (en) * 2016-10-28 2019-07-12 中国科学院计算技术研究所 Merge the step detection method of acceleration and audio-frequency information
US10692485B1 (en) * 2016-12-23 2020-06-23 Amazon Technologies, Inc. Non-speech input to speech processing system
CN109872728A (en) * 2019-02-27 2019-06-11 南京邮电大学 Voice and posture bimodal emotion recognition method based on kernel canonical correlation analysis
CN111798875A (en) * 2020-07-21 2020-10-20 杭州芯声智能科技有限公司 VAD implementation method based on three-value quantization compression

Also Published As

Publication number Publication date
CN113327589A (en) 2021-08-31

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN109065067B (en) Conference terminal voice noise reduction method based on neural network model
WO2021143327A1 (en) Voice recognition method, device, and computer-readable storage medium
CN111833896B (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
US8880396B1 (en) Spectrum reconstruction for automatic speech recognition
CN110600018A (en) Voice recognition method and device and neural network training method and device
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN111899756B (en) Single-channel voice separation method and device
Wang et al. Deep learning assisted time-frequency processing for speech enhancement on drones
CN108564956B (en) Voiceprint recognition method and device, server and storage medium
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
Cai et al. Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment.
Sharma et al. Study of robust feature extraction techniques for speech recognition system
CN111798875A (en) VAD implementation method based on three-value quantization compression
CN113327589B (en) Voice activity detection method based on attitude sensor
CN118197309A (en) Intelligent multimedia terminal based on AI speech recognition
CN114566179A (en) Time delay controllable voice noise reduction method
CN110970044A (en) Speech enhancement method oriented to speech recognition
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN116030824A (en) Directional voice separation method based on deep neural network
CN113808604B (en) Sound scene classification method based on gamma through spectrum separation
CN114023352B (en) Voice enhancement method and device based on energy spectrum depth modulation
Pan et al. Application of hidden Markov models in speech command recognition
Srinivasarao An efficient recurrent Rats function network (Rrfn) based speech enhancement through noise reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant