CN110580919B

CN110580919B - Voice feature extraction method and reconfigurable voice feature extraction device under multi-noise scene

Info

Publication number: CN110580919B
Application number: CN201910764547.1A
Authority: CN
Inventors: 刘波; 孙煜昊; 朱文涛; 李焱; 沈泽昱; 杨军
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2021-09-28
Anticipated expiration: 2039-08-19
Also published as: CN110580919A

Abstract

The invention discloses a voice feature extraction method and a reconfigurable voice feature extraction device in a multi-noise scene, and belongs to the technical field of voice recognition. The device dynamically selects a voice feature extraction mode according to a bottom noise threshold analysis judgment result and output results of the low-pass filter and the neural network by combining the characteristics of low power consumption of voice extraction of the low-pass filter and the characteristics of high accuracy of voice feature extraction of the Mel filter, and switches voice feature extraction channels through the reconfigurable feature extraction function configuration module. The low-pass filter is adopted to extract the voice features and identify the neural network under the condition that the external environment has no voice or has voice but high signal-to-noise ratio, and the Mel filter is adopted to extract the voice features under the condition that the signal-to-noise ratio is low and the voice is input, so that the overall power consumption of the voice feature extraction is reduced.

Description

Voice feature extraction method and reconfigurable voice feature extraction device under multi-noise scene

Technical Field

The invention discloses a voice feature extraction method and a reconfigurable voice feature extraction device in a multi-noise scene, relates to an artificial intelligent neural network technology, and belongs to the technical field of voice recognition.

Background

Currently, with the development of voice recognition technology, digital equipment and multimedia technology, voice endpoint detection technology has been well developed, and voice endpoint detection is a technology for detecting voice segments in continuous signals, and can be combined with an automatic voice recognition system and a voiceprint recognition system, and can provide accurate and effective voice endpoints, so that voice endpoint detection is an important module.

In order to detect the voice endpoint constantly, the voice endpoint detection module must be opened all the time, the power consumption of the whole module must be considered at this time, and in order to reduce the power consumption and keep the recognition accuracy, the reconfigurable voice endpoint detection method facing to the multi-noise environment is provided.

Disclosure of Invention

The invention aims to provide a voice feature extraction method and a reconfigurable voice feature extraction device in a multi-noise scene aiming at the defects of the background technology, which can dynamically select a voice feature extraction mode to reduce power consumption on the premise of keeping precision and solve the technical problems of low power consumption, low precision or high power consumption and high precision of the traditional voice endpoint detection module.

The invention adopts the following technical scheme for realizing the aim of the invention:

the method for extracting the voice features in the multi-noise scene detects the environmental signal to noise ratio, extracts and identifies the features of the input signals based on low-pass filtering, and extracts and identifies the features of the input signals based on Mel filtering only when the environmental signal to noise ratio is low and the input signals are identified to contain the voice signals.

Further, in the voice feature extraction method under the multi-noise scene, the input signal is obtained by amplifying and performing analog-to-digital conversion on the signal acquired by the microphone array.

Further, in the method for extracting voice features in a multi-noise scene, the method for detecting the signal-to-noise ratio of the environment comprises the following steps: and detecting the environment bottom noise analog value, and quantizing the environment bottom noise analog value into an environment signal-to-noise ratio of n bits.

Further, in the method for extracting voice features in a multi-noise scene, the method for identifying the voice signal contained in the input signal comprises the following steps: and carrying out bit-by-bit shift operation on the feature extraction and recognition results based on low-pass filtering under the same clock signal and storing the result, and judging that the input signal contains a voice signal when at least one bit of the recognition results under the same clock signal has an output value.

Further, in the method for extracting voice features in a multi-noise scene, the method for judging the environmental signal-to-noise ratio is as follows: and comparing the environmental signal-to-noise ratio with a preset value according to bits, judging that the environmental signal-to-noise ratio is low when the environmental signal-to-noise ratio is lower than the preset value, and judging that the environmental signal-to-noise ratio is high when the environmental signal-to-noise ratio is higher than the preset value.

The speech feature extraction device under the multi-noise scene comprises:

a noise detection module for detecting the signal-to-noise ratio of the environment,

the feature extraction and identification module based on low-pass filtering firstly carries out low-pass filtering and then carries out feature extraction on the input signal, outputs an identification result,

a function configuration module for outputting a start signal of the feature extraction and recognition module based on Mel filtering when the environmental signal-to-noise ratio is low and the input signal is recognized to contain a voice signal, and,

and the characteristic extraction and identification module based on the Mel filtering firstly carries out the Mel filtering and then carries out the characteristic extraction on the input signal after receiving the starting signal, and outputs an identification result.

Furthermore, in the device for extracting reconfigurable speech features in a multi-noise scene, the noise detection module comprises:

a bottom noise detecting unit for detecting the environment bottom noise analog value, quantizing the environment bottom noise analog value into an environment signal-to-noise ratio of n bits, and,

and the noise threshold judging unit compares the environmental signal-to-noise ratio with a preset value according to bits, judges that the environmental signal-to-noise ratio is low when the environmental signal-to-noise ratio is lower than the preset value, and judges that the environmental signal-to-noise ratio is high when the environmental signal-to-noise ratio is higher than the preset value.

Further, in the device for extracting reconfigurable voice features in a multi-noise scene, the function configuration module includes:

a presence judging unit which bit-wise shifts and stores the feature extraction and recognition result based on the low-pass filtering in the same clock signal, judges that the input signal contains a voice signal when at least one bit of the recognition result has an output value in the same clock signal, and,

and the NAND gate unit is used for carrying out NAND operation on the output value of the noise detection module and the output value of the existence judgment unit.

A voice endpoint detection system under a multi-noise scene comprises:

a voice collecting device for amplifying and analog-to-digital converting the signals collected by the microphone array to obtain input signals,

any one of the above reconfigurable speech feature extraction devices extracts speech signal features from an input signal obtained by the speech acquisition device and then recognizes the input signal.

Furthermore, in the voice endpoint detection system under the multi-noise scene, the voice acquisition device comprises a low-noise amplifier, a programmable gain amplifier and an analog-to-digital converter which are connected in sequence, the input end of the low-noise amplifier is connected with the signals acquired by the microphone array, and the analog-to-digital converter outputs input signals.

The voice sampling input module amplifies and samples input voice into a digital signal. The extraction and recognition of voice features are realized through a low-pass filter and a forward neural network based on the feature extraction and recognition of the low-pass filter, the extraction of the voice features is performed on input voice data through the low-pass filter, and the extracted data is recognized through the forward neural network. Feature extraction and recognition based on Mel filtering are realized by Mel filter and forward neural network, the input voice data is subjected to extraction of voice features by Mel filter, and the extracted data is input to forward neural network for recognition. The bottom noise detection module detects the environmental bottom noise and the signal-to-noise ratio of the environment. The noise threshold judging module judges whether the environmental noise is smaller than a preset value according to the signal-to-noise ratio output by the bottom noise detecting module. The existence judging module compares the existence of '1' in the output results of the low-pass filter and the neural network thereof and judges whether the voice exists. The function configuration module opens the feature extraction and recognition module based on the Mel filtering when meeting the requirement of noise magnitude and judging that the output has the language, realizes the speech recognition by the Mel filter and the forward neural network, outputs the final recognition result based on the feature extraction and recognition module of the Mel filtering, realizes the reconstruction of the speech feature extraction channel, and switches the speech feature extraction mode under the condition of speech input in the environment with low signal-to-noise ratio.

By adopting the technical scheme, the invention has the following beneficial effects:

(1) aiming at the influence of a voice feature extraction mode on the recognition accuracy under different noise environments, the method adopts a low-power-consumption low-pass filtering-based feature extraction mode under the two conditions of no voice input under the environment with high signal-to-noise ratio or no voice input under the environment with low signal-to-noise ratio by switching the low-pass filtering-based feature extraction and recognition mode and the Mel filtering-based feature extraction and recognition mode, so that the requirement on the detection accuracy is met with lower power consumption, and the defect of poor voice accuracy under the environment with low signal-to-noise ratio is overcome by adopting the Mel filtering-based feature extraction and recognition mode under the environment with low signal-to-noise ratio.

(2) The constructed reconfigurable voice feature extraction device judges a reconfigurable voice feature extraction channel based on bottom noise threshold analysis, dynamically selects a feature extraction and identification module based on low-pass filtering and a feature extraction and identification module based on Mel filtering according to the magnitude of the environmental signal-to-noise ratio through a function configuration module, and reasonably controls the power consumption of the whole device under the condition of controlling the voice endpoint detection and identification precision.

Drawings

Fig. 1 is a schematic overall architecture diagram of the reconfigurable speech feature extraction device disclosed in the present invention.

FIG. 2 is a flow chart of extracting speech features in a multi-noise scene according to the present invention.

Fig. 3 is a circuit for implementing the noise threshold determination module according to the present invention.

FIG. 4 is a circuit diagram of an implementation of the presence determination module of the present invention.

Fig. 5 is a circuit for implementing the functional configuration module according to the present invention.

Fig. 6 shows the detailed steps of the function control of the reconfigurable speech feature extraction device according to the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of protection of the invention, since modifications of equivalent forms of the invention, which fall within the limits of the appended claims of the present application, will occur to persons skilled in the art after reading the present invention.

The reconfigurable voice feature extraction device oriented to the multi-noise scene realizes dynamic selection of a voice feature extraction mode and a voice recognition mode according to the relationship between the magnitude of an environmental signal-to-noise ratio and a threshold value and the voice existence of the environmental signal-to-noise ratio under the coordination control among all internal modules. As shown in fig. 1, the entire apparatus includes: the device comprises a voice sampling input module, a reconfigurable feature extraction function configuration module based on bottom noise threshold analysis and judgment, a feature extraction and identification module based on low-pass filtering and a feature extraction and identification module based on Mel filtering.

The voice sampling input module directly collects input sound into analog quantity by the microphone array, the analog quantity is amplified by the low-noise amplifier and then amplified by the programmable gain amplifier, and the analog quantity is sent to the analog-to-digital converter by the driving module to carry out data sampling and output a digital signal with fixed bits.

The sampled sound data can pass through a Mel filter and a forward neural network behind the Mel filter, a low-pass filter and a forward neural network behind the low-pass filter, and a reconfigurable feature extraction function configuration module based on bottom noise threshold analysis and judgment.

The reconfigurable feature extraction function configuration module based on the bottom noise threshold analysis judgment comprises a bottom noise detection module, a noise threshold judgment module, a presence judgment module and a reconfigurable feature extraction function configuration module. The input voice data can detect the signal-to-noise ratio in the environment through the bottom noise detection module, and the output of the analog-to-digital converter is compared with the set noise-free sample energy to obtain the signal-to-noise ratio of the noise existing in the sound in the environment. The range of signal-to-noise ratios output may be a smaller range around the preset value. In practice, the preset value can be set to 10dB, the output range can be 9dB to 12dB, and the output less than 9dB can be regarded as 9dB output; while a value greater than 12dB may be considered a 12dB output. The result of the simultaneous output is expressed by a 2-bit digital signal, 9dB can be expressed as "00", 10dB can be expressed as "01", 11dB can be expressed as "10", and 12dB can be expressed as "11".

The noise threshold judging module is a 2-bit numerical comparator, and is expressed as '01' according to the previously set preset value of 10 dB. And comparing the 2-bit binary number output by the bottom noise detection module with the set 2-bit binary preset value, wherein if the number is smaller than the preset value, the output signal is 1, and otherwise, the output signal is 0. As shown in fig. 3, a is the output signal of the noise detection module, B is a preset value of 10dB, i.e., "01", and FA < B is the required output signal. The process of selecting the speech feature extraction method according to the bottom noise detection value is shown in fig. 2.

The existence judging module can be regarded as a combination of a shift register and an OR gate, the output of the low-pass filter and the forward neural network is used as input to be stored in the shift register, whether each bit in the shift register has '1' or not is judged, if yes, the output is '1', and if not, the output is '0'. As shown in fig. 4, the output of the low-pass filter speech feature extraction and its neural network enters from the input port, the outputs of the four registers are the inputs of the or gate, and the output of the module is the output of the or gate.

As shown in fig. 5, the reconfigurable feature extraction function configuration may be regarded as a nand gate, and when the signal for determining the noise threshold is "1", if the output of the existence determination module is "1", the output of the reconfigurable feature extraction function configuration is "1" in order to improve the accuracy, so that data can be input to the mel filter and the following forward neural network is turned on, and at this time, the displayed result is the output result of the forward neural network after the mel filter; when the output result of the forward neural network after the low-pass filter is output to be 0 through the existence judging module, the Mel filter and the forward neural network behind the Mel filter are not started; and similarly, the environmental signal-to-noise ratio is not smaller than a preset value, and the analysis and judgment can not be carried out by adopting a Mel filter module.

Fig. 6 shows a reconfigurable speech feature extraction module and a function control method for a multi-noise scene in this embodiment, and the specific implementation steps are as follows:

step 101: the input sound is directly collected by the microphone array, amplified by the low-noise amplifier, amplified by the programmable gain amplifier and sent to the analog-to-digital converter by the driving module for data sampling;

step 102: the data after analog-to-digital conversion can be simultaneously input into a reconfigurable feature extraction function configuration module based on bottom noise threshold analysis judgment, a Mel filter and a backward forward neural network thereof, a low-pass filter and a backward forward neural network thereof, and the reconfigurable feature extraction function configuration module based on bottom noise threshold analysis judgment actually controls whether the Mel filter and the backward forward neural network module thereof are started;

step 103, step 104 and step 105 may be performed simultaneously;

step 103-A: the bottom noise detection module is used for performing bottom noise detection on an input voice signal to obtain the size of noise, and the signal-to-noise ratio of the environmental sound is calculated locally and accurately and is expressed as a 2-bit digital signal;

step 103-B: the noise threshold judgment module compares the measured signal-to-noise ratio of the environmental sound with a preset value, and outputs a judgment result after comparison, wherein the judgment result is '0' or '1';

step 103-D: the reconfigurable feature extraction function configuration module performs NAND operation on the output result of the noise threshold judgment and the output result of the existence judgment module in the step 103-C, and outputs a switching signal for determining whether to perform the step 105;

step 104: the low-pass filter and the backward forward neural network perform low-pass filtering voice feature extraction on the voice signal preprocessed by the voice acquisition module, send the extracted voice features into the neural network backward analysis after the low-pass filter, and output the analysis result;

step 103-C: the existence judging module inputs the results into a shift register in sequence, performs OR operation on each bit of output results of the shift register, outputs a bit of digital signal '0' or '1', and is used for the reconfigurable feature extraction function configuration module in the step 103-D;

step 105: in step 103-D, the output of the reconfigurable feature extraction function configuration module opens a Mel filter and a forward neural network behind the Mel filter, the Mel filter and the forward neural network behind the Mel filter perform Mel-filtered voice feature extraction on the voice signal preprocessed by the voice acquisition module, the extracted voice features are sent to the neural network behind the Mel filter for forward analysis, and the analyzed result is output;

step 106: the result output preferentially selects the output result of the Mel filter and the backward forward neural network; if the output result of the Mel filter and the backward forward neural network is not available, the output result of the low-pass filter and the backward forward neural network is output.

Claims

1. The method for extracting the voice features in the multi-noise scene is characterized in that an environmental signal-to-noise ratio is detected, feature extraction and recognition based on low-pass filtering are carried out on an input signal, and feature extraction and recognition based on Mel filtering are carried out on the input signal only when the environmental signal-to-noise ratio is low and the input signal is recognized to contain the voice signal; the method for identifying the voice signal contained in the input signal comprises the following steps: and carrying out bit-by-bit shift operation on the feature extraction and recognition results based on low-pass filtering under the same clock signal and storing the result, and judging that the input signal contains a voice signal when at least one bit of the recognition results under the same clock signal has an output value.

2. The method for extracting speech features in multiple noise scenes according to claim 1, wherein the input signal is obtained by performing amplification processing and analog-to-digital conversion on signals collected by a microphone array.

3. The method for extracting speech features in a multi-noise scene according to claim 1, wherein the method for detecting the signal-to-noise ratio of the environment comprises: and detecting the environment bottom noise analog value, and quantizing the environment bottom noise analog value into an environment signal-to-noise ratio of n bits.

4. The method for extracting speech features in a multi-noise scene according to claim 3, wherein the method for judging the environmental signal-to-noise ratio is as follows: and comparing the environmental signal-to-noise ratio with a preset value according to bits, judging that the environmental signal-to-noise ratio is low when the environmental signal-to-noise ratio is lower than the preset value, and judging that the environmental signal-to-noise ratio is high when the environmental signal-to-noise ratio is higher than the preset value.

5. Reconfigurable speech feature extraction device under many noise scenes, its characterized in that includes:

the feature extraction and identification module based on low-pass filtering firstly performs low-pass filtering and then performs feature extraction on the input signal, outputs a feature extraction and identification result based on low-pass filtering,

a function configuration module for bit-wise shifting and storing the feature extraction and recognition result based on low-pass filtering under the same clock signal, determining that the input signal contains a voice signal when at least one bit of the recognition result under the same clock signal has an output value, outputting a start signal of the feature extraction and recognition module based on Mel filtering when the environmental signal-to-noise ratio is low and the input signal contains the voice signal, and,

and the characteristic extraction and identification module based on the Mel filtering firstly carries out the Mel filtering and then carries out the characteristic extraction on the input signal after receiving the starting signal, and outputs the characteristic extraction and identification result based on the Mel filtering.

6. The device for extracting the reconfigurable speech feature under the multi-noise scene according to claim 5, wherein the noise detection module comprises:

7. The device for extracting the reconfigurable speech feature under the multi-noise scene according to claim 5, wherein the functional configuration module comprises:

8. Voice endpoint detection system under many noise scenes, characterized by, includes:

a reconfigurable speech feature extraction mechanism according to claim 5, 6 or 7, wherein the speech signal features are extracted from the input signal obtained by the speech acquisition mechanism and the input signal is recognized.

9. The system according to claim 8, wherein the voice capturing device comprises a low noise amplifier, a programmable gain amplifier, and an analog-to-digital converter, which are connected in sequence, an input terminal of the low noise amplifier is connected to the signal captured by the microphone array, and the analog-to-digital converter outputs the input signal.