CN110556125B

CN110556125B - Feature extraction method and device based on voice signal and computer storage medium

Info

Publication number: CN110556125B
Application number: CN201910976850.8A
Authority: CN
Inventors: 李勤; 付聪
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Mobvoi Information Technology Co Ltd
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2022-06-10
Anticipated expiration: 2039-10-15
Also published as: CN110556125A

Abstract

The invention discloses a method, equipment and computer storage medium for extracting a characteristic value based on a voice signal, wherein the method comprises the following steps: carrying out time domain to frequency domain conversion on the noisy speech signal to obtain a frequency domain signal of the noisy speech signal; carrying out Mel filtering processing on the frequency domain signal to obtain a Mel power spectrum value of the frequency domain signal; denoising the Mel power spectrum value to obtain a noise-reduced Mel power spectrum value; and performing voice recognition according to the noise-reduced Mel power spectrum value to obtain voice characteristics corresponding to the noise-containing voice signal.

Description

Feature extraction method and device based on voice signal and computer storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for extracting feature values based on speech signals, and a computer storage medium.

Background

The speech recognition technology is a technology for converting a speech signal of a speaker into information recognizable by a computer program, thereby recognizing a speech command and text contents of the speaker. At present, speech recognition is widely applied to the fields of customer service quality inspection, navigation, intelligent home and the like. Speech recognition generally includes modules such as front-end processing, feature extraction, and the like. The input voice data stream is subjected to front-end processing to obtain a recognition result.

The voice denoising is used for denoising an input voice with noise, a voice signal with noise is used as an input signal of a denoising module, a time domain signal is generally transformed into a frequency domain firstly, then the frequency domain signal is used as the input signal of the denoising module for denoising, and finally the signal is transformed into the time domain. Other functions, such as speech recognition and noise reduction, are typically included on a carrier with speech processing capabilities. With the increase of functions, the data volume of carrier operation is also increased, and the occupation of the memory is influenced.

Disclosure of Invention

The embodiment of the invention provides a method and equipment for extracting a characteristic value based on a voice signal and a computer storage medium, which have the effect of reducing the data volume of operation.

The embodiment of the invention provides a feature extraction method based on a voice signal, which comprises the following steps: carrying out time domain to frequency domain conversion on the noisy speech signal to obtain a frequency domain signal of the noisy speech signal; carrying out Mel filtering processing on the frequency domain signal to obtain a Mel power spectrum value of the frequency domain signal; denoising the Mel power spectrum value to obtain a noise-reduced Mel power spectrum value; and performing voice recognition according to the noise-reduced Mel power spectrum value to obtain voice characteristics corresponding to the noise-containing voice signal.

In an implementation manner, the performing mel filtering processing on the frequency domain signal to obtain a mel power spectrum value of the frequency domain signal includes: calculating a signal power spectrum of the frequency domain signal; and carrying out Mel filtering processing on the signal power spectrum obtained by calculation through a Mel filter bank to obtain a Mel power spectrum value of the frequency domain signal.

In an embodiment, denoising the mel-power spectrum values to obtain denoised mel-power spectrum values includes: carrying out noise estimation on the signal power spectrum to obtain a noise estimation value; and carrying out noise suppression on the Mel power spectrum value according to the noise estimation value to obtain a noise-reduced Mel power spectrum value.

In one embodiment, performing noise estimation on the signal power spectrum to obtain a noise estimation value includes: calculating the signal power spectrum to obtain the minimum value of the noisy power within a set time; determining the minimum value of the noisy power as a noise estimation reference value; and compensating the noise estimation reference value to obtain the noise estimation value.

In an embodiment, the noise suppressing the mel-power spectrum value according to the noise estimation value to obtain a noise-reduced mel-power spectrum value includes: determining a first gain value of the mel-power spectrum value according to the noise estimation value; performing inter-spectrum smoothing on the first gain value to obtain a second gain value; and performing noise reduction processing on the Mel power spectrum value by using the second gain value to obtain a noise-reduced Mel power spectrum value.

In a further possible embodiment, determining a first gain value for the mel-power spectrum value based on the noise estimate comprises: calculating the posterior signal-to-noise ratio of the Mel power spectrum value according to the noise estimation value to obtain the posterior signal-to-noise ratio; carrying out prior signal-to-noise ratio calculation according to the posterior signal-to-noise ratio to obtain a prior signal-to-noise ratio; and calculating a gain value according to the prior signal-to-noise ratio to obtain the first gain value corresponding to the Mel power spectrum value.

Another aspect of the present invention provides a feature extraction apparatus based on a speech signal, the apparatus including: the conversion module is used for converting a noisy speech signal from a time domain to a frequency domain to obtain a frequency domain signal of the noisy speech signal; the filtering module is used for carrying out Mel filtering processing on the frequency domain signal to obtain a Mel power spectrum value of the frequency domain signal; the noise reduction module is used for carrying out noise reduction on the Mel power spectrum value to obtain a noise-reduced Mel power spectrum value; and the recognition module is used for carrying out voice recognition according to the noise-reduced Mel power spectrum value to obtain the voice characteristics corresponding to the noise-containing voice signal.

In an embodiment, the filtering module includes: the calculation submodule is used for calculating a signal power spectrum of the frequency domain signal; and the filtering submodule is used for carrying out Mel filtering processing on the signal power spectrum obtained by calculation through a Mel filter bank to obtain a Mel power spectrum value of the frequency domain signal.

In one embodiment, the noise reduction module includes: the noise estimation submodule is used for carrying out noise estimation on the signal power spectrum to obtain a noise estimation value; and the noise suppression submodule is used for performing noise suppression on the Mel power spectrum value according to the noise estimation value to obtain a noise-reduced Mel power spectrum value.

In one embodiment, the noise estimation sub-module includes: the computing unit is used for computing the signal power spectrum to obtain a minimum value of the noisy power within a set time; a first determining unit, configured to determine the minimum value of the noisy power as a noise estimation reference value; and the compensation unit is used for compensating the noise estimation reference value to obtain the noise estimation value.

In one embodiment, the noise suppression sub-module includes: a second determining unit for determining a first gain value of the mel-power spectrum value according to the noise estimation value; the smoothing unit is used for carrying out inter-spectrum smoothing processing on the first gain value to obtain a second gain value; and the noise reduction unit is used for carrying out noise reduction processing on the Mel power spectrum value by using the second gain value to obtain a noise-reduced Mel power spectrum value.

In an embodiment, the second determining unit includes: specifically, the method is used for calculating the posterior signal-to-noise ratio of the mel-power spectrum value according to the noise estimation value to obtain the posterior signal-to-noise ratio; carrying out prior signal-to-noise ratio calculation according to the posterior signal-to-noise ratio to obtain a prior signal-to-noise ratio; and calculating a gain value according to the prior signal-to-noise ratio to obtain the first gain value corresponding to the Mel power spectrum value.

Another aspect of the present invention provides a computer-readable storage medium, which includes a set of computer-executable instructions, when executed, for performing any one of the above-mentioned methods for extracting feature values based on a speech signal.

According to the characteristic value extraction method and device based on the voice signal and the computer storage medium, voice noise reduction and voice characteristic recognition are combined in the whole process and do not need to be carried out separately, so that the whole process only needs to carry out time domain to frequency domain conversion on the signal, the amount of noisy data which really participates in noise reduction operation is reduced, and memory resources and operation resources consumed by a noise reduction algorithm in the voice characteristic recognition process are greatly reduced.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, like or corresponding reference characters designate like or corresponding parts.

Fig. 1 is a schematic flow chart illustrating an implementation of a feature extraction method based on a speech signal according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a Mel filtering process implementation flow of the extraction method of the embodiment of the present invention;

FIG. 3 is a schematic view of a noise reduction process implementation flow of the extraction method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a noise estimation implementation flow of the extraction method according to the embodiment of the present invention;

FIG. 5 is a schematic diagram of a noise suppression implementation flow of the extraction method according to an embodiment of the present invention;

fig. 6 is a schematic block diagram of a feature extraction device based on a speech signal according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart illustrating an implementation of a feature extraction method based on a speech signal according to an embodiment of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a feature extraction method based on a speech signal, where the method includes: step 101, converting a noisy speech signal from a time domain to a frequency domain to obtain a frequency domain signal of the noisy speech signal; 102, carrying out Mel filtering processing on the frequency domain signal to obtain a Mel power spectrum value of the frequency domain signal; 103, denoising the Mel power spectrum value to obtain a denoised Mel power spectrum value; and 104, performing voice recognition according to the noise-reduced Mel power spectrum value to obtain voice characteristics corresponding to the voice signal with noise.

The feature extraction method based on the voice signal provided by the embodiment of the invention takes the intermediate result 'Mel power spectrum value' generated in the voice feature extraction process as input, noise reduction is carried out on the voice signal with noise, and then voice recognition is carried out on the Mel power spectrum value after noise reduction, so as to obtain the voice feature corresponding to the voice signal with noise. In the whole process, the voice noise reduction and the voice feature recognition are combined without being separately carried out, so that the whole process only needs to carry out one-time conversion from a time domain to a frequency domain on a signal, the noise-carrying data volume which really participates in noise reduction operation is reduced, and the memory resource and the operation resource consumed by a noise reduction algorithm in the voice feature recognition process are greatly reduced.

The method comprises the step of converting a time domain to a frequency domain of a voice signal with noise to obtain a frequency domain signal of the voice signal with noise. Specifically, the embodiment of the invention realizes the conversion from the time domain to the frequency domain of the signal by framing, windowing and solving FFT on the voice signal with noise, so that the voice signal with noise is converted into the frequency domain signal. The method also comprises the step of carrying out Mel filtering processing on the frequency domain signal to obtain a Mel power spectrum value of the frequency domain signal. Specifically, the dimensionality of the frequency domain signal is reduced by carrying out Mel filtering processing on the frequency domain signal, the complexity of a Mel power spectrum value is low, and the processing effect is good. The method also comprises the step of carrying out noise reduction on the Mel power spectrum value to obtain the Mel power spectrum value after noise reduction. Specifically, the noise reduction method comprises noise estimation and noise suppression, and the noise-reduced mel power spectrum value is obtained by adopting the noise estimation and the noise suppression. The embodiment of the invention does not limit the specific methods of noise estimation and noise suppression. The method also comprises the step of carrying out voice recognition according to the noise-reduced Mel power spectrum value to obtain the voice characteristics corresponding to the voice signal with noise. Specifically, the voice sound characteristic of the embodiment of the invention is the Fbank value, and the Fbank value is obtained by solving the natural logarithm of the noise-reduced Mel power spectrum value. The Fbank value is used as a voice characteristic to be sent to a voice recognition engine for voice recognition.

Fig. 2 is a schematic flow diagram illustrating a mel filtering process of the extraction method according to the embodiment of the present invention.

Referring to fig. 2, in the embodiment of the present invention, in step 102, a mel filtering process is performed on the frequency domain signal to obtain a mel power spectrum value of the frequency domain signal, which includes: step 1021, calculating a signal power spectrum of the frequency domain signal; and step 1022, performing mel filtering processing on the calculated signal power spectrum through a mel filter bank to obtain a mel power spectrum value of the frequency domain signal.

Specifically, in the process of performing mel filtering on the frequency domain signal, the method includes calculating the frequency domain signal obtained in step 101, so as to obtain a signal power spectrum of the frequency domain signal, and implement dimension reduction on the frequency domain signal, and then performing mel filtering on the obtained signal power spectrum through a mel filter bank, so as to implement dimension reduction on the signal power spectrum, and obtain a mel power spectrum value.

Fig. 3 is a schematic view of a noise reduction process implementation flow of the extraction method according to the embodiment of the present invention.

Referring to fig. 3, in the embodiment of the present invention, in step 103, performing noise reduction on the mel-power spectrum value to obtain a noise-reduced mel-power spectrum value, includes: step 1031, performing noise estimation on the signal power spectrum to obtain a noise estimation value; and step 1032, carrying out noise suppression on the Mel power spectrum value according to the noise estimation value to obtain a noise-reduced Mel power spectrum value.

Specifically, in the process of denoising the mel-frequency filtered spectrum value, the method includes performing noise estimation on the signal power spectrum obtained in the step 1021 to obtain a noise estimation value, and performing noise suppression by combining the noise estimation value and the mel-frequency power spectrum value to obtain a mel-frequency power spectrum value after denoising. The noise-reduced mel-power spectrum value can be used for voice recognition and can also be used as the input of other modules, such as the storage and other processing of the noise-reduced voice signal.

Fig. 4 is a schematic diagram of a noise estimation implementation flow of the extraction method according to the embodiment of the present invention.

Referring to fig. 4, in the embodiment of the present invention, in step 1031, performing noise estimation on the signal power spectrum to obtain a noise estimation value, includes: step 10311, calculating a signal power spectrum to obtain a minimum value of the noisy power within a set time; step 10312, determining the minimum value of the power with noise as a noise estimation reference value; and step 10313, compensating the noise estimation reference value to obtain a noise estimation value.

Specifically, the embodiment of the invention adopts a noise estimation algorithm based on minimum statistics to estimate the noise. The method includes firstly calculating a signal power spectrum, and obtaining a minimum value of the signal power spectrum within a set time period, where the set time period is a certain time period set as required, such as 1s, 2s … or any other set time, and details are not described below. And after the minimum value of the noisy power within the set time is obtained, the value is used as a noise estimation reference value, and then the noise estimation value is obtained by compensating the reference value. Any one of a recursive average noise algorithm, a minimum tracking algorithm, a histogram noise estimation algorithm, or others may also be used to perform noise estimation to obtain a noise estimation value.

Fig. 5 is a schematic diagram of a flow chart of implementing noise suppression by the extraction method according to the embodiment of the present invention.

Referring to fig. 5, in the embodiment of the present invention, in step 1032, the noise suppression is performed on the mel-power spectrum value according to the noise estimation value, and the obtaining of the mel-power spectrum value after noise reduction includes: step 10321, determining a first gain value of the mel-power spectrum value according to the noise estimation value; step 10322, performing inter-spectrum smoothing on the first gain value to obtain a second gain value; and 10323, performing noise reduction on the mel-power spectrum value by using the second gain value to obtain a mel-power spectrum value after noise reduction.

Specifically, the embodiment of the invention adopts a noise suppression algorithm based on wiener filtering to suppress noise. The method comprises the steps of firstly processing a Mel power spectrum value through a noise estimation value to obtain a first gain value, and after the first gain value is obtained, performing certain inter-spectrum smoothing on the first gain value to obtain a second gain value, namely a final gain value. And then, carrying out noise reduction treatment by multiplying the Mel power spectrum value by a second gain value to obtain a Mel power spectrum value after noise reduction. The noise-reduced mel power spectrum value can be used for recognizing voice characteristics and can also be used as an input signal of other voice processing processes. Other noise suppression algorithms can be adopted for noise suppression to obtain a noise-reduced mel power spectrum value, such as a spectral subtraction method and the like.

In this embodiment of the present invention, step 10321, determining a first gain value of the mel-power spectrum value according to the noise estimation value, includes: firstly, calculating the posterior signal-to-noise ratio of a Mel power spectrum value according to a noise estimation value to obtain the posterior signal-to-noise ratio; then, carrying out prior signal-to-noise ratio calculation according to the posterior signal-to-noise ratio to obtain the prior signal-to-noise ratio; and then, calculating a gain value according to the prior signal-to-noise ratio to obtain a first gain value corresponding to the Mel power spectrum value.

Specifically, in the process of determining the first gain value, the method first calculates the posterior signal-to-noise ratio of the mel-power spectrum value according to the noise estimation value. After the posterior signal-to-noise ratio (SNRpost) is obtained through calculation, the posterior signal-to-noise ratio is calculated according to a decision-directed algorithm (decision-directed approach) to obtain a priori signal-to-noise ratio (SNRprio), and a specific formula for calculating the priori signal-to-noise ratio is as follows: SNRprio (i) ═ factor SNRprio (i-1) + (1-factor) × (snrpost (i) — 1, 0). Wherein, the factor is a smoothing factor and is a positive real number between 0 and 1. After obtaining the prior signal-to-noise ratio, calculating a first gain value (gain) of the noisy speech by the prior signal-to-noise ratio, wherein the formula is as follows: gain (i) ═ snrprio (i)/(snrprio (i) + 1). A second gain value is obtained by performing a certain inter-spectral smoothing of the first gain value. And multiplying the Mel power spectrum value by the second gain value to obtain a noise-reduced Mel power spectrum value.

To facilitate understanding of the above embodiments, a specific implementation scenario is provided below for explanation. In this implementation scenario, the feature extraction method based on the voice signal is applied to devices with data processing functions, such as computers, mobile phones, smart speakers, smart headsets, smart watches, smart robots, and the like. In this implementation scenario, the device is a cell phone. First, when a user needs to instruct the device to perform an inquiry about certain information, the device receives a noisy speech signal containing a play command, such as "inquiry information a", through a microphone. The device needs to perform speech feature extraction on the noisy speech signal to clarify the user instruction. In the process of extracting voice characteristics of a noisy voice signal, firstly, performing framing, windowing and FFT (fast Fourier transform) processing on the noisy voice signal to complete conversion of the noisy voice signal from a time domain to a frequency domain to obtain a frequency domain signal, then, calculating a signal power spectrum from the frequency domain signal, passing through a Mel filter to obtain a series of Mel power spectrum values, then, performing noise estimation and noise suppression on the Mel power spectrum values to realize noise reduction processing on the noisy voice signal to obtain a noise-reduced Mel power spectrum value, and obtaining an Fbank value by respectively taking natural logarithm of the noise-reduced Mel power spectrum value. The Fbank value is used as a voice characteristic and can be sent to a voice recognition system for voice recognition. The noise-reduced mel-power spectrum value can also be used as an input of other processing, such as converting the noise-reduced mel-power spectrum value into characters, storing and other processing.

By the method, the voice noise reduction algorithm with low memory occupation and low computation complexity can be realized; the method has great practical value on embedded equipment with limited resources.

Referring to fig. 6, another aspect of the present invention provides a feature extraction device based on a speech signal, where the device includes: a conversion module 601, configured to perform time-domain to frequency-domain conversion on the noisy speech signal to obtain a frequency-domain signal of the noisy speech signal; a filtering module 602, configured to perform mel filtering on the frequency domain signal to obtain a mel power spectrum value of the frequency domain signal; a noise reduction module 603, configured to perform noise reduction on the mel-power spectrum value to obtain a mel-power spectrum value after noise reduction; the recognition module 604 is configured to perform speech recognition according to the noise-reduced mel power spectrum value to obtain speech features corresponding to the noise-containing speech signal.

In an embodiment of the present invention, the filtering module 602 includes: the calculation submodule 6021 is used for calculating a signal power spectrum of the frequency domain signal; and the filtering submodule 6022 is configured to perform mel filtering on the calculated signal power spectrum through a mel filter bank to obtain a mel power spectrum value of the frequency domain signal.

In this embodiment of the present invention, the denoising module 603 includes: a noise estimation sub-module 6031, configured to perform noise estimation on the signal power spectrum to obtain a noise estimation value; and a noise suppression submodule 6032, configured to perform noise suppression on the mel power spectrum value according to the noise estimation value, so as to obtain a mel power spectrum value after noise reduction.

In this embodiment of the present invention, the noise estimation sub-module 6031 includes: the calculating unit 60311 is configured to calculate a signal power spectrum to obtain a minimum value of the noisy power within a set time; a first determining unit 60312 for determining the minimum value of the noisy power as a noise estimation reference value; and a compensation unit 60313, configured to compensate the noise estimation reference value to obtain a noise estimation value.

In the embodiment of the present invention, the noise suppressor sub-module 6032 includes: a second determining unit 60321 for determining a first gain value of the mel-power spectrum value from the noise estimation value; a smoothing unit 60322, configured to perform inter-spectrum smoothing on the first gain value to obtain a second gain value; and a noise reduction unit 60323 configured to perform noise reduction processing on the mel-power spectrum value by using the second gain value, so as to obtain a mel-power spectrum value after noise reduction.

In the embodiment of the present invention, the second determining unit is specifically configured to perform a posterior signal-to-noise ratio calculation on the mel-power spectrum value according to the noise estimation value to obtain a posterior signal-to-noise ratio; carrying out prior signal-to-noise ratio calculation according to the posterior signal-to-noise ratio to obtain a prior signal-to-noise ratio; and calculating a gain value according to the prior signal-to-noise ratio to obtain a first gain value corresponding to the Mel power spectrum value.

Another aspect of the embodiments of the present invention provides a computer-readable storage medium, where the storage medium includes a set of computer-executable instructions, and when the instructions are executed, the storage medium is configured to perform any one of the above-mentioned methods for extracting feature values based on a speech signal.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for extracting features based on a speech signal, the method comprising:

carrying out time domain to frequency domain conversion on the noisy speech signal to obtain a frequency domain signal of the noisy speech signal;

carrying out Mel filtering processing on the frequency domain signal to obtain a Mel power spectrum value of the frequency domain signal;

denoising the Mel power spectrum value to obtain a noise-reduced Mel power spectrum value;

performing voice recognition according to the noise-reduced Mel power spectrum value to obtain voice characteristics corresponding to the noise-containing voice signal;

wherein, the denoising the mel power spectrum value to obtain the mel power spectrum value after denoising comprises:

calculating a signal power spectrum of the frequency domain signal;

carrying out noise estimation on the signal power spectrum to obtain a noise estimation value;

Carrying out noise suppression on the Mel power spectrum value according to the noise estimation value to obtain a noise-reduced Mel power spectrum value;

the voice recognition is performed according to the noise-reduced mel power spectrum value to obtain the voice characteristics corresponding to the noise-containing voice signal, and the method comprises the following steps:

and solving a natural logarithm of the denoised Mel power spectrum value to obtain an Fbank value, wherein the Fbank value is used as a voice characteristic to be sent to a voice recognition engine for voice recognition.

2. The method of claim 1, wherein the performing the mel filtering process on the frequency domain signal to obtain the mel power spectrum value of the frequency domain signal comprises:

and carrying out Mel filtering processing on the signal power spectrum obtained by calculation through a Mel filter bank to obtain a Mel power spectrum value of the frequency domain signal.

3. The method of claim 1, wherein performing noise estimation on the power spectrum of the signal to obtain a noise estimate comprises:

calculating the signal power spectrum to obtain the minimum value of the noisy power within a set time;

determining the minimum value of the noisy power as a noise estimation reference value;

and compensating the noise estimation reference value to obtain the noise estimation value.

4. The method of claim 1, wherein performing noise suppression on the mel-power spectral values according to the noise estimation values to obtain denoised mel-power spectral values comprises:

determining a first gain value of the mel power spectrum value according to the noise estimation value;

performing inter-spectrum smoothing on the first gain value to obtain a second gain value;

and performing noise reduction processing on the Mel power spectrum value by using the second gain value to obtain a noise-reduced Mel power spectrum value.

5. The method of claim 4, wherein determining a first gain value for the mel-power spectral value as a function of the noise estimate comprises:

calculating the posterior signal-to-noise ratio of the Mel power spectrum value according to the noise estimation value to obtain the posterior signal-to-noise ratio;

carrying out prior signal-to-noise ratio calculation according to the posterior signal-to-noise ratio to obtain a prior signal-to-noise ratio;

and calculating a gain value according to the prior signal-to-noise ratio to obtain the first gain value corresponding to the Mel power spectrum value.

6. A feature extraction device based on a speech signal, characterized in that the device comprises:

the conversion module is used for converting a time domain to a frequency domain of the noisy speech signal to obtain a frequency domain signal of the noisy speech signal; the filtering module is used for carrying out Mel filtering processing on the frequency domain signal to obtain a Mel power spectrum value of the frequency domain signal;

The noise reduction module is used for carrying out noise reduction on the Mel power spectrum value to obtain a noise-reduced Mel power spectrum value;

the recognition module is used for carrying out voice recognition according to the noise-reduced Mel power spectrum value to obtain voice characteristics corresponding to the noise-containing voice signal;

the noise reduction module comprises a calculation submodule, a noise estimation submodule and a noise suppression submodule;

the calculation submodule is used for calculating a signal power spectrum of the frequency domain signal;

the noise estimation submodule is used for carrying out noise estimation on the signal power spectrum to obtain a noise estimation value;

the noise suppression submodule is used for performing noise suppression on the Mel power spectrum value according to the noise estimation value to obtain a noise-reduced Mel power spectrum value;

the recognition module is further configured to solve a natural logarithm for the noise-reduced mel-power spectrum value to obtain an Fbank value, and the Fbank value is used as a voice feature to be sent to a voice recognition engine for voice recognition.

7. The apparatus of claim 6, wherein the filtering module comprises:

and the filtering submodule is used for carrying out Mel filtering processing on the signal power spectrum obtained by calculation through a Mel filter bank to obtain a Mel power spectrum value of the frequency domain signal.

8. A computer storage medium comprising a set of computer-executable instructions for performing the method of speech signal based feature extraction of any one of claims 1-5 when executed.