CN114822578A

CN114822578A - Voice noise reduction method, device, equipment and storage medium

Info

Publication number: CN114822578A
Application number: CN202210413193.8A
Authority: CN
Inventors: 卢志强
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-07-29

Abstract

The invention discloses a voice noise reduction method, a device, equipment and a storage medium, wherein the method comprises the following steps: collecting audio streams and identifying scene types corresponding to the audio streams; selecting a pre-trained target voice noise reduction model according to the scene type; sequentially carrying out frequency domain noise reduction processing and time domain noise reduction processing on each sampling point corresponding to the audio stream by adopting a target voice noise reduction model to obtain a clean time domain signal of each sampling point; and overlapping and adding the clean time domain signals of all the sampling points to obtain the noise-reduced audio stream. The invention realizes the real-time noise reduction processing of the audio stream and reduces the transmission delay of the audio stream after noise reduction.

Description

Voice noise reduction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of audio and video technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech noise reduction.

Background

Speech noise reduction has the effect of enhancing the useful speech signal by attenuating background noise. Although the existing voice noise reduction method, for example, the voice noise reduction method based on deep learning, can reduce noise of an audio signal in a non-stationary noise scene, the real-time performance is poor, and the noise reduction processing cannot be performed on the audio signal based on a real-time occasion.

Disclosure of Invention

The embodiment of the invention provides a voice noise reduction method, a voice noise reduction device, voice noise reduction equipment and a storage medium, and aims to solve the technical problems that the existing voice noise reduction method is poor in real-time performance of noise reduction of an audio signal and cannot perform noise reduction processing on the audio signal based on a real-time occasion.

The embodiment of the invention provides a voice noise reduction method, which comprises the following steps:

collecting audio streams and identifying scene types corresponding to the audio streams;

selecting a pre-trained target voice noise reduction model according to the scene type;

sequentially carrying out frequency domain noise reduction processing and time domain noise reduction processing on each sampling point corresponding to the audio stream by adopting the target voice noise reduction model to obtain a clean time domain signal of each sampling point; and the number of the first and second groups,

and overlapping and adding the clean time domain signals of all the sampling points to obtain the noise-reduced audio stream.

In an embodiment, the target speech noise reduction model includes a short-time fourier transform layer, a first signal noise reduction layer, a short-time fourier inverse transform layer, a first convolution layer, a second signal noise reduction layer, a second convolution layer, and a signal reconstruction layer, and the short-time fourier transform layer, the first signal noise reduction layer, the short-time fourier inverse transform layer, the first convolution layer, the second signal noise reduction layer, the second convolution layer, and the signal reconstruction layer are sequentially connected.

In an embodiment, after the step of identifying a scene type corresponding to the audio stream, the method further includes:

and determining a noise reduction parameter of the target voice noise reduction model according to the scene type, wherein the noise reduction parameter is used for adjusting the noise reduction effect of the target voice noise reduction model.

In an embodiment, the step of sequentially performing frequency domain noise reduction processing and time domain noise reduction processing on each sampling point corresponding to the audio stream by using the target speech noise reduction model to obtain a clean time domain signal of each sampling point includes:

carrying out short-time Fourier transform on each sampling point to obtain a frequency domain signal of each sampling point;

carrying out noise reduction processing on the frequency domain signals of the sampling points;

carrying out short-time inverse Fourier transform on each frequency domain signal subjected to noise reduction processing to obtain a first time domain signal of each sampling point;

reducing the dimension of the first time domain characteristics of the first time domain signals of each sampling point to obtain each first time domain signal with second time domain characteristics;

denoising each first time domain signal with a second time domain characteristic to obtain a second time domain signal of each sampling point;

and determining the clean time domain signal of each sampling point according to the second time domain signal of each sampling point.

In an embodiment, the step of performing overlap-add on the clean time domain signals of the sampling points to obtain the noise-reduced audio stream includes:

performing dimensionality enhancement on the third time domain characteristic of the clean time domain signal of each sampling point to obtain each clean time domain signal with a fourth time domain characteristic;

and overlapping and adding the clean time domain signals with the fourth time domain characteristic to obtain the noise-reduced audio stream.

In an embodiment, the step of identifying a scene type corresponding to the audio stream includes:

and recognizing the audio stream by adopting a pre-trained acoustic scene recognition model to obtain a scene type corresponding to the audio stream.

In an embodiment, the acoustic scene recognition model includes a convolutional layer, a pooling layer, a full-link layer, and a normalized index function layer, the convolutional layer, the pooling layer, the full-link layer, and the normalized index function layer are sequentially connected, and the step of recognizing the audio stream by using the acoustic scene recognition model trained in advance to obtain the scene type corresponding to the audio stream includes:

extracting Mel frequency spectrum characteristics of the audio stream;

identifying the Mel frequency spectrum characteristics by adopting the convolution layer, the pooling layer, the full-link layer and the normalized index function layer to obtain a plurality of preset scene types and the probability corresponding to each preset scene type;

and taking the preset scene type corresponding to the maximum probability as the scene type.

In addition, to achieve the above object, the present invention also provides a voice noise reduction apparatus, including:

the type acquisition module is used for acquiring audio streams and identifying scene types corresponding to the audio streams;

the model selection module is used for selecting a pre-trained target voice noise reduction model according to the scene type;

the voice noise reduction module is used for sequentially carrying out frequency domain noise reduction processing and time domain noise reduction processing on each sampling point corresponding to the audio stream by adopting the target voice noise reduction model to obtain a clean time domain signal of each sampling point;

and the voice reconstruction module is used for overlapping and adding the clean time domain signals of all the sampling points to obtain the audio stream after noise reduction.

In addition, to achieve the above object, the present invention also provides a terminal device, including: the voice noise reduction system comprises a memory, a processor and a voice noise reduction program which is stored on the memory and can run on the processor, wherein the voice noise reduction program realizes the steps of the voice noise reduction method when being executed by the processor.

In addition, to achieve the above object, the present invention also provides a storage medium having a voice noise reduction program stored thereon, which when executed by a processor, implements the steps of the voice noise reduction method described above.

The technical scheme of the voice noise reduction method, the device, the equipment and the storage medium provided by the embodiment of the invention at least has the following technical effects or advantages:

according to the technical scheme, the technical problems that the existing voice noise reduction method is poor in real-time performance of noise reduction of the audio signal and cannot perform noise reduction processing on the audio signal based on a real-time occasion are solved. According to the method and the device, the scene where the user is located is identified according to the audio stream, so that the corresponding voice noise reduction model is selected according to the scene type where the user is located to reduce the noise of the audio stream, the real-time noise reduction processing of the audio stream is realized, the transmission delay of the noise-reduced audio stream is reduced, and the quality of the noise-reduced audio stream is improved.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a voice denoising method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a voice denoising process of the voice denoising method of the present invention;

FIG. 4 is a schematic diagram of a network structure of an acoustic scene recognition model according to the present invention;

FIG. 5 is a schematic diagram of a network structure of a speech noise reduction model according to the present invention;

fig. 6 is a functional block diagram of the speech noise reduction apparatus according to the present invention.

Detailed Description

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that fig. 1 is a schematic structural diagram of a hardware operating environment of the terminal device.

As an implementation manner, as shown in fig. 1, an embodiment of the present invention relates to a terminal device, where the terminal device includes: a processor 1001, such as a CPU, a memory 1002, and a communication bus 1003. The communication bus 1003 is used to implement connection communication among these components.

The memory 1002 may be a high-speed RAX memory or a non-volatile memory (non-volatile XeXory), such as a disk memory. As shown in fig. 1, a memory 1002 as a storage medium may include therein a voice noise reduction program; and the processor 1001 may be configured to call the voice noise reduction program stored in the memory 1002 and perform the following operations:

Further, the processor 1001 may be configured to call the voice noise reduction program stored in the memory 1002, and perform the following operations:

Further, the processor 1001 may be configured to call the voice noise reduction program stored in the memory 1002 and perform the following operations:

Further, the acoustic scene recognition model includes a convolution layer, a pooling layer, a full-link layer, and a normalized index function layer, which are sequentially connected, and the processor 1001 may be configured to call a voice noise reduction program stored in the memory 1002, and perform the following operations:

extracting Mel frequency spectrum characteristics of the audio stream;

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown.

As shown in fig. 2, in an embodiment of the present invention, the voice noise reduction method of the present invention is applied to a terminal device, where the terminal device may be a PC, a mobile communication device (e.g., a mobile phone), and the like, and the voice noise reduction method includes the following steps:

step S210: collecting audio streams and identifying scene types corresponding to the audio streams.

In this embodiment, the audio stream may be a call voice stream, a music audio stream, or the like. For example, the audio stream is a speech stream for a call, and after a user speaks, the audio stream includes not only the speech of the user but also the ambient speech of the environment in which the user is located. After the audio stream is collected, the environment voice in the audio stream is extracted, so that the scene type of the environment where the user is located is identified according to the extracted environment voice, and the scene type of the environment where the user is located is the scene type corresponding to the audio stream.

Specifically, the scene type corresponding to the audio stream may be obtained through model identification, that is, step S210 includes: and recognizing the audio stream by adopting a pre-trained acoustic scene recognition model to obtain a scene type corresponding to the audio stream.

The acoustic scene recognition model is trained in advance and used for recognizing the type of the acoustic scene. The acoustic scene types include various scene types in real life, for example, parks, subway stations, airports, and the like. The training process of the acoustic scene recognition model comprises the following steps: extracting Mel frequency spectrum characteristics of audio according to labeled audio scene data, inputting the Mel frequency spectrum characteristics and corresponding scene labels to a deep convolution model for iterative training, taking the ratio of the number of errors of each round of labels to the total number of labels as loss, stopping the training of the model until the model training reaches the maximum training round number or the loss is less than a set threshold value, and storing the model which stops the training as the acoustic scene recognition model. For example, the model storage format is tensorflow-lite, the size is 240kb +, the classification precision is basically unchanged compared with the original model, the accuracy is more than 95%, the time for recognizing 10 seconds of audio stream is 0.2 to 0.4 seconds, and the recognition time is shortened.

As shown in fig. 3, in practical applications, after an audio stream is acquired, the audio stream is input into an acoustic scene recognition model, and after the audio stream is recognized and processed by the acoustic scene recognition model, a scene type corresponding to the audio stream is output. For example, the scene type corresponding to the audio stream is identified through an acoustic scene identification model, and the identified scene type result is a subway station.

Specifically, a network structure of the acoustic scene recognition model is shown in fig. 4, the acoustic scene recognition model includes a convolution layer, a pooling layer, a full-link layer, and a normalized index function layer, and the convolution layer, the pooling layer, the full-link layer, and the normalized index function layer are sequentially connected. Conv2d represents a two-dimensional convolution, Relu represents an activation function, Conv2d/Relu corresponds to a convolutional layer, MaxPool2D corresponds to a pooling layer, represents a two-dimensional maximum pooling layer, Fullconnected represents a fully-connected layer, and Softmax is a normalized exponential function, corresponding to a normalized exponential function layer.

The specific implementation process of adopting a pre-trained acoustic scene recognition model to recognize the audio stream and obtaining the scene type corresponding to the audio stream comprises the following steps:

extracting Mel frequency spectrum characteristics of the audio stream;

After the audio stream is input into the acoustic scene recognition model, the acoustic scene recognition model extracts the mel frequency spectrum features of the audio stream, for example, the extracted mel frequency spectrum features are an audio frequency feature matrix of (1,128,416, 6); then, recognition processing is carried out on the Mel frequency spectrum characteristics through the convolution layer, the pooling layer, the full-link layer and the normalized index function layer, a recognition result is output through the normalized index function layer, the recognition result comprises a plurality of preset scene types and probabilities corresponding to the preset scene types, and the preset scene type corresponding to the maximum probability is used as a finally recognized scene type. For example, the normalized exponential function layer outputs a (1,3) -dimensional matrix, where 3 of the (1,3) -dimensional matrix represents that 3 preset scene types are output, and the preset scene type corresponding to the largest numerical value is the scene type corresponding to the audio stream.

Step S220: and selecting a pre-trained target voice noise reduction model according to the scene type.

In this embodiment, referring to fig. 3, the intelligent voice noise reduction model in the figure includes a plurality of voice noise reduction models, and each voice noise reduction model is used for reducing noise of an audio stream corresponding to a scene type. For example, park-type voice noise reduction models reduce noise in audio streams in park scenes, and airport-type voice noise reduction models reduce noise in audio streams in airport scenes. After the scene type corresponding to the audio stream is identified through the acoustic scene identification model, a voice noise reduction model matched with the scene type is selected from the intelligent voice noise reduction models according to the identified scene type, the selected voice noise reduction model matched with the scene type is a target voice noise reduction model, then the target voice noise reduction model is adopted to reduce noise of the audio stream, the audio stream after noise reduction is obtained, and the audio stream is input into the target voice noise reduction model to be subjected to noise reduction treatment.

Step S230: and sequentially carrying out frequency domain noise reduction processing and time domain noise reduction processing on each sampling point corresponding to the audio stream by adopting the target voice noise reduction model to obtain a clean time domain signal of each sampling point.

Step S240: and overlapping and adding the clean time domain signals of all the sampling points to obtain the noise-reduced audio stream.

In this embodiment, the target speech noise reduction model is used to reduce noise of the audio stream, and the audio stream after noise reduction is obtained as follows: and performing noise reduction processing on the audio stream twice by adopting the target voice noise reduction model, and taking the audio stream subjected to the noise reduction processing for the second time as final output. And acquiring each sampling point corresponding to the audio stream by adopting a target voice noise reduction model, wherein every two adjacent sampling point data have mutually overlapped parts.

And performing frequency domain conversion on each sampling point to obtain a frequency domain signal of each sampling point, and then performing frequency domain noise reduction processing on the frequency domain signal corresponding to each sampling point, namely realizing the first noise reduction processing of the audio stream, namely completing the noise reduction of each sampling point in the frequency domain. And then, performing time domain conversion on the frequency domain signal subjected to the noise reduction processing to obtain a first time domain signal of each sampling point, and then performing time domain noise reduction processing on the first time domain signal of each sampling point to obtain a second time domain signal of each sampling point, namely, realizing the second noise reduction processing of the audio stream, namely completing the noise reduction of each sampling point in the time domain. The second time domain signal of each sampling point is the first time domain signal of each sampling point after noise reduction, that is, the second time domain signal of each sampling point does not include noise, and the second time domain signal of each sampling point is also referred to as a clean time domain signal of each sampling point.

After the clean time domain signals of all the sampling points are obtained, the clean time domain signals of all the sampling points are overlapped and added, so that the audio stream after noise reduction is obtained, and then the audio stream is output, so that the time delay of the audio stream is reduced, and the quality of the audio stream after noise reduction is improved.

According to the technical scheme, the scene where the user is located is identified according to the audio stream, so that the corresponding voice noise reduction model is selected according to the scene type where the user is located to perform frequency domain and time domain noise reduction on the audio stream, real-time noise reduction processing of the audio stream is achieved, transmission delay of the noise-reduced audio stream is reduced, and quality of the noise-reduced audio stream is improved.

On one hand, the acoustic scene recognition model and the voice noise reduction model can be transplanted to a mobile phone end of an Android system and an IOS system, and can also be transplanted to a client carrying software, so that the noise reduction requirement of real-time communication is met.

On the other hand, in order to adapt the speech noise reduction model to different scenes, the present invention controls the noise reduction effect of the speech noise reduction model by a noise reduction parameter d (as shown in fig. 3), that is, after step S220, the present invention further includes: and determining the noise reduction parameters of the target voice noise reduction model according to the scene type.

It should be understood that the noise reduction parameters are used to adjust the noise reduction effect of the target voice noise reduction model, and the noise reduction parameters of the target voice noise reduction model may be determined according to the recognized scene type. d ranges between 0 and 1, and d ═ 0 means that all noise is cancelled, and is used for controlling different noise cancellation ratios. For different scene types, the function selects different parameter values for the user, and in order to improve the quality of the noise-reduced speech, it is generally the default that d is 0.04. And after the noise reduction parameters are determined, updating default noise reduction parameters stored in the target voice noise reduction model through the noise reduction parameters, and then working the target voice noise reduction model according to the determined noise reduction parameters, so that the effect of improving the noise reduction effect of the audio stream is achieved.

Further, the training process of the speech noise reduction model is as follows:

assuming that the noisy speech data is x, x may be a combination of clean speech y and noise e, i.e., x ═ y + e, then the task of the speech noise reduction model is to estimate clean speech y under the condition of the speech data x with known noise, i.e., y1 ═ f (x), and the task of speech noise reduction is to find a function f so that the clean speech signal y1 predicted by the function f is closer to y. Noise voice data and corresponding voice (clean voice is equivalent to a label) of a specific scene (such as a conference scene, outdoors, traffic and the like) are collected, the collected voice data are accumulated for several hours, and then voice noise reduction models for different scene types are trained. Wherein the clean speech is speech without noise.

As shown in fig. 5, the network structure of the voice noise reduction model includes a short-time fourier transform layer, a first signal noise reduction layer, a short-time fourier inverse transform layer, a first convolution layer, a second signal noise reduction layer, a second convolution layer and a signal reconstruction layer, wherein the short-time fourier transform layer, the first signal noise reduction layer, the short-time fourier inverse transform layer, the first convolution layer, the second signal noise reduction layer, the second convolution layer and the signal reconstruction layer are sequentially connected; the first signal noise reduction layer and the second signal noise reduction layer respectively comprise a long-term and short-term memory neural network and a full-connection network, and the second convolution layer is a causal convolution layer.

Wherein the short-time fourier transform layer corresponds to STFT (257) in fig. 5, 501 denotes a first signal noise reduction layer including: a plurality of long-short term memory neural networks, a plurality of random discard parameter proportions, a fully connected network and an activation layer; wherein, each long-short term memory neural network and a random discarding parameter proportion form a long-short term memory neural network layer, and the full-connection network and the activation layer form a full-connection layer. Two long-short term memory neural network layers and a full connection layer exist in the first signal noise reduction layer, corresponding to the network structure at the left part in fig. 5, the first LSTM + Dropout from top to bottom represents the first long-short term memory neural network layer, LSTM represents the first long-short term memory neural network, and Dropout represents the first random discard parameter ratio; the second LSTM + Dropout represents the second long-short-term memory neural network layer, LSTM represents the second long-short-term memory neural network, Dropout represents the second random drop parameter ratio, and Dense + Activation represents the fully-connected layer. Wherein, the first long-short term memory neural network layer, the second long-short term memory neural network layer and a full connection layer are connected in sequence.

502 denotes a second signal noise reduction layer, which includes: a plurality of long-short term memory neural networks, a plurality of random discard parameter proportions, a fully connected network and an activation layer; wherein, each long-short term memory neural network and a random discarding parameter proportion form a long-short term memory neural network layer, and the full-connection network and the activation layer form a full-connection layer. Two long-short term memory neural network layers and a full connection layer exist in the second signal noise reduction layer, corresponding to the network structure at the right part in fig. 5, the first LSTM + Dropout from top to bottom represents the first long-short term memory neural network layer, LSTM represents the first long-short term memory neural network, and Dropout represents the first random discard parameter ratio; the second LSTM + Dropout represents the second long-short-term memory neural network layer, LSTM represents the second long-short-term memory neural network, Dropout represents the second random drop parameter ratio, and Dense + Activation represents the fully-connected layer. Wherein, the first long-short term memory neural network layer, the second long-short term memory neural network layer and a full connection layer are connected in sequence.

The short-time fourier inverse transform layer corresponds to ISTFT (512) in fig. 5, the first convolution layer corresponds to Conv1D (256) in fig. 5, the second convolution layer corresponds to Conv1D (512) in fig. 5, and the signal reconstruction layer corresponds to Overlap-add in fig. 5.

STFT (Short-Time Fourier Transform) represents a Short-Time Fourier Transform, Dropout is a random discard parameter ratio (e.g., four Dropouts are set to 0.25 in the figure), Dense is a fully-connected network, Activation is an active layer, an active function is a sigmoid, ISTFT (Inverse Short-Time Fourier Transform) is an Inverse Short-Time Fourier Transform, and overlay-add is an Overlap-add method for reconstructing an audio frame into an audio signal. In general, when speech noise reduction is performed, a Short Time Fourier Transform (STFT) is generally performed first, audio data has continuity, and each frame of data has a close relationship. In order to solve the problem, a long-short term memory neural network, namely an LSTM network, is added into the model, and more previous audio frame information is fully utilized, so that not only fixed time delay caused by a subsequent frame is eliminated, but also the noise reduction quality of the current audio frame is ensured.

Further, based on the above embodiment, the step S230 includes the following steps:

It should be understood that, a short-time fourier transform is performed on each sampling point, that is, each sampling point is converted into a frequency domain to obtain a frequency domain signal of each sampling point, and then the frequency domain signal of each sampling point is denoised in the frequency domain to obtain each frequency domain signal after denoising processing, so that the frequency domain denoising processing of each sampling point is completed, that is, the first denoising processing of the audio stream is completed. And the dimension of the frequency domain characteristic of the frequency domain signal before noise reduction is the same as the dimension of the frequency domain characteristic of the frequency domain signal after noise reduction.

After the first denoising of the audio stream is completed, performing short-time inverse fourier transform on each frequency domain signal subjected to the denoising, that is, converting each frequency domain signal subjected to the denoising into a time domain to obtain a first time domain signal of each sampling point, wherein a time domain feature of the first time domain signal of each sampling point is called a first time domain feature. And the dimension of the first time domain characteristic is larger than the dimension of the frequency domain characteristic of the frequency domain signal before noise reduction and the dimension of the frequency domain characteristic of the frequency domain signal after noise reduction.

And then, performing dimensionality reduction on the first time domain characteristics of the first time domain signals of the sampling points in the time domain to obtain each first time domain signal with second time domain characteristics, wherein the dimensionality of the second time domain characteristics is smaller than that of the first time domain characteristics. And then, denoising each first time domain signal with the second time domain characteristic in the time domain to obtain a second time domain signal of each sampling point, namely the second time domain signal of each sampling point is each denoised first time domain signal with the second time domain characteristic. And after the second time domain signal of each sampling point is obtained, the second time domain signal of each sampling point is the clean time domain signal of each sampling point, and the second noise reduction processing of the audio stream is completed.

Based on the network structure of the target speech noise reduction model, the specific implementation process of step S230 is as follows:

carrying out short-time Fourier transform on each sampling point by adopting the short-time Fourier transform layer to obtain a frequency domain signal of each sampling point;

carrying out noise reduction processing on the frequency domain signals of the sampling points by adopting the first signal noise reduction layer;

performing short-time inverse Fourier transform on each frequency domain signal subjected to noise reduction processing by using the short-time inverse Fourier transform layer to obtain a first time domain signal of each sampling point;

reducing the dimension of the first time domain characteristics of the first time domain signals of each sampling point by adopting the first convolution layer to obtain each first time domain signal with second time domain characteristics;

inputting the second time domain characteristics into the second signal noise reduction layer to reduce noise of each first time domain signal to obtain a second time domain signal of each sampling point;

It should be understood that, by performing short-time fourier transform on each sampling point of the audio stream by using the short-time fourier transform layer, frequency domain conversion of each sampling point can be achieved, and a frequency domain signal corresponding to each sampling point is obtained. Assuming that, for example, an audio stream with a monaural 16KHz sampling rate corresponds to 16 sampling points per second, an input of a short-time fourier transform layer (also an input of a voice noise reduction model) is 512 sampling points, a corresponding audio duration is 32ms, after the short-time fourier transform layer performs short-time fourier transform on the 512 sampling points, a frequency domain signal corresponding to each sampling point is obtained, and a dimensionality of a frequency spectrum characteristic of the frequency domain signal is 257 dimensions, that is, 257 dimensions of the frequency spectrum characteristic.

After the frequency spectrum characteristics of the frequency domain signals are obtained, the frequency spectrum characteristics are input into a first signal noise reduction layer, the first signal noise reduction layer filters each frequency domain signal to achieve noise reduction of each frequency domain signal, each frequency domain signal after noise reduction processing is obtained, and frequency domain noise reduction of each sampling point is achieved, namely the first noise reduction processing of the audio stream is completed. For example, 257-dimensional spectral features are processed by two long-term and short-term memory neural networks and a full connection layer to obtain each frequency domain signal after noise reduction, and the spectral features of each frequency domain signal after noise reduction are also 257-dimensional.

After each frequency domain signal after noise reduction is obtained, a short-time Fourier inverse transformation layer is adopted to perform short-time Fourier inverse transformation on each frequency domain signal after noise reduction, so that each frequency domain signal after noise reduction can be converted into a time domain, and a first time domain signal of each sampling point is obtained. For example, after the spectral features of each frequency domain signal after the noise reduction processing are input to the short-time fourier inverse transform layer and subjected to short-time fourier inverse transform, a first time domain signal of each sample point is output, and the dimension of the first time domain feature of the first time domain signal is 512 dimensions, that is, 512-dimensional time domain feature.

The first convolution layer is a one-dimensional convolution layer, and after the first convolution layer is adopted to carry out dimensionality reduction processing on the first time domain characteristics of the first time domain signals of all the sampling points, all the first time domain signals with second time domain characteristics are obtained, and the dimensionality of the second time domain characteristics is lower than that of the first time domain characteristics. And then, inputting each second time domain characteristic into a second signal noise reduction layer, wherein the second signal noise reduction layer realizes noise reduction on the first time domain signal by filtering each second time domain characteristic, so as to obtain a second time domain signal of each sampling point, and the second time domain signal of each sampling point is each first time domain signal subjected to noise reduction. And after the second time domain signal of each sampling point is obtained, the second time domain signal of each sampling point is the clean time domain signal of each sampling point, and the second noise reduction processing of the audio stream is completed.

Further, based on the above embodiment, the step S240 includes the following steps:

It will be appreciated that the third time domain characteristic of the clean time domain signal for each sample point is the time domain characteristic of the second time domain signal for each sample point. And after obtaining the clean time domain signal of each sampling point, performing dimensionality raising on the third time domain feature of the clean time domain signal of each sampling point to obtain each clean time domain signal with the fourth time domain feature, wherein the dimensionality of the fourth time domain feature is larger than that of the third time domain feature. The clean time domain signals of each sampling point are signals which are subjected to noise reduction completely, but are scattered signals, if output is needed, the clean time domain signals with the fourth time domain characteristic need to be overlapped and added together according to the time sequence, and therefore the final noise-reduced audio stream is obtained and then output. Here, the frequency domain signal may be understood as an audio frame in the frequency domain, and the time domain signal may be understood as an audio frame in the time domain.

Based on the network structure of the target speech noise reduction model, the specific implementation process of step S240 is as follows:

performing dimensionality enhancement on the third time domain feature of the clean time domain signal of each sampling point by using the second convolution layer to obtain each clean time domain signal with a fourth time domain feature;

and overlapping and adding the clean time domain signals with the fourth time domain characteristic by adopting the signal reconstruction layer to obtain the noise-reduced audio stream.

After each clean time domain signal with the third time domain characteristic is obtained, the third time domain characteristic of each clean time domain signal is input into a second convolution layer, the second convolution layer performs dimensionality-up processing on the third time domain characteristic of each clean time domain signal to obtain each clean time domain signal with the fourth time domain characteristic, each clean time domain signal with the fourth time domain characteristic is a signal which is subjected to noise reduction completely, but is also a scattered signal, if output is needed, each second time domain signal needs to be added in an overlapping mode according to the time sequence, namely, each clean time domain signal with the fourth time domain characteristic is subjected to overlapping addition by adopting a signal reconstruction layer, and finally, a noise-reduced audio stream is obtained and then output. Here, the frequency domain signal may be understood as an audio frame in the frequency domain, and the time domain signal may be understood as an audio frame in the time domain.

For example, the first time domain feature of each first time domain signal is a 512-dimensional time domain feature, the first convolution layer performs dimensionality reduction on the 512-dimensional time domain feature to obtain each first time domain signal with a 256-dimensional time domain feature, that is, the second time domain feature is a 256-dimensional time domain feature, the second signal noise reduction layer filters the 256-dimensional time domain feature, then the filtered 256-dimensional time domain feature is input into the second convolution layer, the second convolution layer outputs each clean time domain signal of the 512-dimensional time domain feature, that is, the third time domain feature is a filtered 256-dimensional time domain feature, and the fourth time domain feature is a 512-dimensional time domain feature; the signal reconstruction layer performs overlap-add on each clean time domain signal of the 512-dimensional time domain characteristics to obtain the noise-reduced audio stream, so that the noise reduction of the audio stream is realized, the transmission delay of the noise-reduced audio stream is reduced, and the quality of the noise-reduced audio stream is improved.

As shown in fig. 6, the present invention provides a speech noise reduction apparatus, including:

the type obtaining module 310 is configured to collect an audio stream, and identify a scene type corresponding to the audio stream;

the model selection module 320 is used for selecting a pre-trained target voice noise reduction model according to the scene type;

the voice noise reduction module 330 is configured to sequentially perform frequency domain noise reduction processing and time domain noise reduction processing on each sampling point corresponding to the audio stream by using the target voice noise reduction model to obtain a clean time domain signal of each sampling point;

and the voice reconstruction module 340 is configured to overlap and add the clean time domain signals of the sampling points to obtain the noise-reduced audio stream.

Furthermore, the target voice noise reduction model comprises a short-time Fourier transform layer, a first signal noise reduction layer, a short-time Fourier inverse transform layer, a first convolution layer, a second signal noise reduction layer, a second convolution layer and a signal reconstruction layer, wherein the short-time Fourier transform layer, the first signal noise reduction layer, the short-time Fourier inverse transform layer, the first convolution layer, the second signal noise reduction layer, the second convolution layer and the signal reconstruction layer are sequentially connected.

Further, the voice noise reduction apparatus further includes:

and the parameter selection unit is used for determining the noise reduction parameters of the target voice noise reduction model according to the scene type, and the noise reduction parameters are used for adjusting the noise reduction effect of the target voice noise reduction model.

Further, the voice noise reduction module 330 includes:

the frequency domain conversion unit is used for carrying out short-time Fourier transform on each sampling point to obtain a frequency domain signal of each sampling point;

the first noise reduction unit is used for carrying out noise reduction processing on the frequency domain signals of the sampling points;

the time domain conversion unit is used for carrying out short-time inverse Fourier transform on each frequency domain signal subjected to noise reduction processing to obtain a first time domain signal of each sampling point;

the characteristic dimension reduction unit is used for reducing the dimension of the first time domain characteristic of the first time domain signal of each sampling point to obtain each first time domain signal with a second time domain characteristic;

the second noise reduction unit is used for carrying out noise reduction on each first time domain signal with second time domain characteristics to obtain a second time domain signal of each sampling point;

and the signal determining unit is used for determining the clean time domain signal of each sampling point according to the second time domain signal of each sampling point.

Further, the speech reconstruction module 340 includes:

the feature dimension increasing unit is used for increasing the dimension of the third time domain feature of the clean time domain signal of each sampling point to obtain each clean time domain signal with the fourth time domain feature;

and the signal superposition unit is used for performing overlap addition on each clean time domain signal with the fourth time domain characteristic to obtain the audio stream subjected to noise reduction.

Further, the type obtaining module 310 is specifically configured to, in the aspect of identifying the scene type corresponding to the audio stream, identify the audio stream by using a pre-trained acoustic scene identification model, so as to obtain the scene type corresponding to the audio stream.

Further, the acoustic scene recognition model includes a convolution layer, a pooling layer, a full-link layer and a normalized index function layer, the convolution layer, the pooling layer, the full-link layer and the normalized index function layer are sequentially connected, the type obtaining module 310 recognizes the audio stream by using the acoustic scene recognition model trained in advance, and the aspect of obtaining the scene type corresponding to the audio stream includes:

a feature extraction unit, configured to extract mel frequency spectrum features of the audio stream;

and the type selection unit is used for identifying the Mel frequency spectrum characteristics by adopting the convolution layer, the pooling layer, the full-link layer and the normalized index function layer to obtain a plurality of preset scene types and the probability corresponding to each preset scene type, and taking the preset scene type corresponding to the maximum probability as the scene type.

The specific implementation of the speech noise reduction apparatus of the present invention is substantially the same as the embodiments of the speech noise reduction method, and will not be described herein again.

Further, the present invention also provides a terminal device, where the terminal device includes: the voice noise reduction system comprises a memory, a processor and a voice noise reduction program which is stored on the memory and can run on the processor, wherein the voice noise reduction program realizes the steps of the voice noise reduction method when being executed by the processor.

Further, the present invention also provides a storage medium, on which a voice noise reduction program is stored, and the voice noise reduction program, when executed by a processor, implements the steps of the voice noise reduction method described above.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for speech noise reduction, the method comprising:

sequentially carrying out frequency domain noise reduction processing and time domain noise reduction processing on each sampling point corresponding to the audio stream by adopting the target voice noise reduction model to obtain a clean time domain signal of each sampling point; and (c) a second step of,

2. The method of claim 1, wherein the target speech noise reduction model comprises a short-time fourier transform layer, a first signal noise reduction layer, a short-time fourier inverse transform layer, a first convolutional layer, a second signal noise reduction layer, a second convolutional layer, and a signal reconstruction layer, the short-time fourier transform layer, the first signal noise reduction layer, the short-time fourier inverse transform layer, the first convolutional layer, the second signal noise reduction layer, the second convolutional layer, and the signal reconstruction layer being connected in series.

3. The method of claim 1 or 2, wherein the step of identifying the scene type corresponding to the audio stream is followed by the step of:

4. The method according to claim 1 or 2, wherein the step of sequentially performing frequency domain noise reduction processing and time domain noise reduction processing on each sampling point corresponding to the audio stream by using the target speech noise reduction model to obtain a clean time domain signal of each sampling point comprises:

5. The method according to claim 1 or 2, wherein the step of overlap-adding the clean time domain signals of the respective sampling points to obtain the noise-reduced audio stream comprises:

6. The method of claim 1, wherein the step of identifying a scene type to which the audio stream corresponds comprises:

7. The method of claim 6, wherein the acoustic scene recognition model includes a convolutional layer, a pooling layer, a full-link layer, and a normalized index function layer, which are sequentially connected, and the step of recognizing the audio stream using the pre-trained acoustic scene recognition model to obtain the scene type corresponding to the audio stream includes:

extracting Mel frequency spectrum characteristics of the audio stream;

8. A speech noise reduction apparatus, comprising:

9. A terminal device, characterized in that the terminal device comprises: memory, a processor and a speech noise reduction program stored on the memory and executable on the processor, the speech noise reduction program when executed by the processor implementing the steps of the speech noise reduction method according to any of claims 1-7.

10. A storage medium having stored thereon a speech noise reduction program which, when executed by a processor, implements the steps of the speech noise reduction method of any one of claims 1-7.