CN117133303B

CN117133303B - Voice noise reduction method, electronic equipment and medium

Info

Publication number: CN117133303B
Application number: CN202311400580.9A
Authority: CN
Inventors: 夏殷锋; 孙玉涛; 王满洪; 李佳树
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-03-29
Anticipated expiration: 2043-10-26
Also published as: CN118298840B; CN118298840A; CN117133303A

Abstract

This application provides a speech noise reduction method, electronic device and medium. The method includes: acquiring a noisy speech including original speech and noise, and then selecting a corresponding speech noise reduction mode according to the noise intensity of the noisy speech, and then based on the speech The noise reduction mode denoises the noisy speech and obtains the original speech. Among them, the noise intensity of the noisy speech is determined to determine the intensity of the noise in the noisy speech, and then different speech noise reduction modes are selected according to the noise intensity, and the noisy speech is denoised according to the selected speech noise reduction mode. Realize adaptive adjustment of voice noise reduction, reduce the waste of system resources, and reduce the delay of voice interaction.

Description

Voice noise reduction method, electronic equipment and medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a method for noise reduction in speech, an electronic device, and a medium.

Background

With the development of artificial intelligence, voice interaction (such as voice wake-up and voice recognition) is widely applied to mobile devices or wearable devices (such as tablet, mobile phone and smart watch), and for example, a user can wake up a mobile phone assistant through specific voice content, or the device can convert voice input by the user into corresponding words through voice recognition application.

During voice interaction, the input voice may be affected by a noisy environment, resulting in the inclusion of original voice and noise. Noise is included in the input voice, which may result in a low wake-up rate for voice wake-up or recognition rate for voice recognition. In order to solve the technical problem, a voice noise reduction model is generally adopted in the related art to perform noise reduction processing on input voice so as to remove the influence of noise on voice interaction.

However, when the device performs noise reduction on the voice with noise, system resources are wasted, and the problem of higher time delay in the voice interaction process is caused.

Disclosure of Invention

The voice noise reduction method, the electronic equipment and the medium solve the problems that system resources are wasted and voice interaction time delay is high when equipment reduces noise of voice with noise.

In order to achieve the above purpose, the present application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for voice noise reduction, including:

the method comprises the steps of obtaining noisy speech comprising original speech and noise, selecting a corresponding speech noise reduction mode according to noise intensity of the noisy speech, and then carrying out noise reduction on the noisy speech based on the speech noise reduction mode to obtain the original speech.

In the embodiment of the application, the noise intensity of the voice with noise is determined to determine the intensity of the noise in the voice with noise, then different voice noise reduction modes are selected according to the noise intensity, noise is reduced on the voice with noise according to the selected voice noise reduction modes, self-adaptive regulation of voice noise reduction is realized, waste of system resources is reduced, and time delay of voice interaction is reduced.

As an example, the noise intensity of the noisy speech is obtained by determining the noisy speech feature corresponding to the noisy speech, then performing enhancement processing on the noisy speech feature to obtain an output speech feature, and then processing the output speech feature by using an activation function. Wherein the noise intensity of the noisy speech can be predicted by processing the output speech by the activation function.

As an example, after performing enhancement processing on the noisy speech feature to obtain an output speech feature, the method further includes: and carrying out global average pooling on the output voice characteristics to obtain one-dimensional vectors, and then converting the one-dimensional vectors into noisy characteristics based on the full connection layer. Correspondingly, the noisy features are mapped into a preset interval by using an activation function, and the noise intensity is obtained. The output voice features are subjected to dimension reduction and mapping processing, so that the output voice features can be compressed from the time dimension to obtain one-dimensional vectors, the one-dimensional vectors are converted into noisy features, the noisy features can be mapped into a preset interval by using an activation function, and the noise intensity of noisy voices is predicted.

As an example, enhancement processing is performed on noisy speech features to obtain output speech features, including: firstly, processing noisy speech features based on a multi-head attention mechanism to obtain first intermediate features, then performing residual connection and layer normalization operation on the first intermediate features to obtain second intermediate features, further processing the second intermediate features based on a feedforward layer to obtain third intermediate features, and finally performing residual connection and layer normalization operation on the third intermediate features to obtain output speech features. The difference part in the first middle feature or the third middle feature can be focused through two residual connection and layer normalization operations, and the feature information of a deeper layer in the second middle feature can be acquired through processing of a feedback layer, so that the accuracy of noise intensity prediction is improved.

As an example, before noise reduction is performed on the noisy speech based on the speech noise reduction mode to obtain the original speech, the method further includes: and acquiring a noisy speech spectrogram obtained by carrying out short-time Fourier transform on the noisy speech, and determining the noisy speech characteristics corresponding to the noisy speech based on the noisy speech spectrogram. Correspondingly, based on the voice noise reduction mode, noise is reduced according to the noisy spectrogram and the noisy voice characteristics, and the original voice is obtained. The method can reduce the calculated amount when determining the characteristics of the voice with noise and improve the determination efficiency of the characteristics of the voice with noise by converting the voice with noise into a spectrogram with noise.

As an example, denoising noisy speech according to a noisy speech spectrogram and noisy speech features to obtain original speech, comprising: determining a mask which corresponds to the characteristics of the voice with noise and has the same size as the voice spectrogram with noise, multiplying the mask with the voice spectrogram with noise to obtain an original voice spectrogram, and performing inverse Fourier transform on the original voice spectrogram to obtain the original voice corresponding to the voice with noise. The mask is used for filtering noise in the noisy spectrogram, so that an original spectrogram is obtained, and voice noise reduction is achieved.

As an example, the noise intensity of the noisy speech is determined by the noise intensity determining module in the speech noise reduction model, the number of the speech enhancing modules in the speech noise reduction model is determined according to the noise intensity, and the noisy speech is noise reduced by using the determined number of the speech enhancing modules to obtain the original speech, wherein the number is at least one. The noise intensity of the voice with noise can be determined through the voice noise reduction model, and the number of voice enhancement modules in the voice noise reduction model is adjusted through the noise intensity, so that the self-adaptive adjustment of voice noise reduction is realized.

As one example, a speech noise reduction model is trained by: acquiring training noisy speech comprising training original speech and training noise, and determining a corresponding training processing result according to the training noisy speech through a speech noise reduction model to be trained; and according to the training processing result and the training original voice, adjusting the model parameters of the voice noise reduction model and the model parameters of the noise intensity determination module in the voice noise reduction model. The voice noise reduction model can be used for reducing noise of the noisy voices with different noise intensities by training the voice noise reduction model.

In a second aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory;

wherein one or more computer programs, including instructions, are stored in the memory; the instructions, when executed by a processor, cause an electronic device to perform the method of any of the first aspects.

In a third aspect, embodiments of the present application provide a computer storage medium comprising computer instructions which, when run on an electronic device, perform a method as in any of the first aspects.

Drawings

FIG. 1 is a schematic diagram of a speech noise reduction model according to the related art;

fig. 2 is a flowchart of a voice noise reduction method according to an embodiment of the present application;

fig. 3 is a schematic software structure of an electronic device according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a different form of speech provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a speech noise reduction model according to an embodiment of the present disclosure;

fig. 6a is a schematic diagram of a noise intensity determination procedure according to an embodiment of the present application;

FIG. 6b is a schematic diagram of a preprocessing convolution module according to an embodiment of the present disclosure;

FIG. 6c is a schematic diagram of converting noisy speech into a noisy speech spectrum according to an embodiment of the present application;

FIG. 6d is a schematic diagram of a first output feature determination process according to an embodiment of the present application;

FIG. 6e is a schematic diagram of a post-processing convolution module according to an embodiment of the present disclosure;

FIG. 6f is a schematic diagram of a mask according to an embodiment of the present application;

FIG. 7 is a schematic diagram of filtering a noisy spectrogram using a mask according to an embodiment of the present application;

fig. 8 is a schematic diagram of a noise reduction process of noisy speech according to an embodiment of the present application;

fig. 9 is a schematic view of a scenario of a voice noise reduction method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For clarity of description of the embodiments below, the following description is first provided for the possible occurrence of several terms:

original speech means speech that does not include noise, which means sound that does not vary with the original sound. As one example, the sound of a user speaking is the original voice, and if the user is near a television that is playing video, then the video sound played by the television is noise. As another example, the music played by the sound box is original voice, and if there is a computer playing video near the sound box, the video and sound played by the computer is noise.

Noisy speech means speech that contains noise. In the embodiments described below, the voice input to the electronic device is a noisy voice, where the noisy voice is a sound emitted by an external device or a sound uttered by a user, which is collected by a microphone in the electronic device based on an environment including noise, and the sound includes noise other than the original sound. The original sound refers to a sound made by an external device or a speech sound of a user.

In the related art, the noise-added speech is generally processed by a speech noise reduction model to obtain an original speech, and referring to fig. 1, a speech noise reduction model 100 of the related art is shown, where the speech noise reduction model 100 mainly includes a preprocessing convolution module 101, N speech enhancement modules 102 connected in series, and a post-processing convolution module 103.

The voice noise reduction model 100 generally converts input voice with noise into voice characteristics with noise through a preprocessing convolution module 101, and inputs the voice characteristics with noise into N voice enhancement modules 102 connected in series; n voice enhancement modules connected in series process the voice characteristics with noise to obtain processing characteristics; the post-processing convolution module 103 processes the input processing characteristics to obtain a mask (mask) corresponding to the noisy speech; the speech noise reduction model 100 uses the noisy speech and mask to obtain the original speech.

In the research of the noise reduction model, the noise-carrying voice may contain noises with different noise intensities, and when the noise intensity of the noise-carrying voice is smaller, the noise reduction model 100 still needs N voice enhancement modules connected in series to process the noise-carrying voice with smaller noise intensity, which may waste system resources and cause overlong processing time for noise reduction of the voice.

In order to solve the above technical problem, referring to fig. 2, a method for voice noise reduction provided in an embodiment of the present application may include:

s201: and obtaining the voice with noise.

Wherein the noisy speech comprises original speech and noise. In some examples, as shown in connection with fig. 3, the noisy speech may be collected by a microphone in the hardware layer.

S202: the noise strength of the noisy speech is determined.

Wherein the noise intensity is used for indicating the noise intensity in the noisy speech. The greater the noise intensity, the greater the noise intensity in the noisy speech. In some examples, the noise intensity of the noisy speech may be divided into four classes, low noise intensity (0-0.25), medium low noise intensity (0.25-0.5), medium high noise intensity (0.5-0.75), and high noise intensity (0.75-1).

In some examples, the noise strength of the noisy speech may be determined by a speech noise reduction model in the application layer.

S203: a corresponding speech noise reduction pattern is determined based on the noise strength.

The voice noise reduction mode refers to a processing mode of voice noise reduction corresponding to noise intensity of voice with noise, for example, a first voice noise reduction mode can be adopted by low noise intensity (0-0.25), and a second voice noise reduction mode can be adopted by medium-low noise intensity (0.25-0.5). Different speech noise reduction modes require different system resources, e.g., a first speech noise reduction mode may require 1% of the system resources and a second speech noise reduction mode may require 5% of the system resources.

It should be understood that in the embodiment of the present application, the noise intensity may be divided into a plurality of levels, and different levels may be processed by adopting different voice noise reduction modes, so as to implement adaptive voice noise reduction.

S204: noise is reduced on the basis of the voice noise reduction mode, and original voice is obtained.

It should be understood that in the embodiment of the application, by determining the noise intensity of the voice with noise, selecting the voice noise reduction mode corresponding to each according to different noise intensities, and reducing the noise of the voice with noise according to the voice noise reduction mode, the adaptive adjustment is realized, the waste of system resources is reduced, and the voice interaction time delay is reduced.

The voice noise reduction method provided by the embodiment can be applied to the electronic equipment for installing and running the application program, and in combination with fig. 3, the Android system of the mobile phone can be divided into four layers, namely an application program layer, an application program framework layer, a kernel layer and a hardware abstraction layer from top to bottom. To facilitate understanding of the present solution, a hardware layer is added after the kernel layer.

The application layer may include a series of application packages. As shown in fig. 3, the application package may include: system applications such as audio recorders, talk, notes, video; cloud voice interaction applications such as map applications, ultranotes, input methods, AI captions; local voice interaction applications, including visual and speakable applications and voice assistants. Wherein, the system application means an application of the system of the electronic equipment; the cloud voice interaction application means an application for realizing voice interaction by the cloud platform; a local voice interaction application means an application that enables voice interaction through computing capabilities local to the electronic device.

In some examples, the system application, the cloud voice interaction application and the local voice interaction application may use the voice noise reduction model provided in the embodiments of the present application and may execute the flow of the voice noise reduction method, where the embodiments of the present application take only a voice assistant in the local voice interaction application as an example to perform an exemplary explanation, for example, the voice assistant may include a voice noise reduction model and a scenerised voice instruction.

The voice assistant recognizes the audio, converts the audio into text, and then replies an answer corresponding to the text or performs a corresponding operation.

It should be noted that, in some examples, the voices 1 to 6 shown in fig. 4 may be divided into an original microphone signal, an audio uplink data stream, an audio downlink data stream, a call audio, a video audio, and a voice recognition audio. The system application can input the voices 1-6 shown in fig. 4; both the cloud voice interaction application and the local voice interaction application may be as shown in fig. 4 as voice 6.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes some predefined functions. As shown in fig. 3, the application framework layer may include a content provider, a resource manager, an audio manager, and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The resource manager provides various resources for the application program, such as localization strings, audio files, video files, and the like.

The audio manager enables the application program to record and collect voices generated by the external equipment or voices of users, and can transmit collected voice signals to a voice recognition application or a voice assistant and the like to realize voice interaction.

The kernel layer is a layer between hardware and software. The kernel layer contains at least audio drivers.

The hardware abstraction layer is an interface layer located between the kernel layer and the hardware layer, and aims to abstract the hardware. As shown in fig. 3, the hardware abstraction layer may be used to convert noisy speech collected by a microphone in the hardware layer into an audio stream at the time of input.

As shown in fig. 3, the hardware layer provided in the embodiment of the present application includes at least an advanced digital signal processor, a microphone, and a speaker.

The advanced digital signal processor may perform a pre-speech process on the noisy speech input through the microphone and input the processed noisy speech to the hardware abstraction layer. The speech pre-processing may include echo cancellation processing, noise reduction processing, gain control processing, and the like.

The noise in the voice with noise is divided into stable noise and transient noise, and the stable noise spectrum is stable; transient noise is extremely strong in burst, takes the form of oscillation weakness in the time domain, and has different duration from tens milliseconds to hundreds milliseconds; the frequency spectrum of transient noise is substantially aliased with the frequency spectrum of normal speech, which is widely distributed in the frequency domain. The noise reduction process in the pre-speech process means an operation of denoising stationary noise in noisy speech. The embodiment of the application provides a voice noise reduction method which is used for reducing transient noise in noisy voice.

The speech noise reduction model in the speech assistant shown in fig. 3 is a model that extracts original speech from noisy speech. And training the original voice noise reduction model by introducing the content to obtain the voice noise reduction model.

In some examples, the speech noise reduction model of embodiments of the present application may be trained by:

step 11: acquiring training noisy speech; training noisy speech includes training original speech and training noise.

In some examples, a Libispeech (a sound book dataset containing text and speech) dataset or an Aishell (an eight-channel Chinese Mandarin conference scene speech dataset live through a microphone array) dataset may be employed as training data for training a speech noise reduction model, and a Musha dataset may be employed as noise data. For example, a plurality of data in the Libirspeech data set is used as training original voice, and a plurality of data in the Musha data set is used as training noise.

Where the SIGNAL-to-NOISE RATIO (SNR or S/N, SIGNAL-NOISE RATIO) refers to the RATIO of speech SIGNAL to NOISE in an electronic device or system, during training, the SNR may be a random value between 0, 20. The higher the signal-to-noise ratio, the weaker the noise strength, i.e., the less noise in the noisy speech.

Step 12: determining a corresponding training processing result according to training noisy speech through a speech noise reduction model to be trained; and according to the training processing result and the training original voice, adjusting the model parameters of the voice noise reduction model and the model parameters of the signal-to-noise ratio determining module in the voice noise reduction model.

It should be understood that the speech noise reduction models obtained by executing the steps 11 and 12 are suitable for the speech noise reduction method provided in the embodiments of the present application, and the speech noise reduction models related to the following sections refer to the trained speech noise reduction models.

In some examples, it is assumed that the speech noise reduction model includes 4 groups of 4 speech enhancement modules, each group including 4 speech enhancement modules, i.e., the speech noise reduction model includes 16 noise reduction modules. Taking a preset number of test noisy voices in Libirspeech data set as a test set, testing a voice noise reduction model, determining the number of voice enhancement modules required by the voice noise reduction model when noise reduction is carried out on the noisy voices under different signal to noise ratios, and evaluating indexes of test results; and determining test result evaluation indexes of the voice noise reduction model in the related art including 16 voice enhancement modules, the results are shown in the following table 1:

TABLE 1

Referring to table 1 above, under the same signal-to-noise ratio, the voice noise reduction model provided by the present application can achieve the same noise reduction effect through fewer voice enhancement modules, for example, when the signal-to-noise ratio is [15, 20], the voice noise reduction model provided by the present application only needs 5.28 voice enhancement modules to perform voice noise reduction, and the voice noise reduction model in the related art needs 16 voice enhancement modules to perform voice noise reduction.

Compared with the voice noise reduction model in the related art, the voice noise reduction model provided by the embodiment of the application reduces the waste of system resources and reduces the time delay of voice interaction.

Referring to fig. 5, a speech noise reduction model provided in an embodiment of the present application may include a preprocessing convolution module, an M-group speech enhancement module group, a noise intensity determination module, and a post-processing convolution module. Each of the M groups of voice enhancement modules comprises N voice enhancement modules, M is a positive integer greater than or equal to 2, and N is a positive integer greater than or equal to 1.

The noise intensity determination module may determine the noise intensity of the noisy speech, referring to fig. 6a, and the determining process of the noise intensity of the noisy speech provided in the embodiment of the present application may be:

S601: after the noise intensity determining module receives the noise-carrying voice feature corresponding to the noise-carrying voice, the noise-carrying voice feature is processed through a Multi-head Attention mechanism (Multi-head Attention) in the noise intensity determining module, and an output first intermediate feature is obtained.

The weight of the original voice features in the noisy voice features can be increased through a multi-head attention mechanism, so that the subsequent extraction of the original voice features is facilitated.

The noisy speech features are processed by a preprocessing convolution module, and relevant explanation can be found below.

S602: and carrying out residual connection and layer normalization operation (Add & Norm) on the noisy speech feature and the first intermediate feature to obtain a second intermediate feature.

Through residual connection operation, the noise intensity determination module can only pay attention to the difference part in the first intermediate feature in the training process; the convergence speed of the noise intensity determination module in the training process can be increased through the layer normalization operation.

S603: and processing the second intermediate feature based on the feedforward layer to obtain a third intermediate feature.

The feedforward layer is used for processing the second intermediate features, so that feature information of the second intermediate features with deeper layers can be obtained, and the expression capability of the voice noise reduction model is improved.

S604: and performing a second residual connection and layer normalization operation (Add & Norm) on the second intermediate feature and the third intermediate feature to obtain an output voice feature.

The problem of gradient disappearance caused by network gradient back propagation updating parameters can be avoided through residual connection and layer normalization operation.

S605: and performing Global average pooling (Global-Avg Pool) operation on the output voice features, compressing the output voice features from a time dimension, and changing the two-dimensional output voice features into one-dimensional vectors.

S606: the one-dimensional vector is converted to noisy features based on the full connection layer.

S607: the noisy features are processed using an activation function (Sigmoid function) in the activation function Sigmoid layer, mapping the noisy features to the [0,1] interval.

Wherein the noise intensity of the noisy speech can be predicted by processing the noisy features by the activation function. The noise intensity is any number between [0,1], such as 0,1, 0.5, 0.2, 0.1, 0.8, and so on.

In the embodiment of the application, the sigmoid layer is used as the last layer of the noise intensity determining module, so that the noise intensity between 0 and 1 can be output, the noise intensity is used for indicating the noise intensity in the noisy speech, and the greater the noise intensity is, the stronger the noise intensity in the noisy speech is, and the more the noise is contained in the noisy speech.

And determining the noise intensity of the voice with noise, and determining the number of voice enhancement modules in the voice noise reduction model based on the noise intensity so as to realize the self-adaptive adjustment of voice noise reduction. That is, the noise intensity of each section of noisy speech may be different, and the number of corresponding speech enhancement modules is also different, so that the noisy speech is processed by the number of speech enhancement modules with the noise intensity matched with that of the noisy speech, and the adaptive adjustment of speech noise reduction is realized. I.e. the greater the noise strength, the greater the number of speech enhancement modules required; conversely, the fewer the number of speech enhancement modules.

In some embodiments, the noisy speech feature is obtained by a preprocessing convolution module in a speech noise reduction model, referring to fig. 6b, the preprocessing convolution module provided in the embodiments of the present application may include two one-dimensional convolutional neural networks, and the process of determining the noisy speech feature by using the preprocessing convolution module may be: and processing the noisy spectrogram corresponding to the noisy speech through the one-dimensional convolutional neural network twice to obtain noisy speech characteristics corresponding to the noisy speech.

The one-dimensional convolutional neural network is used for convolving the width of the two-dimensional data. For example, the size of the input data is d×s, where d is the dimension of the word vector and s is the maximum length of the sentence. The convolution kernel window slides in the sentence length direction to carry out convolution operation.

The noise-carrying voice characteristics determined by the preprocessing convolution module can be respectively used as the input of the voice enhancement module and the noise intensity determination module.

In some examples, the process of converting noisy speech to a noisy spectrogram may be:

step 21: the noisy speech with preset duration is read in real time, as shown in fig. 6 c.

It should be understood that the duration of the input noisy speech may be longer, and in order to reduce the amount of computation, the noisy speech of a preset duration may be read for processing. In some examples, assuming that the noisy speech is speech that is input in real-time, 20 milliseconds of noisy speech may be read as input and 30 milliseconds of noisy speech may be read as input the next time. The duration of reading the noisy speech may be the same or different, and is not specifically limited herein.

Step 22: and carrying out signal pre-emphasis processing on the voice with noise.

It should be understood that, in general, the intensity of the high-frequency component of the audio signal is smaller, and the intensity of the low-frequency component is larger, so that the intensity of the high-frequency component and the low-frequency component of the noisy speech can be similar through the signal pre-emphasis processing method.

Step 23: and carrying out framing treatment on the noisy speech after the signal pre-emphasis treatment.

It should be understood that the time of the noisy speech is long, if the fourier transform is performed using the noisy speech of the original duration, only the relationship between the signal frequency and the intensity is obtained, and the information of the time latitude is lost, so in order to obtain the relationship between the frequency change with time, the noisy speech needs to be divided into a plurality of frames, short-time fourier transform is performed on each frame, and then the obtained transform results are spliced according to the time sequence.

Step 24: after framing is completed, a window function is added to each frame to obtain frames (all named frames in the attribute return window) variables, so that a good side lobe reduction amplitude is obtained, and the noise signals and the large spectrum difference of the corresponding part after windowing are avoided.

Step 25: and carrying out short-time Fourier transform on the windowed voice of each frame to obtain a spectrogram corresponding to the windowed voice of each frame.

Step 26: and splicing the multi-frame spectrograms to obtain a noisy spectrogram corresponding to the noisy speech, as shown in fig. 6 c.

Wherein the data on the noisy spectrogram is used to represent the characteristics of the noisy speech. The abscissa of the noisy spectrogram is time, the ordinate is frequency, and the coordinate point value is the energy of the noisy speech. In the embodiment of the application, the noisy speech is converted into the two-dimensional noisy spectrogram, so that the noisy speech can be better processed, the calculated amount is reduced, and the noise reduction efficiency is improved.

Based on the above embodiment, after determining the characteristics of the noisy speech by the preprocessing convolution module and determining the noise intensity of the noisy speech by the noise intensity determination module, the speech noise reduction model may determine the number of speech enhancement modules according to the noise intensity of the noisy speech and process the characteristics of the noisy speech according to the plurality of speech enhancement modules to obtain the output characteristics. Referring to fig. 6d, the process for determining, by a speech enhancement module, the first output characteristic provided in the embodiment of the present application may be:

s611: after the voice enhancement module receives the noisy voice feature, the noisy voice feature may be processed by a Multi-head Attention mechanism (Multi-head Attention) to obtain a first intermediate feature of the output.

S612: after the first intermediate feature is obtained, a residual connection and layer normalization operation (Add & Norm) is performed on the first intermediate feature to obtain a second intermediate feature.

S613: and carrying out the second intermediate feature based on the feedforward layer to obtain a third intermediate feature.

S614: and performing a second residual connection and layer normalization operation (Add & Norm) on the third intermediate feature to obtain a first output feature of the speech enhancement module.

It should be noted that the implementation process of step S611 to step S614 is the same as the implementation process of step S601 to step S604, and the relevant explanation can be referred to above, which is not repeated here.

It should be understood that the plurality of voice enhancement modules provided in the embodiments of the present application all execute the operations of step S611 to step S614. The plurality of voice enhancement modules are connected in series, so that the input of the (i+1) th voice enhancement module is the output of the (i) th voice enhancement module, wherein i is a positive integer greater than or equal to 1. That is, the output feature corresponding to the last speech enhancement module is taken as the output feature.

It should be noted that, by processing the noisy speech features through a plurality of speech enhancement modules, the output features can be determined, and the mask corresponding to the noisy speech can be determined by using the output features. Referring to fig. 6e, the process of generating a mask by the post-processing convolution module provided in the embodiment of the present application may be:

s621: and processing the output characteristics through a one-dimensional convolutional neural network to obtain a one-dimensional vector.

S622: by activating the function layer, processing is performed on the one-dimensional vector based on the sigmoid function, the value of the element corresponding to the original voice feature is set to be 1, the value of the element corresponding to the noise voice feature is set to be 0, and the mask shown in fig. 6f is obtained.

It should be noted that, a mask is understood to be a film covering a noisy spectrogram of noisy speech, so as to mask or select some characteristic elements. That is, noise elements on the noisy spectrogram can be filtered through the mask. It should be appreciated that m×n elements may be included on the noisy spectrogram, including original speech elements and noise elements, the original speech elements corresponding to the original speech portions, the noise elements corresponding to the noise portions.

It should be understood that, by the processing of the foregoing multiple speech enhancement modules, the obtained output features distinguish the original speech features from the noise speech features, so that the values of the elements corresponding to the original speech features may be set to 1 and the values of the elements corresponding to the noise speech features may be set to 0 by using a sigmoid function, so as to obtain a mask corresponding to the output features. Wherein, this mask is used for filtering the noise in the noisy spectrogram. The original voice features refer to features corresponding to the original voice, and the noise voice features refer to features corresponding to the noise.

After determining the mask corresponding to the output feature, filtering the noisy spectrogram corresponding to the noisy speech through the mask to obtain an original spectrogram, referring to fig. 7, the process of filtering the noisy spectrogram by using the mask may be: and multiplying the mask by the noisy spectrogram, and filtering noise elements in the noisy spectrogram to obtain an original spectrogram.

The size of the mask is the same as that of the spectrogram with noise, and the element with the median value of 0 in the mask is multiplied with the element at the same position in the spectrogram with noise to obtain the value of 0 of the element at the same position in the original spectrogram, so that the noise element in the spectrogram with noise is filtered. As an example, as shown in the X position in fig. 7, the value of the element in the X position of the mask is 0, the value of the element in the X position of the noisy speech feature is 0.2, the values of the elements in the X position are multiplied, and the value of the element in the X position in the original spectrogram is 0.

In some examples, as shown in connection with fig. 8, the noise reduction process for noisy speech may be based on a speech noise reduction model:

step 31: and obtaining a noisy speech spectrogram through short-time Fourier transform of a noisy speech signal corresponding to the noisy speech.

Step 32: inputting the spectrogram with noise into a voice noise reduction model, and obtaining a mask based on the voice noise reduction model.

Step 33: and filtering noise elements in the spectrogram with noise through a mask to obtain an original spectrogram.

Step 34: and performing inverse Fourier transform on the original spectrogram to obtain original voice.

The noisy spectrogram, the mask and the original spectrogram are the same in size.

In the embodiment of the application, the voice with noise is converted into a two-dimensional voice spectrogram with noise; generating a mask based on a voice enhancement module in the voice noise reduction model, and filtering noise in the noisy spectrogram by using the mask to obtain an original spectrogram; and converting the original spectrogram into the original voice again so as to realize self-adaptive noise reduction of the voice with noise.

The voice noise reduction method provided by the embodiment of the application can be applied to various application scenes. The following describes the implementation process of the voice noise reduction method by taking a user awakening a voice assistant and a user driving scene as an example.

In a scenario where the user is watching a television, see B in fig. 9, where the user is watching a television, playing audio, the user may wake up the intelligent speaker by speaking "hello, speaker assistant".

In the user driving scenario, see a in fig. 9, where the user is driving the vehicle, the user can wake up the phone's voice assistant by speaking "Hi, voice assistant".

The mobile phone or the intelligent sound box can receive the specific voice through the microphone, such as 'Hi, a voice assistant' or 'Happy, a sound box assistant'; then, the voice noise reduction model of the mobile phone or the intelligent sound box can carry out noise reduction treatment on the input voice with noise to obtain original voice; the voice recognition model of the mobile phone or the intelligent sound box can carry out voice recognition on the original voice; after the recognition is successful, the mobile phone voice assistant or the loudspeaker box assistant can perform related operation or response.

The voice noise reduction model of the mobile phone or the intelligent sound box can process the characteristics of the voice with noise through the noise intensity determining module to obtain the noise intensity of the voice with noise; and the noise intensity is based on the voice enhancement modules with different numbers, noise reduction processing is carried out on the voice characteristics with noise, and self-adaptive noise reduction is realized.

In some examples, the switch of the speech noise reduction model may be switched to node (1) when the noise intensity of the noisy speech is 0.00-0.25; when the noise intensity of the voice with noise is 0.25-0.50, the switch of the voice noise reduction model can be switched to the node (2); when the noise intensity of the voice with noise is 0.50-0.75, the switch of the voice noise reduction model can be switched to the node (3); when the noise intensity of the voice with noise is 0.75-1.00, the switch of the voice noise reduction model can be switched to the node (4).

By way of example, assuming that the environment of a in fig. 9 is relatively quiet, the environment of B in fig. 9 is relatively noisy, i.e., the noise intensity corresponding to a in fig. 9 is small, and the noise intensity corresponding to B in fig. 9 is large, then in the environment of a in fig. 9, node (2) may be selected, i.e., 2 groups of speech enhancement modules may be selected to process the noisy speech feature based on the small noise intensity; in the context of B in fig. 9, node (3) may be selected based on a greater noise strength, i.e., 3 groups of speech enhancement modules may be selected to process the noisy speech feature. The self-adaptive noise reduction is realized, the waste of system resources is reduced, the noise reduction efficiency is improved, and the time delay of voice interaction is further reduced.

The voice noise reduction method provided by the embodiments above can be applied to electronic devices, which can be mobile phones, tablet computers, notebook computers, wearable electronic devices (such as smart watches), devices with voice interaction functions (such as smart speakers, smart televisions, smart refrigerators, intelligent access control systems), vehicle-mounted devices and the like, and the specific form of the electronic devices is not particularly limited.

The following exemplifies a hardware structure of the electronic device using a mobile phone.

As shown in fig. 10, the electronic device may include a processor 1010, an internal memory 1020, a universal serial bus (universal serial bus, USB) interface 1030, an audio module 1040, a speaker 1040A, a receiver 1040B, a microphone 1040C, and an earphone interface 1040D.

It is to be understood that the configuration illustrated in this embodiment does not constitute a specific limitation on the electronic apparatus. In other embodiments, the electronic device may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 1010 is configured to perform the voice noise reduction method provided in the above embodiment.

The processor 1010 may include one or more processing units, such as: the processor 1010 may include an advanced digital signal processor (analog digital signal processor, ADSP) and/or a neural Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

An advanced digital signal processor is used to process the audio signal. For example, when a user enters noisy speech through the microphone 1040C of the handset, an advanced digital signal processor is used to process and analyze the noisy speech, etc. In some examples, the advanced digital signal processor may perform speech pre-processing on the noisy speech and input the noisy speech to the hardware abstraction layer.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent cognition of electronic devices can be realized through the NPU, for example: speech recognition, speech wake-up, etc. In some examples, the NPU computing processor may be used in reasoning about the speech noise reduction model.

In some embodiments, the processor 1010 may include one or more interfaces. The interfaces may include an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

In some embodiments, the processor 1010 may include multiple sets of I2S buses. The processor 1010 may be coupled to the audio module 1040 via an I2S bus to enable communication between the processor 1010 and the audio module 1040.

The USB interface 1030 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 1030 may be used to connect headphones through which audio is played. The USB interface 1030 may also be used to connect to a wired headset and receive noisy speech through a microphone provided on the wired headset.

It should be understood that the connection relationship between the modules illustrated in this embodiment is only illustrative, and does not limit the structure of the electronic device. In other embodiments of the present application, the electronic device may also use different interfacing manners in the foregoing embodiments, or a combination of multiple interfacing manners.

Internal memory 1020 may be used to store computer-executable program code comprising instructions. The processor 1010 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 1020. The internal memory 1020 may include a stored program area and a stored data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device (e.g., audio data, phonebook, etc.), and so forth. In addition, the internal memory 1020 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash memory (universal flash storage, UFS), and the like. The processor 1010 performs various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 1020 and/or instructions stored in a memory provided in the processor.

The electronic device may implement audio functionality through an audio module 1040, a speaker 1040A, a receiver 1040B, a microphone 1040C, an ear-headphone interface 1040D, and so forth. Such as music playing, recording, etc.

The audio module 1040 is used to convert digital audio information to an analog audio signal output and also to convert an analog audio input to a digital audio signal. The audio module 1040 may also be used to encode and decode audio signals. In some embodiments, the audio module 1040 may be disposed in the processor 1010, or some functional modules of the audio module 1040 may be disposed in the processor 1010.

The speaker 1040A, also called a "horn", is used to convert audio electrical signals into sound signals. In some examples, the speaker 1040A may play audio corresponding to the speech recognition result, or audio corresponding to the successful voice wake-up. In some examples, the user may pass through the speaker 1040A of the cell phone.

A receiver 1040B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. In some examples, the user may bring receiver 1040B close to the human ear and listen to the audio corresponding to the speech recognition result.

A microphone 1040C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. In some embodiments, the microphone 1040C of the handset may collect noisy speech. As shown in connection with fig. 3, a microphone 1040C may collect noisy speech and communicate it to the processor 1010; the advanced digital signal processor in the processor 1010 may perform speech pre-processing on the noisy speech and input the noisy speech to a hardware abstraction layer. The hardware abstraction layer in fig. 3 takes input noisy speech as an audio stream, and the audio driver based on the kernel layer sends the input noisy speech as an input to the speech recognition application or the speech assistant by the audio processor to the audio processor of the application layer.

The earphone interface 1040D is used for connecting a wired earphone. The headset interface 1040D may be a USB interface 1030, or may be a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The terms first, second, third and the like in the description and in the claims and drawings are used for distinguishing between different objects and not for limiting the specified sequence.

In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

The technical solution of the present embodiment may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform all or part of the steps of the method described in the respective embodiments. And the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic or optical disk, and the like.

Claims

1. A method of voice noise reduction, comprising:

acquiring voice with noise; the noisy speech comprises original speech and noise;

selecting a corresponding voice noise reduction mode according to the noise intensity of the voice with noise;

noise is reduced on the basis of the voice noise reduction mode, and the original voice is obtained;

the noise intensity of the voice with noise is obtained by the following steps:

determining the characteristics of the noisy speech corresponding to the noisy speech;

processing the noisy speech feature based on a multi-head attention mechanism to obtain a first intermediate feature;

performing residual connection and layer normalization operation on the first intermediate feature to obtain a second intermediate feature;

processing the second intermediate feature based on a feedforward layer to obtain a third intermediate feature;

performing residual connection and layer normalization operation on the third intermediate feature to obtain an output voice feature;

and processing the output voice characteristics by using an activation function to obtain the noise intensity of the voice with noise.

2. The method of claim 1, wherein after said enhancing said noisy speech feature to obtain an output speech feature, said method further comprises:

Carrying out global average pooling on the output voice characteristics to obtain a one-dimensional vector;

converting the one-dimensional vector into noisy features based on a full connection layer;

the processing the noisy feature by using an activation function to obtain the noise intensity of the noisy speech comprises:

and mapping the noisy features into a preset interval by using an activation function to obtain the noise intensity.

3. The method of claim 1, wherein prior to denoising the noisy speech based on the speech denoising mode to obtain the original speech, the method further comprises:

acquiring a noisy spectrogram corresponding to the noisy speech; the noisy spectrogram is obtained by carrying out short-time Fourier transform on the noisy speech;

determining the characteristics of the noisy speech corresponding to the noisy speech based on the noisy spectrogram;

the noise reduction is performed on the noise-added voice based on the voice noise reduction mode to obtain the original voice, and the method comprises the following steps:

and based on the voice noise reduction mode, noise reduction is carried out on the voice with noise according to the spectrogram with noise and the voice characteristics with noise, and the original voice is obtained.

4. The method of claim 3, wherein said denoising said noisy speech from said noisy spectrogram and said noisy speech features to obtain said original speech comprises:

Determining a mask corresponding to the noisy speech features; the size of the mask is the same as that of the spectrogram with noise;

multiplying the mask with the noisy spectrogram to obtain an original spectrogram;

and performing inverse Fourier transform on the original spectrogram to obtain the original voice corresponding to the voice with noise.

5. The method of any of claims 1-4, wherein the determining the noise strength of the noisy speech comprises:

determining the noise intensity of the noise-carrying voice through a noise intensity determining module in the voice noise reduction model;

the selecting a corresponding voice noise reduction mode according to the noise intensity includes:

determining the number of voice enhancement modules in the voice noise reduction model according to the noise intensity; said number being at least one;

and denoising the noisy speech by using the number of speech enhancement modules to obtain the original speech.

6. The method of claim 5, wherein the speech noise reduction model is trained by:

acquiring training noisy speech; the training noisy speech comprises training original speech and training noise;

Determining a corresponding training processing result according to the training noisy speech through a speech noise reduction model to be trained; and according to the training processing result and the training original voice, adjusting model parameters of the voice noise reduction model and model parameters of a noise intensity determining module in the voice noise reduction model.

7. An electronic device, comprising: a processor and a memory;

wherein one or more computer programs are stored in the memory, the one or more computer programs comprising instructions; the instructions, when executed by the processor, cause the electronic device to perform the method of any of claims 1-6.

8. A computer storage medium comprising computer instructions which, when run on an electronic device, perform the method of any of claims 1-6.