CN118098260A - Voice signal processing method and related equipment - Google Patents

Voice signal processing method and related equipment Download PDF

Info

Publication number
CN118098260A
CN118098260A CN202410350758.1A CN202410350758A CN118098260A CN 118098260 A CN118098260 A CN 118098260A CN 202410350758 A CN202410350758 A CN 202410350758A CN 118098260 A CN118098260 A CN 118098260A
Authority
CN
China
Prior art keywords
signal
voice signal
gradient
noise reduction
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410350758.1A
Other languages
Chinese (zh)
Inventor
王泰辉
夏日升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202410350758.1A priority Critical patent/CN118098260A/en
Publication of CN118098260A publication Critical patent/CN118098260A/en
Pending legal-status Critical Current

Links

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The application provides a voice signal processing method and related equipment, which are applied to the field of audio. According to the method, noise reduction is carried out on a voice signal to be processed (particularly a far-field voice signal) based on a diffusion model, and then high-frequency components are repaired based on the diffusion model, so that a target voice signal is obtained. Through the noise reduction, the target voice signal can be clearer in the sense of hearing, and through the restoration, the target voice signal can be plumter in the sense of hearing. Compared with the noise reduction treatment, the hearing is more full. When the restoration processing is performed, the voice signal before noise reduction is used as the condition of the conditional diffusion model, namely, the reference information of the restoration high-frequency component, so that the possible interference of the noise reduction processing on the restoration processing can be reduced, and the restoration effect is improved. The two diffusion models are used for processing far-field voice signals respectively, so that the complexity of an algorithm can be reduced and the calculation force can be saved under the condition that the same effect is achieved compared with the case that one model is used for processing the far-field voice signals.

Description

Voice signal processing method and related equipment
Technical Field
The present application relates to the field of terminal technologies, and in particular, to a method for processing a voice signal and a related device.
Background
It is very challenging to collect the speech signal remotely (or collect the speech signal remotely, which may also be referred to as collecting or picking up the far-field speech signal) with the terminal device. A long distance or far field is understood to mean that the microphone for picking up or picking up sound is far from the sound source. In general, in the acoustic field, a distance of 10m or more from a microphone to a sound source may be referred to as a long distance, and a distance of 20m or 30m from a microphone to a sound source may be referred to as an ultra-long distance.
At present, far-field voice signals are picked up mainly based on a microphone array, and the picked-up far-field voice signals are enhanced according to an acoustic model, so that voice signals of target orientations are picked up.
Disclosure of Invention
The application provides a voice signal processing method and related equipment, which can enable the hearing of an acquired voice signal (particularly a far-field voice signal) to be clearer and fuller.
In a first aspect, a method for processing a voice signal is provided, and the method is applied to an electronic device, and includes: acquiring a first voice signal to be processed; noise reduction processing is carried out on the first voice signal based on the first diffusion model, and a second voice signal is obtained; and repairing the first voice signal and the second voice signal based on a second diffusion model to obtain a target voice signal, wherein the second diffusion model is a conditional diffusion model, and the repairing comprises: and restoring the high-frequency component of the second voice signal by taking the first voice signal as the condition of the second diffusion model.
According to the scheme, on one hand, noise is reduced, the target voice signal can be clearer in auditory sense, and on the other hand, the high-frequency component is repaired, so that the target voice signal is plumter in auditory sense. In addition, compared with the scheme of only carrying out noise reduction processing on far-field voice signals, the target voice signals can be enabled to be full in hearing.
Also, since the attenuated high frequency component may be submerged in noise, the noise reduction process may interfere with the noise of the attenuated high frequency component of the first voice signal, thereby possibly affecting the restoration effect. When the restoration processing is carried out, the restoration processing is carried out by taking the voice signal before noise reduction as the condition of the conditional diffusion model, and the voice signal is taken as the reference information for restoring the high-frequency component, so that the possible interference of the noise reduction processing on the restoration processing is reduced, and the restoration effect is improved.
In addition, the processing of the first voice signal is decomposed into two steps, and two diffusion models are used for processing the first voice signal, so that the complexity of an algorithm can be reduced and the calculation force can be saved under the condition that the same effect is achieved compared with the case that one model is used for processing the far-field voice signal. That is, the first speech signal is processed by a model, which requires a more complex algorithm to achieve the effect that the two models can achieve, and has higher requirements on power consumption, computational power and the like of the terminal device.
In one possible embodiment, the first speech signal is a far-field speech signal, or the distance of the microphone of the electronic device from the sound source of the first speech signal is greater than or equal to a preset distance threshold.
The embodiment of the application has particularly obvious effect of voice processing on the pick-up of far-field voice signals.
In one possible embodiment, the noise reduction processing is performed on the first voice signal based on the first diffusion model to obtain a second voice signal, including: calculating a gradient of the first voice signal, wherein the gradient is used for representing probability distribution of the second voice signal, and the probability distribution of the second voice signal corresponds to the distribution of time-frequency points in a spectrogram of the second voice signal; and sampling the gradient of the first voice signal to obtain a second voice signal.
It can be appreciated that the probability distribution of the second speech signal is considered, that is, the integrity of the distribution of the time-frequency points of the second speech signal is considered, and the integrity includes the relationship between the time-frequency points and the time-frequency points.
It can be further understood that, in the case of analyzing time-frequency points one by one without considering the integrity, even if the total error (for example, the first error value) between the time-frequency points of the noise-reduced signal and the time-frequency points corresponding to the ideal clean signal (the signal obtained by removing all noise from the first speech signal in the ideal case) is very small, the situation that the error between the time-frequency points corresponding to a small part of the time-frequency points and the ideal clean signal is very large and the error between a large part of the time-frequency points is very small is very likely to occur, so that the error between a small part of the time-frequency points is very large, in principle, the noise-reduced speech is damaged to a certain extent, the spectral continuity is not good, and the hearing feeling is also poor.
Compared with the scheme without considering the distribution integrity of the time frequency points, the method and the device have the advantages that on the basis of considering the distribution integrity of the time frequency points, if the total error is controlled to be the same small value (such as the first error value), the error of almost all points is balanced and is smaller, so that the damage of noise reduction processing on the voice signals can be reduced in principle.
In one possible embodiment, calculating the gradient of the first speech signal comprises: inputting the first voice signal into a first neural network to obtain a noise reduction to-be-sampled gradient of the first voice signal, wherein the noise reduction to-be-sampled gradient is used for representing probability distribution of the second voice signal; sampling the gradient of the first voice signal, including: predicting a noise reduction sampling signal according to the noise reduction gradient to be sampled; calculating the gradient of the first speech signal further comprises: inputting the noise reduction sampling signal into a first neural network to obtain a noise reduction to-be-corrected gradient of the noise reduction sampling signal, wherein the noise reduction to-be-corrected gradient is used for representing probability distribution of a second voice signal; sampling the gradient of the first voice signal, and further comprising: correcting the noise reduction sampling signal according to the gradient to be corrected in noise reduction to obtain a noise reduction correction signal; and generating a second voice signal according to the noise reduction correction signal.
According to the scheme, the noise reduction to-be-corrected gradient is continuously calculated on the noise reduction sampling signal obtained based on the noise reduction to-be-sampled gradient prediction, and the noise reduction sampling signal is corrected based on the noise reduction to-be-corrected gradient, so that the noise reduction effect is better compared with a diffusion model which only calculates a primary gradient or compared with a diffusion model which only predicts an uncorrected diffusion model.
That is, sampling is performed according to the gradient, including two sub-steps of prediction and correction. Illustratively, the predicting step employs ancestor sampling; the correction step adopts Langmuir dynamics sampling or annealing Langmuir dynamics sampling, and the correction step is used for correcting the prediction result of the prediction step. The prediction and sampling steps in the present application may refer to the examples herein, and are described in detail herein.
In one possible embodiment, predicting the noise-reduced sampling signal from the noise-reduced gradient to be sampled includes: calculating a noise reduction drift coefficient according to the first voice signal based on a random differential equation of the first diffusion model; according to the noise reduction drift coefficient and the noise reduction gradient to be sampled, calculating a noise reduction inverse drift coefficient; and predicting to obtain a noise reduction sampling signal according to the noise reduction inverse drift coefficient, the first voice signal and a fourth Gaussian noise value, wherein the fourth Gaussian noise value is generated based on a third random seed.
It will be appreciated that the inverse drift coefficient may be used to describe the path of a noisy speech signal towards a clean speech signal during sampling. And the gradient is obtained by sampling based on the inverse drift coefficient, so that noise reduction can be realized.
In one possible embodiment, the first neural network is derived based on first input data and first target data, wherein the first input data includes a sample noisy speech signal and a sample noise-reducing sampled signal, the sample noisy speech signal being generated from the first sample noise signal and the first sample speech signal, the sample noise-reducing sampled signal being generated based on the sample noisy speech signal, the first target data being used to characterize a probability distribution of the first sample speech signal.
According to the scheme, the neural network model is trained by representing the probability distribution of the sample voice signals contained in the sample noisy voice signals through the target data, so that the probability distribution of the second voice signals can be represented by the gradient output by the trained neural network model.
In one possible embodiment, the sample noise reduction sampling signal is a fifth gaussian noise value generated from the sample noisy speech signal, the sample speech signal, and a sixth gaussian noise value subject to a standard normal distribution, the first target data is generated based on a standard deviation of the fifth gaussian noise value and the sixth gaussian noise value, and the sixth gaussian noise value is generated based on a fourth random seed.
It can be understood that, because the random numbers generated by the random seeds at different times are different, the voice signal processing method provided by the application carries out noise reduction processing on the voice signals to be processed at the same section at different times, and the second voice signals respectively output are different or not identical, so that the target voice signals finally respectively output are different or not identical.
In one possible embodiment, the repairing processing is performed on the first voice signal and the second voice signal based on the second diffusion model to obtain a target voice signal, including: calculating the gradient of the second voice signal according to the first voice signal and the second voice signal, wherein the gradient is used for representing the probability distribution of the target voice signal, and the probability distribution of the target voice signal corresponds to the distribution of time-frequency points in a spectrogram of the target voice signal; and sampling the gradient of the second voice signal based on a first Gaussian noise value to obtain a target voice signal, wherein the first Gaussian noise value is generated based on a first random seed.
According to the scheme, the Gaussian noise is superimposed according to the gradient in the sampling step, and if the Gaussian noise accords with the gradient or accords with the probability distribution of the target voice signal, the attenuated high-frequency component is enhanced, so that the high-frequency component is repaired.
It can be appreciated that the conventional discriminant neural network cannot implement restoration of high frequency components based on probability distribution superimposed gaussian noise like the generative neural network.
In one possible embodiment, calculating the gradient of the second speech signal from the first speech signal and the second speech signal comprises: inputting the first voice signal and the second voice signal into a second neural network to obtain a repairing to-be-sampled gradient of the second voice signal, wherein the repairing to-be-sampled gradient is used for representing probability distribution of the target voice signal; sampling the gradient of the second speech signal based on the first gaussian noise value, comprising: predicting a repair sampling signal according to the repair gradient to be sampled and the first Gaussian noise value; calculating a gradient of the second speech signal from the first speech signal and the second speech signal, further comprising: inputting the repair sampling signal into a second neural network to obtain a repair to-be-corrected gradient of the repair sampling signal, wherein the repair to-be-corrected gradient is used for representing probability distribution of the target voice signal; sampling the gradient of the second speech signal, further comprising: correcting the repair sampling signal according to the gradient to be corrected to obtain a repair correction signal; and generating a target voice signal according to the repair correction signal.
According to the scheme, the repairing gradient to be corrected is continuously calculated on the repairing sampling signal obtained based on the prediction of the repairing gradient to be sampled, and the repairing sampling signal is corrected based on the repairing gradient to be corrected, so that the effect of repairing the high-frequency component is better compared with a diffusion model which only calculates the gradient once or compared with a diffusion model which only predicts the non-correction.
In one possible embodiment, predicting a repair sample signal from a repair to-be-sampled gradient and a first gaussian noise value comprises: calculating a repair drift coefficient according to the second voice signal based on a random differential equation of the second diffusion model; calculating a repair inverse drift coefficient according to the repair drift coefficient and the repair gradient to be sampled; and predicting to obtain a repair sampling signal according to the repair inverse drift coefficient, the second voice signal and the first Gaussian noise value.
It will be appreciated that the inverse drift coefficient may be used to describe the path of a noisy speech signal towards a clean speech signal during sampling. That is, the time-frequency points that do not follow the probability distribution of the target speech signal are reduced, and the paths of the time-frequency points that follow the probability distribution of the target speech signal are generated by means of the first gaussian noise.
In one possible embodiment, the second neural network is trained based on second input data and second target data, wherein the second input data includes a sample noisy speech signal, a sample attenuated speech signal, and a sample repair sample signal, the sample noisy speech signal is generated from the second sample noisy signal and the second sample speech signal, the sample attenuated speech signal is convolved with a pre-set distance room impulse response, the sample repair sample signal is generated based on the sample attenuated speech signal and the second sample speech signal, the second target data is used to characterize a probability distribution of the second sample speech signal, and the pre-set distance room impulse response is used to simulate a process in which sound is emitted by the sound source and propagates to the microphone when a distance between the sound source and the microphone of the electronic device is greater than or equal to a pre-set distance threshold.
According to the scheme, the neural network model is trained through the probability distribution of the sample voice signal, which is characterized by the target and used for generating the sample attenuated voice signal, so that the probability distribution of the target voice signal can be characterized by the gradient output by the trained neural network model.
In one possible embodiment, the sample-repairing sampled signal is a second gaussian noise value generated from the sample-attenuating speech signal, the second sample speech signal, and a third gaussian noise value subject to a standard normal distribution, the second target data being generated based on a standard deviation of the second gaussian noise value and the third gaussian noise value, the third gaussian noise value being generated based on a second random seed.
It can be understood that, because the random numbers generated by the random seeds at different times are different, the voice signal processing method provided by the application carries out restoration processing on the voice signal to be processed at the same section at different times, and the respectively output target voice signals are different or not identical.
In one possible embodiment, inputting the first speech signal into the first neural network to obtain a noise reduction to-be-sampled gradient of the first speech signal includes: inputting the first voice signal into a first neural network to obtain a1 st noise reduction gradient to be sampled of the first voice signal; based on a random differential equation of the first diffusion model, calculating a noise reduction drift coefficient from the first speech signal, comprising: calculating a1 st noise reduction drift coefficient according to the first voice signal based on a random differential equation of the first diffusion model; according to the noise reduction drift coefficient and the noise reduction gradient to be sampled, calculating a noise reduction inverse drift coefficient comprises the following steps: according to the 1 st noise reduction drift coefficient and the 1 st noise reduction gradient to be sampled, calculating a1 st noise reduction inverse drift coefficient; according to the noise reduction inverse drift coefficient, the first voice signal and the fourth Gaussian noise value, a noise reduction sampling signal is obtained through prediction, and the noise reduction sampling signal comprises: predicting to obtain a1 st noise reduction sampling signal according to the 1 st noise reduction inverse drift coefficient, the first voice signal and the 1 st fourth Gaussian noise value; inputting the noise reduction sampling signal into a first neural network to obtain a noise reduction gradient to be corrected of the noise reduction sampling signal, wherein the noise reduction gradient comprises: inputting the 1 st noise reduction sampling signal into a first neural network to obtain a1 st noise reduction gradient to be corrected of the 1 st noise reduction sampling signal; correcting the noise reduction sampling signal according to the noise reduction gradient to be corrected to obtain a noise reduction correction signal, comprising: and correcting the 1 st noise reduction sampling signal according to the 1 st noise reduction gradient to be corrected to obtain a1 st noise reduction correction signal.
Or inputting the first voice signal into a first neural network to obtain a noise reduction to-be-sampled gradient of the first voice signal, including: inputting the (n-1) th noise reduction correction signal and the first voice signal into a first neural network to obtain an (n) th noise reduction gradient to be sampled of the first voice signal; based on a random differential equation of the first diffusion model, calculating a noise reduction drift coefficient from the first speech signal, comprising: calculating an nth noise reduction drift coefficient according to the nth-1 noise reduction correction signal and the first voice signal based on a random differential equation of the first diffusion model; according to the noise reduction drift coefficient and the noise reduction gradient to be sampled, calculating a noise reduction inverse drift coefficient comprises the following steps: according to the nth noise reduction drift coefficient and the nth noise reduction gradient to be sampled, calculating an nth noise reduction inverse drift coefficient; according to the noise reduction inverse drift coefficient, the first voice signal and the fourth Gaussian noise value, a noise reduction sampling signal is obtained through prediction, and the noise reduction sampling signal comprises: predicting to obtain an nth noise reduction sampling signal according to the nth noise reduction inverse drift coefficient, the (n-1) th noise reduction correction signal and the (n) fourth Gaussian noise value; inputting the noise reduction sampling signal into a first neural network to obtain a noise reduction gradient to be corrected of the noise reduction sampling signal, wherein the noise reduction gradient comprises: inputting the nth noise reduction sampling signal and the first voice signal into a first neural network to obtain an nth noise reduction gradient to be corrected of the nth noise reduction sampling signal; correcting the noise reduction sampling signal according to the noise reduction gradient to be corrected to obtain a noise reduction correction signal, comprising: correcting the nth noise reduction sampling signal according to the nth noise reduction gradient to be corrected to obtain an nth noise reduction correction signal, wherein N is more than or equal to 2 and less than or equal to N, and N and N are positive integers, and the nth noise reduction correction signal is used as a second voice signal under the condition of n=N.
According to the scheme, the steps of calculating the gradient, predicting and correcting are iterated for N times, so that a better noise reduction effect is achieved.
In one possible embodiment, inputting the first speech signal and the second speech signal into the second neural network to obtain a repair-to-be-sampled gradient of the second speech signal includes: inputting the first voice signal and the second voice signal into a second neural network to obtain a 1 st repairing gradient to be sampled of the second voice signal; calculating a repair drift coefficient from the second speech signal based on a random differential equation of the second diffusion model, comprising: calculating a 1 st repair drift coefficient according to the second speech signal based on a random differential equation of the second diffusion model; according to the repair drift coefficient and the repair gradient to be sampled, calculating a repair inverse drift coefficient, comprising: calculating a 1 st repair inverse drift coefficient according to the 1 st repair drift coefficient and the 1 st repair gradient to be sampled; according to the repair inverse drift coefficient, the second voice signal and the first Gaussian noise value, predicting to obtain a repair sampling signal comprises the following steps: predicting to obtain a 1 st repair sampling signal according to the 1 st repair inverse drift coefficient, the second voice signal and the first Gaussian noise value; inputting the repair sampling signal into a second neural network to obtain a repair gradient to be corrected of the repair sampling signal, including: inputting the 1 st repair sampling signal into a second neural network to obtain a 1 st repair gradient to be corrected of the 1 st repair sampling signal; correcting the repair sampling signal according to the repair gradient to be corrected to obtain a repair correction signal, including: and correcting the 1 st repair sampling signal according to the 1 st repair gradient to be corrected to obtain the 1 st repair correction signal.
Or inputting the first voice signal and the second voice signal into a second neural network to obtain a repairing to-be-sampled gradient of the second voice signal, comprising: inputting the first voice signal, the second voice signal and the m-1 repair correction signal into a second neural network to obtain an m repair gradient to be sampled of the second voice signal; calculating a repair drift coefficient from the second speech signal based on a random differential equation of the second diffusion model, comprising: calculating an mth repair drift coefficient according to the second voice signal and the mth-1 repair correction signal based on a random differential equation of the second diffusion model; according to the repair drift coefficient and the repair gradient to be sampled, calculating a repair inverse drift coefficient, comprising: calculating an mth repair inverse drift coefficient according to the mth repair drift coefficient and the mth repair gradient to be sampled; according to the repair inverse drift coefficient, the second voice signal and the first Gaussian noise value, predicting to obtain a repair sampling signal comprises the following steps: predicting to obtain an mth repair sampling signal according to the mth repair inverse drift coefficient, the mth-1 repair correction signal and the first Gaussian noise value; inputting the repair sampling signal into a second neural network to obtain a repair gradient to be corrected of the repair sampling signal, including: inputting the first voice signal, the second voice signal and the mth repair sampling signal into a second neural network to obtain an mth repair gradient to be corrected of the mth repair sampling signal; correcting the repair sampling signal according to the repair gradient to be corrected to obtain a repair correction signal, including: correcting the mth repair sampling signal according to the mth repair gradient to be corrected to obtain an mth repair correction signal, wherein M is more than or equal to 2 and less than or equal to M, and M and M are positive integers, and the mth repair correction signal is taken as a target voice signal under the condition of m=M.
According to the scheme, the steps of calculating the gradient, predicting and correcting are iterated for M times, so that a better effect of repairing the high-frequency components is achieved.
In a second aspect, the present application provides an electronic device comprising one or more processors and one or more memories; wherein the one or more memories are coupled to the one or more processors, the one or more memories being operable to store computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described in the first aspect and any possible implementation of the first aspect.
In a third aspect, embodiments of the present application provide a chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.
In a fourth aspect, the application provides a computer readable storage medium comprising instructions which, when run on an electronic device, cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.
In a fifth aspect, the application provides a computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.
It will be appreciated that the electronic device provided in the second aspect, the chip system provided in the third aspect, the computer storage medium provided in the fourth aspect, and the computer program product provided in the fifth aspect are all configured to perform the method provided by the present application. Therefore, the advantages achieved by the method can be referred to as the advantages of the corresponding method, and will not be described herein.
Drawings
FIG. 1 is a schematic diagram of an example of a scenario in which an embodiment of the present application is applicable;
Fig. 2A is a time domain diagram and a time frequency diagram of a speech signal 1 acquired at 5m in the scenario shown in fig. 1;
Fig. 2B is a time domain diagram and a time frequency diagram of the speech signal 2 acquired at 30m in the scenario shown in fig. 1;
Fig. 3 is a schematic diagram of a speech signal processing method 100 according to an embodiment of the present application;
Fig. 4 is a schematic diagram of a voice conversion method 200 according to an embodiment of the present application;
fig. 5 is a schematic diagram of a voice conversion method 300 according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a training process 400 of a deep neural network model according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a training process 500 of a deep neural network model according to an embodiment of the present application;
Fig. 8 is a schematic hardware structure of an electronic device 1000 according to an embodiment of the present application;
fig. 9 is a block diagram of a software system of an electronic device 1000 according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
For a better understanding of the embodiments of the present application, technical terms related to the embodiments of the present application will be described first.
1. A long distance or far field is understood to mean that the microphone for picking up or picking up sound is far from the sound source. In general, in the acoustic field, a distance of 10m or more from a microphone to a sound source may be referred to as a long distance, and a distance of 20m or 30m from a microphone to a sound source may be referred to as an ultra-long distance. In the embodiment of the present application, the long distance is understood as a case where the distance between the microphone and the sound source is greater than or equal to a preset distance threshold.
The long-distance voice signal or the far-field voice signal is the voice signal collected or picked up in a long-distance or far-field scene. Far-field speech signals, which can be understood as speech signals recorded by the microphone of the electronic device having a distance to the sound source greater than or equal to a preset distance threshold.
2. The room impulse response (room impulse response, RIR), the role in the present application includes simulating the process of sound transmission between microphones after the sound source emits the sound. In the process, noise is likely to exist, and sound emitted by a noise source is also propagated together; in this process, the distance between the sound source and the microphone affects the attenuation of the high-frequency component; in this process, reverberation may also be caused by wall reflection or the like.
In some embodiments, the embodiments of the present application are applicable to scenes where sound signals are acquired remotely. The RIR data set used in the process of training the second neural network is mainly a remote RIR data set, for example, may include a remote RIR recorded in an oversized room, may also include a remote RIR in an oversized room simulated by using a simulation algorithm, or may also include a remote RIR generated in other manners, and the application is not limited. Illustratively, an oversized room herein is typically a room in which there is a linear distance greater than or equal to a preset distance threshold (e.g., 20m or 30m or 10m or 15m or 25m, etc.).
The remote RIR in the present application may also be referred to as a preset distance RIR, which is equal to the preset distance threshold described above. The preset distance RIR is used for simulating the process that sound is transmitted to the microphone after being emitted by the sound source when the distance between the sound source and the microphone of the electronic equipment is larger than or equal to a preset distance threshold value.
It should be noted that while the RIR datasets of the present application are all related to oversized rooms, the remote or far-field to which the present application relates is not coupled to oversized rooms or rooms. That is, the voice signal processing method provided by the application is not only suitable for the closed space such as oversized rooms, large-scale step classrooms and banquet halls, but also suitable for the non-closed space such as playgrounds and sites surrounded by four sides but open. And, these enclosed spaces or non-enclosed spaces each include a linear distance of 20m or more or 30m or more.
It can be appreciated that, in the training process of the second neural network, the noise data set or the data set of the RIR is not concerned with the data set that causes interference to the sample clean voice signal (or may be called as the sample voice signal), but is concerned with the probability distribution of the sample voice signal, so that the second neural network is not affected by the noise type, and the generalization performance is better. Then, in the training process, even if the interference data set related to the closed space is used, the trained second neural network can also obtain the target voice signal for the voice signal to be processed in the non-closed space.
3. Time domain diagrams, spectrograms and high frequency components.
Time is plotted on the abscissa and amplitude is plotted on the ordinate of the time domain plot. By performing feature transformation on the time domain graph, a corresponding spectrogram can be obtained.
The spectrogram (which may also be referred to as a time-frequency diagram), the abscissa is time, the ordinate is frequency, and the spectrogram is composed of a plurality of time-frequency points. The brightness or the depth of the time-frequency point represents the energy, for example, the brighter or shallower the time-frequency point, the larger the energy; and vice versa, the smaller the energy.
For example, the frequency range of the spectrogram is 0-5000 KHz,5000 KHz-8000 KHz is high frequency, and 0-5000 KHz is medium-low frequency.
The high-frequency component is a component with higher frequency in the voice signal, and can be specifically expressed as a time-frequency point with higher frequency in the spectrogram. For example, the frequency range of the spectrogram of a certain voice signal is 0-8000 KHz, and the time frequency point belongs to the high frequency component, wherein the time frequency point belongs to 5000 KHz-8000 KHz.
4. Diffusion model: a class of stochastic process-based generation models can be used to generate various types of data such as images, text, audio, etc. Unlike conventional generative models, diffusion models do not rely on known tags or target data, but rather gradually generate high quality data from noise by processing random noise. Thus, the diffusion model can be regarded as an unsupervised generation model.
Among diffusion models, the most common is an unconditional diffusion model, whose main task is to generate random samples of raw data. However, for some specific applications, such as image restoration, image synthesis, text-to-image generation, etc., we may need finer control over the generated samples, which may require the introduction of conditional diffusion models.
An important feature of the conditional diffusion model is that the generated samples can be controlled by adding additional conditional information, such as specific parts of the image, text descriptions, etc. For example, in the task of text-to-image generation, we can control which elements the generated image should contain, or what style should be presented, by adding text descriptions.
In a specific implementation, the conditional diffusion model generally controls the results generated by encoding additional conditional information as part of the model during the training phase of the diffusion model, and then decoding the conditional information during the generation phase. An important advantage of this approach is that it can provide finer control over the results of the generation while maintaining the diffusion model generation capability.
The diffusion model according to the present application is a conditional diffusion model.
5. Drift coefficients and inverse drift coefficients.
The clean signal (e.g., a speech signal in the present application) is denoised by sampling. The drift coefficient and the diffusion coefficient are used together to characterize the difference between the noisy random signal and the audio signal to be processed in the process of noise-adding the clean signal.
The drift coefficient is the average value of random signals sampled for the nth time in the N sampling processes; the diffusion coefficient is the variance of the random signal of the nth sample during the N samples.
And denoising the audio signal to be processed by sampling. The inverse drift coefficient and the inverse diffusion coefficient are used together to characterize the difference between the denoised random signal and the clean signal in the process of denoising the audio signal to be processed.
The inverse drift coefficient is the average value of random signals sampled for the nth time in the N sampling processes; the inverse diffusion coefficient is the variance of the random signal of the nth sample during the N samples.
Where 1.ltoreq.n.ltoreq.N and N and N are integers.
6. Gaussian noise (Gaussian): is a common random noise and has the characteristic of gaussian distribution, namely the intensity and the frequency of the noise are uniformly distributed. Gaussian noise is often used as a tool to simulate various noise in signal processing and image processing. The frequency response of gaussian noise can be calculated by a gaussian function. In the frequency domain, the spectral density function of gaussian noise exhibits a gaussian curve.
7. Random seed (random seed) is a term of art for computers, a random number that targets a random number and is initially conditioned on a true random number (seed). The random number of a general computer is a pseudo random number, a true random number (seed) is used as an initial condition, and then a certain algorithm is used for continuously iterating to generate the random number.
It will be appreciated that in the present application, the random numbers (e.g., gaussian noise values) generated from the same random seed at different times are different.
At present, far-field voice signals are picked up mainly based on a microphone array, and the picked-up far-field voice signals are enhanced according to an acoustic model, so that voice signals of target orientations are picked up. Such methods are linear model driven, with limited performance; if the positioning module has errors, the target voice signal is distorted, and the method needs a plurality of microphone devices, so that the cost is high.
Fig. 1 is a schematic diagram of an example of a scenario to which an embodiment of the present application is applied.
As shown in fig. 1, the dashed box may represent a closed class space or a non-closed class space as described above. The sound source is sounding, and the microphone a (for example, included in the terminal device 1) and the microphone B (for example, included in the terminal device 2) are used to collect voice signals at a distance of 5m and 30m from the sound source, respectively, so that voice signals of the same sound emitted by the same sound source at the same time are collected at a normal distance and a long distance.
Here, only 5m and 30m are taken as examples to illustrate that the remote recording has a problem to be solved compared with the normal remote recording, and the present invention is not limited to specific numerical values.
For example, students are in class in a large-scale step classroom (or the like) and a teacher gives a lecture on a lecture table, and students 1 sitting in the 2 nd row (about 5m from the lecture table) use the terminal device 1 to record sound, and students 2 sitting in the last row (about 30m from the lecture table) use the terminal device 2 to record sound. In general, the audio recorded by the student 1 can easily hear the teacher's voice; in the audio recorded by the student 2, the sound of the teacher is very small and the signal-to-noise ratio is very low, and even if the volume is amplified in order to amplify the sound of the teacher, the noise is also amplified, and the sound of the teacher is still submerged in the noise. In addition, even if some sentences are barely audible, the sound is not audible enough.
For another example, students listen to a lecture on a playground (or other non-enclosed space). Students 1 located approximately 5m from the presenter are recorded using the terminal device 1, and students 2 located approximately 30m from the presenter are recorded using the terminal device 2. In general, the audio recorded by the student 1 can easily hear the voice of the lecturer; in the audio recorded by the student 2, the sound of the presenter is very small and the signal-to-noise ratio is very low, and even if the volume is amplified in order to amplify the sound of the presenter, the noise is amplified as well and the sound of the presenter is still buried in the noise. In addition, even if some sentences are barely audible, the sound is not audible enough.
Fig. 2A is a time domain diagram and a time frequency diagram of the speech signal 1 acquired at 5m in the scenario shown in fig. 1.
Fig. 2B is a time domain diagram and a time frequency diagram of the speech signal 2 acquired at 30m in the scenario shown in fig. 1.
Note that, since the volume of the voice signal 2 collected at 30m is very small, the sound from the sound source is barely heard by amplifying the volume, so fig. 2B is processed by amplifying the volume.
As shown in fig. 2A, the upper part is a time domain diagram 1, and the lower part is a spectrogram 1 obtained by performing feature conversion according to the time domain diagram 1. As shown in the spectrogram 1, the harmonic structure of the middle-low frequency region of the voice signal 1 is obvious, and the time-frequency point distribution of the high frequency region is also obvious; in the whole, a small amount of irregularly distributed time-frequency points are noise, but the energy of the time-frequency points of the harmonic component of the middle and low frequency and the high frequency component is obviously larger than that of the noise, namely the time-frequency points of the noise are not submerged in the time-frequency points of the voice signal. From the sense of hearing, the sound is very full and clear, and the noise is negligible.
As shown in fig. 2B, the upper part is the time domain diagram 2, and the lower part is the spectrogram 2 obtained by performing feature conversion according to the time domain diagram 2. As can be seen from the time domain fig. 2, the signal to noise ratio is small. As shown in the spectrogram 2, in the speech signal 2 as a whole, the energy of the noise time-frequency point is not significantly different from the energy of the non-noise time-frequency point, which also means that the signal-to-noise ratio is very low. The harmonic structure of the middle and low frequencies is invisible in the first half of the time, and compared with the first half of the time, the harmonic structure of the middle and low frequencies is relatively obvious in the second half of the time. The high frequency component is almost completely buried in the time-frequency point of noise. The first half of the time, which is much more noisy than the non-noise, is very hard to listen carefully and cannot hear what is being said, and the second half of the time, which is very hard, can hear what is being said, but the sound is clearly not natural enough.
It can be understood that, for far-field voice or long-distance voice collection, as the distance increases, on one hand, the signal-to-noise ratio of the voice signal collected by the microphone becomes lower gradually, so that the voice signal becomes less and less clear in hearing; on the other hand, the attenuation of the high-frequency components of the voice signal becomes more and less full in hearing.
Therefore, how to improve the speech quality of the collected far-field speech signal is a problem to be solved.
In view of the above, the present application provides a method for processing a speech signal, which is to reduce noise based on a diffusion model, then repair high frequency components based on the diffusion model, and finally obtain a target speech signal.
According to the scheme, on one hand, noise is reduced, the target voice signal can be clearer in auditory sense, and on the other hand, the high-frequency component is repaired, so that the target voice signal is plumter in auditory sense. In addition, compared with the scheme of only carrying out noise reduction processing on far-field voice signals, the target voice signals can be enabled to be full in hearing.
Also, since the attenuated high frequency component may be submerged in noise, the noise reduction process may interfere with the noise of the attenuated high frequency component of the first voice signal, thereby possibly affecting the restoration effect. When the restoration processing is carried out, the restoration processing is carried out by taking the voice signal before noise reduction as the condition of the conditional diffusion model, and the voice signal is taken as the reference information for restoring the high-frequency component, so that the possible interference of the noise reduction processing on the restoration processing is reduced, and the restoration effect is improved.
In addition, the far-field voice signal processing is decomposed into two steps, and two diffusion models are used for processing the far-field voice signal respectively, so that the complexity of an algorithm can be reduced and the calculation force can be saved under the condition that the same effect is achieved compared with the case that one model is used for processing the far-field voice signal. That is, a far-field speech signal is processed by a model, and the model needs a more complex algorithm to achieve the effect that can be achieved by the two models, so that requirements on power consumption, calculation power and the like of the terminal equipment are higher.
In some embodiments, the step of computing the gradient of the denoised speech signal is included in the repair process, and the step of sampling the gradient based on gaussian noise. Since gaussian noise is superimposed according to the gradient in the sampling step, if the gaussian noise conforms to the gradient or conforms to the probability distribution of the target speech signal, the attenuated high-frequency component is enhanced, thereby recovering the high-frequency component.
The conventional discriminant neural network cannot be like a generative neural network, and the restoration of the high-frequency component is realized based on probability distribution and Gaussian noise superposition.
Fig. 3 is a schematic diagram of a speech signal processing method 100 according to an embodiment of the application.
S101, acquiring a first voice signal to be processed.
For example, the speech signal s (k) to be processed is acquired remotely with the microphone module 110 of the electronic device. The voice signal to be processed recorded by the microphone of the electronic equipment is a voice signal in a time domain. The speech signal s (k) in the time domain can be converted into the first speech signal x (f, t) in the time-frequency domain by the feature transformation module 120, and then the subsequent processing is performed based on the first speech signal x (f, t). Wherein f represents frequency and t represents time frame.
S102, the noise reduction module 130 based on the first diffusion model is used for carrying out noise reduction processing on the first voice signal, so as to obtain a second voice signal y (f, t).
As one possible example of S102, example 1 includes:
step A-1, calculating the gradient of the first voice signal.
The gradient is used for representing probability distribution of the second voice signal, and the probability distribution of the second voice signal corresponds to the distribution of time-frequency points in a spectrogram of the second voice signal.
And step A-2, sampling the gradient of the first voice signal to obtain a second voice signal.
It can be appreciated that the probability distribution of the second speech signal is considered, that is, the integrity of the distribution of the time-frequency points of the second speech signal is considered, and the integrity includes the relationship between the time-frequency points and the time-frequency points.
It can be further understood that, in the method of analyzing time-frequency points one by one without considering the integrity, even if the total error (for example, the first error value) of the time-frequency points of the signal after the speech enhancement processing and the time-frequency points corresponding to the ideal clean signal (the signal after the first speech signal removes all noise in the ideal case) is very small, the following situations are likely to occur: the small portion of the time-frequency points have a large error (e.g., significantly greater than the first error value) relative to the corresponding time-frequency points of the ideal clean signal, and the large portion of the time-frequency points have a small error (e.g., significantly less than the first error value). Therefore, the small part of time frequency points with large errors can cause the noise-reduced voice to be damaged to a certain extent in principle, the spectrum continuity is poor, and the hearing is poor.
Compared with the scheme without considering the distribution integrity of the time frequency points, the method and the device have the advantages that on the basis of considering the distribution integrity of the time frequency points, if the total error is controlled to be the same small value (for example, the first error value), the error of almost all the time frequency points is balanced and is smaller, so that the damage of noise reduction processing on the voice signals can be reduced in principle.
The total error in the present application can be understood as a loss function value used in the training process for calculating the noise reduction gradient to be sampled or the neural network of the noise reduction gradient to be corrected.
As one possible implementation of example 1, the gradient calculation is performed twice, and the correction flow is performed after the sampling process. Example 1-1, comprising:
And C-1, inputting the first voice signal into a first neural network to obtain a noise reduction gradient to be sampled of the first voice signal. The noise reduction to-be-sampled gradient is used to characterize the probability distribution of the second speech signal.
And C-2, predicting a noise reduction sampling signal according to the noise reduction gradient to be sampled.
And C-3, inputting the noise reduction sampling signal into a first neural network to obtain a noise reduction gradient to be corrected of the noise reduction sampling signal, wherein the noise reduction gradient to be corrected is used for representing probability distribution of the second voice signal.
And C-4, correcting the noise reduction sampling signal according to the gradient to be corrected in the noise reduction process to obtain a noise reduction correction signal, wherein the noise reduction correction signal is used for generating a second voice signal.
Wherein, step C-1 and step C-3 are taken as a specific example of step A-1; step C-2 and step C-3 are given as a specific example of step A-2.
According to the scheme, the noise reduction to-be-corrected gradient is continuously calculated on the noise reduction sampling signal obtained based on the noise reduction to-be-sampled gradient prediction, and the noise reduction sampling signal is corrected based on the noise reduction to-be-corrected gradient, so that the noise reduction effect is better compared with a diffusion model which only calculates a primary gradient or compared with a diffusion model which only predicts an uncorrected diffusion model.
As a possible implementation of step C-2, the noise reduction sampling signal is predicted by a drift coefficient and an inverse drift coefficient. Examples 1-2, comprising:
and E-1, calculating a noise reduction drift coefficient according to the first voice signal based on a random differential equation of the first diffusion model.
E-2, calculating a noise reduction inverse drift coefficient according to the noise reduction drift coefficient and the noise reduction gradient to be sampled.
And E-3, predicting to obtain a noise reduction sampling signal according to the noise reduction inverse drift coefficient, the first voice signal and the fourth Gaussian noise value.
Wherein the fourth gaussian noise value is generated based on the third random seed.
It will be appreciated that the inverse drift coefficient may be used to describe the path of a noisy speech signal towards a clean speech signal during sampling. And the gradient is obtained by sampling based on the inverse drift coefficient, so that noise reduction can be realized.
Specifically, a more specific implementation will be described below in connection with examples 1-1 and 1-2, based on fig. 4.
Specifically, for the first neural network in example 1-1, a training process will be described below based on fig. 6.
S103, repairing the first voice signal and the second voice signal based on the second diffusion model to obtain a target voice signal.
Wherein the second diffusion model is a conditional diffusion model, and the repair process includes: and restoring the high-frequency component of the second voice signal by taking the first voice signal as the condition of the second diffusion model.
For example, the second diffusion model-based restoration module 140 is used to perform restoration processing on the first speech signal x (f, t) and the second speech signal y (f, t) to obtain a restored speech signal r (f, t). Inputting the repaired voice signal r (f, t) into the characteristic inverse transformation module 150 to obtain a target voice signal
As one possible example of S102, example 2 includes:
And B-1, calculating the gradient of the second voice signal according to the first voice signal and the second voice signal, wherein the gradient is used for representing the probability distribution of the target voice signal, and the probability distribution of the target voice signal corresponds to the distribution of time-frequency points in a spectrogram of the target voice signal.
And B-2, sampling the gradient of the second voice signal based on a first Gaussian noise value to obtain a target voice signal, wherein the first Gaussian noise value is generated based on a first random seed.
According to the scheme, the Gaussian noise is superimposed according to the gradient in the sampling step, and if the Gaussian noise accords with the gradient or accords with the probability distribution of the target voice signal, the attenuated high-frequency component is enhanced, so that the high-frequency component is repaired.
It can be appreciated that the conventional discriminant neural network cannot implement restoration of high frequency components based on probability distribution superimposed gaussian noise like the generative neural network.
As one possible implementation of example 2, the gradient calculation is performed twice, and the correction flow is performed after the sampling process. Example 2-1, comprising:
and D-1, inputting the first voice signal and the second voice signal into a second neural network to obtain a repairing gradient to be sampled of the second voice signal, wherein the repairing gradient to be sampled is used for representing probability distribution of the target voice signal.
And D-2, predicting a repair sampling signal according to the repair gradient to be sampled and the first Gaussian noise value.
And D-3, inputting the repair sampling signal into a second neural network to obtain a repair gradient to be corrected of the repair sampling signal, wherein the repair gradient to be corrected is used for representing probability distribution of the target voice signal.
Step D-4, sampling the gradient of the second voice signal, and further comprising: and correcting the repair sampling signal according to the gradient to be corrected to obtain a repair correction signal, wherein the repair correction signal is used for generating the target voice signal.
Of these, step D-1 and step D-3 are given as a specific example of step B-1. Step D-2 and step D-3 are given as a specific example of step B-2.
According to the scheme, the repairing gradient to be corrected is continuously calculated on the repairing sampling signal obtained based on the prediction of the repairing gradient to be sampled, and the repairing sampling signal is corrected based on the repairing gradient to be corrected, so that the effect of repairing the high-frequency component is better compared with a diffusion model which only calculates the gradient once or compared with a diffusion model which only predicts the non-correction.
As a possible implementation of step D-2, the sampled signal is restored by means of drift coefficient and inverse drift coefficient prediction. Example 2-2, comprising:
And F-1, calculating a repair drift coefficient according to the second voice signal based on a random differential equation of the second diffusion model.
And F-2, calculating a restored inverse drift coefficient according to the restored drift coefficient and the restored gradient to be sampled.
And F-3, predicting to obtain a repair sampling signal according to the repair inverse drift coefficient, the second voice signal and the first Gaussian noise value.
It will be appreciated that the inverse drift coefficient may be used to describe the path of a noisy speech signal towards a clean speech signal during sampling. That is, the time-frequency points that do not follow the probability distribution of the target speech signal are reduced, and the paths of the time-frequency points that follow the probability distribution of the target speech signal are generated by means of the first gaussian noise.
Specifically, a more specific implementation will be described below in connection with examples 2-1 and 2-2, based on fig. 5.
Specifically, for the second neural network in example 2-1, a training process will be described below based on fig. 7.
According to the embodiment of the application, on one hand, noise reduction is carried out, so that the target voice signal is clearer in auditory sense, and on the other hand, the high-frequency component is repaired, so that the target voice signal is plumter in auditory sense. In addition, compared with the scheme of only carrying out noise reduction processing on far-field voice signals, the target voice signals can be enabled to be full in hearing.
As a possible further example of examples 1-1 and 1-2, example 3, the steps of examples 1-1 and 1-2 require iteration N times to achieve a better noise reduction effect, example 3 includes:
Under the condition that the sampling frequency n=1, in the step G-1, the first voice signal is input into the first neural network, and the 1 st noise reduction gradient to be sampled of the first voice signal is obtained.
And G-2, calculating a1 st noise reduction drift coefficient according to the first voice signal based on a random differential equation of the first diffusion model.
And G-3, calculating the 1 st noise reduction inverse drift coefficient according to the 1 st noise reduction drift coefficient and the 1 st noise reduction gradient to be sampled.
And G-4, predicting to obtain a1 st noise reduction sampling signal according to the 1 st noise reduction inverse drift coefficient, the first voice signal and the 1 st fourth Gaussian noise value.
And G-5, inputting the 1 st noise reduction sampling signal into a first neural network to obtain a1 st noise reduction gradient to be corrected of the 1 st noise reduction sampling signal.
And G-6, correcting the 1 st noise reduction sampling signal according to the 1 st noise reduction gradient to be corrected to obtain the 1 st noise reduction correction signal.
Or under the condition that the sampling frequency is N, N is more than or equal to 2 and less than or equal to N, and both N and N are positive integers, step H-1, inputting the N-1 noise reduction correction signal and the first voice signal into the first neural network to obtain the N noise reduction gradient to be sampled of the first voice signal.
And H-2, calculating an nth noise reduction drift coefficient according to the nth-1 noise reduction correction signal and the first voice signal based on a random differential equation of the first diffusion model.
And H-3, calculating an nth noise reduction inverse drift coefficient according to the nth noise reduction drift coefficient and the nth noise reduction gradient to be sampled.
And step H-4, predicting to obtain an nth noise reduction sampling signal according to the nth noise reduction inverse drift coefficient, the nth-1 noise reduction correction signal and the nth fourth Gaussian noise value.
And H-5, inputting the nth noise reduction sampling signal and the first voice signal into the first neural network to obtain an nth noise reduction gradient to be corrected of the nth noise reduction sampling signal.
And step H-6, correcting the nth noise reduction sampling signal according to the nth noise reduction gradient to be corrected to obtain the nth noise reduction correction signal.
Fig. 4 is a schematic diagram of a voice conversion method 200 according to an embodiment of the present application. Fig. 4 may be taken as a possible further example of example 3. For example, the noise reduction module 130 based on the first diffusion model includes an initialization module 201 to a determination module 207 as shown in fig. 4.
The initialization module 201 initializes the sampling number n=1 (representing the current 1 st sampling), the sampling time s=0.999, and the sampling signalAnd outputs the initialized parameters to the drift coefficient calculation module 202.
Wherein,Is the first speech signal described above.
Wherein,Representing the noise reduced sampled signal.
Where the number of samplings N is equal to or greater than 2 and equal to or less than N, when the steps from the drift coefficient calculation module 202 to the determination module 207 are executed for the nth time,The n-1 th noise reduction correction signal can be understood as the noise reduction correction signal obtained by the n-1 th sampling.
Drift coefficient calculation module 202, the input of which is,s,/>. The function of this module is to calculate the drift coefficient fc from the random differential equation of the diffusion model, such as/>. The drift coefficients calculated by this module are output to the inverse drift coefficient calculation module 204 to calculate the inverse drift coefficients.
The random differential equation may be a forward random differential equation of the diffusion model.
The steps performed by the drift factor calculation module 202 may be understood as one possible example of steps G-2 or H-2.
The gradient calculation module 203, the input of which is,s,/>. The function of the module is to calculate the noise reduction gradient to be sampled by using the trained first neural network model,/>
The steps performed by the gradient calculation module 203 may be understood as one possible example of the step G-1 or the step H-1.
Specifically, the training process of the first neural network model will be described in detail below in conjunction with fig. 6.
The gradient calculated by this module is output to the inverse drift coefficient calculation module 204 to calculate the inverse drift coefficient.
An inverse drift coefficient calculation block 204, the inputs of which are the drift coefficient fc and the gradient grad, using the formulaTo calculate the inverse drift coefficient fcr and output it to the prediction module 205 for prediction. /(I)For a predefined diffusion coefficient, in the present application,/>May be an empirically chosen constant, such as 1.2.
The steps performed by the inverse drift coefficient calculation module 204 may be understood as one possible example of step G-3 or step H-3.
The prediction module 205 may also be a sampling prediction module 205, and may use the formulaTo predict a new sampled signal where z 1 is gaussian noise with a mean of zero and a variance of 1 (i.e., the fourth gaussian noise described above). It should be noted that generation of gaussian noise requires setting of random seeds, which are set when the terminal device is reset for evidence collection.
Wherein the formula isRepresentation, will/>Updated as/>A new sampling signal, i.e. a noise reduction sampling signal obtained by the nth sampling, also referred to as the nth noise reduction sampling signal, is obtained.
Exemplary sampling methods include, but are not limited to, ancestral sampling, langevin kinetic sampling, and annealed langevin kinetic sampling (a combination of one or both), but either sampling method requires the generation of random numbers from random seeds (i.e., gaussian noise with a mean of 0 and a variance of 1 in the present application), followed by sampling based on the random numbers.
Wherein the steps performed by the prediction module 205 may be understood as one possible example of step G-4 or step H-4.
After predicting the noise reduction sampling signal, the prediction module 205 outputs the prediction result to the gradient calculation module 203.
The gradient calculation module 203, the input of which is,s,/>. Wherein/>, hereFor the updated nth sample signal/>, in the prediction module 205. The function of the module is to calculate the noise reduction gradient to be corrected by using the trained neural network model,/>. The gradient calculated by this module is output to the correction module 206 for correction.
Wherein the steps performed by the gradient computation module 203 may be understood as one possible example of step G-5 or step H-5.
Correction module 206, using annealing Langmuir dynamics sampling method, based on the noise reduction gradient to be corrected for the predicted new sampling signalCorrecting to obtain the nth correction signal/>Among them, the annealing langevin dynamics sampling method can be referred to the description in the related art. Alternatively, a langevin dynamics sampling method may be employed, which may be referred to the description of the related art.
1.Ltoreq.n < N, as the n+1th input of the drift coefficient calculation module 202, and n=N, as the final output of the method.
Wherein the steps performed by correction module 206 may be understood as one possible example of step G-6 or step H-6.
A determination module 207 for determining whether N is equal to N (a preset total number of sampling steps, e.g. 10 is assumed), and if n=n, correcting the signalOutputting as the second speech signal, otherwise updating the sampling times and sampling times,/>(I.e. update n with n+1),/>And loops through the drift factor calculation module 202.
The benefits of method 200 may be seen as those of the corresponding aspects of method 100.
As a possible further example of examples 2-1 and 2-2, example 4, the steps of example 2-1 and example 2-2 require iteration M times to achieve a better effect of repairing the high frequency component, example 4 includes:
And under the condition that the sampling frequency m=1, in step I-1, inputting the first voice signal and the second voice signal into a second neural network to obtain a1 st repairing gradient to be sampled of the second voice signal.
And step I-2, calculating a1 st repair drift coefficient according to the second voice signal based on a random differential equation of the second diffusion model.
And step I-3, calculating a1 st repair inverse drift coefficient according to the 1 st repair drift coefficient and the 1 st repair gradient to be sampled.
And step I-4, predicting to obtain a1 st repair sampling signal according to the 1 st repair inverse drift coefficient, the second voice signal and the first Gaussian noise value.
And step I-5, inputting the 1 st repair sampling signal into a second neural network to obtain the 1 st repair gradient to be corrected of the 1 st repair sampling signal.
And step I-6, correcting the 1 st repair sampling signal according to the 1 st repair gradient to be corrected to obtain the 1 st repair correction signal.
Or under the condition that the frequency is M and M is more than or equal to 2 and less than or equal to M, and both M and M are positive integers, inputting the first voice signal, the second voice signal and the M-1 repair correction signal into a second neural network to obtain the M repair gradient to be sampled of the second voice signal.
And J-2, calculating an mth repair drift coefficient according to the second voice signal and the mth-1 repair correction signal based on a random differential equation of the second diffusion model.
And J-3, calculating an mth repair inverse drift coefficient according to the mth repair drift coefficient and the mth repair gradient to be sampled.
And J-4, predicting to obtain an mth repair sampling signal according to the mth repair inverse drift coefficient, the mth-1 repair correction signal and the first Gaussian noise value.
And step J-5, inputting the first voice signal, the second voice signal and the mth repair sampling signal into a second neural network to obtain an mth repair gradient to be corrected of the mth repair sampling signal.
And step J-6, correcting the mth repair sampling signal according to the mth repair gradient to be corrected to obtain an mth repair correction signal.
Fig. 5 is a schematic diagram of a voice conversion method 300 according to an embodiment of the present application. Fig. 5 may be taken as a possible further example of example 4. For example, the second diffusion model-based repair module 140 includes an initialization module 301 through a judgment module 307 as shown in fig. 5.
The initialization module 301 initializes the sampling number m=1 (representing the current 1 st sampling), the sampling time s=0.999, and the sampling signalAnd outputs the initialized parameters to the drift coefficient calculation module 302.
Wherein,Is the second speech signal described above.
Wherein,Representing the repair sample signal.
Where the number of samples M corresponds to 2.ltoreq.m.ltoreq.m, when the steps from the drift coefficient calculation module 302 to the determination module 307 are executed for the mth time,The m-1 th repair correction signal can be understood as the repair correction signal obtained by the m-1 th sampling.
A drift coefficient calculation module 302, the input of which isS and/>. The function of this module is to calculate the drift coefficient fc from the random differential equation of the diffusion model, such as/>. The drift coefficients calculated by this module are output to the inverse drift coefficient calculation module 304 to calculate the inverse drift coefficients.
The random differential equation may be a forward random differential equation of the diffusion model.
The steps performed by the drift factor calculation module 302 may be understood as one possible example of step I-2 or step J-2.
Gradient calculation module 303, the input of which is,s,/>And/>. The function of the module is to calculate and repair the gradient to be sampled by using the trained first neural network model,/>
Wherein the steps performed by the gradient computation module 303 may be understood as one possible example of step I-1 or step J-1.
Specifically, the training process of the first neural network model will be described in detail below in conjunction with fig. 6.
The gradient calculated by this module is output to the inverse drift coefficient calculation module 304 to calculate the inverse drift coefficient.
An inverse drift coefficient calculation block 304, the inputs of which are the drift coefficient fc and the gradient grad, using the formulaTo calculate the inverse drift coefficient fcr and output to the prediction module 305 for prediction.
The steps performed by the inverse drift coefficient calculation module 304 may be understood as one possible example of step I-3 or step J-3.
The prediction module 305, which may also be referred to as a sample prediction module 305, uses the formulaTo predict a new sampled signal, where z 2 is gaussian noise with a mean of zero and a variance of 1 (i.e., the first gaussian noise described above). It should be noted that generation of gaussian noise requires setting of random seeds, which are set when the terminal device is reset for evidence collection.
Wherein the formula isRepresentation, will/>Updated as/>A new sample signal, i.e. a repair sample signal obtained by the nth sample, also called the mth repair sample signal, is obtained.
Exemplary sampling methods include, but are not limited to, ancestral sampling, langevin kinetic sampling, and annealed langevin kinetic sampling (a combination of one or both), but either sampling method requires the generation of random numbers from random seeds (i.e., gaussian noise with a mean of 0 and a variance of 1 in the present application), followed by sampling based on the random numbers.
Wherein the steps performed by the prediction module 305 may be understood as one possible example of step I-4 or step J-4.
After the prediction module 305 predicts the repair sampling signal, the prediction result is output to the gradient calculation module 303.
Gradient calculation module 303, the input of which is,s,/>And/>. Wherein, hereFor the updated mth sample signal/>, in the prediction module 305. The function of the module is to calculate and repair the gradient to be corrected by using the trained neural network model,/>. The gradient calculated by this module is output to the correction module 306 for correction.
Wherein the steps performed by the gradient computation module 303 may be understood as one possible example of step I-5 or step J-5.
Correction module 306, utilizing annealing Langmuir dynamics sampling method, corrects the gradient to be corrected for the predicted new sampling signalCorrecting to obtain the mth correction signal/>Among them, the annealing langevin dynamics sampling method can be referred to the description in the related art. Alternatively, a langevin dynamics sampling method may be employed, which may be referred to the description of the related art.
1.Ltoreq.m < M, as the m+1th input of the drift coefficient calculation module 302, and m=M, as the final output of the method.
Wherein the steps performed by correction module 306 may be understood as one possible example of step I-6 or step J-6.
A determination module 307 for determining whether M is equal to M (a preset total number of sampling steps, e.g., 10 is assumed), and if m=m, correcting the signalAs the repaired voice signal r (f, t) is output, otherwise the sampling times and sampling time are updated,(I.e., update m with m+1),/>And loops through the drift factor calculation module 302.
The benefits of method 300 may be seen as those of the corresponding aspects of method 100.
As a further example of example 1-1, example 5, the first neural network is trained based on first input data and first target data, wherein the first input data includes a sample noisy speech signal and a sample noise-reducing sampled signal, the sample noisy speech signal is generated from the first sample noisy signal and the first sample speech signal, the sample noise-reducing sampled signal is generated based on the sample noisy speech signal, and the first target data is used to characterize a probability distribution of the first sample speech signal.
According to the scheme, the neural network model is trained by representing the probability distribution of the sample voice signals contained in the sample noisy voice signals through the target data, so that the probability distribution of the second voice signals can be represented by the gradient output by the trained neural network model.
The sample noise reduction sampling signal is a fifth Gaussian noise value, the fifth Gaussian noise value is generated according to the sample noisy speech signal, the sample speech signal and a sixth Gaussian noise value conforming to a standard normal distribution, the first target data is generated based on a standard deviation of the fifth Gaussian noise value and the sixth Gaussian noise value, and the sixth Gaussian noise value is generated based on a fourth random seed.
It can be understood that, because the random numbers generated by the random seeds at different times are different, the voice signal processing method provided by the application carries out noise reduction processing on the voice signals to be processed at the same section at different times, and the second voice signals respectively output are different or not identical, so that the target voice signals finally respectively output are different or not identical.
Fig. 6 shows a schematic diagram of a training process 400 of a deep neural network model according to an embodiment of the present application. The training process shown in fig. 6 may be understood as a specific example of the training process of the first neural network in the method 100, that is, a specific example of example 5.
The training flow diagram of the deep neural network model comprises the following modules.
The adding module 401 is configured to linearly add the sample noise and the sample clean speech signal Ya (k) according to a certain signal-to-noise ratio, so as to obtain a sample noisy speech signal Ys (k), where the signal is a signal in the time domain.
Wherein the sample noise is from a sample noise set and the sample clean speech signal is from a clean speech signal set. The sample noise set may be generated by simulation or may be actually recorded. The sample clean speech data set may be prerecorded.
The feature transformation module 402 performs feature transformation on the sample noisy speech signal Ys (k) to obtain a sample noisy speech signal Yx (f, t) in a time-frequency domain after the feature transformation; the sample clean speech signal Ya (k) is further subjected to feature transformation to obtain a sample clean speech signal Ya (f, t) in a time-frequency domain after the feature transformation.
For example, the feature transform is a short-time fourier transform, an amplitude transform, or the like.
The sample time generation module 403, which does not require input information. The function of this module is to randomly generate a fraction between l min and l max as the sampling time, where l min is a minimum, such as 0.003, and l max is a maximum, such as 0.999. The sample times generated by this module are output to a sample generation module 405.
The gaussian noise generation module 404 is configured to generate a gaussian noise value z 3 (i.e., the sixth gaussian noise value described above) with a mean value of zero and a variance of 1.
A sample generation module 405 for generating a sample based onGenerating a sample noise reduction sampling signalWherein/>Is mean value/>Is the standard deviation. /(I)
Wherein,,/>
A first neural network 406. Noise-reduced sample signal Yx (f, t), sample sampling time l and sample noise-reduced speech signalInput to the first neural network 406, output a noise reduction sample gradient, i.e. >
The first neural network may be formed by a variety of deep neural network models, such as a Noise Condition Scoring Network (NCSN) ++ network, a convolutional neural network (convolutional neural networks, CNN), a convolutional recurrent neural network (convolutional recurrent neural network, CRNN), a U-shaped neural network pooling sub-graph structure (U-Net), and the like.
The loss function calculation module 407. The calculation formula of the loss function isWill/>As target data, a loss function is calculated with the output prediction gradient.
The benefits of the training process 400 may be seen in the benefits of the corresponding scenario in the method 100.
As a further example of example 2-1, example 6, the second neural network is trained based on second input data and second target data, wherein the second input data includes a sample noisy speech signal, a sample attenuated speech signal, and a sample repair sample signal, the sample noisy speech signal is generated from the second sample noisy signal and the second sample speech signal, the sample attenuated speech signal is convolved with a preset distance room impulse response, the sample repair sample signal is generated based on the sample attenuated speech signal and the second sample speech signal, the second target data is used to characterize a probability distribution of the second sample speech signal, and the first distance room impulse response is used to simulate a process of transmitting the sound source to the microphone after the sound source is emitted by the sound source at a preset distance, and the preset distance is a preset distance threshold.
According to the scheme, the neural network model is trained through the probability distribution of the sample voice signals for generating the sample attenuated voice signals by the target data representation, so that the probability distribution of the target voice signals can be represented by the gradient output by the trained neural network model.
The sample restoration sampling signal is a second Gaussian noise value, the second Gaussian noise value is generated according to the sample attenuation voice signal, the sample voice signal and a third Gaussian noise value which is subjected to standard normal distribution, second target data is generated based on a standard deviation of the second Gaussian noise value and the third Gaussian noise value, and the third Gaussian noise value is generated based on a second random seed.
It can be understood that, because the random numbers generated by the random seeds at different times are different, the voice signal processing method provided by the application carries out restoration processing on the voice signal to be processed at the same section at different times, and the respectively output target voice signals are different or not identical.
Fig. 7 is a schematic diagram of a training process 500 of a deep neural network model according to an embodiment of the present application. The training process shown in fig. 7 may be understood as a specific example of the training process of the second neural network in the method 100, that is, a specific example of example 6.
The training flow diagram of the deep neural network model comprises the following modules.
The adding module 501 is configured to linearly add the sample noise and the sample clean speech signal Ya (k) according to a certain signal-to-noise ratio to obtain a sample noisy speech signal Ys (k), where the signal is a signal in the time domain.
Wherein the sample noise is from a sample noise set and the sample clean speech signal is from a clean speech signal set. The sample noise set may be generated by simulation or may be actually recorded. The sample clean speech data set may be prerecorded.
The convolution module 502 is configured to perform convolution calculation on the sample noisy speech signal Ys (k) and the remote room impulse response (i.e. the remote RIR) to obtain a sample attenuated speech signal Yd (k) with attenuated high frequency components.
Wherein the remote RIR is from a remote RIR dataset comprising remote RIRs recorded in an oversized room, and also comprising remote RIRs in oversized rooms simulated using a simulation algorithm.
A feature transformation module 503, configured to perform feature transformation on the sample attenuated speech signal Yd (k) to obtain a sample attenuated speech signal Yx (f, t) in a time-frequency domain after the feature transformation; the sample clean speech signal Ya (k) is further subjected to feature transformation to obtain a sample clean speech signal Ya (f, t) in a time-frequency domain after the feature transformation.
The noise reduction module 504 based on the first diffusion model is configured to perform noise reduction processing on the sample noise-contained speech signal Yx (f, t), and output a sample noise-reduced speech signal Yy (f, t).
The sample time generation module 505, which does not require input information. The function of this module is to randomly generate a fraction between l min and l max as the sampling time, where l min is a minimum, such as 0.003, and l max is a maximum, such as 0.999. The sample time generated by this module is output to the sample generation module 507.
The gaussian noise generating module 506 is configured to generate a gaussian noise value z 4 (i.e., the third gaussian noise value described above) with a mean value of zero and a variance of 1.
A sample generation module 507 for generating a sample based onGenerating a sample repair sampling signalWherein/>Is mean value/>Is the standard deviation.
Wherein,,/>
A second neural network 508. Attenuating the sample by the speech signal Yx (f, t), the sample sampling time l and the sample repairing sampling signalInput to the second neural network 508, output a repair sample gradient, i.e. >
The second neural network herein may be formed of a variety of deep neural network models, such as NCSN++ network, CNN, CRNN, U-Net, and the like.
A loss function calculation module 509. The calculation formula of the loss function isWill/>As target data, a loss function is calculated with the output prediction gradient.
The benefits of the training process 500 may be seen from the benefits of the corresponding scenario in the method 100.
Referring to fig. 8, fig. 8 shows a schematic hardware structure of an electronic device 1000 according to an embodiment of the application.
The electronic device 1000 may be a headset, a cell phone, a smart screen, a tablet computer, a wearable electronic device, an in-vehicle electronic device, an Augmented Reality (AR) device, a Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), a projector, etc., and embodiments of the present application do not limit the specific type of the electronic device 1000.
Referring to fig. 8, the electronic device 1000 may include a processor 1010, an audio module 1020, a microphone 1020A, optionally including: speaker 1020B.
It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 1000. In other embodiments of the application, electronic device 1000 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The processor 1010 may include one or more processing units, such as: the processor 1010 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural Network Processor (NPU), etc.
The controller may be a neural hub and a command center of the electronic device 1000, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 1010 for storing instructions and data. In some embodiments, the memory in the processor 1010 is a cache memory. The memory may hold instructions or data that the processor 1010 has just used or recycled. If the processor 1010 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 1010 is reduced, thereby improving the efficiency of the system.
In the present application, the processor 1010 is configured to reduce noise based on a diffusion model for a speech signal to be processed, and then repair high frequency components based on the diffusion model, so as to obtain a target speech signal. On one hand, noise reduction is carried out, so that the target voice signal is clearer in auditory sense, and on the other hand, high-frequency components are repaired, so that the target voice signal is plumter in auditory sense. In addition, compared with the scheme of only carrying out noise reduction processing on far-field voice signals, the target voice signals can be enabled to be full in hearing.
In some embodiments, the processor 1010 may include one or more interfaces, such as may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.
It should be understood that the connection relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device 1000. In other embodiments of the present application, the electronic device 1000 may also employ different interfacing manners in the above embodiments, or a combination of multiple interfacing manners.
The electronic device 1000 may implement audio functions such as music playing, recording, voice conversation, etc. through an audio module 1020, a speaker 1020B, a microphone 1020A, an application processor (not shown), etc.
In a voice call or recording scenario, microphone 1020A is used to record far-field voice signals.
Optionally, during the voice call, speaker 1020B is used to play the voice of the party to the user call; in the recording scene, if the user wants to listen to the recording content, the speaker 1020B plays the recording content.
Next, a software system of the electronic apparatus 1000 will be described.
By way of example, the electronic device 1000 may be a cell phone. The software system of the electronic device 1000 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, an Android (Android) system with a layered architecture is taken as an example, and a software system of the electronic device 1000 is illustrated.
Fig. 9 shows a block diagram of a software system of an electronic device 1000 according to an embodiment of the application. Referring to fig. 9, the hierarchical architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into five layers, from top to bottom, an application layer (application layer), an application framework layer (framework layer), a hardware abstraction layer (hardware abstraction layer, HAL), a driver layer, and a hardware layer, respectively.
The application layer may include a series of application packages, such as a dialing application, gallery application, etc. (not shown). In the embodiment of the application, the application program package can comprise application programs such as recording and voice communication, and the application programs all need to record audio by using a microphone. The terminal equipment can perform the voice signal processing of the application on the audio recorded by the microphone, including noise reduction processing and high-frequency component restoration processing.
Or the application layer may further include other applications that need to perform the voice signal of the present application on the voice signal recorded by the microphone, which is not limited by the present application.
The framework layer provides an application programming interface (application programming interface, API) and programming framework for the application programs of the application layer. The application framework layer includes a number of predefined functions. In an embodiment of the application, the framework layer comprises a microphone service interface and a noise reduction and repair high frequency component service interface. The noise reduction and restoration high-frequency component service interface can provide an API and a programming framework for the application for acquiring the noise reduction and restoration high-frequency component service. The microphone service may be used to provide an API and programming framework for applications calling the microphone.
A Hardware Abstraction Layer (HAL) is an interface layer between the operating system kernel and the upper layer software that provides a virtual hardware platform for the operating system. In the embodiment of the application, the hardware abstraction layer can comprise a microphone hardware abstraction layer and a noise reduction and high-frequency restoration component algorithm. The microphone hardware abstraction layer may provide virtual hardware for microphone 1, microphone 2, or more microphone devices. The noise reduction and restoration high frequency component algorithm may include running codes and data implementing the speech signal processing method provided by the embodiments of the present application.
The driver layer is a layer between hardware and software. The driver layer includes drivers for various hardware. The drive layer may include a microphone device driver, a digital signal processor driver, and the like. The microphone device is used for driving the microphone sensor to collect sound signals and driving the audio signal processor to preprocess the sound signals to obtain audio digital signals. The digital signal processor driver is used for driving the digital signal processor to process the audio digital signal.
The hardware layer includes sensors and an audio signal processor. The sensor comprises a microphone 1 and a microphone 2. Microphones included in the sensor are in one-to-one correspondence with virtual microphones included in the microphone hardware abstraction layer. The audio signal processor may be used to convert sound signals collected by the microphone into audio digital signals. The digital signal processor may be used to process the audio digital signal. It should be noted that, the software structure schematic diagram of the electronic device shown in fig. 9 provided by the present application is only used as an example, and is not limited to specific module division in different layers of the Android operating system, and the description of the software structure of the Android operating system in the conventional technology may be referred to specifically.
The method in the embodiment of the present application is specifically described below with reference to the above hardware structure and system structure:
In response to enabling an application such as a sound recording, or a voice call, such an application may invoke the noise reduction and repair high frequency component service interface to obtain a noise reduction and repair high frequency component service providing application programming interface and programming framework.
In one aspect, the noise reduction and restoration high frequency component service may invoke a microphone service of the framework layer through which sound signals in the environment are collected. The microphone service may send an instruction to collect a sound signal to a microphone 1 sensor of the hardware layer by invoking the microphone 1 in the microphone hardware abstraction layer. The microphone hardware abstraction layer sends the instruction to the microphone device driver of the driver layer. The microphone device driver may activate the microphone 1 in accordance with the above instructions to obtain sound signals in the environment and generate digital audio signals via the audio signal processor.
On the other hand, the noise reduction and restoration high frequency component service may initialize a noise reduction and restoration high frequency component algorithm. The noise reduction and repair high frequency component algorithm may obtain an audio signal processor through a microphone hardware abstraction layer to generate a digital audio signal. Then, according to the voice signal processing method stored in the noise reduction and restoration high frequency component algorithm, the noise reduction and restoration high frequency component algorithm can process the acquired digital audio signal by using the digital signal processor so as to obtain a digital audio signal after noise reduction and restoration of the high frequency component.
In particular, how the digital audio signal is processed to obtain the noise-reduced and high-frequency-restored digital audio signal can be seen from the above-described flowcharts of the methods shown in fig. 3 to 7.
Finally, the noise reduction and restoration high-frequency component algorithm can transmit the digital audio signal after the noise reduction and restoration of the high-frequency component back to the noise reduction and restoration high-frequency component service, and then to the response layer.
Embodiments of the present application provide a chip system including one or more processors configured to invoke from a memory and execute instructions stored in the memory, so that the method of the embodiments of the present application described above is performed. The chip system may be formed of a chip or may include a chip and other discrete devices.
The chip system may include an input circuit or interface for transmitting information or data, and an output circuit or interface for receiving information or data, among other things.
The application also provides a computer program product which, when executed by a processor, implements the method of any of the method embodiments of the application.
The computer program product may be stored in a memory and eventually converted to an executable object file that can be executed by a processor through preprocessing, compiling, assembling, and linking.
The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a computer implements the method according to any of the method embodiments of the application. The computer program may be a high-level language program or an executable object program.
The computer readable storage medium may be volatile memory or nonvolatile memory, or may include both volatile memory and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working processes and technical effects of the apparatus and device described above may refer to corresponding processes and technical effects in the foregoing method embodiments, which are not described in detail herein.
In the several embodiments provided by the present application, the disclosed systems, devices, and methods may be implemented in other manners. For example, some features of the method embodiments described above may be omitted, or not performed. The above-described apparatus embodiments are merely illustrative, the division of units is merely a logical function division, and there may be additional divisions in actual implementation, and multiple units or components may be combined or integrated into another system. In addition, the coupling between the elements or the coupling between the elements may be direct or indirect, including electrical, mechanical, or other forms of connection.
It should be understood that, in various embodiments of the present application, the size of the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.
It should be understood that references to "a plurality" in this disclosure refer to two or more. The term "and/or" herein is merely one association relationship describing the associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
The terms (or numbers) of "first," "second," …, etc. in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying any relative importance or number of features indicated, for example, different "coordinates" or the like. Thus, features defining "first," "second," …, etc., may explicitly or implicitly include one or more features. In the description of embodiments of the application, "at least one (an item)" means one or more. The meaning of "plurality" is two or more. "at least one of (an) or the like" below means any combination of these items, including any combination of a single (an) or a plurality (an) of items.
In summary, the foregoing description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (15)

1. A method for processing a voice signal, applied to an electronic device, the method comprising:
Acquiring a first voice signal to be processed;
noise reduction processing is carried out on the first voice signal based on a first diffusion model, and a second voice signal is obtained;
Repairing the first voice signal and the second voice signal based on a second diffusion model to obtain a target voice signal, wherein the second diffusion model is a conditional diffusion model, and the repairing comprises: and repairing the high-frequency component of the second voice signal by taking the first voice signal as the condition of the second diffusion model.
2. The method of claim 1, wherein the first speech signal is a far-field speech signal or a distance of a microphone of the electronic device from a sound source of the first speech signal is greater than or equal to a preset distance threshold.
3. The method of claim 1, wherein the repairing the first speech signal and the second speech signal based on the second diffusion model to obtain the target speech signal comprises:
Calculating a gradient of the second voice signal according to the first voice signal and the second voice signal, wherein the gradient is used for representing probability distribution of the target voice signal, and the probability distribution of the target voice signal corresponds to the distribution of time-frequency points in a spectrogram of the target voice signal;
And sampling the gradient of the second voice signal based on a first Gaussian noise value to obtain the target voice signal, wherein the first Gaussian noise value is generated based on a first random seed.
4. The method of claim 3, wherein,
Said calculating a gradient of said second speech signal from said first speech signal and said second speech signal comprises: inputting the first voice signal and the second voice signal into a second neural network to obtain a repair to-be-sampled gradient of the second voice signal, wherein the repair to-be-sampled gradient is used for representing probability distribution of the target voice signal;
The sampling the gradient of the second speech signal based on the first gaussian noise value includes: predicting a repair sampling signal according to the repair gradient to be sampled and the first Gaussian noise value;
The calculating the gradient of the second voice signal according to the first voice signal and the second voice signal further comprises: inputting the repair sampling signal into the second neural network to obtain a repair gradient to be corrected of the repair sampling signal, wherein the repair gradient to be corrected is used for representing probability distribution of the target voice signal;
The sampling processing of the gradient of the second voice signal further includes: and correcting the repair sampling signal according to the gradient to be corrected to obtain a repair correction signal, wherein the repair correction signal is used for generating the target voice signal.
5. The method of claim 4, wherein predicting a repair sample signal based on the repair to-be-sampled gradient and the first gaussian noise value comprises:
Calculating a repair drift coefficient according to the second voice signal based on a random differential equation of the second diffusion model;
Calculating a repair inverse drift coefficient according to the repair drift coefficient and the repair gradient to be sampled;
and predicting to obtain the repair sampling signal according to the repair inverse drift coefficient, the second voice signal and the first Gaussian noise value.
6. The method of claim 4, wherein the second neural network is trained based on second input data and second target data, wherein the second input data comprises a sample noisy speech signal, a sample attenuated speech signal, and a sample repair sample signal, the sample noisy speech signal being generated from a second sample noisy signal and a second sample speech signal, the sample attenuated speech signal being convolved with a preset distance room impulse response, the sample repair sample signal being generated based on the sample attenuated speech signal and the second sample speech signal, the second target data being used to characterize a probability distribution of the second sample speech signal, the preset distance room impulse response being used to simulate a process of sound emanating from a sound source that propagates to a microphone of the electronic device when a distance from the sound source is greater than or equal to a preset distance threshold.
7. The method of claim 6, wherein the sample-repair-sampled signal is a second gaussian noise value generated from the sample-attenuated speech signal, the second sample speech signal, and a third gaussian noise value subject to a standard normal distribution, the second target data generated based on a standard deviation of the second gaussian noise value and a third gaussian noise value generated based on a second random seed.
8. The method of claim 1, wherein the noise reduction processing of the first speech signal based on the first diffusion model to obtain a second speech signal comprises:
calculating a gradient of the first voice signal, wherein the gradient is used for representing probability distribution of the second voice signal, and the probability distribution of the second voice signal corresponds to the distribution of time-frequency points in a spectrogram of the second voice signal;
and sampling the gradient of the first voice signal to obtain the second voice signal.
9. The method of claim 8, wherein,
Said calculating a gradient of said first speech signal comprising: inputting the first voice signal into a first neural network to obtain a noise reduction to-be-sampled gradient of the first voice signal, wherein the noise reduction to-be-sampled gradient is used for representing probability distribution of the second voice signal;
The sampling the gradient of the first voice signal includes: predicting a noise reduction sampling signal according to the noise reduction gradient to be sampled;
The computing the gradient of the first speech signal further comprises: inputting the noise reduction sampling signal into the first neural network to obtain a noise reduction to-be-corrected gradient of the noise reduction sampling signal, wherein the noise reduction to-be-corrected gradient is used for representing probability distribution of the second voice signal;
The sampling processing for the gradient of the first voice signal further includes: and correcting the noise reduction sampling signal according to the noise reduction gradient to be corrected to obtain a noise reduction correction signal, wherein the noise reduction correction signal is used for generating the second voice signal.
10. The method of claim 9, wherein predicting the noise-reduced sampled signal from the noise-reduced to-be-sampled gradient comprises:
Calculating a noise reduction drift coefficient according to the first voice signal based on a random differential equation of the first diffusion model;
according to the noise reduction drift coefficient and the noise reduction gradient to be sampled, calculating a noise reduction inverse drift coefficient;
And predicting to obtain the noise reduction sampling signal according to the noise reduction inverse drift coefficient, the first voice signal and a fourth Gaussian noise value, wherein the fourth Gaussian noise value is generated based on a third random seed.
11. The method of claim 9, wherein the first neural network is trained based on first input data and first target data, wherein the first input data comprises a sample noisy speech signal and a sample noise-reduced sampled signal, the sample noisy speech signal being generated from a first sample noise signal and a first sample speech signal, the sample noise-reduced sampled signal being generated based on the sample noisy speech signal, the first target data being used to characterize a probability distribution of the first sample speech signal.
12. The method of claim 11, wherein the sample noise reduction sampling signal is a fifth gaussian noise value generated from the sample noisy speech signal, the sample speech signal, and a sixth gaussian noise value subject to a standard normal distribution, the first target data being generated based on a standard deviation of the fifth gaussian noise value and the sixth gaussian noise value, the sixth gaussian noise value being generated based on a fourth random seed.
13. An electronic device, the electronic device comprising: one or more processors, and memory;
The memory is coupled with the one or more processors, the memory for storing computer program code comprising computer instructions that are invoked by the one or more processors to cause the electronic device to perform the method of any one of claims 1-12.
14. A chip system for application to an electronic device, the chip system comprising one or more processors to invoke computer instructions to cause the electronic device to perform the method of any of claims 1 to 12.
15. A computer readable storage medium comprising instructions that, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1 to 12.
CN202410350758.1A 2024-03-26 2024-03-26 Voice signal processing method and related equipment Pending CN118098260A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410350758.1A CN118098260A (en) 2024-03-26 2024-03-26 Voice signal processing method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410350758.1A CN118098260A (en) 2024-03-26 2024-03-26 Voice signal processing method and related equipment

Publications (1)

Publication Number Publication Date
CN118098260A true CN118098260A (en) 2024-05-28

Family

ID=91154870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410350758.1A Pending CN118098260A (en) 2024-03-26 2024-03-26 Voice signal processing method and related equipment

Country Status (1)

Country Link
CN (1) CN118098260A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090228272A1 (en) * 2007-11-12 2009-09-10 Tobias Herbig System for distinguishing desired audio signals from noise
JP2013175869A (en) * 2012-02-24 2013-09-05 Nippon Telegr & Teleph Corp <Ntt> Acoustic signal enhancement device, distance determination device, methods for the same, and program
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
US20190259381A1 (en) * 2018-02-14 2019-08-22 Cirrus Logic International Semiconductor Ltd. Noise reduction system and method for audio device with multiple microphones
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN112712818A (en) * 2020-12-29 2021-04-27 苏州科达科技股份有限公司 Voice enhancement method, device and equipment
US20220392471A1 (en) * 2021-06-02 2022-12-08 Arizona Board Of Regents On Behalf Of Arizona State University Systems, methods, and apparatuses for restoring degraded speech via a modified diffusion model
CN117153181A (en) * 2023-02-10 2023-12-01 荣耀终端有限公司 Voice noise reduction method, device and storage medium
CN117610509A (en) * 2023-11-30 2024-02-27 北京理工大学 Text generation method based on diffusion language model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090228272A1 (en) * 2007-11-12 2009-09-10 Tobias Herbig System for distinguishing desired audio signals from noise
JP2013175869A (en) * 2012-02-24 2013-09-05 Nippon Telegr & Teleph Corp <Ntt> Acoustic signal enhancement device, distance determination device, methods for the same, and program
US20190259381A1 (en) * 2018-02-14 2019-08-22 Cirrus Logic International Semiconductor Ltd. Noise reduction system and method for audio device with multiple microphones
CN108831495A (en) * 2018-06-04 2018-11-16 桂林电子科技大学 A kind of sound enhancement method applied to speech recognition under noise circumstance
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN112712818A (en) * 2020-12-29 2021-04-27 苏州科达科技股份有限公司 Voice enhancement method, device and equipment
US20220392471A1 (en) * 2021-06-02 2022-12-08 Arizona Board Of Regents On Behalf Of Arizona State University Systems, methods, and apparatuses for restoring degraded speech via a modified diffusion model
CN117153181A (en) * 2023-02-10 2023-12-01 荣耀终端有限公司 Voice noise reduction method, device and storage medium
CN117610509A (en) * 2023-11-30 2024-02-27 北京理工大学 Text generation method based on diffusion language model

Similar Documents

Publication Publication Date Title
CN110992974B (en) Speech recognition method, apparatus, device and computer readable storage medium
US10511908B1 (en) Audio denoising and normalization using image transforming neural network
CN113436643B (en) Training and application method, device and equipment of voice enhancement model and storage medium
CN111968658A (en) Voice signal enhancement method and device, electronic equipment and storage medium
CN107240396B (en) Speaker self-adaptation method, device, equipment and storage medium
CN114203163A (en) Audio signal processing method and device
CN109803059A (en) Audio-frequency processing method and device
CN111627455A (en) Audio data noise reduction method and device and computer readable storage medium
CN116030823B (en) Voice signal processing method and device, computer equipment and storage medium
US20240177726A1 (en) Speech enhancement
Kothapally et al. Skipconvgan: Monaural speech dereverberation using generative adversarial networks via complex time-frequency masking
CN114792524B (en) Audio data processing method, apparatus, program product, computer device and medium
Somayazulu et al. Self-supervised visual acoustic matching
CN111354367A (en) Voice processing method and device and computer storage medium
CN112466327B (en) Voice processing method and device and electronic equipment
US10079028B2 (en) Sound enhancement through reverberation matching
CN118098260A (en) Voice signal processing method and related equipment
Steinmetz et al. Randomized overdrive neural networks
WO2023287782A1 (en) Data augmentation for speech enhancement
CN115762546A (en) Audio data processing method, apparatus, device and medium
CN117953912A (en) Voice signal processing method and related equipment
CN117153178B (en) Audio signal processing method, device, electronic equipment and storage medium
CN113707163B (en) Speech processing method and device and model training method and device
CN110931038B (en) Voice enhancement method, device, equipment and storage medium
CN113516995B (en) Sound processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination