CN110580910B

CN110580910B - Audio processing method, device, equipment and readable storage medium

Info

Publication number: CN110580910B
Application number: CN201810589891.7A
Authority: CN
Inventors: 文仕学; 潘逸倩
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2024-04-26
Anticipated expiration: 2038-06-08
Also published as: CN110580910A

Abstract

The embodiment of the invention provides an audio processing method, an audio processing device, audio processing equipment and a readable storage medium, wherein the method comprises the following steps: training a voice enhancement model according to weight information corresponding to the acquired voice frequency band errors in advance; after receiving a mixed voice signal, carrying out voice enhancement on the mixed voice signal according to weight bias parameters corresponding to each voice frequency band in the voice enhancement model to obtain a target voice signal, wherein the weight bias parameters are obtained by training according to the weight information; and outputting according to the target voice signal. The embodiment of the invention solves the problem that the noise reduction effect is poor due to the fact that the existing voice enhancement model is used for simultaneously carrying out the kernel-at-the-same operation on each voice frequency band, and improves the voice enhancement effect.

Description

Audio processing method, device, equipment and readable storage medium

Technical Field

The present invention relates to the field of communication technology, and in particular, to an audio processing method, an audio processing apparatus, a device, and a readable storage medium.

Background

With the rapid development of communication technology, terminals such as mobile phones and tablet computers are becoming more popular, and great convenience is brought to life, study and work of people.

These terminals may collect voice signals through microphones and process the collected voice signals using voice enhancement techniques to reduce the effects of noise interference. The speech enhancement is a technique for extracting useful speech signals from noise background, suppressing and reducing noise interference when speech signals are disturbed or even submerged by various kinds of noise.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide an audio processing method for improving the voice enhancement effect.

Correspondingly, the embodiment of the invention also provides an audio processing device, equipment and a readable storage medium, which are used for ensuring the implementation and application of the method.

In order to solve the above problems, an embodiment of the present invention discloses an audio processing method, including: training a voice enhancement model according to weight information corresponding to the acquired voice frequency band errors in advance; after receiving a mixed voice signal, carrying out voice enhancement on the mixed voice signal according to weight bias parameters corresponding to each voice frequency band in the voice enhancement model to obtain a target voice signal, wherein the weight bias parameters are obtained by training according to the weight information; and outputting according to the target voice signal.

Optionally, the performing voice enhancement on the mixed voice signal according to the weight bias parameters corresponding to each voice frequency band in the voice enhancement model to obtain a target voice signal includes: extracting features of the mixed voice signals to obtain noisy voice data, wherein the noisy voice data comprise at least one voice frequency band data; inputting the noisy speech signal into the speech enhancement model; and carrying out noise reduction processing on each voice frequency band data in the voice data with noise according to the weight bias parameters corresponding to each voice frequency band through the voice enhancement model to obtain a target voice signal corresponding to a target user.

Optionally, training the speech enhancement model in advance according to the weight information corresponding to the obtained speech frequency band errors, including: acquiring weight information corresponding to each preset voice frequency band error according to the received voice signal; and performing model training according to the weight information corresponding to each voice frequency band error and the voice signal to obtain a voice enhancement model.

Optionally, performing model training according to the weight information corresponding to each voice frequency band error and the voice signal to obtain a voice enhancement model, including: adding a noise signal to the voice signal to generate a noisy voice signal; extracting the characteristics of the noisy speech signals to obtain speech characteristics corresponding to the noisy speech data; based on the voice characteristics, model training is carried out by adopting the noisy voice signals, the voice signals and weight information corresponding to each voice frequency band error, so as to obtain a voice enhancement model.

Optionally, the model training using the noisy speech signal, the speech signal, and the weight information corresponding to each speech frequency band error, to obtain a speech enhancement model includes: determining an output estimation signal corresponding to the noisy speech signal; determining an output prediction error corresponding to the output estimation signal according to the voice signal; performing self-adaptive processing on the output prediction error according to the weight information corresponding to the voice frequency band error to obtain a voice enhancement error corresponding to each voice frequency band; determining weight bias parameters corresponding to each voice frequency band according to the voice enhancement errors corresponding to each voice frequency band; and generating a voice enhancement model according to the weight bias parameters corresponding to each voice frequency band.

Optionally, the noise reduction processing is performed on each voice frequency band data in the voice data with noise according to the weight bias parameter corresponding to each voice frequency band, so as to obtain a target voice signal corresponding to the target user, including: determining a target weight bias parameter corresponding to each voice frequency band data in the voice data with noise based on the weight bias parameters corresponding to each voice frequency band; performing noise reduction processing according to the target weight bias parameters for each voice frequency band data in the voice data with noise to obtain noise reduction voice data corresponding to each voice frequency band; and generating a target voice signal corresponding to the target user according to the voice characteristics and the noise reduction voice data.

Optionally, outputting according to the target voice signal includes: performing voice output according to the target voice signal; and/or performing voice recognition on the target voice signal, generating a recognition result, and outputting the recognition result.

The embodiment of the invention also discloses an audio processing device, which comprises:

the model training module is used for training a voice enhancement model in advance according to the weight information corresponding to the acquired voice frequency band errors;

the voice enhancement module is used for carrying out voice enhancement on the mixed voice signal according to the weight bias parameters corresponding to each voice frequency band in the voice enhancement model after receiving the mixed voice signal to obtain a target voice signal;

and the output module is used for outputting according to the target voice signal.

Optionally, the voice enhancement module includes the following sub-modules:

The characteristic extraction sub-module is used for carrying out characteristic extraction on the mixed voice signal to obtain noisy voice data, wherein the noisy voice data comprises at least one voice frequency band data;

a signal input sub-module for inputting the noisy speech signal to the speech enhancement model;

And the noise reduction processing sub-module is used for carrying out noise reduction processing on each voice frequency band data in the voice data with noise according to the weight bias parameters corresponding to each voice frequency band through the voice enhancement model to obtain a target voice signal corresponding to a target user.

Optionally, the model training module includes the following sub-modules:

the weight information acquisition sub-module is used for acquiring weight information corresponding to each preset voice frequency band error according to the received voice signals;

And the model training sub-module is used for carrying out model training according to the weight information corresponding to each voice frequency band error and the voice signal to obtain a voice enhancement model.

Optionally, the model training submodule includes the following units:

a noise adding unit for adding a noise signal to the voice signal to generate a voice signal with noise;

The feature extraction unit is used for carrying out feature extraction on the noisy speech signal to obtain speech features corresponding to the noisy speech data;

and the model training unit is used for carrying out model training by adopting the noisy speech signal, the speech signal and weight information corresponding to each speech frequency band error based on the speech characteristics to obtain a speech enhancement model.

Optionally, the model training unit comprises the following subunits:

an estimated signal determining subunit, configured to determine an output estimated signal corresponding to the noisy speech signal;

A prediction error determining subunit, configured to determine an output prediction error corresponding to the output estimation signal according to the speech signal;

The self-adaptive processing subunit is used for carrying out self-adaptive processing on the output prediction error according to the weight information corresponding to the voice frequency band error to obtain a voice enhancement error corresponding to each voice frequency band;

The weight parameter determining subunit is used for determining weight bias parameters corresponding to each voice frequency band according to the voice enhancement errors corresponding to each voice frequency band;

and the model generation subunit is used for generating a voice enhancement model according to the weight bias parameters corresponding to each voice frequency band.

Optionally, the noise reduction processing submodule includes the following units:

The target weight bias parameter determining unit is used for determining a target weight bias parameter corresponding to each voice frequency band data in the voice data with noise based on the weight bias parameters corresponding to each voice frequency band;

the noise reduction processing unit is used for carrying out noise reduction processing on each voice frequency band data in the voice data with noise according to the target weight bias parameters to obtain noise reduction voice data corresponding to each voice frequency band;

and the target voice signal generation unit is used for generating a target voice signal corresponding to the target user according to the voice characteristics and the noise reduction voice data.

Optionally, the output module includes the following sub-modules:

The voice output sub-module is used for outputting voice according to the target voice signal; and/or the number of the groups of groups,

The voice recognition sub-module is used for carrying out voice recognition on the target voice signal and generating a recognition result; and outputting the identification result.

The embodiment of the invention also discloses a device, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by one or more processors, and the one or more programs comprise instructions for: training a voice enhancement model according to weight information corresponding to the acquired voice frequency band errors in advance; after receiving a mixed voice signal, carrying out voice enhancement on the mixed voice signal according to weight bias parameters corresponding to each voice frequency band in the voice enhancement model to obtain a target voice signal, wherein the weight bias parameters are obtained by training according to the weight information; and outputting according to the target voice signal.

The embodiment of the invention also discloses a readable storage medium, which enables the device to execute the audio processing method in one or more of the embodiments of the invention when the instructions in the storage medium are executed by the processor of the device.

The embodiment of the invention has the following advantages:

According to the embodiment of the invention, model training can be carried out according to the weight information corresponding to the errors of the voice frequency bands, so that the trained voice enhancement model contains the weight bias parameters corresponding to the voice frequency bands, and after a mixed voice signal is received, the mixed voice signal can be subjected to voice enhancement according to the weight bias parameters corresponding to the voice frequency bands in the voice enhancement model, so that the emphasis of voice enhancement is placed on the voice frequency bands with larger voice energy in the mixed voice signal, the problem that the noise reduction effect of the conventional voice enhancement model is poor due to the fact that the voice frequency bands are identical is solved, the noise reduction effect of the voice enhancement model is improved, and the voice enhancement effect is further improved.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of an audio processing method of the present invention;

FIG. 2 is a flowchart of the steps for pre-training a speech enhancement model in an alternative embodiment of the present invention;

FIG. 3 is a flow chart of steps of an alternative embodiment of an audio processing method of the present invention;

FIG. 4 is a block diagram of an embodiment of an audio processing apparatus of the present invention;

FIG. 5 is a block diagram illustrating an apparatus 600 for audio processing according to an exemplary embodiment;

Fig. 6 is a schematic structural view of an apparatus according to an embodiment of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

At present, the conventional speech enhancement method generally performs noise reduction processing on speech signals corresponding to each speech frequency band according to the same weight parameter. Specifically, in the training stage, the traditional voice enhancement method is based on a mean square error (Mean Squared Error) calculation method, only the error corresponding to each frequency band of a voice signal is adopted to determine the mean square error corresponding to the objective function of a voice enhancement model, namely, model training is carried out on the error corresponding to each voice frequency band at the same time, and a voice enhancement model is obtained; in the voice enhancement stage, through the trained voice enhancement model, voice enhancement is carried out on the same kernel of each voice frequency band, namely the corresponding noise reduction degree of each voice frequency band is the same. This limits the noise reduction effect of the speech enhancement model, i.e. affects the speech enhancement effect.

One of the core ideas of the embodiment of the invention is that a new audio processing method is provided, the voice enhancement is carried out according to the weight bias parameters corresponding to each voice frequency band in the voice enhancement model, the problem that the noise reduction effect of the voice enhancement model is poor due to the fact that each voice frequency band is identical at one time is avoided, and the voice enhancement effect is improved.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of an audio processing method according to the present invention may specifically include the following steps:

Step 102, training a voice enhancement model according to the weight information corresponding to the acquired voice frequency band errors in advance.

In the embodiment of the invention, model training can be performed in advance according to the weight information corresponding to the error of each voice frequency band, so that the trained voice enhancement model can contain the weight bias parameters corresponding to each voice frequency band. The voice frequency band error can be used for representing the error corresponding to the voice signal of one voice frequency band.

It should be noted that the voice frequency band may be determined according to a preset sampling frequency. For example, the collected speech signal may be divided into 256 speech frequency bands for processing, and weight information corresponding to 256 speech frequency band errors may be preset based on the 256 speech frequency bands, and each speech frequency band error may correspond to a weight, so that model training may be performed according to the weight information corresponding to the 256 speech frequency band errors in a model training stage, so that the network parameters of the trained speech enhancement model may include weight offset parameters corresponding to the 256 speech frequency bands.

Of course, the voice signal may be divided into other number of corresponding voice frequency bands for processing, for example, the voice signal may be divided into 512 voice frequency bands, etc., which is not limited in number in the embodiment of the present invention.

For ease of understanding, embodiments of the present invention are described below in conjunction with two simple examples.

As an example of the present invention, in the case of dividing a speech signal into 3 speech frequency bands for processing, a first speech frequency band may correspond to a low frequency, i.e. may be used to characterize the speech signal at the low frequency; the second voice frequency band can correspond to the intermediate frequency, namely can be used for representing voice signals of the intermediate frequency; the third speech band may correspond to a high frequency, i.e. may be used to characterize a speech signal at a high frequency. For example, in the case where the sampling frequency is 16000 hz, according to the nyquist theorem, a voice signal of 0-8000 hz can be sampled at the highest, a voice signal of 0-2000 hz can be determined as a voice signal of low frequency, a voice signal of 2000-6000 hz can be determined as a voice signal of intermediate frequency, and a voice signal of 6000-8000 hz can be determined as a voice signal of intermediate frequency. Wherein, the first voice frequency band can be 0-2000 Hz, the second voice frequency band can be 2000-6000 Hz, and the third voice frequency band can be 6000-8000 Hz.

Similarly, in the case of dividing the speech signal into 4 speech frequency bands for processing, the first speech frequency band may correspond to a low frequency, i.e. may be used to characterize the low frequency speech signal, e.g. may be used to characterize the 0-2000 hz speech signal; the second voice frequency band can be used for corresponding middle and low frequencies, namely can be used for representing voice signals of the middle and low frequencies, such as can be used for representing voice signals of 2000-4000 hertz; the third voice frequency band can correspond to the middle and high frequencies, namely can be used for representing voice signals of the middle and high frequencies, such as 4000-6000 Hz voice signals; the fourth speech band may correspond to high frequencies, i.e. may be used to characterize high frequency speech signals, such as may be used to characterize 6000-8000 hz speech signals, etc.

Specifically, in the training stage of the speech enhancement model, weight information corresponding to each preset speech frequency band error can be obtained, so that model training can be performed according to the weight information corresponding to each speech frequency band error and the received speech signal, and thus the speech enhancement model can be trained. The voice enhancement model can have weight bias parameters corresponding to each voice frequency band, so that the voice data corresponding to each voice frequency band can be subjected to noise reduction according to the weight bias parameters corresponding to each voice frequency band, and the problem of poor noise reduction effect caused by consistent noise reduction degree corresponding to each voice frequency band is solved.

As an example of the present invention, in combination with the above example, weight information corresponding to 4 preset voice frequency band errors may be obtained, and then model training may be performed on the received voice signal according to preset training times based on the weight information corresponding to the 4 frequency band errors, so that the trained voice enhancement model may include weight offset parameters corresponding to 4 voice frequency bands, so as to focus voice enhancement on a voice frequency band including information important for recognition, for example, weight parameters corresponding to a first voice frequency band 0-2000 hz may be 2, weight parameters corresponding to a second voice frequency band 2000-4000 hz and a third voice frequency band 4000-6000 hz may be 1, weight parameters corresponding to a fourth voice frequency band 6000-8000 hz may be 0.5, so that the voice enhancement model focuses attention on noise reduction of a low frequency portion with large voice energy, and noise reduction of a high frequency portion with large noise energy is reduced.

Step 104, after receiving the mixed voice signal, performing voice enhancement on the mixed voice signal according to the weight bias parameters corresponding to each voice frequency band in the voice enhancement model to obtain a target voice signal.

In the embodiment of the invention, the weight bias parameter is obtained by training according to the weight information. In a specific implementation, at least two voice frequency bands may be determined according to a sampling frequency corresponding to the voice signal, for example, 256 voice frequency bands may be determined based on the sampling frequency, and for example, 512 or 1024 voice frequency segments may be determined based on the sampling frequency. In the voice enhancement stage, after receiving an input mixed voice signal, noise reduction processing can be performed on each voice frequency band signal contained in the mixed voice signal according to the weight bias parameters corresponding to each voice frequency band in the voice enhancement model so as to remove interference noise in each voice frequency band signal and obtain a target voice signal after voice enhancement. The mixed voice signal may include a voice signal of a target user, a noise signal to be removed, and the like. The speech signal of the target user may refer to a clean speech signal of the target user speaking, such as a time domain signal corresponding to the speech of the target speaker; the noise signal may refer to a signal corresponding to interference noise, for example, may include a time domain signal corresponding to interference voice spoken by other speakers, which is not limited in the embodiment of the present invention.

For example, when the frequency of the received mixed voice signal is between 1000 hz and 7000 hz, in combination with the above example, it may be determined that the mixed voice signal includes 3 voice frequency band signals of high frequency, intermediate frequency and low frequency, and then noise reduction processing may be performed on the low frequency, intermediate frequency and high frequency voice signals in the mixed voice signal according to weight bias parameters trained in advance in the voice enhancement model, so as to obtain the target voice signal. The target voice signal may include a clean voice signal after noise reduction in each voice frequency band, and may be used to represent a clean voice signal of a target user, for example, may refer to a clean voice signal corresponding to a voice of a target speaker, and so on.

And step 106, outputting according to the target voice signal.

After the target voice signal is obtained, the embodiment of the invention can output according to the target voice signal. For example, speech output may be performed according to the target speech signal to output clean speech spoken by the user; for another example, voice recognition may be performed according to the target voice signal to identify a text corresponding to the clean voice spoken by the user, then a recognition result may be generated based on the identified text, and the recognition result may be output, that is, the target voice signal may be converted into corresponding text information through voice recognition, and then output according to the text information, for example, displaying the text on a screen of the device, displaying a search result corresponding to the text, and so on.

In summary, the embodiment of the invention can perform model training according to the weight information corresponding to the errors of each voice frequency band, so that the trained voice enhancement model contains the weight bias parameters corresponding to each voice frequency band, and after a mixed voice signal is received, the mixed voice signal can be subjected to voice enhancement according to the weight bias parameters corresponding to each voice frequency band in the voice enhancement model, so that the emphasis of voice enhancement is placed on the voice frequency band with larger voice energy in the mixed voice signal, the problem that the noise reduction effect is poor due to the fact that the conventional voice enhancement model has the same kernel for each voice frequency band is solved, the noise reduction effect of the voice enhancement model is improved, and the voice enhancement effect is further improved.

In a specific implementation, weight information corresponding to one or more voice frequency band errors can be preset based on an attention mechanism, and a deep learning technology can be used for training a voice enhancement model according to the preset weight information, so that the trained voice enhancement model can have weight bias parameters corresponding to each voice frequency band. Optionally, training the speech enhancement model in advance according to the weight information corresponding to the obtained speech frequency band errors may specifically include: acquiring weight information corresponding to each preset voice frequency band error according to the received voice signal; and performing model training according to the weight information corresponding to each voice frequency band error and the voice signal to obtain a voice enhancement model.

Referring to FIG. 2, a flowchart illustrating the steps of pre-training a speech enhancement model in an alternative embodiment of the present invention may specifically include the steps of:

step 202, acquiring weight information corresponding to each preset voice frequency band error according to the received voice signal.

Specifically, in the training phase of the speech enhancement model, the received speech signals may be used as training data for the speech enhancement model to perform model training using these speech signals. The received voice signal may be a clean voice signal, for example, a currently input clean voice signal received in real time in a voice input process, or a prerecorded time domain signal of a section of clean voice, for example, a prerecorded time domain signal of a section of clean voice, or the like, which is not limited in the embodiment of the present invention.

After receiving the voice signal, the embodiment of the invention can acquire the preset weight information corresponding to each voice frequency band error according to the received voice signal, so that the subsequent model training process can determine the voice enhancement error corresponding to each voice frequency band of the voice signal according to the weight information. The speech enhancement error may characterize the error between the estimated speech signal predicted by the speech enhancement model and the speech signal actually required to be output. The preset weight information corresponding to each voice frequency band error may include weight parameters corresponding to each voice frequency band error, that is, may include preset weight parameters corresponding to errors of all voice frequency bands. The weight parameters corresponding to the voice frequency band errors may be the same or different, and the weight parameters corresponding to the preset 4 voice frequency band errors in the above example may be all 1; for another example, the preset weight parameter corresponding to the first voice frequency band error may be 2, the weight parameters corresponding to the second voice frequency band error and the third voice frequency band error may be 1, the weight parameter corresponding to the fourth voice frequency band error may be 0.5, and so on, which is not limited in the embodiment of the present invention.

And 204, performing model training according to the weight information corresponding to each voice frequency band error and the voice signal to obtain a voice enhancement model.

In a specific implementation, a noise signal can be added to a received language signal to generate a voice signal with noise, so that feature extraction is performed according to the voice signal with noise to obtain voice features corresponding to the voice signal with noise; then, aiming at the obtained voice characteristics, model training can be carried out by adopting the generated voice signals with noise, the received voice signals and weight information corresponding to the errors of each voice frequency band, so that a voice enhancement model is trained.

In an optional embodiment of the present invention, performing model training according to the weight information corresponding to each voice frequency band error and the voice signal to obtain a voice enhancement model may include: adding a noise signal to the voice signal to generate a noisy voice signal; extracting the characteristics of the noisy speech signals to obtain speech characteristics corresponding to the noisy speech data; based on the voice characteristics, model training is carried out by adopting the noisy voice signals, the voice signals and weight information corresponding to each voice frequency band error, so as to obtain a voice enhancement model.

Specifically, in the training stage, the received clean voice signal can be subjected to noise adding, i.e. a noise signal can be added to the received voice signal, so as to generate a voice signal with noise. The noise signals may include simulation noise signals, pre-collected noise signals, and the like. The simulated noise signal may be used to characterize noise previously synthesized by speech synthesis techniques; the pre-collected noise signal may be used to characterize the pre-collected real noise, such as may be a pre-recorded noise signal, etc.

And then, carrying out feature extraction by adopting a noisy voice signal added with a noise signal to obtain corresponding voice features, so that model training can be carried out by combining the voice features later to generate a voice enhancement model. The voice features may be used to characterize voice voiceprint features, and may specifically include time domain voice features and/or frequency domain voice features, which are not limited in this embodiment of the present invention. It should be noted that, the time domain speech feature may be used to characterize the speech feature in the time domain, and the frequency domain speech feature may be used to characterize the speech feature in the frequency domain.

In an optional implementation manner, model training is performed by using the noisy speech signal, the speech signal, and weight information corresponding to the error of each speech frequency band, so as to obtain a speech enhancement model, which may include: determining an output estimation signal corresponding to the noisy speech signal; determining an output prediction error corresponding to the output estimation signal according to the voice signal; performing self-adaptive processing on the output prediction error according to the weight information corresponding to the voice frequency band error to obtain a voice enhancement error corresponding to each voice frequency band; determining weight bias parameters corresponding to each voice frequency band according to the voice enhancement errors corresponding to each voice frequency band; and generating a voice enhancement model according to the weight bias parameters corresponding to each voice frequency band.

In a specific implementation, the noise-carrying voice signal can be predicted based on the weight bias parameters determined by training, so that the neural network can predict the noise-carrying voice signal based on the weight bias parameters to obtain an output estimation signal corresponding to the noise-carrying voice signal. The output estimated speech signal may then be compared to the received speech signal, such that an output prediction error corresponding to the output estimated speech signal may be determined based on the comparison. The output prediction error may comprise an error corresponding to each speech segment. For example, in the case of representing the received voice signal y by using the labeling vectors corresponding to 4 voice frequency bands, the noisy voice signal corresponding to the voice signal can be predicted according to the weight bias parameters determined by training to obtain the output estimation signal corresponding to the noisy voice signalThis output estimation signal/>, can then be usedComparing with the voice signal y to obtain an output estimated signal/>The corresponding output prediction error, and the output prediction error may comprise 4 speech segment errors.

After determining the output prediction error, the adaptive processing can be performed on each voice frequency band error contained in the output prediction error based on the weight information corresponding to each voice frequency band error, so as to obtain a voice enhancement error corresponding to each voice frequency band. For example, in the case of fixing the weight information corresponding to each voice frequency band error, the voice frequency band error may be adaptively weighted based on the weight information corresponding to each voice frequency band error to obtain a voice enhancement error corresponding to each voice frequency band, e.g., a multiplication operation may be performed by using a mean square error corresponding to each voice frequency band error and a weight matrix corresponding to each voice frequency band error to obtain a voice enhancement error. The mean square error corresponding to the voice frequency band can be determined according to the voice frequency band error, for example, can be the square of the voice frequency band error; the weight matrix may represent weight information corresponding to each voice frequency band error, and may specifically include weight parameters corresponding to each voice frequency band error, for example, when the weight matrix is a square matrix, the weight parameters corresponding to each voice frequency band error may be recorded by using diagonal elements of the square matrix.

As an example of the present invention, the weight parameter corresponding to the first voice frequency band error may be recorded in the element of the first row and the first column of the weight matrix, the weight parameter corresponding to the second voice frequency band error may be recorded in the element of the second row and the second column of the weight matrix, the weight parameter corresponding to the third voice frequency band error may be recorded in the element of the third row and the third column of the weight matrix … …, and so on, the weight parameter corresponding to the nth voice frequency band error may be recorded in the element of the nth row and the nth column of the weight matrix, and N is an integer.

After determining the voice enhancement errors corresponding to the voice frequency bands, whether the voice enhancement errors corresponding to the voice frequency bands exceed a preset error range or not can be judged, so that whether the mapping relation between the voice signals with noise and the voice signals is trained or not is determined. If the voice enhancement error exceeds the preset error range, the weight bias parameters of the neural network can be updated based on a random gradient descent algorithm (Stochastic GRADIENT DESCENT, SDG) which is preset. Then, according to the updated weight bias parameters, updating output estimation signals corresponding to the noisy speech signals; and performing self-adaptive processing on the output prediction error corresponding to the updated output estimation signal until the voice enhancement error corresponding to each voice frequency band is within a preset error range. If the speech enhancement error is within a predetermined error range, i.e., within the predicted output estimated signalWhen the received voice signal y can be characterized, the mapping relation between the trained voice signal with noise and the voice signal can be determined, and a voice enhancement model based on the minimum mean square error criterion of the deep neural network can be generated based on the weight bias parameters of the neural network.

Therefore, in the voice enhancement stage, the noise reduction processing can be carried out on the received mixed voice signal according to the weight bias parameters corresponding to each voice frequency band in the voice enhancement model, so that the problem of poor noise reduction effect caused by the voice enhancement model on the same view of each voice frequency band is avoided.

In an optional embodiment of the present invention, the performing, according to the weight bias parameters corresponding to each voice frequency band in the voice enhancement model, voice enhancement on the mixed voice signal to obtain a target voice signal may specifically include: extracting features of the mixed voice signals to obtain noisy voice data, wherein the noisy voice data comprise at least one voice frequency band data; inputting the noisy speech signal into the speech enhancement model; and carrying out noise reduction processing on each voice frequency band data in the voice data with noise according to the weight bias parameters corresponding to each voice frequency band through the voice enhancement model to obtain a target voice signal corresponding to a target user. Wherein the mixed speech signal may comprise a noise signal and a speech signal of said target user.

Referring to fig. 3, a flowchart illustrating steps of an alternative embodiment of an audio processing method of the present invention may specifically include the steps of:

Step 302, after receiving the mixed voice signal, extracting features of the mixed voice signal to obtain noisy voice data.

In the embodiment of the invention, the mixed voice signal can contain the voice signal of the target user to be reserved and the noise signal to be removed, for example, the mixed voice signal can comprise a clean voice signal corresponding to the speaking of the target user, an interference voice signal corresponding to the speaking of other users and the like.

Specifically, after the mixed voice signal is received, the mixed voice signal can be determined to be a signal needing to be subjected to voice enhancement processing, and then feature extraction can be performed on the mixed voice signal to obtain voice data with noise and voice features corresponding to the voice data with noise. The noisy speech data may refer to speech data with noise after the speech feature is extracted, and may include at least one speech frequency band data, that is, may include speech data of one or more speech frequency bands, for example, may include noise data to be removed, target speech data to be reserved, and the like.

Step 304, inputting the noisy speech signal into a speech enhancement model.

And 306, carrying out noise reduction processing on each voice frequency band data in the voice data with noise according to the weight bias parameters corresponding to each voice frequency band through a voice enhancement model to obtain a target voice signal corresponding to a target user.

After feature extraction, the embodiment of the invention can input the noisy speech data into the pre-trained speech enhancement model based on the extracted speech features so as to remove the noise data contained in the noisy speech data through the speech enhancement model. Specifically, in the speech enhancement model, noise reduction processing may be performed on each speech frequency band data included in the speech frequency band data according to the weight bias parameter corresponding to each speech frequency band, so as to remove noise data of each frequency band in the speech frequency band data, and meanwhile, target speech data included in the speech frequency band data may be reserved, and then a target speech signal corresponding to the target user may be generated based on the reserved target speech data.

In an optional embodiment of the present invention, noise reduction processing is performed on each voice frequency band data in the voice data with noise according to a weight bias parameter corresponding to each voice frequency band, so as to obtain a target voice signal corresponding to the target user, which may include: determining a target weight bias parameter corresponding to each voice frequency band data in the voice data with noise based on the weight bias parameters corresponding to each voice frequency band; performing noise reduction processing according to the target weight bias parameters for each voice frequency band data in the voice data with noise to obtain noise reduction voice data corresponding to each voice frequency band; and generating a target voice signal corresponding to the target user according to the voice characteristics and the noise reduction voice data.

In a specific implementation, the noisy speech data input into the speech enhancement model may be segmented according to the weight bias parameters corresponding to each speech frequency band in the speech enhancement model, each speech frequency band data included in the noisy speech data may be determined, and then the target weight bias parameters corresponding to each speech frequency band data may be determined based on the speech frequency band to which each speech frequency band data in the noisy speech data belongs, for example, the weight bias parameters corresponding to each speech frequency band data may be determined as the target weight bias parameters corresponding to each speech frequency band data. Therefore, the noise reduction processing can be carried out on each voice frequency band data in the voice frequency band data according to the target weight bias parameter corresponding to each voice frequency band data, namely, the noise reduction processing can be carried out on each voice frequency band data in the voice frequency band data according to different weight bias parameters, so that the noise reduction voice data corresponding to each voice frequency band is obtained, and the problem that the noise reduction effect is limited due to the fact that all voice frequency band data in the voice frequency band data are subjected to noise reduction according to the same weight parameters is avoided.

After the noise reduction voice data corresponding to each voice frequency band is obtained, a target voice signal corresponding to a target user can be generated by adopting the noise reduction voice data corresponding to each voice frequency band based on voice characteristics.

As an example of the present invention, attention model training may be referred to, and speech enhancement model training may be performed according to weight information corresponding to each speech band error, so that the trained speech enhancement model may include at least two weight bias parameters. The noise reduction processing of one voice frequency band can be corresponding to one weight bias parameter, so that the voice enhancement model can focus on the frequency band containing voice data very important to the voice enhancement task according to the weight bias parameter corresponding to each voice frequency band. For example, under the condition that the weight bias parameters corresponding to the 3 voice frequency bands of the voice enhancement model are 0.5,2 and 3 respectively, the voice enhancement model can perform noise reduction processing on the noisy voice data belonging to the first voice frequency band according to the weight bias parameter 0.5 corresponding to the first voice frequency band to obtain noise reduction voice data A corresponding to the first voice frequency band; noise reduction processing can be carried out on the noise-carrying voice data belonging to the second voice frequency band according to the weight bias parameter 2 corresponding to the second voice frequency band, so as to obtain noise-reducing voice data B corresponding to the second voice frequency band; and noise reduction processing can be carried out on the voice data with noise belonging to the 3 rd voice frequency band according to the weight bias parameter 3 corresponding to the third voice frequency band, so as to obtain noise reduction voice data C corresponding to the second voice frequency band; subsequently, the noise-reduced voice data a, the noise-reduced voice data B, and the noise-reduced voice data C may be synthesized based on the voice characteristics, to generate a voice-enhanced target voice signal. As can be seen, the speech enhancement model in this example may perform noise reduction processing on the noisy speech data belonging to the 3 speech frequency bands according to the weight bias parameters corresponding to the 3 speech frequency bands, and may focus on the speech frequency band with the weight bias parameter of 3, so as to perform noise reduction processing on the speech data of the third speech frequency band with the weight bias parameter of 3, and pay less attention to the second speech frequency band with the weight bias parameter of 0.5, thereby improving the speech noise reduction effect. The third voice frequency band may be a frequency band containing information very important for voice recognition, such as a low-frequency part with larger voice energy; the first speech band may be a band containing part of the information useful for speech recognition, such as a high frequency part where noise energy is large, etc.

Step 308, outputting according to the target voice signal.

In an alternative embodiment, outputting according to the target voice signal may include: and outputting the voice according to the target voice signal.

Specifically, the embodiment of the invention can be applied to a product of voice conversation in a noisy environment, such as a telephone watch in a voice conversation scene, so that conversation parties can only hear the pure voice of a main speaker concerned by the conversation parties. For example, when a parent uses a telephone watch to call a child participating in an activity, the audio processing method provided by the embodiment of the invention can enable the parent to only hear clear sound of the child, and reduce the influence of speaking of other children, namely, the influence of noise interference.

Of course, the embodiment of the present invention may also be applied to other scenarios, such as a speech input scenario, a speech recognition scenario, etc., which is not limited by the embodiment of the present invention.

In another alternative embodiment, outputting according to the target voice signal may include: performing voice recognition on the target voice signal to generate a recognition result; and outputting the identification result. Specifically, after the speech enhancement model outputs the speech enhanced target speech signal, the target speech signal can be used for speech recognition, that is, the pure speech of the target speaker can be used for speech recognition, so as to recognize the speech spoken by the target speaker, for example, under the condition that the target speech output by the speech enhancement model is "good, i called li XX, and very good to know about," the target speech can be "good, i called li XX, very good to know about" for speech recognition. Then, the output can be performed according to the recognized recognition result, such as outputting the word "good for everything," i call for the li XX, a personal photo of happy to know everything, "li XX," etc., corresponding to the recognized voice.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 4, a block diagram of an embodiment of an audio processing apparatus according to the present invention is shown, and may specifically include the following modules:

The model training module 402 is configured to train a speech enhancement model in advance according to weight information corresponding to the obtained speech frequency band errors;

the voice enhancement module 404 is configured to, after receiving the mixed voice signal, perform voice enhancement on the mixed voice signal according to weight bias parameters corresponding to each voice frequency band in the voice enhancement model, so as to obtain a target voice signal, where the weight bias parameters are obtained by training according to the weight information;

and an output module 406, configured to output according to the target voice signal.

In an alternative embodiment of the present invention, the speech enhancement module 404 may include the following sub-modules:

In an alternative embodiment of the present invention, the model training module 402 may include the following sub-modules:

In an alternative embodiment of the present invention, the model training sub-module may include the following units:

In the embodiment of the present invention, optionally, the model training unit may specifically include the following sub-units:

In an alternative embodiment of the present invention, the noise reduction processing sub-module may include the following units:

In an alternative embodiment of the present invention, the output module 406 may include the following sub-modules:

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

Fig. 5 is a block diagram illustrating an apparatus 500 for audio processing according to an exemplary embodiment. For example, device 500 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, server, or the like.

Referring to fig. 5, device 500 may include one or more of the following components: a processing component 502, a memory 504, a power supply component 506, a multimedia component 508, an audio component 510, an input/output (I/O) interface 512, a sensor component 514, and a communication component 516.

The processing component 502 generally controls overall operation of the device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 502 may include one or more processors 520 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interactions between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

Memory 504 is configured to store various types of data to support operations at device 500. Examples of such data include instructions for any application or method operating on device 500, contact data, phonebook data, messages, pictures, video, and the like. The memory 504 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 506 provides power to the various components of the device 500. Power supply components 506 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 500.

The multimedia component 508 includes a screen between the device 500 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front-facing camera and/or a rear-facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 500 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 504 or transmitted via the communication component 516. In some embodiments, the audio component 510 further comprises a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 514 includes one or more sensors for providing status assessment of various aspects of the device 500. For example, the sensor assembly 514 may detect the on/off state of the device 500, the relative positioning of the components, such as the display and keypad of the device 500, the sensor assembly 514 may also detect a change in position of the device 500 or a component of the device 500, the presence or absence of user contact with the device 500, the orientation or acceleration/deceleration of the device 500, and a change in temperature of the device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate communication between the device 500 and other devices, either wired or wireless. The device 500 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 516 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 516 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 504, including instructions executable by processor 520 of device 500 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of a device, causes a terminal to perform an audio processing method, the method comprising: training a voice enhancement model according to weight information corresponding to the acquired voice frequency band errors in advance; after receiving a mixed voice signal, carrying out voice enhancement on the mixed voice signal according to weight bias parameters corresponding to each voice frequency band in the voice enhancement model to obtain a target voice signal, wherein the weight bias parameters are obtained by training according to the weight information; and outputting according to the target voice signal.

Fig. 6 is a schematic structural diagram of an apparatus in an embodiment of the present invention. The device 600 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 622 (e.g., one or more processors) and memory 632, one or more storage mediums 630 (e.g., one or more mass storage devices) that store applications 642 or data 644. Wherein memory 632 and storage medium 630 may be transitory or persistent storage. The program stored on the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations in the device. Still further, the central processor 622 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the device 600.

The device 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input/output interfaces 658, one or more keyboards 656, and/or one or more operating systems 641 such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

In an exemplary embodiment, an apparatus configured to be executed by one or more processors the one or more programs includes instructions for: training a voice enhancement model according to weight information corresponding to the acquired voice frequency band errors in advance; after receiving a mixed voice signal, carrying out voice enhancement on the mixed voice signal according to weight bias parameters corresponding to each voice frequency band in the voice enhancement model to obtain a target voice signal, wherein the weight bias parameters are obtained by training according to the weight information; and outputting according to the target voice signal.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The foregoing has described in detail a method and apparatus for audio processing, an device, and a readable storage medium, wherein specific examples are set forth for purposes of illustrating the principles and embodiments of the present invention, and wherein the above examples are provided to assist in understanding the method and core concepts of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. An audio processing method, comprising:

training a voice enhancement model according to weight information corresponding to the acquired voice frequency band errors in advance;

after receiving a mixed voice signal, carrying out voice enhancement on the mixed voice signal according to weight bias parameters corresponding to each voice frequency band in the voice enhancement model to obtain a target voice signal, wherein the weight bias parameters are obtained by training according to the weight information;

Outputting according to the target voice signal;

Training a voice enhancement model according to the weight information corresponding to the acquired voice frequency band errors in advance, wherein the training comprises the following steps:

acquiring weight information corresponding to each preset voice frequency band error according to the received voice signal;

Model training is carried out according to the weight information corresponding to each voice frequency band error and the voice signals, so as to obtain a voice enhancement model;

Model training is carried out according to the weight information corresponding to each voice frequency band error and the voice signal to obtain a voice enhancement model, and the method comprises the following steps:

Adding a noise signal to the voice signal to generate a noisy voice signal;

extracting the characteristics of the noisy speech signals to obtain speech characteristics corresponding to the noisy speech data;

based on the voice characteristics, model training is carried out by adopting the noisy voice signals, the voice signals and weight information corresponding to each voice frequency band error, so as to obtain a voice enhancement model;

the model training is performed by adopting the noisy speech signal, the speech signal and the weight information corresponding to each speech frequency band error to obtain a speech enhancement model, which comprises the following steps:

Determining an output estimation signal corresponding to the noisy speech signal;

Determining an output prediction error corresponding to the output estimation signal according to the voice signal;

Performing self-adaptive processing on the output prediction error according to the weight information corresponding to the voice frequency band error to obtain a voice enhancement error corresponding to each voice frequency band;

determining weight bias parameters corresponding to each voice frequency band according to the voice enhancement errors corresponding to each voice frequency band;

and generating a voice enhancement model according to the weight bias parameters corresponding to each voice frequency band.

2. The method of claim 1, wherein performing speech enhancement on the mixed speech signal according to the weight bias parameters corresponding to each speech frequency band in the speech enhancement model to obtain a target speech signal comprises:

extracting features of the mixed voice signals to obtain noisy voice data, wherein the noisy voice data comprise at least one voice frequency band data;

Inputting the noisy speech signal into the speech enhancement model;

And carrying out noise reduction processing on each voice frequency band data in the voice data with noise according to the weight bias parameters corresponding to each voice frequency band through the voice enhancement model to obtain a target voice signal corresponding to a target user.

3. The method of claim 2, wherein the performing noise reduction processing on each voice frequency band data in the noisy voice data according to the weight bias parameter corresponding to each voice frequency band to obtain the target voice signal corresponding to the target user includes:

Determining a target weight bias parameter corresponding to each voice frequency band data in the voice data with noise based on the weight bias parameters corresponding to each voice frequency band;

Performing noise reduction processing according to the target weight bias parameters for each voice frequency band data in the voice data with noise to obtain noise reduction voice data corresponding to each voice frequency band;

And generating a target voice signal corresponding to the target user according to the voice characteristics and the noise reduction voice data.

4. A method according to claim 1,2 or 3, wherein outputting in dependence on the target speech signal comprises:

Performing voice output according to the target voice signal; and/or the number of the groups of groups,

Performing voice recognition on the target voice signal to generate a recognition result; and outputting the identification result.

5. An audio processing apparatus, comprising:

The output module is used for outputting according to the target voice signal;

the model training module comprises:

The model training sub-module is used for carrying out model training according to the weight information corresponding to each voice frequency band error and the voice signal to obtain a voice enhancement model;

the model training submodule includes:

The model training unit is used for carrying out model training by adopting the noisy speech signal, the speech signal and weight information corresponding to the error of each speech frequency band based on the speech characteristics to obtain a speech enhancement model;

the model training unit includes:

6. The audio processing apparatus of claim 5, wherein the speech enhancement module comprises:

7. The audio processing apparatus of claim 6, wherein the noise reduction processing submodule includes:

8. The audio processing apparatus according to claim 5 or 6 or 7, wherein the output module includes:

9. An apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

Outputting according to the target voice signal;

Adding a noise signal to the voice signal to generate a noisy voice signal;

10. The apparatus of claim 9, wherein performing speech enhancement on the mixed speech signal according to the weight bias parameters corresponding to each speech frequency band in the speech enhancement model to obtain a target speech signal comprises:

Inputting the noisy speech signal into the speech enhancement model;

11. The apparatus of claim 10, wherein the noise reduction processing is performed on each voice frequency band data in the noisy voice data according to the weight bias parameter corresponding to each voice frequency band to obtain the target voice signal corresponding to the target user, including:

12. The apparatus according to claim 9 or 10 or 11, wherein outputting in accordance with the target speech signal comprises:

13. A readable storage medium, characterized in that instructions in said storage medium, when executed by a processor of a device, enable the device to perform the audio processing method according to any of the method claims 1-4.