CN110503968B

CN110503968B - Audio processing method, device, equipment and readable storage medium

Info

Publication number: CN110503968B
Application number: CN201810481272.6A
Authority: CN
Inventors: 文仕学
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2024-06-04
Anticipated expiration: 2038-05-18
Also published as: CN110503968A

Abstract

The embodiment of the invention provides an audio processing method, an audio processing device, audio processing equipment and a readable storage medium, wherein the method comprises the following steps: acquiring an input mixed voice signal; performing voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal; and outputting according to the target voice signal. The embodiment of the invention can solve the gradient disappearance problem of the traditional neural network-based voice enhancement method, and improve the voice enhancement effect.

Description

Audio processing method, device, equipment and readable storage medium

Technical Field

The present invention relates to the field of communication technology, and in particular, to an audio processing method, an audio processing apparatus, a device, and a readable storage medium.

Background

With the rapid development of communication technology, terminals such as mobile phones and tablet computers are becoming more popular, and great convenience is brought to life, study and work of people.

These terminals may collect voice signals through microphones and process the collected voice signals using voice enhancement techniques to reduce the effects of noise interference. The speech enhancement is a technique for extracting useful speech signals from noise background, suppressing and reducing noise interference when speech signals are disturbed or even submerged by various kinds of noise.

Currently, terminals typically use voice enhancement methods of conventional neural networks, such as deep neural networks (Deep Neural Network, DNN), convolutional neural networks (Convolutional Neural Network, CNN), long Short-Term Memory artificial neural networks (LSTM), and the like for voice enhancement. But the voice enhancement method using the conventional neural network has a gradient vanishing problem. For example, for fully connected DNNs, as the network depth of the DNN increases, i.e., as the number of network layers of the DNN increases, the problem of gradient extinction of the network becomes more and more serious, such as when the number of network layers increases to 5 layers. If the network layer number of DNN is continuously increased, the DNN is used for voice enhancement, and finally the voice enhancement performance of the voice result is not increased, but can be reduced, so that the voice enhancement effect is affected.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide an audio processing method for improving the voice enhancement effect.

Correspondingly, the embodiment of the invention also provides an audio processing device, equipment and a readable storage medium, which are used for ensuring the implementation and application of the method.

In order to solve the above problems, an embodiment of the present invention discloses an audio processing method, including: acquiring an input mixed voice signal; performing voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal; and outputting according to the target voice signal.

Optionally, the performing voice enhancement on the mixed voice signal according to a pre-trained residual network model to obtain a target voice signal includes: extracting characteristics of the mixed voice signal to obtain voice characteristics and voice data of a target user, wherein the mixed voice signal comprises a noise signal and the voice signal of the target user; and according to the voice characteristics, carrying out noise reduction processing on the voice data through a pre-trained residual error network model to obtain a target voice signal corresponding to the target user.

Optionally, the method further comprises: and pre-training a residual network model corresponding to the voice characteristics. The noise reduction processing is performed on the voice data through a pre-trained residual network model according to the voice characteristics to obtain a target voice signal corresponding to the target user, and the method comprises the following steps: determining a residual network model corresponding to the target user according to the voice characteristics of the target user; and carrying out noise reduction processing on the voice data through a residual error network model corresponding to the target user to obtain the target voice signal.

Optionally, the noise reduction processing is performed on the voice data through a residual network model corresponding to the target user, so as to obtain the target voice signal, which includes: determining network weight information corresponding to each network layer in the residual error network model; mapping the voice data according to the network weight information corresponding to each network layer to obtain mapped voice data; and generating a target voice signal based on the mapped voice data and the voice data.

Optionally, the feature extracting the mixed voice signal to obtain voice features and voice data of the target user includes: and extracting the frequency domain characteristics of the mixed voice signals to obtain the frequency domain voice characteristics and the frequency domain voice data of the target user. The generating a target speech signal based on the mapped speech data and the speech data, comprises: decoding the mapped voice data and the frequency domain voice data to obtain decoded voice data; and carrying out waveform reconstruction on the decoded voice data according to the frequency domain voice characteristics to obtain a target voice signal.

Optionally, the feature extracting the mixed voice signal to obtain voice features and voice data of the target user includes: and extracting the time domain characteristics of the mixed voice signal to obtain the time domain voice characteristics and the time domain voice data of the target user. The generating a target speech signal based on the mapped speech data and the speech data, comprises: and generating a target voice signal by adopting the mapping voice data and the time domain voice data.

Optionally, the training the residual network model corresponding to the voice feature includes: adding a noise signal to an input voice signal to generate a noisy voice signal; extracting features of the noisy speech signals to obtain speech features corresponding to the noisy speech signals; and performing model training according to the noisy speech signal and the speech signal according to a preset residual error network structure, and generating a residual error network model corresponding to the speech feature.

Optionally, outputting according to the target voice signal includes: performing voice output according to the target voice signal; and/or performing voice recognition on the target voice signal to generate a recognition result; and outputting the identification result.

The embodiment of the invention also discloses an audio processing device, which comprises:

The voice signal acquisition module is used for acquiring an input mixed voice signal;

The voice enhancement module is used for carrying out voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal;

And the voice signal output module is used for outputting according to the target voice signal.

Optionally, the voice enhancement module includes:

The feature extraction sub-module is used for carrying out feature extraction on the mixed voice signal to obtain voice features and voice data of a target user, wherein the mixed voice signal comprises a noise signal and a voice signal of the target user;

And the noise reduction processing sub-module is used for carrying out noise reduction processing on the voice data through a pre-trained residual network model according to the voice characteristics to obtain a target voice signal corresponding to the target user.

Optionally, the method may further include: the residual network model training module is used for training a residual network model corresponding to the voice characteristics in advance;

Wherein the noise reduction processing submodule includes: the residual network model determining unit is used for determining a residual network model corresponding to the target user according to the voice characteristics of the target user; and the noise reduction processing unit is used for carrying out noise reduction processing on the voice data through the residual network model corresponding to the target user to obtain the target voice signal.

Optionally, the noise reduction processing unit includes:

A network weight information determining subunit, configured to determine network weight information corresponding to each network layer in the residual network model;

The mapping processing subunit is used for carrying out mapping processing on the voice data according to the network weight information corresponding to each network layer to obtain mapped voice data;

and the target voice signal generation subunit is used for generating a target voice signal based on the mapping voice data and the voice data.

Optionally, the feature extraction submodule is specifically configured to perform frequency domain feature extraction on the mixed speech signal to obtain frequency domain speech features and frequency domain speech data of a target user;

The target voice signal generation subunit is specifically configured to decode the mapped voice data and the frequency domain voice data to obtain decoded voice data; and carrying out waveform reconstruction on the decoded voice data according to the frequency domain voice characteristics to obtain a target voice signal.

Optionally, the feature extraction submodule is specifically configured to perform time domain feature extraction on the mixed speech signal to obtain time domain speech features and time domain speech data of a target user;

The target voice signal generation subunit is specifically configured to generate a target voice signal by using the mapped voice data and the time domain voice data.

Optionally, the residual network model training module includes:

The noise adding submodule is used for adding a noise signal to an input voice signal and generating a voice signal with noise;

The characteristic extraction sub-module is used for carrying out characteristic extraction on the voice signal with noise to obtain voice characteristics corresponding to the voice signal with noise;

And the model training sub-module is used for carrying out model training according to the noise-carrying voice signal and the voice signal according to a preset residual error network structure, and generating a residual error network model corresponding to the voice characteristic.

Optionally, the voice signal output module includes:

The voice output sub-module is used for outputting voice according to the target voice signal; and/or

The voice recognition sub-module is used for carrying out voice recognition on the target voice signal and generating a recognition result; and outputting the identification result.

The embodiment of the invention also discloses a device, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by one or more processors, and the one or more programs comprise instructions for: acquiring an input mixed voice signal; performing voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal; and outputting according to the target voice signal.

Optionally, execution of the one or more programs by the one or more processors includes instructions for: and pre-training a residual network model corresponding to the voice characteristics. The noise reduction processing is performed on the voice data through a pre-trained residual network model according to the voice characteristics to obtain a target voice signal corresponding to the target user, and the method comprises the following steps: determining a residual network model corresponding to the target user according to the voice characteristics of the target user; and carrying out noise reduction processing on the voice data through a residual error network model corresponding to the target user to obtain the target voice signal.

The embodiment of the invention also discloses a readable storage medium, which enables the device to execute the audio processing method in one or more of the embodiments of the invention when the instructions in the storage medium are executed by the processor of the device.

The embodiment of the invention has the following advantages:

According to the embodiment of the invention, the obtained mixed voice signal can be subjected to voice enhancement through the pre-trained residual network model, so that the problem of poor voice enhancement effect caused by increasing the network depth can be avoided, namely the gradient disappearance problem of the traditional neural network-based voice enhancement method is solved, and the voice enhancement effect is improved.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of an audio processing method of the present invention;

FIG. 2 is a schematic illustration of speech enhancement using a pre-trained residual network model in one example of the present invention;

FIG. 3 is a flow chart of steps of an alternative embodiment of an audio processing method of the present invention;

FIG. 4 is a schematic representation of a mixed speech collected in an example of the invention;

FIG. 5 is a block diagram of an embodiment of an audio processing apparatus of the present invention;

FIG. 6 is a block diagram illustrating an apparatus for audio processing according to an exemplary embodiment;

Fig. 7 is a schematic structural view of an apparatus according to an embodiment of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Currently, the existing speech enhancement method generally uses a traditional neural network to perform model training, so as to obtain a neural network model for speech enhancement based on the training. The network depth is a major factor affecting the performance of conventional neural networks. As the depth of the network increases, the gradient vanishing problem occurs in the conventional neural network. As the depth of the network is deepened, the gradient vanishing problem of the traditional neural network is more and more serious, so that the neural network model trains a relatively poor target voice signal, and the voice enhancement effect is affected.

One of the core ideas of the embodiment of the invention is that a new audio processing method is provided to use a residual network model to carry out voice enhancement on an input mixed voice signal, so that the problem of gradient disappearance of the existing voice enhancement method based on the traditional neural network is solved, and the voice enhancement effect is improved.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of an audio processing method according to the present invention may specifically include the following steps:

step 102, obtaining an input mixed voice signal.

The embodiment of the invention can acquire the input mixed voice signal in the voice input process. The mixed speech signal may include a speech signal requiring speech enhancement, and may include a speech signal of a target user, a noise signal, and the like. The speech signal of the target user may refer to a clean speech signal of the target user speaking, such as a time domain signal corresponding to the speech of the target speaker; the noise signal may refer to a signal corresponding to interference noise, for example, may include a time domain signal corresponding to interference voice spoken by other speakers, which is not limited in the embodiment of the present invention.

And 104, carrying out voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal.

In the embodiment of the invention, the obtained mixed voice signal can be used as the input of a pre-trained residual error network model, then the obtained mixed voice signal is input into the pre-trained residual error network model, the obtained mixed voice signal is subjected to voice enhancement through the residual error network model, interference noise in the mixed voice signal is removed, and the target voice signal after voice enhancement is obtained. The target voice signal may only include a signal of the target user's clean voice, and may be used to represent a signal corresponding to the target user's clean voice, for example, may refer to a clean voice signal corresponding to the target speaker's voice, and so on.

In an alternative embodiment, after the mixed voice signal is obtained, feature extraction may be performed on the mixed voice signal to obtain voice features and voice data. The voice data may refer to noisy voice data after voice feature extraction, and may specifically include: noise data, target voice data to be retained, and the like. And then, noise reduction processing can be carried out on the voice data through a pre-trained residual network model according to the voice characteristics, so that a target voice signal after voice enhancement is obtained. Note that the voice features may include: time domain speech features and/or frequency domain speech features, to which embodiments of the invention are not limited. The time domain speech features may be used to characterize speech features in the time domain and the frequency domain speech features may be used to characterize speech features in the frequency domain.

And step 106, outputting according to the target voice signal.

After the target voice signal after voice enhancement is obtained, the embodiment of the invention can output according to the target voice signal. For example, speech output may be performed according to the target speech signal to output clean speech spoken by the user; for another example, the voice recognition may be performed according to the target voice signal to recognize the clean voice spoken by the user, or the recognized clean voice may be converted into text information, and then output according to the text information, such as displaying text on a screen of the device, displaying a search result corresponding to the text, and so on.

In summary, the embodiment of the invention can carry out voice enhancement on the obtained mixed voice signal through the pre-trained residual network model, thereby being capable of avoiding the problem of poor voice enhancement effect caused by increasing the network depth, namely solving the gradient disappearance problem of the traditional voice enhancement method based on the traditional neural network and improving the voice enhancement effect.

In a specific implementation, model training can be performed in advance according to the network structure of the residual network based on the language characteristics of the language signals, so that residual network models corresponding to various voice characteristics can be trained, and then voice enhancement can be performed by using the pre-trained residual network models according to the voice characteristics, so that the voice enhancement effect is ensured. Optionally, the audio processing method of the embodiment of the present invention may further include: and pre-training a residual network model corresponding to the voice characteristics.

Specifically, in the model training stage, a noise signal can be added to an input language signal to generate a voice signal with noise, so that feature extraction is performed according to the voice signal with noise to obtain corresponding voice features; and then, aiming at the obtained voice characteristics, according to a preset residual error network structure, performing model training by adopting the generated voice signals with noise, and generating a residual error network model corresponding to the voice characteristics. The input voice signal may refer to a clean voice signal, which may specifically include: the collected clean voice signal and/or the pre-synthesized clean voice signal may be, for example, a currently input clean voice signal obtained in real time during the voice input process, or may be a pre-recorded time domain signal of a section of clean voice, or may be a pre-synthesized time domain signal of a section of clean voice, etc., which is not limited in this embodiment of the present invention.

In an optional embodiment of the present invention, training a residual network model corresponding to a speech feature may specifically include: adding a noise signal to an input voice signal to generate a noisy voice signal; extracting the characteristics of the voice signals with noise to obtain voice characteristics corresponding to the voice signals with noise; and performing model training according to the noisy speech signal and the speech signal according to a preset residual error network structure, and generating a residual error network model corresponding to the speech feature. The preset residual network structure may be set in advance according to a network structure of the residual network, which is not limited in the embodiment of the present invention.

Specifically, in the training phase, the input clean speech signal may be noise-added, i.e., a noise signal may be added to the input speech signal to generate a noisy speech signal. The noise signals may include simulation noise signals, pre-collected noise signals, and the like. The simulated noise signal may be used to characterize noise previously synthesized by speech synthesis techniques; the pre-collected noise signal may be used to characterize the pre-collected real noise, such as may be a pre-recorded noise signal, etc.

As an example of the present invention, in the case where real noise is not collected, noise addition processing may be performed on an input speech signal using a pre-synthesized artificial noise signal to perform model training according to a noisy speech signal generated after the noise addition processing, thereby avoiding the problem of high model training cost caused by collecting a large amount of real noise and reducing training cost. Of course, in the case of collecting real noise, the noise signal corresponding to the collected real noise may be used to perform noise adding processing on the input speech signal, for example, the noise signal may be used to perform noise adding processing on the input speech signal; as another example, the input speech signal may be denoised using a portion of the collected real noise and the synthesized artificial noise signal, etc., which is not particularly limited in this example.

And then, carrying out feature extraction according to the noisy speech signal added with the noise signal to obtain corresponding speech features, so that model training can be carried out by using a residual network in combination with the speech voiceprint features to obtain a residual network model corresponding to the speech features. Specifically, as shown in fig. 2, according to a preset residual network structure, model training may be performed by using the generated noisy speech signal and the input speech signal component according to the obtained speech features, so as to train out a residual network model corresponding to each speech feature. The residual network model may include at least three network layers. In the model training process, the output result of each network layer can be used as the input of the next network layer and can be used as the cross-layer input to other network layers, for example, the output result of the first network layer can be used as the input of the second network layer and can be used as the input of the third network layer, and/or can be input to the later network layer to update the weight parameters of each network layer in the residual network model, thereby alleviating the problem of gradient reduction and solving the problem of gradient disappearance.

Therefore, in the voice enhancement stage, namely when the trained residual network model is used for voice enhancement, the residual network model which is needed to be used currently can be determined based on the voice characteristics, noise reduction processing can be carried out on voice data after the characteristics are extracted through the determined residual network model, as shown in fig. 2, a target voice signal is obtained, and the obtained target voice signal is output. The voice data may be generated after feature extraction of an input mixed voice signal, for example, may be frequency domain voice data obtained after frequency domain feature extraction of the mixed voice signal, or may be time domain voice data obtained after time domain feature extraction of the mixed voice signal, or the like, which is not limited in the embodiment of the present invention.

Referring to fig. 3, a flowchart illustrating steps of an alternative embodiment of an audio processing method of the present invention may specifically include the steps of:

step 302, an input mixed speech signal is obtained.

And step 304, extracting the characteristics of the mixed voice signal to obtain the voice characteristics and voice data of the target user, wherein the mixed voice signal comprises a noise signal and the voice signal of the target user.

Specifically, in the voice enhancement stage, after an input mixed voice signal is detected, a signal requiring voice enhancement processing may be determined from the currently detected mixed voice signal, and the input mixed voice signal may be acquired to perform a corresponding voice enhancement task based on the acquired mixed voice signal. In the process of executing the voice enhancement task, the obtained mixed voice signal can be characterized to obtain the voice characteristics and the voice data of the target user. The mixed voice signal may include a voice signal of the target user and a noise signal to be removed, for example, may include a clean voice signal corresponding to the speech of the target user and an interference voice signal corresponding to the speech of other users.

And 306, carrying out noise reduction processing on the voice data through a pre-trained residual network model according to the voice characteristics to obtain a target voice signal corresponding to the target user.

In a specific implementation, the residual network models obtained by training aiming at different voice features can be different, for example, the residual network models obtained by training aiming at the voice features of the user A can inhibit interference voice signals corresponding to other users, for example, inhibit interference voice signals of the user B speaking, and retain voice signals corresponding to the user A speaking, so as to achieve the purpose of enhancing the voice signals corresponding to the user A speaking; the residual network model obtained by training the voice features of the user B can inhibit the interference voice signals corresponding to other users, such as inhibiting the interference voice signals of the user A speaking, and retaining the voice signals corresponding to the user B speaking, so as to achieve the purpose of enhancing the voice signals corresponding to the user B speaking. Therefore, before the noise reduction processing, the voice data can be subjected to the noise reduction processing by combining with the voiceprint recognition technology and determining a residual network model which is required to be used currently according to the voice characteristics of the target user, so as to obtain a target voice signal corresponding to the target user.

In an optional embodiment of the present invention, according to the voice feature, performing noise reduction processing on the voice data through a pre-trained residual network model to obtain a target voice signal corresponding to the target user may include: determining a residual network model corresponding to the target user according to the voice characteristics of the target user; and carrying out noise reduction processing on the voice data through a residual error network model corresponding to the target user to obtain the target voice signal. The target speech signal corresponding to the target user may include a residual network model trained in advance for the speech features of the target user.

Specifically, after the voice characteristics of the target user are obtained, the residual network model trained for the voice characteristics of the target user in advance can be determined based on the voice characteristics of the target user, noise reduction processing can be performed on voice data through the determined residual network model so as to remove noise data in the voice data, meanwhile, target voice data contained in the voice data are reserved, and then a target voice signal after voice enhancement can be generated based on the reserved target voice data. The target speech data may be used to characterize a speech signal of the target user, such as frequency domain data of clean speech uttered by the target user, or time domain data of clean speech uttered by the target user, etc.

In the embodiment of the present invention, optionally, noise reduction processing is performed on the voice data through a residual network model corresponding to a target user to obtain the target voice signal, which may specifically include: determining network weight information corresponding to each network layer in the residual error network model; mapping the voice data according to the network weight information corresponding to each network layer to obtain mapped voice data; and generating a target voice signal based on the mapped voice data and the voice data.

Specifically, after determining the residual network model, the embodiment of the invention can determine the network weight information corresponding to each network layer in the residual network model based on the residual network structure corresponding to the residual network model, and then can perform mapping processing on the voice data according to the network weight information corresponding to each network layer, namely, perform noise reduction processing on the voice data input to each network layer according to the network weight information corresponding to each network layer, so as to remove noise data contained in the voice data, and obtain mapped voice data. Wherein the network weight information may be used to determine a mapping relationship between the voice data and the mapped voice data. The mapped speech data may be used to characterize a clean speech signal obtained by removing a noise signal, e.g., a time domain signal that may characterize a clean speech of the noise signal, and, for example, a frequency domain signal that may characterize a clean speech of the noise signal, etc., which embodiments of the present invention are not limited in this respect. After the mapped voice data is obtained, the embodiment of the invention can process the mapped voice data and the voice data according to the residual network structure to generate the target voice signal corresponding to the target user.

As an example of the present invention, after obtaining the voice data x, the voice data x may be mapped according to the network weight information WEIGHT LAYER corresponding to each network layer in the residual network model to obtain mapped voice data F (x), and then the mapped voice data F (x) and the voice data x may be used to generate target voice data H (x) according to the residual network structure. For example, when the voice data x is 5, if the generated target voice data H (x) is 5.1, it can be determined that the mapped voice data F (x) corresponding to the voice data is 0.1; if the generated target voice data H (x) is 5.2, it can be determined that the mapped voice data F (x) corresponding to the voice data is 0.2, that is, when the target voice data H (x) is changed from 5.1 to 5.2, the mapped voice data F (x) is changed from 0.1 to 0.2, and 100% of variables are added, so that the tiny change of the voice data x can be highlighted, the adjustment effect of the network weight information can be obviously reflected, further, noise data in the voice data can be better suppressed, and the voice enhancement effect is improved.

In one implementation manner of the embodiment of the present invention, frequency domain feature extraction may be performed on an input mixed speech signal to obtain frequency domain speech features and corresponding frequency domain speech data after the frequency domain feature extraction, so as to perform speech enhancement processing on a frequency domain according to the frequency domain speech features and the frequency domain speech data. Optionally, the extracting the features of the mixed voice signal to obtain the voice features and the voice data of the target user may include: and extracting the frequency domain characteristics of the mixed voice signals to obtain the frequency domain voice characteristics and the frequency domain voice data of the target user. Therefore, the frequency domain voice data can be subjected to voice enhancement on the input mixed voice signal through the pre-trained residual network model on the frequency domain according to the obtained frequency domain voice characteristics, namely, the voice enhancement task can be completed on the frequency domain.

The frequency domain voice data may be used to represent noisy voice data in the frequency domain, and may include noise data in the frequency domain, target voice data, and the like. Optionally, the generating a target voice signal based on the mapped voice data and the voice data includes: decoding the mapped voice data and the frequency domain voice data to obtain decoded voice data; and carrying out waveform reconstruction on the decoded voice data according to the frequency domain voice characteristics to obtain a target voice signal. Specifically, after the frequency domain voice data is obtained, mapping processing can be performed on the frequency domain voice data according to the network weight information corresponding to each network layer based on the frequency domain voice characteristics, so as to remove noise data in the frequency domain voice data, obtain mapped voice data, then the mapped voice data and the frequency domain voice data can be decoded, corresponding decoded voice data are obtained, and then the time domain waveform corresponding to the decoded voice data can be reconstructed by combining the voice characteristics of the target user, so that the target voice signal output according to the time domain waveform can have the voice characteristics of the target user, namely, the decoded voice data is subjected to waveform reconstruction according to the extracted frequency domain voice characteristics, the target voice signal in the time domain is generated, and the target voice characteristics carry the voice characteristics of the target user, so that the hearing feeling after voice enhancement is ensured, and the user experience is improved.

Of course, the embodiment of the invention is based on the residual network model, and other modes can be adopted to carry out voice enhancement on the mixed voice signal, such as carrying out voice enhancement processing on the mixed voice signal in the time domain. In an optional embodiment of the present invention, feature extraction is performed on the mixed speech signal to obtain speech features and speech data, including: and extracting the time domain characteristics of the mixed voice signal to obtain time domain voice characteristics and time domain voice data. Therefore, the time domain voice data can be obtained according to the time domain voice characteristics, and the input mixed voice signal can be subjected to voice enhancement through a pre-trained residual network model in the time domain, namely, the voice enhancement task is completed in the time domain.

Wherein, the time domain voice data can be used for representing the voice data with noise in the time domain, and can comprise the noise data in the frequency domain, target voice data and the like. Optionally, the generating the target voice signal based on the mapped voice data and the voice data includes: and generating a target voice signal by adopting the mapping voice signal and the time domain voice data. Specifically, after the time domain voice feature is extracted, mapping processing can be performed on the time domain voice data according to the network weight information corresponding to each network layer so as to remove noise data in the time domain voice data and obtain mapped voice data, then voice processing can be performed on the mapped voice data and the time domain voice data by combining the time domain voice feature, and a target voice signal corresponding to a target user is generated, so that the target voice signal can carry the voice feature of the target user, the hearing of the voice after the voice enhancement is ensured, and the voice quality after the voice enhancement is improved.

Step 308, outputting according to the target voice signal.

In an alternative embodiment, outputting according to the target voice signal may include: and outputting the voice according to the target voice signal. Specifically, the embodiment of the invention can be applied to a product of voice conversation in a noisy environment, such as a telephone watch in a voice conversation scene, so that conversation parties can only hear the pure voice of a main speaker concerned by the conversation parties. For example, when a parent uses a telephone watch to call a child participating in an activity, the audio processing method provided by the embodiment of the invention can enable the parent to only hear clear sound of the child, and reduce the influence of speaking of other children, namely, the influence of noise interference.

Of course, the embodiment of the present invention may also be applied to other scenarios, such as a speech input scenario, a speech recognition scenario, etc., which is not limited by the embodiment of the present invention.

In another alternative embodiment, outputting according to the target voice signal may include: performing voice recognition on the target voice signal to generate a recognition result; and outputting the identification result.

For example, the sentence "good for the voice of the target speaker" in the first dashed box 41 in fig. 4, i call the li XX, is very happy to recognize. "; and the noise is bird sounds, such as the "chiro" in the second dashed box 42 in fig. 4. As shown in fig. 4, the speech and noise (i.e., bird song) spoken by the target speaker have a large number of overlapping portions on the time axis. In the beginning, the two words of "good" that the target speaker said have not been disturbed, because there is no bird song, so that the two words can be heard; while the target speaker is said to be "good" later, the part of me called li XX "is disturbed by the bird's chirping, which results in that the target speaker is said to be" good ", and me called li XX" may not be heard clearly. At this time, if the audio processing method provided by the embodiment of the invention uses the end-to-end voice enhancement model, the interference voice of the ' creaking chirps ' sentence can be removed, and only the target voice is left, namely ' good, i call Liji XX, and very happy to know, so as to achieve the purpose of voice enhancement.

Then, the target voice signal after voice enhancement can be adopted to perform voice recognition, namely, pure voice of the target speaker is adopted to perform voice recognition so as to recognize the voice spoken by the target speaker, for example, in combination with the above example, the target voice output by the voice enhancement model can be adopted to perform voice recognition, namely, the voice recognition is performed by calling the plum XX, and the user is very happy, so that the voice recognition effect can be improved. Then, the output can be performed according to the recognized recognition result, such as outputting the word "good for everything," i call for the li XX, a personal photo of happy to know everything, "li XX," etc., corresponding to the recognized voice.

In summary, the embodiment of the invention can introduce the residual network structure into the voice enhancement task, thereby solving the problem of gradient disappearance in the voice enhancement task, further training to obtain the residual network model with deeper network depth, carrying out voice enhancement by adopting the residual network model, and improving the voice enhancement effect.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 5, a block diagram of an embodiment of an audio processing apparatus according to the present invention is shown, and may specifically include the following modules:

A voice signal acquisition module 510 for acquiring an input mixed voice signal;

The voice enhancement module 520 is configured to perform voice enhancement on the mixed voice signal according to a pre-trained residual network model, so as to obtain a target voice signal;

The voice signal output module 530 is configured to output according to the target voice signal.

In an alternative embodiment of the present invention, the speech enhancement module 520 may include the following sub-modules:

In an embodiment of the present invention, optionally, the audio processing apparatus may further include a residual network model training module. The residual network model training module is used for training a residual network model corresponding to the voice characteristics in advance. The noise reduction processing submodule comprises the following units:

the residual network model determining unit is used for determining a residual network model corresponding to the target user according to the voice characteristics of the target user;

And the noise reduction processing unit is used for carrying out noise reduction processing on the voice data through the residual network model corresponding to the target user to obtain the target voice signal.

In an alternative embodiment of the present invention, the noise reduction processing unit may include the following sub-units:

In an optional embodiment of the present invention, the feature extraction submodule is specifically configured to perform frequency domain feature extraction on the mixed speech signal to obtain frequency domain speech features and frequency domain speech data of the target user. The target voice signal generation subunit is specifically configured to decode the mapped voice data and the frequency domain voice data to obtain decoded voice data; and carrying out waveform reconstruction on the decoded voice data according to the frequency domain voice characteristics to obtain a target voice signal.

In another optional embodiment of the present invention, the feature extraction submodule is specifically configured to perform time domain feature extraction on the mixed speech signal to obtain time domain speech features and time domain speech data of the target user. The target voice signal generation subunit is specifically configured to generate a target voice signal by using the mapped voice data and the time domain voice data.

In an alternative embodiment of the present invention, the residual network model training module may include the following sub-modules:

In an alternative embodiment of the present invention, the voice signal output module 530 may include the following sub-modules:

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

Fig. 6 is a block diagram illustrating an apparatus 600 for audio processing according to an exemplary embodiment. For example, device 600 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, server, or the like.

Referring to fig. 6, device 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 may include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the device 600. Examples of such data include instructions for any application or method operating on device 600, contact data, phonebook data, messages, pictures, videos, and the like. The memory 604 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 606 provides power to the various components of the device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 600.

The multimedia component 608 includes a screen between the device 600 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front camera and/or a rear camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 600 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the device 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 614 includes one or more sensors for providing status assessment of various aspects of the device 600. For example, the sensor assembly 614 may detect the on/off state of the device 600, the relative positioning of the components, such as the display and keypad of the device 600, the sensor assembly 614 may also detect a change in position of the device 600 or a component of the device 600, the presence or absence of user contact with the device 600, the orientation or acceleration/deceleration of the device 600, and a change in temperature of the device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communication between the device 600 and other devices, either wired or wireless. The device 600 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 616 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 604, including instructions executable by processor 620 of device 600 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of a device, causes the device to perform an audio processing method, the method comprising: acquiring an input mixed voice signal; performing voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal; and outputting according to the target voice signal.

Fig. 7 is a schematic structural view of an apparatus according to an embodiment of the present invention. The device 700 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 722 (e.g., one or more processors) and memory 732, one or more storage mediums 730 (e.g., one or more mass storage devices) that store applications 742 or data 744. Wherein memory 732 and storage medium 730 may be transitory or persistent. The program stored on the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations in the device. Still further, the central processor 722 may be configured to communicate with the storage medium 730 and execute a series of instruction operations in the storage medium 730 on the device 700.

The device 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input/output interfaces 758, one or more keyboards 756, and/or one or more operating systems 741 such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

In an exemplary embodiment, an apparatus configured to be executed by one or more processors the one or more programs includes instructions for: acquiring an input mixed voice signal; performing voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal; and outputting according to the target voice signal.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The foregoing has described in detail a method and apparatus for audio processing, an device, and a readable storage medium, wherein specific examples are set forth for purposes of illustrating the principles and embodiments of the present invention, and wherein the above examples are provided to assist in understanding the method and core concepts of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. An audio processing method, comprising:

acquiring an input mixed voice signal;

performing voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal; the residual network model comprises at least three network layers, and in the residual network model training process, the output result of each network layer is input to the next network layer, or is input to other network layers in a cross-layer manner;

Outputting according to the target voice signal;

The step of performing voice enhancement on the mixed voice signal according to a pre-trained residual network model to obtain a target voice signal comprises the following steps:

Extracting characteristics of the mixed voice signal to obtain voice characteristics and voice data of a target user, wherein the mixed voice signal comprises a noise signal and the voice signal of the target user;

And according to the voice characteristics, carrying out noise reduction processing on the voice data through a pre-trained residual error network model to obtain a target voice signal corresponding to the target user.

2. The method as recited in claim 1, further comprising:

Pre-training a residual error network model corresponding to the voice characteristics;

The noise reduction processing is performed on the voice data through a pre-trained residual network model according to the voice characteristics to obtain a target voice signal corresponding to the target user, and the method comprises the following steps:

determining a residual network model corresponding to the target user according to the voice characteristics of the target user;

And carrying out noise reduction processing on the voice data through a residual error network model corresponding to the target user to obtain the target voice signal.

3. The method according to claim 2, wherein the noise reduction processing is performed on the voice data through a residual network model corresponding to the target user to obtain the target voice signal, including:

determining network weight information corresponding to each network layer in the residual error network model;

Mapping the voice data according to the network weight information corresponding to each network layer to obtain mapped voice data;

And generating a target voice signal based on the mapped voice data and the voice data.

4. The method of claim 3, wherein the step of,

The step of extracting the characteristics of the mixed voice signal to obtain the voice characteristics and voice data of the target user comprises the following steps: extracting frequency domain characteristics of the mixed voice signals to obtain frequency domain voice characteristics and frequency domain voice data of a target user;

The generating a target speech signal based on the mapped speech data and the speech data, comprises: decoding the mapped voice data and the frequency domain voice data to obtain decoded voice data; and carrying out waveform reconstruction on the decoded voice data according to the frequency domain voice characteristics to obtain a target voice signal.

5. The method of claim 3, wherein the step of,

The step of extracting the characteristics of the mixed voice signal to obtain the voice characteristics and voice data of the target user comprises the following steps: extracting time domain features of the mixed voice signals to obtain time domain voice features and time domain voice data of a target user;

the generating a target speech signal based on the mapped speech data and the speech data, comprises: and generating a target voice signal by adopting the mapping voice data and the time domain voice data.

6. The method according to any one of claims 2 to 5, wherein the training the residual network model corresponding to the speech feature comprises:

Adding a noise signal to an input voice signal to generate a noisy voice signal;

Extracting features of the noisy speech signals to obtain speech features corresponding to the noisy speech signals;

And performing model training according to the noisy speech signal and the speech signal according to a preset residual error network structure, and generating a residual error network model corresponding to the speech feature.

7. The method according to any one of claims 1 to 5, wherein outputting in accordance with the target speech signal comprises:

Performing voice output according to the target voice signal; and/or

Performing voice recognition on the target voice signal to generate a recognition result; and outputting the identification result.

8. An audio processing apparatus, comprising:

The voice enhancement module is used for carrying out voice enhancement on the mixed voice signal according to a pre-trained residual error network model to obtain a target voice signal; the residual network model comprises at least three network layers, and in the residual network model training process, the output result of each network layer is input to the next network layer, or is input to other network layers in a cross-layer manner;

the voice signal output module is used for outputting according to the target voice signal;

the speech enhancement module comprises:

9. The apparatus as recited in claim 8, further comprising:

The residual network model training module is used for training a residual network model corresponding to the voice characteristics in advance;

wherein the noise reduction processing submodule includes: the residual network model determining unit and the noise reduction processing unit;

the noise reduction processing unit is used for carrying out noise reduction processing on the voice data through the residual network model corresponding to the target user to obtain the target voice signal.

10. The apparatus of claim 9, wherein the noise reduction processing unit comprises:

11. The apparatus of claim 10, wherein the device comprises a plurality of sensors,

The characteristic extraction submodule is specifically used for carrying out frequency domain characteristic extraction on the mixed voice signal to obtain frequency domain voice characteristics and frequency domain voice data of a target user;

12. The apparatus of claim 10, wherein the device comprises a plurality of sensors,

The feature extraction submodule is specifically used for extracting time domain features of the mixed voice signal to obtain time domain voice features and time domain voice data of a target user;

13. The apparatus according to any one of claims 9 to 12, wherein the residual network model training module comprises:

14. The apparatus of any of claims 8 to 12, the speech signal output module comprising:

15. An apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

acquiring an input mixed voice signal;

Outputting according to the target voice signal;

16. The apparatus of claim 15, wherein execution of the one or more programs by one or more processors comprises instructions further for:

the noise reduction processing is performed on the voice data through a pre-trained residual network model according to the voice characteristics to obtain a target voice signal corresponding to the target user, and the method comprises the following steps: determining a residual network model corresponding to the target user according to the voice characteristics of the target user; and carrying out noise reduction processing on the voice data through a residual error network model corresponding to the target user to obtain the target voice signal.

17. The apparatus of claim 16, wherein noise reduction processing is performed on the voice data through a residual network model corresponding to the target user to obtain the target voice signal, including:

18. The apparatus of claim 17, wherein the device comprises a plurality of sensors,

19. The apparatus of claim 17, wherein the device comprises a plurality of sensors,

20. The apparatus according to any one of claims 16 to 19, wherein the training the residual network model corresponding to the speech feature comprises:

21. The apparatus according to any one of claims 15 to 19, wherein outputting in accordance with the target speech signal comprises:

Performing voice output according to the target voice signal; and/or

22. A readable storage medium, characterized in that instructions in said storage medium, when executed by a processor of a device, enable the device to perform the audio processing method according to any of the method claims 1-7.