CN117012221A - Audio noise reduction method, computer device and storage medium - Google Patents

Audio noise reduction method, computer device and storage medium Download PDF

Info

Publication number
CN117012221A
CN117012221A CN202311130009.XA CN202311130009A CN117012221A CN 117012221 A CN117012221 A CN 117012221A CN 202311130009 A CN202311130009 A CN 202311130009A CN 117012221 A CN117012221 A CN 117012221A
Authority
CN
China
Prior art keywords
noise
audio
reduced
sub
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311130009.XA
Other languages
Chinese (zh)
Inventor
陈洲旋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202311130009.XA priority Critical patent/CN117012221A/en
Publication of CN117012221A publication Critical patent/CN117012221A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Abstract

The application relates to an audio noise reduction method, a computer device, a storage medium and a computer program product. The method comprises the following steps: performing frequency division processing on the audio data to be noise reduced to obtain sub audio data to be noise reduced of the audio data to be noise reduced; acquiring the frequency spectrum characteristics of the sub-audio data to be noise reduced; carrying out noise reduction treatment on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced; performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced; and combining the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced. The method can improve the noise reduction effect of the audio.

Description

Audio noise reduction method, computer device and storage medium
Technical Field
The present application relates to the field of audio processing technology, and in particular, to an audio noise reduction method, a computer device, a storage medium, and a computer program product.
Background
With the development of internet technology, various song recording application programs are layered endlessly; however, due to the influence of non-professional equipment and recording environment, singing voice recorded by a song recording application program is easy to mix in noise, such as microphone fricatives, environmental background noise and the like; therefore, it is very important to reduce noise of audio data.
In the conventional technology, when noise is reduced on the audio data to be reduced, a digital signal processing method, such as a statistical signal processing method, is mainly used to extract clean audio data from the audio data to be reduced; however, this noise reduction method has a certain noise reduction effect only for stationary noise, but has a difficult ideal effect for other more complex and variable background noise (such as non-stationary noise), resulting in poor audio noise reduction effect.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an audio noise reduction method, a computer device, a computer-readable storage medium, and a computer program product that are capable of improving the audio noise reduction effect.
In a first aspect, the present application provides an audio noise reduction method. The method comprises the following steps:
performing frequency division processing on the audio data to be noise reduced to obtain sub audio data to be noise reduced of the audio data to be noise reduced;
acquiring the frequency spectrum characteristics of the sub-audio data to be noise reduced;
carrying out noise reduction treatment on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced;
performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced;
And combining the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced.
In one embodiment, the noise reduction processing is performed on the spectral features through a pre-trained audio noise reduction model to obtain noise-reduced spectral features of the sub-audio data to be noise reduced, including:
performing feature extraction processing on the frequency spectrum features through a pre-trained audio noise reduction model to obtain processed frequency spectrum features of the sub-audio data to be noise reduced;
carrying out noise reduction processing related to a frequency domain on the processed frequency spectrum characteristics to obtain first noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced;
performing noise reduction processing related to time on the first noise-reduced frequency spectrum characteristic to obtain a second noise-reduced frequency spectrum characteristic of the sub-audio data to be noise-reduced;
and carrying out feature extraction processing on the second noise-reduced frequency spectrum features to obtain the noise-reduced frequency spectrum features of the sub-audio data to be noise-reduced.
In one embodiment, the pre-trained audio noise reduction model includes a first convolution layer, a first hole convolution network, a second hole convolution network, and a second convolution layer that are sequentially connected, where the first hole convolution network and the second hole convolution network each include a plurality of sequentially connected sub-hole convolution networks, and each sub-hole convolution network includes a hole convolution layer, a normalization layer, and an activation layer that are sequentially connected.
In one embodiment, the obtaining the spectral feature of the sub-audio data to be noise reduced includes:
carrying out framing treatment on the sub-audio data to be noise reduced to obtain an audio frame to be noise reduced of the sub-audio data to be noise reduced;
performing feature extraction processing on the audio frame to be noise-reduced to obtain the frequency spectrum features of the audio frame to be noise-reduced;
and combining the frequency spectrum characteristics of the audio frame to be noise reduced to obtain the frequency spectrum characteristics of the sub audio data to be noise reduced.
In one embodiment, the performing signal reconstruction processing on the denoised spectrum feature to obtain denoised sub-audio data corresponding to the sub-audio data to be denoised, includes:
from the denoised spectral features, confirming denoised spectral features of the audio frame to be denoised;
performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics of the audio frame to be noise-reduced to obtain the audio frame after noise reduction of the audio frame to be noise-reduced;
and combining the noise-reduced audio frames to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced.
In one embodiment, the pre-trained audio noise reduction model is obtained by:
Acquiring a clean audio sample and a noisy frequency sample corresponding to the clean audio sample;
inputting the spectral characteristics of the noisy sub-audio sample of the noisy audio sample to an audio noise reduction model to be trained to obtain the noise-reduced spectral characteristics of the noisy sub-audio sample;
performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics of the noise-reduced sub-audio sample to obtain a noise-reduced sub-audio sample corresponding to the noise-reduced sub-audio sample;
combining the noise-reduced sub-audio samples to obtain noise-reduced audio samples corresponding to the noise-reduced sub-audio samples;
and training the audio noise reduction model to be trained according to the difference between the spectral features of the noise-reduced audio sample and the spectral features of the clean audio sample to obtain the pre-trained audio noise reduction model.
In one embodiment, the training the audio noise reduction model to be trained according to the difference between the spectral features of the noise reduced audio sample and the spectral features of the clean audio sample to obtain the pre-trained audio noise reduction model includes:
obtaining a loss value according to the difference between the spectral characteristics of the noise-reduced audio sample and the spectral characteristics of the clean audio sample;
Training the audio noise reduction model to be trained according to the loss value until reaching a training ending condition;
and confirming the trained audio noise reduction model reaching the training ending condition as the pre-trained audio noise reduction model.
In one embodiment, the noisy audio samples corresponding to the clean audio samples are obtained by:
acquiring a noise audio sample and a preset signal-to-noise ratio;
and according to the preset signal-to-noise ratio, carrying out fusion processing on the clean audio sample and the noise audio sample to obtain a noise-carrying frequency sample corresponding to the clean audio sample.
In a second aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
performing frequency division processing on the audio data to be noise reduced to obtain sub audio data to be noise reduced of the audio data to be noise reduced;
acquiring the frequency spectrum characteristics of the sub-audio data to be noise reduced;
carrying out noise reduction treatment on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced;
Performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced;
and combining the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced.
In a third aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
performing frequency division processing on the audio data to be noise reduced to obtain sub audio data to be noise reduced of the audio data to be noise reduced;
acquiring the frequency spectrum characteristics of the sub-audio data to be noise reduced;
carrying out noise reduction treatment on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced;
performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced;
and combining the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced.
In a fourth aspect, the application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
Performing frequency division processing on the audio data to be noise reduced to obtain sub audio data to be noise reduced of the audio data to be noise reduced;
acquiring the frequency spectrum characteristics of the sub-audio data to be noise reduced;
carrying out noise reduction treatment on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced;
performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced;
and combining the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced.
According to the audio noise reduction method, the computer equipment, the storage medium and the computer program product, the sub audio data to be noise reduced of the audio data to be noise reduced is obtained by performing frequency division processing on the audio data to be noise reduced; then, obtaining the frequency spectrum characteristics of the sub-audio data to be denoised, and performing noise reduction treatment on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain the denoised frequency spectrum characteristics of the sub-audio data to be denoised; then, carrying out signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced; and finally, combining the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced. Therefore, through carrying out frequency division processing on the audio data to be noise-reduced, human voice in a low frequency band and noise in each frequency band can be captured better, the follow-up noise reduction model trained in advance is facilitated, the corresponding noise reduction processing is carried out on the spectral features of each sub audio data to be noise-reduced of the audio data to be noise-reduced, the more comprehensive noise reduction processing is facilitated, the noise to be processed is avoided, the audio quality of the audio data after noise reduction based on the reconstruction of the spectral features after noise reduction of the sub audio data to be noise-reduced is higher, and the noise reduction effect of the audio data to be noise-reduced is improved.
Drawings
FIG. 1 is a flow chart of an audio noise reduction method according to one embodiment;
FIG. 2 is a flowchart illustrating steps of noise reduction processing of spectral features in one embodiment;
FIG. 3 is a schematic diagram of an audio noise reduction model in one embodiment;
FIG. 4 is a schematic diagram of an audio frame to be denoised in one embodiment;
FIG. 5 is a flow chart illustrating training steps of an audio noise reduction model in one embodiment;
FIG. 6 is a flow chart of an audio noise reduction method according to another embodiment;
FIG. 7 is a flow chart of a training method of an audio noise reduction model in one embodiment;
FIG. 8 is a flow chart of a training method of an audio noise reduction model according to another embodiment;
FIG. 9 is a flow chart of a method of audio noise reduction according to yet another embodiment;
fig. 10 is an internal structural view of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, an audio noise reduction method is provided, where the method is applied to a terminal to illustrate the method, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. The terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers. In this embodiment, the method includes the steps of:
Step S101, performing frequency division processing on the audio data to be noise reduced to obtain sub audio data to be noise reduced of the audio data to be noise reduced.
The audio data to be noise-reduced refers to audio data to be noise-reduced, and is generally represented by a waveform signal, such as a time-domain waveform signal. In an actual scenario, the audio data to be noise-reduced may refer to singing voice data acquired through an audio acquisition device (such as a microphone), or may refer to singing voice data recorded through a singing application program carried on a terminal, or may refer to singing voice data stored in a database, or may refer to singing voice data acquired through a server.
It should be noted that the audio data to be noise reduced may also refer to other forms of audio data, and the present application is mainly illustrated by taking the audio data to be noise reduced as singing voice data.
Wherein, the singing voice data contains noise (such as microphone fricatives, environmental background noise and the like) and accompaniment (such as musical instrument accompaniment, original song accompaniment and the like), and the accompaniment can be regarded as noise; therefore, the present application mainly eliminates noise and accompaniment in singing voice data.
Generally, in a singing voice recording scene, the singing voice of a user needs to be subjected to dynamic compression, voice enhancement, reverberation and other effect processing, then is subjected to loudness equalization processing and sound mixing processing with accompaniment, and finally a complete singing work is formed; however, when picking up by the microphone, in addition to the singing voice of the user, the accompaniment, the environmental noise and the like are picked up at the same time, and when performing the effect processing such as dynamic compression, voice enhancement, reverberation and the like, it is difficult to call out the ideal effect due to the influence of the accompaniment, noise and the like, particularly in the accompaniment scene, the picked-up accompaniment (i.e., the first accompaniment) is played out by the speaker, and is different from the original well-made accompaniment (i.e., the second accompaniment) by different paths such as the speaker and the microphone. Furthermore, the final singing voice is also subjected to a mixing process with the originally produced good accompaniment, so that the final singing work is superimposed with the accompaniment in total of two times.
The frequency division processing refers to decomposing the to-be-denoised audio data into a plurality of to-be-denoised sub-audio data, wherein the frequency bands corresponding to each to-be-denoised sub-audio data are different, i.e. each to-be-denoised sub-audio data is a sub-band, and the frequency bands corresponding to different sub-bands are different. In a practical scenario, the frequency division process may be implemented by PQMF (Pseudo Quadrature Mirror Filter, pseudo-quadrature mirror filter), although other frequency division techniques, such as wavelet decomposition, may be implemented.
On the one hand, the frequency division processing is performed on the audio data to be noise-reduced, so that the data size of each sub-audio data to be noise-reduced is smaller than that of the audio data to be noise-reduced, the data processing size of each audio noise-reduced model is reduced, the calculation complexity of the audio noise-reduced model is also reduced, and the audio noise-reduced model is lighter. On the other hand, considering that more energy is concentrated in the low frequency band in the human voice; in accompaniment, the frequency bands distributed by accompaniment of different musical instruments are different, and the waveform signal of the audio data to be noise reduced is decomposed into a plurality of sub-bands (namely the sub-audio data to be noise reduced), so that the sub-bands can be subjected to targeted noise reduction treatment, accompaniment and noise in the audio can be better eliminated, the definition, the intelligibility and the like of the audio are effectively improved, and the audio noise reduction effect is improved.
Specifically, the terminal acquires the audio data to be noise reduced through audio acquisition equipment deployed on the terminal or through independent audio acquisition equipment, or acquires the audio data to be noise reduced from a database or a server; and then, the terminal performs frequency division processing on the audio data to be noise-reduced by utilizing the signal frequency division instruction to obtain a plurality of sub audio data to be noise-reduced of the audio data to be noise-reduced.
For example, a user records singing voice through a singing application program deployed on a terminal, and triggers a singing voice noise reduction request; then, the terminal responds to the singing voice noise reduction request, acquires singing voice data recorded by a user, and decomposes the waveform signal of the singing voice data into a plurality of sub-bands, such as 4 sub-bands, by utilizing a pseudo-orthogonal mirror filter; or the terminal performs wavelet decomposition processing on the singing voice data to obtain a plurality of sub-bands.
Step S102, obtaining the frequency spectrum characteristics of the sub-audio data to be noise reduced.
The frequency spectrum features are used for representing frequency spectrum information of the sub-audio data to be noise reduced, and specifically comprise amplitude features and phase features. In an actual scenario, the spectral characteristics of the sub-audio data to be denoised can be obtained by performing FFT (fast Fourier transform ) processing on the sub-audio data to be denoised, that is, converting the time domain waveform signal of the sub-audio data to be denoised from the time domain to the frequency domain.
Specifically, the terminal performs feature extraction processing on the sub-audio data to be noise reduced, so that the frequency spectrum features of the sub-audio data to be noise reduced can be obtained. For example, the terminal performs fourier transform on the sub-audio data to be denoised to obtain the spectral features of the sub-audio data to be denoised, or the terminal inputs the sub-audio data to be denoised into a pre-trained spectral feature extraction model to perform feature extraction processing to obtain the spectral features of the sub-audio data to be denoised. Wherein the pre-trained spectral feature extraction model is a neural network model for extracting spectral features of the audio data.
Step S103, carrying out noise reduction processing on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced.
The pre-trained audio noise reduction model is a neural network model for performing noise reduction processing on audio data, and specifically refers to a CNN (Convolutional Neural Network ), RNN (Recurrent Neural Network, cyclic neural network), DNN (Deep Neural Networks, deep neural network) and the like. Of course, the pre-trained audio noise reduction model may also refer to other lightweight neural network models, such as the model shown in FIG. 3.
The noise reduction processing is performed on the spectrum features, which means that noise and accompaniment carried in the spectrum features are eliminated. The denoised spectral features are spectral features obtained after denoise the spectral features. It will be appreciated that for the same sub-audio data to be denoised, the noise in its denoised spectral characteristics is less than the noise in its spectral characteristics.
Specifically, the terminal acquires a clean audio sample and a noisy frequency sample corresponding to the clean audio sample from the database, and performs iterative training on an audio noise reduction model to be trained according to the clean audio sample and the noisy frequency sample corresponding to the clean audio sample to obtain a trained audio noise reduction model as a pre-trained audio noise reduction model. Then, the terminal inputs the frequency spectrum characteristics of the sub-audio data to be noise reduced into a pre-trained audio noise reduction model, and performs noise reduction treatment on the frequency spectrum characteristics of the sub-audio data to be noise reduced for a plurality of times through the pre-trained audio noise reduction model to obtain the noise reduced frequency spectrum characteristics of the sub-audio data to be noise reduced.
Step S104, carrying out signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced.
The signal reconstruction processing refers to restoring the denoised spectral characteristics to the same format (i.e., waveform signal format) as the sub-audio data to be denoised, so as to obtain denoised sub-audio data corresponding to the sub-audio data to be denoised. It will be appreciated that the noise in the sub-audio data after noise reduction is less than the noise in the sub-audio data to be noise reduced, and the sharpness, signal fidelity, etc. is higher than the sub-audio data to be noise reduced.
The noise-reduced sub-audio data is also represented by a waveform signal, such as a time-domain waveform signal, as the sub-audio data to be noise-reduced. In an actual scene, the denoised sub-audio data corresponding to the sub-audio data to be denoised can be obtained by performing IFFT (Inverse Fast Fourier Transform ) processing on the denoised spectral features, that is, converting the denoised spectral features into the time domain.
Specifically, the terminal performs signal reconstruction processing on the denoised frequency spectrum characteristics through a signal reconstruction instruction, namely, the denoised frequency spectrum characteristics are restored to the same format as the data of the sub-audio to be denoised, so that the data of the sub-audio to be denoised, which has less noise and higher definition, is obtained.
For example, the terminal performs inverse fourier transform on the denoised spectrum feature, so as to obtain denoised sub-audio data corresponding to the sub-audio data to be denoised. Or the terminal inputs the noise-reduced frequency spectrum characteristics into a pre-trained signal reconstruction model, and performs signal reconstruction processing on the noise-reduced frequency spectrum characteristics through the signal reconstruction model, so that noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced can be obtained. The pre-trained signal reconstruction model is a neural network model for reconstructing the signals of the spectral features of the audio data.
Step S105, the noise-reduced sub-audio data are combined to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced.
The noise-reduced audio data refers to audio data with less noise and purer signal, such as higher definition, intelligibility, signal fidelity, etc., than the audio data to be noise-reduced.
Specifically, the terminal obtains the arrangement sequence of the noise-reduced sub-audio data, and combines the noise-reduced sub-audio data according to the arrangement sequence to obtain combined audio data with less noise and purer signals, and the combined audio data is used as noise-reduced audio data corresponding to the audio data to be noise-reduced. For example, the terminal combines several noise-reduced sub-bands of the noise-reduced sub-audio data together in a sub-band synthesis manner, and finally obtains noise-reduced audio data corresponding to the audio data to be noise-reduced.
Further, the terminal can input the noise-reduced sub-audio data into a pre-trained audio quality prediction model, and the audio quality value of the noise-reduced sub-audio data is predicted through the audio quality prediction model; under the condition that the audio quality values of the noise-reduced sub-audio data are larger than a preset quality value, combining all the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced; and under the condition that the audio quality value of at least one piece of noise-reduced sub-audio data is smaller than or equal to a preset quality value, taking the noise-reduced sub-audio data as the sub-audio data to be noise-reduced, and repeatedly executing the steps S102 to S104 until the audio quality value of the finally obtained noise-reduced sub-audio data is larger than the preset quality value.
In the audio noise reduction method, the sub audio data to be noise reduced of the audio data to be noise reduced is obtained by carrying out frequency division processing on the audio data to be noise reduced; then, obtaining the frequency spectrum characteristics of the sub-audio data to be denoised, and performing noise reduction treatment on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain the denoised frequency spectrum characteristics of the sub-audio data to be denoised; then, carrying out signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced; and finally, combining the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced. Therefore, through carrying out frequency division processing on the audio data to be noise-reduced, human voice in a low frequency band and noise in each frequency band can be captured better, the follow-up noise reduction model trained in advance is facilitated, the corresponding noise reduction processing is carried out on the spectral features of each sub audio data to be noise-reduced of the audio data to be noise-reduced, the more comprehensive noise reduction processing is facilitated, the noise to be processed is avoided, the audio quality of the audio data after noise reduction based on the reconstruction of the spectral features after noise reduction of the sub audio data to be noise-reduced is higher, and the noise reduction effect of the audio data to be noise-reduced is improved.
In one embodiment, as shown in fig. 2, the step S103 includes performing noise reduction processing on the spectral features through a pre-trained audio noise reduction model to obtain noise-reduced spectral features of the sub-audio data to be noise reduced, and specifically includes the following steps:
step S201, performing feature extraction processing on the frequency spectrum features through a pre-trained audio noise reduction model to obtain processed frequency spectrum features of the sub-audio data to be noise reduced.
The processed spectrum feature is obtained after feature extraction processing is performed on the spectrum feature. The feature extraction processing specifically refers to convolution processing; it will be appreciated that after the feature extraction process, the processed spectral features all comprise key spectral features.
Step S202, performing noise reduction processing related to the frequency domain on the processed frequency spectrum characteristics to obtain first noise-reduced frequency spectrum characteristics of the sub-audio data to be noise reduced.
The noise reduction processing related to the frequency domain refers to noise reduction processing of the processed spectrum features on the frequency domain axis, and is mainly to fully learn the information in the frame, so that the noise reduction processing has a better fidelity effect on harmonic components or overtones in singing voice data, and transient noise (such as door opening sound, keyboard sound, mouth water sound and the like) in the singing voice data can be effectively restrained.
It can be understood that, considering that various harmonic components or overtones exist in singing voice data, noise reduction processing related to a frequency domain is mainly to fully learn information in a frame on a frequency domain axis, and a larger field of view established on the frequency domain axis can better capture correlation on the frequency domain, so that the method has better fidelity effect on the harmonic components or overtones in the singing voice data, and meanwhile, the transient noise suppression capability in the singing voice data is stronger.
The first noise-reduced spectrum feature refers to a spectrum feature obtained by performing noise reduction processing related to a frequency domain on the processed spectrum feature. It will be appreciated that after the frequency domain dependent noise reduction process, the signal fidelity of the first post-noise reduction spectral feature is higher and contains less or no transient noise.
Step S203, performing a noise reduction process related to time on the first noise-reduced spectral feature to obtain a second noise-reduced spectral feature of the sub-audio data to be noise-reduced.
The noise reduction processing related to time refers to noise reduction processing of the first noise-reduced spectrum feature on a time axis, and is mainly to fully learn information between frames, so that the integrity of singing voice is better reserved, and the swallowing of sound is avoided.
It will be appreciated that considering that singing voice data has a strong context, such as a singing concert lengthens vowels to sing more rhythmically, and the time-dependent noise reduction process is to learn inter-frame information substantially on a time axis, where a larger field of view is established on the time axis, so that the time-dependent relationship can be better captured, and by elongating a word and continuously retaining for a longer period of time, better integrity of the singing voice is facilitated, and swallowing is avoided.
The second denoised spectral feature refers to a spectral feature obtained by performing a time-dependent denoising process on the first denoised spectral feature. It will be appreciated that after the time dependent noise reduction process, the singing voice of the second post-noise reduction spectral feature is more complete and the signal fidelity is further improved.
And step S204, performing feature extraction processing on the second noise-reduced frequency spectrum feature to obtain the noise-reduced frequency spectrum feature of the sub-audio data to be noise-reduced.
The denoised spectral features are spectral features obtained after feature extraction processing is performed on the second denoised spectral features. The feature extraction processing herein also refers to convolution processing; it can be understood that after the feature extraction processing, the noise-reduced spectrum features are further optimized, so that the signal fidelity, the signal definition and the like are further improved.
Specifically, the terminal inputs the frequency spectrum characteristics of the sub-audio data to be noise-reduced into a pre-trained audio noise reduction model, and firstly performs characteristic extraction processing on the frequency spectrum characteristics through the audio noise reduction model to obtain key frequency spectrum characteristics serving as the processed frequency spectrum characteristics of the sub-audio data to be noise-reduced; then, carrying out noise reduction processing related to the frequency domain on the processed frequency spectrum characteristics so as to inhibit transient noise and other noise in the processed frequency spectrum characteristics under the condition of better fidelity effect on harmonic components, overtones and the like in the audio data, thereby obtaining first noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced; then, carrying out noise reduction processing related to time on the first noise-reduced frequency spectrum characteristic so as to inhibit noise in the first noise-reduced frequency spectrum characteristic under the condition of better keeping the integrity of singing voice and avoiding swallowing, thereby obtaining a second noise-reduced frequency spectrum characteristic of the sub-audio data to be noise-reduced; and finally, carrying out feature extraction processing on the second denoised frequency spectrum feature to obtain a target frequency spectrum feature which is used as the denoised frequency spectrum feature of the sub-audio data to be denoised.
In this embodiment, after feature extraction processing is performed on the spectrum features, noise reduction processing is performed on the obtained processed spectrum features twice, so that the finally obtained noise-reduced spectrum features are purer, which is beneficial to further improving the noise reduction effect of the audio. Meanwhile, through the noise reduction processing related to the frequency domain, the correlation on the frequency domain can be better captured, and the distortion condition of harmonic components or overtones in the audio data can be effectively reduced; in addition, through the noise reduction processing related to time, the time correlation can be better captured, so that the singing integrity can be better reserved, and the sound swallowing is avoided.
In one embodiment, as shown in fig. 3, the pre-trained audio noise reduction model includes a first convolution layer, a first hole convolution network, a second hole convolution network, and a second convolution layer that are sequentially connected, where the first hole convolution network and the second hole convolution network each include a plurality of sequentially connected sub-hole convolution networks, each sub-hole convolution network including a hole convolution layer, a normalization layer, and an activation layer that are sequentially connected.
Referring to fig. 3, the first convolution layer and the second convolution layer are both normal convolution layers, and mainly perform spectral feature extraction. For example, the first convolution layer is mainly used for performing feature extraction processing on spectral features of the sub-audio data to be denoised, and the second convolution layer is mainly used for performing feature extraction processing on the second denoised spectral features of the sub-audio data to be denoised.
Referring to fig. 3, the first hole convolutional network includes a plurality of (e.g., 6) sub hole convolutional networks connected in turn, and the expansion rate of the hole convolutional layer in each sub hole convolutional network increases by an exponential multiple of 2, such as 1, 2, 4, 8, 16, and 32, so that the whole first hole convolutional network has a larger field of view on the frequency domain axis, and is convenient for better capturing the correlation on the frequency domain. The first cavity convolution network mainly performs noise reduction processing related to a frequency domain; for example, the first hole convolution network is mainly used for performing noise reduction processing related to a frequency domain on the processed spectrum features.
With reference to fig. 3, in each sub-hole convolutional network, the hole convolutional layer (i.e. Dilation Convolution) mainly performs hole convolutional processing, the normalization layer (such as the batch norm layer) mainly performs normalization processing, and the activation layer (such as the pralu layer) mainly performs activation processing.
Referring to fig. 3, the second hole convolutional network also includes a plurality of (e.g., 6) sub hole convolutional networks connected in turn, and the expansion rate of the hole convolutional layer in each sub hole convolutional network is also exponentially increased by 2, for example, 1, 2, 4, 8, 16, and 32, so that the whole second hole convolutional network has a larger field of view on a time axis, and is convenient for better capturing the correlation in time. The second cavity convolution network mainly performs noise reduction processing related to time; for example, the second hole convolution network is mainly used for performing noise reduction processing related to time on the first noise-reduced spectrum feature.
It should be noted that, the network structures of the first cavity convolution network and the second cavity convolution network are the same, and the parameters are also consistent, so that additional processing such as feature dimension conversion is not required in the processing process, other network structures are not required to be added, and the network structure of the whole audio noise reduction model is ensured to be lighter. Meanwhile, the whole audio noise reduction model adopts a convolutional neural network result, can be parallelized and can also be realized by using separate convolutions, so that the audio noise reduction model is a lightweight network model and can perform noise reduction in real time.
In a practical scenario, referring to fig. 3, in the case where the number of sub-audio data to be noise reduced is 4, the first convolution layer is a two-dimensional convolution layer, the input channel is 4, and the output channel is 8; the input channel and the output channel of the first space convolution network and the second cavity convolution network are 8; the second convolution layer is a two-dimensional convolution layer, the input channel is 8, and the output channel is 4. It can be appreciated that the network structure of the audio noise reduction model can be adaptively adjusted according to the actual scene.
Specifically, referring to fig. 3, the terminal inputs the spectral features of the sub-audio data to be noise reduced into a pre-trained audio noise reduction model, and firstly, performs feature extraction processing on the spectral features through a first convolution layer in the audio noise reduction model to obtain key spectral features in the spectral features, wherein the key spectral features are used as processed spectral features of the sub-audio data to be noise reduced; then inputting the processed spectral features into a first cavity convolution network in an audio noise reduction model, and carrying out noise reduction processing related to a frequency domain on the processed spectral features through the first cavity convolution network so as to inhibit transient noise and other noise in the processed spectral features under the condition of better fidelity effect on harmonic components, overtones and the like in audio data, thereby obtaining first noise-reduced spectral features of the sub-audio data to be noise-reduced; then inputting the first noise-reduced frequency spectrum characteristic into a second cavity convolution network in the audio noise reduction model, and performing time-dependent noise reduction processing on the first noise-reduced frequency spectrum characteristic through the second cavity convolution network so as to inhibit noise in the first noise-reduced frequency spectrum characteristic under the condition of better preserving singing integrity and avoiding sound swallowing, thereby obtaining a second noise-reduced frequency spectrum characteristic of the sub-audio data to be noise-reduced; and finally, inputting the second denoised spectral features into a second convolution layer in the audio denoising model, and performing feature extraction processing on the second denoised spectral features through the second convolution layer to obtain target spectral features serving as denoised spectral features of the sub-audio data to be denoised.
In this embodiment, the first convolution layer, the first cavity convolution network, the second cavity convolution network and the second convolution layer in the audio noise reduction model perform noise reduction processing on the spectrum features for multiple times, so that noise in the spectrum features can be better eliminated, and the noise reduction model is convenient for obtaining purer noise-reduced audio data based on the noise-reduced spectrum features, thereby further improving the audio noise reduction effect. Meanwhile, the light-weight audio noise reduction model is used for carrying out noise reduction treatment on the frequency spectrum characteristics, so that the audio noise reduction time can be effectively shortened, and the audio noise reduction efficiency is improved.
In one embodiment, the step S102 includes obtaining the spectral features of the sub-audio data to be noise reduced, which specifically includes the following contents: carrying out framing treatment on the sub-audio data to be denoised to obtain an audio frame to be denoised of the sub-audio data to be denoised; performing feature extraction processing on the audio frame to be noise-reduced to obtain the frequency spectrum features of the audio frame to be noise-reduced; and combining the frequency spectrum characteristics of the audio frame to be denoised to obtain the frequency spectrum characteristics of the sub-audio data to be denoised.
The framing process refers to separating the sub-audio data to be noise reduced into multiple frames of audio frames to be noise reduced, where the time length of each frame of audio frame to be noise reduced is the same, for example, the time length of each frame of audio frame to be noise reduced is 20ms. The audio frame to be noise-reduced refers to an audio frame obtained after framing the sub-audio data to be noise-reduced. The feature extraction process herein refers to fourier transform.
The frequency spectrum characteristics of the audio frame to be noise-reduced are used for representing frequency spectrum information of the audio frame to be noise-reduced, and specifically comprise amplitude characteristics and phase characteristics. The spectral features of each sub-audio data to be denoised are jointly formed by the spectral features of a plurality of frames of sub-audio frames to be denoised.
Specifically, the terminal performs framing processing on the sub-audio data to be denoised according to a frame length L (generally an exponential multiple of 2, for example 1024) and a frame shift P (for example 0.5L) so as to divide the sub-audio data to be denoised into multiple frames, thereby obtaining multiple frames of the sub-audio data to be denoised and a to-be-denoised audio frame; then, the terminal carries out Fourier transformation on each frame of audio frame to be noise-reduced, or inputs each frame of audio frame to be noise-reduced into a pre-trained frequency spectrum feature extraction model to carry out feature extraction processing, so as to obtain the frequency spectrum feature of each frame of audio frame to be noise-reduced; and finally, combining the frequency spectrum characteristics of each frame of the audio frame to be noise reduced to obtain combined frequency spectrum characteristics serving as the frequency spectrum characteristics of the sub audio data to be noise reduced.
It should be noted that, assuming that the frame length is L, after fourier transformation is performed on audio data (such as sub-audio data to be noise reduced, audio frames to be noise reduced, etc.), there are L total frequency points; because of the symmetrical conjugation of the frequency points, L/2+1 frequency points are generally adopted.
For example, referring to fig. 4, the terminal performs frame division processing on input sub-audio data to be noise reduced, wherein the frame length is 20ms, the frame is shifted to 10ms, and then performs windowing operation to obtain n frames of the sub-audio data to be noise reduced, namely, 1 st frame (0 to 20 ms), 2 nd frame (10 ms to 30 ms), 3 rd frame (20 ms to 40 ms), 4 th frame (30 ms to 50 ms) … … nth frame. And then, the terminal performs Fourier transform on the n frames of the audio frames to be noise-reduced to obtain the frequency spectrum characteristics of the n frames of the audio frames to be noise-reduced, and finally, the frequency spectrum characteristics of the n frames of the audio frames to be noise-reduced are combined together to obtain the frequency spectrum characteristics of the sub audio data to be noise-reduced.
It should be noted that, after performing fourier transform on the audio frame to be noise reduced (or the sub-audio data to be noise reduced), the terminal may calculate, according to the real part and the imaginary part of the fourier transform result (in complex form), the amplitude feature and the phase feature of the audio frame to be noise reduced (or the sub-audio data to be noise reduced), so as to obtain the spectral feature of the audio frame to be noise reduced (or the sub-audio data to be noise reduced). Wherein, the fourier transform result is:
X=X r +iX i
wherein X is r Representing the real part, X i Representing the imaginary part, the corresponding amplitude features beingThe corresponding phase characteristic is α=arctan (X i /X r ) The method comprises the steps of carrying out a first treatment on the surface of the arctan represents an arctangent function.
In this embodiment, the sub-audio data to be noise-reduced is subjected to framing, so that the sub-audio data to be noise-reduced can be divided into multiple frames of audio frames to be noise-reduced, and feature extraction processing is performed on each frame of audio frame to be noise-reduced, so that the spectral features of the sub-audio data to be noise-reduced can be better obtained, and effective noise reduction processing is performed on the sub-audio data to be noise-reduced based on the spectral features of the sub-audio data to be noise-reduced.
In one embodiment, the step S104 performs signal reconstruction processing on the denoised spectrum feature to obtain denoised sub-audio data corresponding to the sub-audio data to be denoised, and specifically includes the following contents: from the denoised spectral features, confirming denoised spectral features of the audio frame to be denoised; performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics of the audio frame to be noise-reduced to obtain a noise-reduced audio frame of the audio frame to be noise-reduced; and combining the noise-reduced audio frames to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced.
The noise-reduced frequency spectrum features of the sub-audio data to be noise-reduced are formed by the noise-reduced frequency spectrum features of the multi-frame noise-reduced audio frames of the sub-audio data to be noise-reduced. It will be appreciated that for the same audio frame to be denoised, the noise in its denoised spectral characteristics is less than the noise in its spectral characteristics.
The signal reconstruction processing specifically refers to inverse fourier transform, and is used for reconstructing a noise-reduced audio frame of the audio frame to be noise-reduced. It is understood that a denoised audio frame of an audio frame to be denoised refers to an audio frame that contains less noise and has higher signal fidelity than the audio frame to be denoised.
The noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced consists of noise-reduced audio frames of multi-frame noise-reduced audio frames of the sub-audio data to be noise-reduced.
Specifically, because the post-noise-reduction spectral features of the sub-audio data to be noise-reduced are formed by the post-noise-reduction spectral features of the multi-frame sub-audio frame of the sub-audio data to be noise-reduced, the terminal can identify the post-noise-reduction spectral features of each frame of the sub-audio data to be noise-reduced from the post-noise-reduction spectral features of the sub-audio data to be noise-reduced; then, performing inverse Fourier transform on the noise-reduced frequency spectrum characteristics of each frame of the audio frame to be noise-reduced so as to transform the noise-reduced frequency spectrum characteristics of each frame of the audio frame to be noise-reduced to a time domain, thereby obtaining the noise-reduced audio frame of each frame of the audio frame to be noise-reduced; and finally, the terminal combines the noise-reduced audio frames of each frame of the audio frame to be noise-reduced to obtain combined audio, and the combined audio is used as noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced.
In this embodiment, the post-noise-reduction spectral features of the audio frames to be noise-reduced of each frame are confirmed from the post-noise-reduction spectral features of the sub-audio data to be noise-reduced, and then the post-noise-reduction spectral features of the audio frames to be noise-reduced of each frame are subjected to signal reconstruction processing, so that the post-noise-reduction audio frames of the audio frames to be noise-reduced of each frame can be better reconstructed, the audio quality of the post-noise-reduction sub-audio data obtained based on the combination of the post-noise-reduction audio frames is higher, and the audio noise reduction effect is further improved.
In one embodiment, as shown in fig. 5, the audio noise reduction method provided by the present application further includes a training step of an audio noise reduction model, and specifically includes the following steps:
step S501, a clean audio sample and a noisy audio sample corresponding to the clean audio sample are obtained.
Wherein, clean audio samples refer to audio samples without noise; the noisy audio sample refers to an audio sample containing noise, and specifically refers to a mixed audio sample obtained by performing fusion processing on a clean audio sample and a noisy audio sample. It is understood that the noisy audio samples comprise clean audio samples and noisy audio samples.
Step S502, inputting the spectral characteristics of the noisy sub-audio sample with the noisy audio sample to the audio noise reduction model to be trained, and obtaining the noise-reduced spectral characteristics of the noisy sub-audio sample.
The noise-carrying sub-audio sample with the noise-carrying audio sample is obtained by carrying out frequency division processing on the noise-carrying audio sample; the spectral features of the noisy sub-audio samples are obtained by feature extraction of the noisy sub-audio samples.
The audio noise reduction model to be trained may refer to various neural network models such as CNN, RNN, DNN, and may also refer to a model as shown in fig. 3.
Step S503, performing signal reconstruction processing on the noise-reduced spectrum characteristics of the noise-reduced sub-audio sample to obtain a noise-reduced sub-audio sample corresponding to the noise-reduced sub-audio sample.
Step S504, the noise-reduced sub-audio samples are combined to obtain noise-reduced audio samples corresponding to the noise-reduced audio samples.
Step S505, training the audio noise reduction model to be trained according to the difference between the spectral features of the noise reduced audio sample and the spectral features of the clean audio sample to obtain a pre-trained audio noise reduction model.
Specifically, the terminal acquires a clean audio sample and a noisy frequency sample corresponding to the clean audio sample from a database; then, carrying out frequency division processing on the noisy audio sample to obtain a noisy sub-audio sample of the noisy audio sample, and carrying out Fourier transformation on the noisy sub-audio sample to obtain the frequency spectrum characteristics of the noisy sub-audio sample; then inputting the spectral features of the noisy sub-audio sample into an audio noise reduction model to be trained, and performing a series of noise reduction treatment on the spectral features of the noisy sub-audio sample through the audio noise reduction model to be trained to obtain the noise-reduced spectral features of the noisy sub-audio sample; then, performing inverse Fourier transform on the denoised spectrum characteristics of the denoised sub-audio sample to obtain a denoised sub-audio sample corresponding to the denoised sub-audio sample; combining the noise-reduced sub-audio samples according to the arrangement sequence of the noise-reduced sub-audio samples to obtain noise-reduced audio samples corresponding to the noise-reduced sub-audio samples; and finally, performing Fourier transform on the noise-reduced audio sample and the clean audio sample respectively to obtain the spectral features of the noise-reduced audio sample and the spectral features of the clean audio sample, and performing iterative training on the audio noise reduction model to be trained according to the difference between the spectral features of the noise-reduced audio sample and the spectral features of the clean audio sample to obtain a trained audio noise reduction model as a pre-trained audio noise reduction model.
For example, in the audio noise reduction model training process, the terminal takes a clean audio sample as a training target, and calculates a loss value according to the difference between the spectral characteristics of the noise-reduced audio sample and the spectral characteristics of the clean audio sample; and under the condition that the loss value is greater than or equal to a preset loss value, performing iterative training on the audio noise reduction model to be trained according to the loss value, so as to repeatedly execute the steps S502 to S505 until the loss value obtained according to the output result of the trained audio noise reduction model is smaller than the preset loss value, stopping training, and taking the trained audio noise reduction model as a pre-trained audio noise reduction model.
In this embodiment, spectral features of a noisy sub-audio sample with a noisy audio sample are input to an audio noise reduction model to be trained, so as to obtain a denoised spectral feature of the noisy sub-audio sample, further, according to the denoised spectral feature of the noisy sub-audio sample, a denoised sub-audio sample is obtained, and combined to obtain a denoised audio sample, and finally, according to the difference between the spectral feature of the denoised audio sample and the spectral feature of a clean audio sample, iterative training is performed on the audio noise reduction model to be trained, so that the spectral feature of the denoised audio sample obtained based on the audio noise reduction model is continuously close to the spectral feature of the clean audio sample, which is favorable for improving the noise reduction capability of the audio noise reduction model, and further improving the audio noise reduction effect.
In one embodiment, the step S505 is performed to train the audio noise reduction model to be trained according to the difference between the spectral features of the noise reduced audio sample and the spectral features of the clean audio sample, so as to obtain a pre-trained audio noise reduction model, and specifically includes the following contents: obtaining a loss value according to the difference between the spectral characteristics of the noise-reduced audio sample and the spectral characteristics of the clean audio sample; training the audio noise reduction model to be trained according to the loss value until reaching the training ending condition; and confirming the trained audio noise reduction model reaching the training ending condition as a pre-trained audio noise reduction model.
The loss value may be MSE (Mean Square Error ) or SI-SDR (Scale-invariant Signal-to-loss Ratio).
The training ending condition may be that the loss value is smaller than a preset loss value, or that the total training frequency reaches the preset training frequency.
Specifically, the terminal calculates MSE or SI-SDR according to the difference between the spectral characteristics of the noise-reduced audio sample and the spectral characteristics of the clean audio sample, and takes the MSE or SI-SDR as a loss value of an audio noise reduction model to be trained; and then taking the clean audio sample as a training target, carrying out iterative training on the audio noise reduction model to be trained according to the loss value, and continuously adjusting model parameters in the audio noise reduction model until the trained audio noise reduction model meets the training ending condition, for example, if the loss value calculated based on the output result of the trained audio noise reduction model is smaller than a preset threshold value, taking the trained audio noise reduction model as a pre-trained audio noise reduction model.
In this embodiment, according to the difference between the spectral features of the noise-reduced audio sample and the spectral features of the clean audio sample, the audio noise reduction model to be trained is iteratively trained, so that the accuracy of the noise-reduced spectral features output by the trained audio noise reduction model can be improved, and the noise reduction effect of the audio noise reduction model is further improved.
In one embodiment, in step S501, a noisy audio sample corresponding to the clean audio sample is obtained, which specifically includes the following steps: acquiring a noise audio sample and a preset signal-to-noise ratio; and according to a preset signal-to-noise ratio, carrying out fusion processing on the clean audio sample and the noise audio sample to obtain a noisy frequency sample corresponding to the clean audio sample.
The noise audio sample comprises noise audio and accompaniment audio; the noise audio means various types of noise audio such as a square, a road, a conference room, a restaurant, a coffee shop, a keyboard click, and the like. The accompaniment audio refers to various types of accompaniment audio such as piano, guitar, drum and the like musical instruments, original songs and the like.
The preset signal-to-noise ratio is within the range of the preset signal-to-noise ratio of-5 dB and 20dB, and can be specifically selected according to actual scenes.
The noise-carrying frequency sample can be formed by mixing and superposing a clean audio sample and noise audio, can be formed by mixing and superposing a clean audio sample and accompaniment audio, and can be formed by mixing and superposing a clean audio sample, noise audio and accompaniment audio.
Specifically, the terminal acquires noise audio and accompaniment audio from a database, and takes the noise audio and the accompaniment audio as noise audio samples; and then determining a preset signal-to-noise ratio from a preset signal-to-noise ratio range according to the current noise reduction scene, and performing mixed superposition processing on the clean audio sample and the noise audio sample according to the preset signal-to-noise ratio to obtain a plurality of types of noise-carrying frequency samples, wherein the noise-carrying frequency samples are respectively a first noise-carrying frequency sample formed by mixing and superposing the clean audio sample and the noise audio, a second noise-carrying frequency sample formed by mixing and superposing the clean audio sample and the accompaniment audio, and a third noise-carrying frequency sample formed by mixing and superposing the clean audio sample, the noise audio and the accompaniment audio.
In this embodiment, according to a preset signal-to-noise ratio, a fusion process is performed on a clean audio sample and a noise audio sample to obtain a noisy audio sample corresponding to the clean audio sample, so that more complicated and variable background noise (such as complex noise and unsteady noise in a real singing scene) can be simulated, an audio noise reduction model obtained based on training the noisy audio sample has a stronger noise reduction capability on audio data carrying different types of noise, and therefore the audio noise reduction effect of the audio noise reduction model is further improved.
In one embodiment, as shown in fig. 6, another audio noise reduction method is provided, and the method is applied to a terminal for illustration, and includes the following steps:
step S601, performing frequency division processing on the audio data to be noise reduced to obtain sub audio data to be noise reduced of the audio data to be noise reduced.
Step S602, framing the sub-audio data to be noise reduced to obtain an audio frame to be noise reduced of the sub-audio data to be noise reduced; and carrying out feature extraction processing on the audio frame to be noise-reduced to obtain the frequency spectrum features of the audio frame to be noise-reduced.
Step S603, combining the spectral features of the audio frame to be noise reduced to obtain the spectral features of the sub-audio data to be noise reduced.
In step S604, feature extraction processing is performed on the spectral features through a pre-trained audio noise reduction model, so as to obtain processed spectral features of the sub-audio data to be noise reduced.
Step S605, performing noise reduction processing related to a frequency domain on the processed frequency spectrum characteristics to obtain first noise-reduced frequency spectrum characteristics of the sub-audio data to be noise reduced; and performing time-dependent noise reduction processing on the first noise-reduced frequency spectrum characteristic to obtain a second noise-reduced frequency spectrum characteristic of the sub-audio data to be noise-reduced.
Step S606, performing feature extraction processing on the second denoised spectral feature to obtain denoised spectral features of the sub-audio data to be denoised.
Step S607, confirming the noise-reduced spectrum feature of the audio frame to be noise-reduced from the noise-reduced spectrum features; and carrying out signal reconstruction processing on the noise-reduced frequency spectrum characteristics of the audio frame to be noise-reduced to obtain the audio frame after noise reduction of the audio frame to be noise-reduced.
Step S608, the denoised audio frames are combined to obtain denoised sub-audio data corresponding to the sub-audio data to be denoised.
Step S609, the noise-reduced sub-audio data are combined to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced.
In this embodiment, by performing the frequency division processing on the audio data to be noise-reduced, human voice in a low frequency band and noise in each frequency band can be captured better, so that the subsequent noise reduction model trained in advance is facilitated, the spectral features of each sub audio data to be noise-reduced of the audio data to be noise-reduced are subjected to targeted noise reduction processing, the more comprehensive noise reduction processing is facilitated, the noise to be processed is avoided, and therefore the audio quality of the noise-reduced audio data obtained based on the reconstruction of the noise-reduced spectral features of the sub audio data to be noise-reduced is higher, and the noise reduction effect of the audio data to be noise-reduced is improved.
In one embodiment, as shown in fig. 7, a training method of an audio noise reduction model is provided, and the method is applied to a terminal for illustration, and includes the following steps:
Step S701, a clean audio sample, a noise audio sample and a preset signal-to-noise ratio are obtained.
Step S702, fusion processing is carried out on the clean audio sample and the noise audio sample according to a preset signal-to-noise ratio, and a noisy audio sample corresponding to the clean audio sample is obtained.
Step S703, performing frequency division processing on the noisy audio sample to obtain a noisy sub-audio sample of the noisy audio sample.
Step S704, obtaining the spectral feature of the noisy sub-audio sample.
Step S705, inputting the spectral features of the noisy sub-audio sample to the audio noise reduction model to be trained, and obtaining the noise-reduced spectral features of the noisy sub-audio sample.
Step S706, performing signal reconstruction processing on the noise-reduced spectrum characteristics of the noise-reduced sub-audio sample to obtain a noise-reduced sub-audio sample corresponding to the noise-reduced sub-audio sample.
Step S707, combining the noise-reduced sub-audio samples to obtain a noise-reduced audio sample corresponding to the noise-reduced sub-audio sample.
In step S708, a loss value is obtained according to the difference between the spectral features of the noise-reduced audio sample and the spectral features of the clean audio sample.
Step S709, training the audio noise reduction model to be trained according to the loss value until reaching the training ending condition.
Step S710, confirming the trained audio noise reduction model reaching the training end condition as a pre-trained audio noise reduction model.
In this embodiment, according to the clean audio sample and the noisy audio sample corresponding to the clean audio sample, the audio noise reduction model to be trained is iteratively trained, so that the spectral features of the noise-reduced audio sample obtained based on the audio noise reduction model are continuously similar to the spectral features of the clean audio sample, which is favorable for improving the noise reduction capability of the audio noise reduction model and further improving the audio noise reduction effect.
In one embodiment, in order to more clearly illustrate the audio noise reduction method provided by the embodiment of the present application, a specific embodiment is described below specifically. In one embodiment, referring to fig. 8, the present application further provides another training method of an audio noise reduction model, which may be applied to a terminal, and specifically includes the following contents:
referring to fig. 8, in the singing voice data noise reduction scene, a terminal acquires an accompaniment audio sample, a noise audio sample and a clean audio sample from a database, and performs mixed superposition processing on the clean audio sample and the accompaniment audio sample, or performs mixed superposition processing on the clean audio sample and the noise audio sample, or performs mixed superposition processing on the clean audio sample, the accompaniment audio sample and the noise audio sample according to a preset signal-to-noise ratio, so as to obtain a plurality of types of noise frequency samples; then, frequency division processing is carried out on the noisy audio samples to obtain a plurality of sub-bands (namely noisy audio samples); performing Fourier transform on each sub-band to obtain the frequency spectrum characteristic of each sub-band; inputting the spectral characteristics of each sub-band into an audio noise reduction model to be trained to obtain noise-reduced spectral characteristics of each sub-band; performing inverse Fourier transform on the noise-reduced frequency spectrum characteristics of each sub-band to obtain a plurality of reconstructed sub-bands; combining the plurality of reconstructed sub-bands in a sub-band synthesis mode to obtain a noise-reduced audio sample; performing Fourier transform on the clean audio sample and the noise-reduced audio sample to obtain the spectral features of the clean audio sample and the spectral features of the noise-reduced audio sample; finally, taking the clean audio sample as a training target, and calculating to obtain a loss value according to the difference between the spectral characteristics of the noise-reduced audio sample and the spectral characteristics of the clean audio sample; and under the condition that the loss value is greater than or equal to a preset loss value, carrying out iterative training on the audio noise reduction model to be trained according to the loss value, stopping training until the loss value obtained according to the output result of the trained audio noise reduction model is smaller than the preset loss value, and taking the trained audio noise reduction model as a pre-trained audio noise reduction model.
In addition, referring to fig. 9, the present application also provides another audio noise reduction method, which can be applied to a terminal, and specifically includes the following contents:
referring to fig. 9, the terminal acquires singing voice data recorded by a user as audio data to be noise reduced; performing frequency division processing on the audio data to be noise reduced to obtain a plurality of sub-bands (namely the sub-audio data to be noise reduced); performing Fourier transform on each sub-band to obtain the frequency spectrum characteristic of each sub-band; inputting the spectral characteristics of each sub-band into a pre-trained audio noise reduction model for noise reduction treatment to obtain noise-reduced spectral characteristics of each sub-band; performing inverse Fourier transform on the noise-reduced frequency spectrum characteristics of each sub-band to obtain a plurality of reconstructed sub-bands; and combining the plurality of reconstructed subbands in a subband synthesis mode to obtain noise-reduced audio data, thereby obtaining noise-reduced singing voice data.
In this embodiment, on the one hand, through clean audio samples and noisy audio samples corresponding to the clean audio samples, iterative training is performed on the audio noise reduction model to be trained, so that the spectral features of the noise-reduced audio samples obtained based on the audio noise reduction model are continuously close to the spectral features of the clean audio samples, which is favorable for improving the noise reduction capability of the audio noise reduction model, and further improves the audio noise reduction effect. On the other hand, through the audio frequency noise reduction model trained in advance, the targeted noise reduction treatment is carried out on each sub-band, the more comprehensive noise reduction treatment is facilitated, the noise to be treated is avoided, the reconstructed noise-reduced audio frequency data are cleaner, the definition, the intelligibility, the signal fidelity and the like are effectively improved, and the noise reduction effect of the noise-reduced audio frequency data is further improved.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 10. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an audio noise reduction method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A method of audio noise reduction, the method comprising:
performing frequency division processing on the audio data to be noise reduced to obtain sub audio data to be noise reduced of the audio data to be noise reduced;
acquiring the frequency spectrum characteristics of the sub-audio data to be noise reduced;
carrying out noise reduction treatment on the frequency spectrum characteristics through a pre-trained audio noise reduction model to obtain noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced;
Performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced;
and combining the noise-reduced sub-audio data to obtain noise-reduced audio data corresponding to the audio data to be noise-reduced.
2. The method according to claim 1, wherein the noise reduction processing is performed on the spectral features by using a pre-trained audio noise reduction model to obtain noise-reduced spectral features of the sub-audio data to be noise reduced, including:
performing feature extraction processing on the frequency spectrum features through a pre-trained audio noise reduction model to obtain processed frequency spectrum features of the sub-audio data to be noise reduced;
carrying out noise reduction processing related to a frequency domain on the processed frequency spectrum characteristics to obtain first noise-reduced frequency spectrum characteristics of the sub-audio data to be noise-reduced;
performing noise reduction processing related to time on the first noise-reduced frequency spectrum characteristic to obtain a second noise-reduced frequency spectrum characteristic of the sub-audio data to be noise-reduced;
and carrying out feature extraction processing on the second noise-reduced frequency spectrum features to obtain the noise-reduced frequency spectrum features of the sub-audio data to be noise-reduced.
3. The method of claim 2, wherein the pre-trained audio noise reduction model comprises a first convolutional layer, a first hole convolutional network, a second hole convolutional network, and a second convolutional layer connected in sequence, the first hole convolutional network and the second hole convolutional network each comprise a plurality of sub-hole convolutional networks connected in sequence, each sub-hole convolutional network comprising a hole convolutional layer, a normalization layer, and an activation layer connected in sequence.
4. The method of claim 1, wherein the obtaining spectral features of the sub-audio data to be denoised comprises:
carrying out framing treatment on the sub-audio data to be noise reduced to obtain an audio frame to be noise reduced of the sub-audio data to be noise reduced;
performing feature extraction processing on the audio frame to be noise-reduced to obtain the frequency spectrum features of the audio frame to be noise-reduced;
and combining the frequency spectrum characteristics of the audio frame to be noise reduced to obtain the frequency spectrum characteristics of the sub audio data to be noise reduced.
5. The method of claim 4, wherein the performing signal reconstruction processing on the denoised spectral feature to obtain denoised sub-audio data corresponding to the sub-audio data to be denoised comprises:
From the denoised spectral features, confirming denoised spectral features of the audio frame to be denoised;
performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics of the audio frame to be noise-reduced to obtain the audio frame after noise reduction of the audio frame to be noise-reduced;
and combining the noise-reduced audio frames to obtain noise-reduced sub-audio data corresponding to the sub-audio data to be noise-reduced.
6. The method of claim 1, wherein the pre-trained audio noise reduction model is derived by:
acquiring a clean audio sample and a noisy frequency sample corresponding to the clean audio sample;
inputting the spectral characteristics of the noisy sub-audio sample of the noisy audio sample to an audio noise reduction model to be trained to obtain the noise-reduced spectral characteristics of the noisy sub-audio sample;
performing signal reconstruction processing on the noise-reduced frequency spectrum characteristics of the noise-reduced sub-audio sample to obtain a noise-reduced sub-audio sample corresponding to the noise-reduced sub-audio sample;
combining the noise-reduced sub-audio samples to obtain noise-reduced audio samples corresponding to the noise-reduced sub-audio samples;
and training the audio noise reduction model to be trained according to the difference between the spectral features of the noise-reduced audio sample and the spectral features of the clean audio sample to obtain the pre-trained audio noise reduction model.
7. The method of claim 6, wherein training the audio noise reduction model to be trained based on differences between spectral features of the noise reduced audio samples and spectral features of the clean audio samples to obtain the pre-trained audio noise reduction model comprises:
obtaining a loss value according to the difference between the spectral characteristics of the noise-reduced audio sample and the spectral characteristics of the clean audio sample;
training the audio noise reduction model to be trained according to the loss value until reaching a training ending condition;
and confirming the trained audio noise reduction model reaching the training ending condition as the pre-trained audio noise reduction model.
8. The method of claim 6, wherein the noisy audio samples corresponding to the clean audio samples are obtained by:
acquiring a noise audio sample and a preset signal-to-noise ratio;
and according to the preset signal-to-noise ratio, carrying out fusion processing on the clean audio sample and the noise audio sample to obtain a noise-carrying frequency sample corresponding to the clean audio sample.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
CN202311130009.XA 2023-08-31 2023-08-31 Audio noise reduction method, computer device and storage medium Pending CN117012221A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311130009.XA CN117012221A (en) 2023-08-31 2023-08-31 Audio noise reduction method, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311130009.XA CN117012221A (en) 2023-08-31 2023-08-31 Audio noise reduction method, computer device and storage medium

Publications (1)

Publication Number Publication Date
CN117012221A true CN117012221A (en) 2023-11-07

Family

ID=88563758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311130009.XA Pending CN117012221A (en) 2023-08-31 2023-08-31 Audio noise reduction method, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN117012221A (en)

Similar Documents

Publication Publication Date Title
CN112289333B (en) Training method and device of voice enhancement model and voice enhancement method and device
US20210089967A1 (en) Data training in multi-sensor setups
US11443756B2 (en) Detection and suppression of keyboard transient noise in audio streams with aux keybed microphone
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
Şimşekli et al. Non-negative tensor factorization models for Bayesian audio processing
Ju et al. Tea-pse 3.0: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2023 dns-challenge
CN113299313B (en) Audio processing method and device and electronic equipment
CN116705045B (en) Echo cancellation method, apparatus, computer device and storage medium
CN115588437B (en) Speech enhancement method, apparatus, device and storage medium
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
Zhou et al. Speech Enhancement via Residual Dense Generative Adversarial Network.
CN115116469B (en) Feature representation extraction method, device, equipment, medium and program product
CN117012221A (en) Audio noise reduction method, computer device and storage medium
CN113241088B (en) Training method and device of voice enhancement model and voice enhancement method and device
Hu et al. Learnable spectral dimension compression mapping for full-band speech enhancement
CN117373468A (en) Far-field voice enhancement processing method, far-field voice enhancement processing device, computer equipment and storage medium
CN114283833A (en) Speech enhancement model training method, speech enhancement method, related device and medium
CN115798453A (en) Voice reconstruction method and device, computer equipment and storage medium
CN117079623A (en) Audio noise reduction model training method, singing work processing equipment and medium
CN116386651A (en) Audio noise reduction method, computer device, storage medium and computer program product
WO2023102930A1 (en) Speech enhancement method, electronic device, program product, and storage medium
CN113707163B (en) Speech processing method and device and model training method and device
Lan et al. Research on improved DNN and MultiResU_Net network speech enhancement effect
WO2024055751A1 (en) Audio data processing method and apparatus, device, storage medium, and program product
CN114220448A (en) Voice signal generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination