CN110491407A

CN110491407A - Method, apparatus, electronic equipment and the storage medium of voice de-noising

Info

Publication number: CN110491407A
Application number: CN201910754269.1A
Authority: CN
Inventors: 黄杰雄; 戴长军; 黄健源
Original assignee: Guangzhou Huaduo Network Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2019-11-22
Anticipated expiration: 2039-08-15
Also published as: CN110491407B

Abstract

This application discloses a kind of method, apparatus of voice de-noising, electronic equipment and storage medium, this method includes obtaining the first audio, and the first audio is to be mixed with the audio of voice signal and ambient noise signal；Then the first audio is pre-processed, the spectrum energy feature of the first audio is transformed into Bark frequency domain by linear time, obtains the Bark feature of the first audio；Bark feature is inputted into the target nerve network model that training obtains in advance again, obtains the Bark characteristic ratio parameter of target nerve network model output；The magnitude parameters of voice signal are calculated based on Bark characteristic ratio parameter again；It is then based on magnitude parameters and obtains target voice.By the way that obtained target nerve network model is trained in the Bark feature input of the first audio in advance, and then calculate the magnitude parameters of voice signal, target voice is obtained based on magnitude parameters again, reduces the calculation amount of neural network model, reduces the ambient noise in voice messaging.

Description

Method, apparatus, electronic equipment and the storage medium of voice de-noising

Technical field

This application involves voice de-noising technical fields, method, apparatus, electronics more particularly, to a kind of voice de-noising Equipment and storage medium.

Background technique

Voice de-noising technology is a kind of from the audio for being mixed with target voice and ambient noise, eliminates or background is inhibited to make an uproar Sound obtains the technology of target voice.As a kind of mode, can by a large amount of reality targeted voice signal and noise signal with Machine mixing, as the input of neural network, after supervised training, neural network can automatically learn from training sample defeated Targeted voice signal out.However, the calculation amount of neural network will constantly increase with the raising of target voice sample rate, make it It is unable to get and is widely applied.

Summary of the invention

In view of the above problems, present applicant proposes a kind of method, apparatus of voice de-noising, electronic equipment and storage medium, To improve the above problem.

In a first aspect, the embodiment of the present application provides a kind of method of voice de-noising, this method comprises: obtaining the first sound Frequently, the first audio is to be mixed with the audio of voice signal and ambient noise signal；First audio is pre-processed, by The spectrum energy feature of one audio is transformed into Bark frequency domain by linear time, obtains the Bark feature of the first audio；By Bark spy The sign input target nerve network model that training obtains in advance obtains the Bark characteristic ratio ginseng of target nerve network model output Number, the spectral magnitude feature of Bark characteristic ratio parameter characterization voice signal ratio shared in Bark frequency domain；Based on Bark Characteristic ratio parameter calculates the magnitude parameters of voice signal；Target voice is obtained based on magnitude parameters.

Second aspect, the embodiment of the present application provide a kind of device of voice de-noising, which includes: the first acquisition mould Block, for obtaining the first audio, the first audio is to be mixed with the audio of voice signal and ambient noise signal；Pre-process mould The spectrum energy feature of first audio is transformed into Bark frequency by linear time for pre-processing to the first audio by block Domain obtains the Bark feature of the first audio；First computing module, for Bark feature to be inputted the target mind that training obtains in advance Through network model, the Bark characteristic ratio parameter of target nerve network model output, Bark characteristic ratio parameter characterization language are obtained The spectral magnitude feature of sound signal ratio shared in Bark frequency domain；Second computing module, for being based on Bark characteristic ratio Parameter calculates the magnitude parameters of voice signal；Second obtains module, for obtaining target voice based on magnitude parameters.

The third aspect, the embodiment of the present application provide a kind of electronic equipment, including memory and one or more processing Device；One or more programs are stored in memory and are configured as being performed by one or more processors, one or more Program is configured to carry out method described in above-mentioned first aspect.

Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, computer readable storage medium In be stored with program code, wherein method described in above-mentioned first aspect is executed when program code is run as processor.

Method, apparatus, electronic equipment and the storage medium of a kind of voice de-noising provided by the embodiments of the present application, are related to voice Noise reduction technology field.This method is to be mixed with voice signal and ambient noise signal by obtaining the first audio, the first audio Audio；Then the first audio is pre-processed, the spectrum energy feature of the first audio is transformed by linear time Bark frequency domain obtains the Bark feature of the first audio；Bark feature is inputted into the target nerve network mould that training obtains in advance again Type obtains the Bark characteristic ratio parameter of target nerve network model output, Bark characteristic ratio parameter characterization voice signal Spectral magnitude feature ratio shared in Bark frequency domain；The amplitude ginseng of voice signal is calculated based on Bark characteristic ratio parameter again Number；It is then based on magnitude parameters and obtains target voice.This method passes through the input of the Bark feature of the first audio is trained in advance The target nerve network model arrived, and then the magnitude parameters of voice signal are calculated, then obtain target voice based on magnitude parameters, The calculation amount of neural network model is reduced, the ambient noise in voice messaging is reduced.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 shows the network of the separation gate convolutional layer suitable for the method for voice de-noising provided by the embodiments of the present application Structural schematic diagram.

Fig. 2 shows the knots of the memory network in short-term of the length suitable for the method for voice de-noising provided by the embodiments of the present application Structure schematic diagram.

Fig. 3 shows a kind of method flow diagram of the method for voice de-noising of one embodiment of the application offer.

Fig. 4 shows a kind of method flow diagram of the method for voice de-noising that another embodiment of the application provides.

Fig. 5 show it is provided by the embodiments of the present application to band make an uproar song carry out noise reduction method schematic flow diagram.

Fig. 6 shows a kind of structural block diagram of the device of voice de-noising provided by the embodiments of the present application.

Fig. 7 shows the structural block diagram of a kind of electronic equipment provided by the embodiments of the present application.

Fig. 8 shows realizing for saving or carrying according to the voice de-noising of the embodiment of the present application for the embodiment of the present application Method program code storage unit.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described.

In recent years, as the fast development of voice communication technology and user are to voice quality increasingly higher demands, base The research of entire voice de-noising is accelerated in the method for deep neural network supervised learning.Voice de-noising is referred to target voice Signal separate with ambient noise eliminating or inhibiting ambient noise.It, can will be in a large amount of reality as a kind of mode Targeted voice signal and noise signal mix at random, and as the input of neural network, after supervised training, neural network can The automatically study output targeted voice signal from training sample, promotes noise reduction effect.However, with target voice sample rate It improves, the calculation amount of neural network will constantly increase, and be unable to get it and be widely applied.

For example, targeted voice signal is the song of people under the scene sung, the sample rate of audio signal is usually 44.1kHz.Song noise reduction is a kind of scene more special in voice de-noising, and traditional voice noise reduction technology is used in high sampling rate Under the scene of audio, effect is very unsatisfactory；And the more existing voice de-noising method based on deep neural network is also because of net Network parameter causes calculation amount excessive excessively to be difficult to apply to the scene of song noise reduction.

For above-mentioned problem, inventor has found by long-term research, for a segment of audio signal, inputs nerve net The non-stationary increase that will lead to neural computing amount of the signal of network model, and audio signal is directly inputted into convolutional Neural Network model can further increase the calculation amount of neural network, and audio defeat effect is unobvious.In order to reduce neural network Calculation amount promotes the noise reduction effect of audio, inventors have found that being converted the spectrum energy of audio by linear time to frequency domain (domain Bark), using spectrum energy the domain Bark expression as Bark feature, then using Bark feature as the defeated of neural network Enter, then make neural network using completely new separation gate convolutional layer structure, convolutional neural networks structure can be made to increase convolution The increase of neural computing amount is reduced while the study visual field of core, while increasing the nonlinear degree of neural network model, Promote the effect of audio defeat.

Therefore, it is proposed to which a kind of method, apparatus of audio defeat provided by the embodiments of the present application, electronic equipment and storage are situated between Matter by the way that obtained target nerve network model is trained in the Bark feature input of the first audio in advance, and then calculates voice The magnitude parameters of signal, then target voice is obtained based on magnitude parameters, the calculation amount of neural network model is reduced, voice is reduced Ambient noise in information.

For the ease of application scheme is described in detail, first separation gate provided by the embodiments of the present application is rolled up in conjunction with attached drawing below Product neural network model is illustrated.

Referring to Fig. 1, for the illustrative separation gate volume of one of method of voice de-noising provided by the embodiments of the present application The schematic network structure of lamination.Wherein, the separation gate convolutional layer include four two-dimensional convolution layers, the first activation primitive module with And the second activation primitive module.As a kind of mode, which includes the first cause and effect convolutional layer, the second cause and effect volume Lamination, third convolutional layer and Volume Four lamination.Wherein, the first activation primitive module is connect with third convolutional layer, the second activation Function module is connect with Volume Four lamination.The convolution kernel size of first cause and effect convolutional layer can be kw*1, the second cause and effect convolutional layer Convolution kernel size can be 1*kh, the size of the convolution kernel of third convolutional layer and Volume Four lamination can be identical, four two The convolution kernel port number for tieing up convolutional layer is identical, i.e. the first cause and effect convolutional layer, the second cause and effect convolutional layer, third convolutional layer and the 4th The port number (for example, as shown in Figure 1 can be c) of the convolution kernel of convolutional layer is identical.Third convolutional layer activates letter in connection first Digital-to-analogue block and Volume Four lamination, can be by the defeated of the first activation primitive module in the case where connecting the second activation primitive module It is multiplied out with the output of the second activation primitive module, and then obtains the last output of the separation gate convolutional layer.

Wherein, the first activation primitive module can use Relu (Rectified Linear Unit, line rectification letter Number), the second activation primitive module can use Sigmoid function.Optionally, in actual implementation, the first activation primitive module And second activation primitive module can also use other function, be not limited thereto.

Optionally, in the embodiment of the present application, the specific value of kw, kh and c are not construed as limiting.As a kind of mode, By adjusting these three parameters, separation gate convolutional layer can more effectively learn the voice characteristics information of input, and then preferable The target voice recognized the need for, or removal is with the ambient noise in noise frequency.

Optionally, in one implementation, as shown in Figure 1, four two-dimensional convolution layers can be respectively cause and effect convolution Layer (kw*1, c), cause and effect convolutional layer (1*kh, c) and two convolutional layers (1*1, c) separated.Wherein, kw and kh is cause and effect The convolution kernel size of convolutional layer, c are the convolution kernel port number of convolutional layer.By the way that convolution kernel is separated into two strip convolution kernels The way of (convolutional layer (1*1, c) that i.e. as shown in Figure 1 two separate), can expand the study visual field of convolutional layer, simultaneously Using the convolutional layer separated, the calculation amount of convolutional layer can be reduced.

For the ease of better understanding application scheme, first Bark feature involved in the embodiment of the present application is carried out below Explanation.

The domain Bark is a kind of psychologic acoustics scale of sound.Because of the special tectonic of human ear cochlea, the auditory system of people is produced A series of critical bands (Critical band) is given birth to.Critical band is sound frequency band, the sound in the same critical band Signal is easy to happen masking effect, i.e. voice signal in critical band is easy another big by energy and close frequency letter It number is sheltered, causes the auditory system of people that can not experience this voice signal.As a kind of mode, if voice signal is existed Critical frequency band is converted into frequency dimension, as soon as each critical frequency band just becomes a Bark, so also by voice signal The domain Bark is transformed into from linear frequency domain.Optionally, the embodiment of the present application using following formula by voice signal from linear frequency Domain is transformed into the domain Bark:

Wherein, arctan is arctan function, and f is the linear frequency dimension of voice signal, and Bark (f) is voice signal Bark domain representation.

Optionally, by voice signal from linear frequency dimension transformation at the domain Bark dimension after, need to tie up linear frequency The audible spectrum energy feature of voice signal under degree is converted to the Bark feature of the domain Bark dimension.As a kind of mode, audio (i.e. above-mentioned voice signal) does the value obtained after Short Time Fourier Transform and (the spectrum signature of audio or is called the frequency of audio Domain representation) it can indicate are as follows:

stft_(t,f)=x_(t,f)+i×y_(t,f),

Wherein, stft_(t,f)It indicates the spectrum signature on frequency domain, is made of a vector, that is, the x+yi in formula, X represents the real part of this spectrum signature, and y represents the imaginary part of feature.

Further, the linear spectral energy of audio can be calculated by following formula and be obtained:

So, linear spectral energy feature is converted to Bark feature and can indicate are as follows:

Bark_feature=mat_mul (stft_energy, stft2bark_matrix),

Wherein, mat_mul representing matrix is multiplied, and stft2bark_matrix indicates the transition matrix of Bark feature.

Optionally, the Bark value for being target voice (such as song) and band exported after neural network learning Bark feature is made an uproar The ratio bark_mask of audio Bark value.The application is converted using principle ibid, obtains target under linear frequency dimension The ratio between the spectral magnitude of voice and spectral magnitude with noise frequency mask, conversion formula are as follows:

Mask=mat_mul (bark_mask, bark2stft_matrix),

Wherein, bark2stft_matrix is the inverse conversion matrix of Bark feature.

To be taken as a kind of mode, in the embodiment of the present application including at least one separation gate convolutional layer and at least one The target nerve network model of shot and long term memory layer is removed to the ambient noise signal in noise frequency.By above-mentioned audio Bark feature inputs the target nerve network model, the audio frequency characteristics after available denoising, the i.e. audio frequency characteristics of target voice. Optionally, the separation gate convolutional layer audio frequency characteristics that are used to be made an uproar according to band export the textural characteristics of corresponding targeted voice signal, shot and long term Remember layer be used for according to textural characteristics output denoising after audio frequency characteristics, i.e., target voice the domain Bark spectrum signature (including Spectral magnitude and spectrum energy).Wherein, i.e. long memory network (the Long Short-Term in short-term of shot and long term memory layer Memory, LSTM).Long memory network in short-term used by the application is briefly described with reference to the accompanying drawing.

Referring to Fig. 2, for the length memory network in short-term suitable for the method for voice de-noising provided by the embodiments of the present application Structural schematic diagram.As shown in Fig. 2, LSTM includes three control doors, door, input gate and out gate are respectively forgotten.Each door Activation primitive σ in limit indicates S-shaped activation primitive.It can be to upper one layer of output h by S-shaped activation primitive_t-1With it is current defeated Enter X_tIt is handled, the cell state C at a upper moment can be determined by following formula_t-1The middle data for needing to pass into silence:

f_t=σ (W_t·[h_t-1,X_t]+b_t),

Wherein, f_tValue be 0 indicate completely forget, for 1 indicate completely receive.

Further, it can be determined by S-shaped activation primitive and receive which information, tanh generates new candidate valueKnot Both close, it can be by following formula to hiding layer state C_t-1It is updated:

i_t=σ (W_i·[h_t-1,X_t]+b_i)

Further, the information for exporting which part can be determined by activation primitive, it is candidate that tanh generates new output Value, the value h of the final output hidden layer_t:

o_t=σ (W_o·[h_t-1,X_t]+b_o)

h_t=o_t·tanh(C_t)

Optionally, LSTM may include multilayer structure as shown in Figure 2, and it is defeated that each layer receives upper one layer of hidden layer Out, state vector and data currently entered are as input, and the shadow hidden layer for updating next layer exports and state vector, from And past key message can be saved, for predicting following information.

Present embodiments are specifically described below in conjunction with attached drawing.

Referring to Fig. 3, showing a kind of flow chart of the method for voice de-noising provided by the embodiments of the present application, the present embodiment A kind of method of voice de-noising is provided, above-mentioned electronic equipment can be applied, this method comprises:

Step S110: the first audio is obtained, first audio is to be mixed with voice signal and ambient noise signal Audio.

Wherein, the first audio can be to be mixed with the audio of the voice signal and ambient noise signal of target sampling rate. Optionally, target sampling rate can be high sampling rate, such as 44.1kHz or 48kHz, be also possible to non-high sampling rate, such as 11.025kHz, 22.05kHz and 24kHz etc. are not construed as limiting the specific value of target sampling rate in the present embodiment.Voice letter Number characterization clean speech signal either adulterate less noise signal voice signal.As a kind of mode, voice signal can be with From a segment of audio, such as a Duan Gesheng, one section of voice recorded etc.；Alternatively, voice signal can also come Derived from video, i.e. voice signal can be the voice signal intercepted in video, specifically, the source of voice signal does not limit It is fixed.

Optionally, the first audio in the embodiment of the present application can be song (sample rate is usually 44.1kHz).

Ambient noise signal refers to that the voice signal that interference is generated to voice signal, ambient noise signal can derive from Electromagnetic interference of sound either ambient enviroment etc., ambient noise can be such that the performance of many speech processing systems sharply declines, Strong influence user experience.It is understood that the first audio can inevitably have ambient noise signal, then, it is Influence of the ambient noise signal to voice signal is reduced, user experience is promoted, the present embodiment available first audio passes through Respective handling is carried out to the first audio, to reduce the ambient noise signal of the first audio.

Optionally, in order to promote the voice system function of electronic equipment, electronic equipment can monitor audio signal in real time, that In this case, any a segment of audio (including the audio data in video) can be identified as the first sound by electronic equipment Frequently, in order to the ambient noise of the first audio can be reduced in real time.

Wherein, electronic equipment can obtain the first audio in several ways.

As a kind of mode, electronic equipment can obtain the third party client including audio data by audio system program The audio data of program is held, and then obtains the first audio.It is being transported for example, obtaining game class application program by audio system program Row during generate gaming audio, obtain sing class application program in the process of running sing audio, obtain video playing Class application program video playing audio in the process of running or be to obtain the starting sound of electronic equipment during startup Frequently, optionally, the first audio can be obtained to realize using above-mentioned audio as the first audio.

Alternatively, electronic equipment can obtain audio data as the first audio in real time from network, for example, will The advertisement of a certain website is dubbed as the first audio.Optionally, electronic equipment can also using the audio data of remote download as First audio, or the voice of one section of user is recorded as the first audio.The source of first audio and format are unrestricted, In This will not enumerate.

Step S120: pre-processing first audio, obtains the Bark feature of first audio.

Wherein, as a kind of mode, the pretreatment in the embodiment of the present application, which can refer to, ties up the first audio from linear time Degree is transformed into frequency domain dimension and is handled.Specifically, by the spectrum signature of the first audio by linear time dimension transformation to Bark Frequency domain, to obtain the Bark feature of the first audio.Wherein, spectrum signature includes the spectrum energy feature and frequency of the first audio Spectrum amplitude value tag, optionally, the value of spectrum energy feature are equal to square of the value of spectral magnitude feature, then it is understood that It is that the Bark feature of the first audio can be understood as expression of the spectrum energy feature in Bark frequency domain of the first audio.

It is understood that the voice signal of the first audio is the voice signal of non-stationary, by carrying out to the first audio Pretreatment, can reduce the linear degree of Bark feature, in order to which Bark feature is input to preparatory trained target nerve It can the more efficiently ambient noise signal for removing the first audio after network model.

Step S130: the Bark feature is inputted into the target nerve network model that training obtains in advance, obtains the mesh Mark the Bark characteristic ratio parameter of neural network model output.

From the foregoing it will be appreciated that the target nerve network model in the embodiment of the present application include at least one separation gate convolutional layer with And at least one shot and long term remembers layer.It should be noted that separation gate convolutional layer and shot and long term memory layer particular number with And put in order the embodiment of the present application with no restrictions, it can be set according to the actual situation.

For example, target nerve network model may include 3 separation gate convolutional layers and two length as a kind of mode Phase remembers layer.In this case, in order to obtain better noise reduction effect, this implementation devises loss function and takes adaptive The moment estimation technique (ADAM) is answered, the first sound after Bark feature being made to be input to target nerve network model by loss function The amplitude of the voice signal of frequency reduces distortion；Allow target nerve network model according to above-mentioned including 3 separation gate convolution The network structure of layer and two shot and long term memory layers, the combining adaptive moment estimation technique is to the first audio newly obtained It practises, to preferably reduce the ambient noise signal in the first audio.Specifically, when adaptive involved in the embodiment of the present application It carves the estimation technique and uses factor of momentum BETA1 as 0.9, BETA2 0.999, basic learning rate (LEARNING_RATE) is set as 0.001, the number of iterations is every to be increased by 300000 times, and learning rate falls to original 0.3.In the present embodiment, training batch size (BATCH_SIZE) 32 are set as, i.e., inputs 32 trained audios when progress primary network trains iteration, sample is repeatable to be extracted.Most It has trained eventually 1000000 times or so, so that Loss is converged near minimum value.

Optionally, it is shared in Bark frequency domain can to characterize the spectral magnitude feature of voice signal for Bark characteristic ratio parameter Ratio, i.e. the spectral magnitude feature of voice signal accounts for the frequency spectrum of the first audio (including voice signal and ambient noise signal) The ratio of amplitude Characteristics.As a kind of mode, by the way that the Bark feature of the first audio is inputted separation gate convolutional layer, then will separation The output of door convolutional layer is input to shot and long term memory layer, exportable to obtain Bark characteristic ratio parameter (target nerve network model It can learn the Bark characteristic ratio parameter of voice signal out automatically).

For example, in a specific application scenarios, it is assumed that one section with noise frequency (i.e. the first audio) spectral magnitude be 1, entire audio be 0.8 by spectral magnitude voice (i.e. voice signal) and spectral magnitude be 0.2 noise (i.e. ambient noise Signal) composition, then this is input to above-mentioned target nerve network model, the target with the corresponding Bark feature of noise frequency The voice signal that neural network model can be 0.8 with output spectrum amplitude, i.e. the target nerve network model can make an uproar from band Voice signal " selecting " is come out in bark feature, to obtain the Bark characteristic ratio parameter (being herein 0.8) of voice signal.

Step S140: it is based on the Bark characteristic ratio parameter, calculates the magnitude parameters of the voice signal.

So, in these cases, the magnitude parameters of voice signal can be calculated based on Bark characteristic ratio parameter.

Wherein, what magnitude parameters indicated is the spectral magnitude parameter of voice signal.Specifically, including voice signal linear Spectral magnitude ratio, spectral magnitude and the spectral magnitude ratio in Bark frequency domain of time domain.By the width for calculating voice signal Value parameter, the voice signal after can making noise reduction is converted by Bark frequency domain to linear time, to make the waveform of voice signal also As far as linear time, in order to export voice signal.

Step S150: target voice is obtained based on the magnitude parameters.

Wherein, target voice refers to carrying out the first audio the voice signal after noise reduction.Optionally, it is based on Bark feature After magnitude parameters are calculated in scale parameter, target voice can be obtained based on magnitude parameters, that is, obtain and eliminate the first audio The obtained voice signal of noise.

The method of voice de-noising provided in this embodiment, by be blended with voice signal and ambient noise signal first The Bark feature input target nerve network model that training obtains in advance of audio, the Bark feature for characterizing voice signal is selected Out, the Bark characteristic ratio parameter of the spectral magnitude feature ratio shared in Bark frequency domain of characterization voice signal is obtained, The magnitude parameters for being calculated voice signal based on Bark characteristic ratio parameter again are obtained target voice based on magnitude parameters and (eliminated The voice signal after ambient noise signal in first audio), realize the purpose of noise reduction, it is by target nerve network model straight The mode for judge to the voice signal in the first audio screening is connect, the calculation amount of neural network model is reduced.

Referring to Fig. 4, showing a kind of flow chart of the method for voice de-noising that another embodiment of the application provides, this reality It applies example and a kind of method of voice de-noising is provided, can be applied to above-mentioned electronic equipment, this method comprises:

Step S210: training sample set is obtained.

It should be noted that the embodiment of the present application can be trained can be identified in advance by the training sample set of acquisition Voice signal, and then realize the target nerve network model of noise reduction, it can preferably be filtered out by the model in noise frequency Noise signal obtains voice signal.

Wherein, the training sample of the embodiment of the present application concentrates the voice signal for including preset duration and ambient noise to believe Number.Optionally, preset duration can be arbitrary continuation or discrete duration, and preset duration and the background of voice signal are made an uproar The preset duration of acoustical signal can be equal or unequal.For example, the preset duration of voice signal can be 20 hours, and background The preset duration of noise signal can be 10 hours；Or voice signal preset duration and ambient noise signal preset duration It is 15 hours etc., is specifically not construed as limiting.

Optionally, can believe the target song of the different tone colors in continuous preset duration as the voice of preset duration Number, can also using the target song of the different tone colors in discontinuous preset duration (i.e. preset duration has interruption) as it is default when Long voice signal.

Similar, it can be using different types of ambient noise in continuous preset duration as the ambient noise of preset duration Signal can also be believed the different types of ambient noise in discontinuous preset duration as the ambient noise of preset duration Number.

As a kind of mode, preset duration can be obtained according to preset acquisition modes, for example, whole according to hour Several times obtain preset duration；Optionally, voice signal and ambient noise signal can also be obtained at random, will obtain voice respectively The acquisition duration of signal and ambient noise signal is as respective preset duration.

In one implementation, the audio with chronological order that the available user of electronic equipment selects as The voice signal and ambient noise signal of preset duration；Audio data can also be grabbed from network at random as preset duration Voice signal and ambient noise signal；Either by the audio number in the audio class application program operational process of electronic equipment According to the voice signal and ambient noise signal as preset duration.It is worth noting that, the voice signal of preset duration and The acquisition modes of ambient noise signal and the content sources of acquisition are unrestricted, can be selected according to the actual situation.

Step S220: the voice signal and ambient noise signal are carried out on linear time according to default signal-to-noise ratio Superposition, and the superimposed training sample set is input to machine learning model, the machine learning model is trained, Obtain target nerve network model.

It is understood that all unavoidably there is background and make an uproar in any one section of voice data without any noise reduction process That is, there is signal-to-noise ratio in sound.Signal-to-noise ratio (SIGNAL-NOISE RATIO, SNR), also known as signal to noise ratio, refer to an electronic equipment Or in electronic system signal and noise ratio.It is understood that in order to increase the target nerve in the embodiment of the present application The noise reduction accuracy of network model, so that noise reduction algorithm adapts to the audio data of different signal-to-noise ratio, it can be by the language of preset duration Sound signal and ambient noise signal are overlapped on linear time according to default signal-to-noise ratio, and by superimposed training sample Collection is input to machine learning model, is trained to machine learning model, obtains target nerve network model.

Optionally, default signal-to-noise ratio can be the random number between 0~20, and specific value is unrestricted.

Wherein, machine learning model can be linear model, kernel method and support vector machines, decision tree and Boosting with And neural network (including full Connection Neural Network, convolutional neural networks, Recognition with Recurrent Neural Network etc.) etc..Wherein, about each machine The specific training method of device learning model can be with reference to respective working principle in the prior art, and which is not described herein again.

It should be noted that by voice signal and ambient noise signal according to preset signal-to-noise ratio on linear time When being overlapped, the voice signal that the training sample taken is concentrated is equal with the preset duration of ambient noise signal, for example, from 2.5 seconds voice signals and 2.5 seconds ambient noise signals are chosen in shorthand in training sample set, so as to so that training obtains Neural network model adapt to more signal-to-noise ratio band noise frequency.

Step S230: the first audio is obtained.

Wherein, the specific descriptions for obtaining the first audio are referred to the description in previous embodiment to step S110, herein It repeats no more.

Step S240: framing adding window is carried out to the first audio signal.

Since the first audio signal is non-stationary signal, it is therefore desirable to carry out framing and windowing process to it.As one kind Mode, the embodiment of the present application use Hanning window (Hanning Window), are arranged a length of 40ms of window (millisecond), and sliding window is 10ms.Wherein, the present embodiment is specifically not construed as limiting the window function taken, and is also possible to other window function, such as quarter window Function etc..

In a specific application scenarios, if the audio sample rate of voice signal is 44.1kHz, then the window of Hanning window A length of 1764 audios point, and sliding window is 441 audio points.Optionally, so setting window length can guarantee voice letter Under the premise of number distortionless, the integral operation speed of target nerve network model is improved.By dividing the first audio signal Frame adding window can be mutated to avoid interframe.

Step S250: short time discrete Fourier transform is carried out to first audio signal in each window, obtains described first The Bark feature of audio.

Optionally, short time discrete Fourier transform is carried out to the first audio signal in each window, by the frequency spectrum of the first audio Energy feature is transformed into Bark frequency domain by linear time, and then obtains the Bark feature of the first audio.Specifically, the application is implemented The points of short time discrete Fourier transform are set as 2048 by example, then available 1025 frequency dimensions after short time discrete Fourier transform Value (i.e. stft value).The dimension for the Bark feature taken in the present embodiment is 48 dimensions, then stft_energy is transformed into The dimension of the transition matrix stft2bark_matrix of Bark feature is 1025*48.

It should be noted that can also calculate first while carrying out short time discrete Fourier transform to the signal in each window The phase value of audio, specific formula for calculation are as follows:

Wherein, arctan is arctan function.

Step S260: the Bark feature is inputted into the target nerve network model that training obtains in advance, obtains institute's predicate Spectral magnitude ratio of the sound signal in the Bark frequency domain.

As a kind of mode, as previously mentioned, the target nerve network model in the embodiment of the present application may include three points Layer is remembered from door convolutional layer and two shot and long terms.Bark feature is being inputted into the target nerve network model that training obtains in advance When, Bark feature will be firstly inputted to separation gate convolutional layer, and the output of separation gate convolutional layer is then input to shot and long term note Recall layer, then exports to obtain voice signal in the spectral magnitude ratio (bark_mask) of Bark frequency domain.

Wherein, Bark feature is entered the processing step of each separation gate convolutional layer after separation gate convolutional layer and may include: By input data (for first separation gate convolutional layer, input data is Bark feature) the first cause and effect convolutional layer of input；Again The output of first cause and effect convolutional layer is input to the second cause and effect convolutional layer；Then the output of the second cause and effect convolutional layer is inputted respectively To third convolutional layer and Volume Four lamination；Then the output of third convolutional layer is input to the first activation primitive module, and will The output of Volume Four lamination is input to the second activation primitive module；Again by the output of the first activation primitive module and the second activation letter The output of digital-to-analogue block is multiplied, and obtains the output of separation gate convolutional layer.

Step S270: Bark feature inverse conversion is carried out to the spectral magnitude ratio of the Bark frequency domain, obtains the voice Spectral magnitude ratio of the signal in linear time.

As a kind of mode, formula can be passed through:

Mask=mat_mul (bark_mask, bark2stft_matrix),

The spectral magnitude ratio (bark_mask) of Bark frequency domain is subjected to Bark feature inverse conversion, wherein Bark feature is inverse The dimension of transition matrix is 25*1025, obtains voice signal in the spectral magnitude ratio (mask) of linear time.Wherein, pass through It converts to the spectral magnitude ratio of linear time, can be in order to the sound wave of synthetic speech signal, and then check the voice after noise reduction Signal effect.

Step S280: the frequency of spectral magnitude ratio and first audio based on the linear time in linear time Spectrum energy calculates the spectral magnitude of the voice signal.

Wherein, the first audio can be calculated in the spectrum energy of linear time by following formula:

As a kind of mode, existed by the first audio in the spectral magnitude ratio (mask) of linear time and the first audio The spectrum energy stft_mag of linear time, can be calculated the spectral magnitude of voice signal, and specific formula for calculation is as follows:

Step S290: target voice is obtained based on the spectral magnitude and the phase value of the first audio.

As a kind of mode, inverse-Fourier transform can be carried out to above-mentioned phase value and spectral magnitude, obtain target language Sound.

The present embodiment is illustratively illustrated by taking Fig. 5 as an example below:

As shown in figure 5, for it is provided by the embodiments of the present application to band make an uproar song carry out noise reduction method schematic flow diagram.It is optional , voice signal is singing voice signals, then the first audio is that band is made an uproar song.By band make an uproar song carry out short time discrete Fourier transform, obtain Song is made an uproar in the spectrum energy (stft_energy) of linear time to band, then the make an uproar spectrum energy of song of band is done into Bark feature Conversion, obtains band and makes an uproar song in the Bark feature (Bark_feature) of Bark frequency domain, be then input to Bark feature in advance Trained neural network model, which includes that 3 separation gate convolutional layers and two shot and long terms remember layer, defeated Obtain out band make an uproar song singing voice signals Bark frequency spectrum spectral magnitude ratio (Bark_mask), then to spectral magnitude ratio Example (Bark_mask) carries out feature inverse conversion, obtains singing voice signals in the spectral magnitude ratio (mask) of linear time, then be based on Spectral magnitude ratio (mask) and to band make an uproar song carry out inverse-Fourier transform when calculated band make an uproar song in linear time Spectrum energy (stft_energy) calculate the spectral magnitudes (stft_mag) of singing voice signals.

It is worth noting that, having sought band after song of making an uproar to band carries out Short-time Fourier variation and having made an uproar song when linear The phase (stft_phase) in domain, it is possible to online according to singing voice signals spectral magnitude (stft_mag) and band song of making an uproar Property time domain phase (stft_phase) synthesis singing voice signals noise reduction after linear time waveform, to obtain singing voice signals, Target song as shown in Figure 5 is made an uproar song, hence it is evident that reduce ambient noise compared to band.

The method of voice de-noising provided in this embodiment is then concentrated the training sample by obtaining training sample set Voice signal and ambient noise signal be overlapped on linear time according to default signal-to-noise ratio, by training sample after superposition Collection is input to machine learning model, is trained to the machine learning model, obtains target nerve network model, then obtains the One audio, then framing adding window is carried out to the first audio, and short time discrete Fourier transform is carried out to the first audio signal in each window, The Bark feature of the first audio is obtained, Bark feature is then inputted into the target nerve network model that training obtains in advance, is obtained Voice signal then carries out the reverse of Bark feature to the spectral magnitude ratio of Bark frequency domain in the spectral magnitude ratio of Bark frequency domain It changes, obtains voice signal in the spectral magnitude ratio of linear time, be then based on the spectral magnitude ratio and of linear time One audio calculates the spectral magnitude of voice signal in the spectrum energy of linear time, is finally based on spectral magnitude and the first audio Phase value obtain target voice.It realizes and is handled by Bark feature of the completely new separation gate convolutional coding structure to input, So that being significantly reduced the calculation amount and complexity of neural network while guaranteeing noise reduction effect, user experience is promoted.

Referring to Fig. 6, being a kind of structural block diagram of the device of voice de-noising provided by the embodiments of the present application, the present embodiment is mentioned For a kind of device 300 of voice de-noising, electronic equipment is run on, described device 300 includes: the first acquisition module 310, pretreatment Module 320, the first computing module 330, the second computing module 340 and second obtain module 350:

First obtains module 310, and for obtaining the first audio, first audio is to be mixed with voice signal and background The audio of noise signal.

As a kind of mode, device 300 can also include sample set acquiring unit and model acquiring unit, the sample set Acquiring unit can be used for obtaining training sample set, which may include the voice signal and background of preset duration Noise signal, the voice signal and ambient noise signal are superimposed on linear time according to default signal-to-noise ratio.The mould Type acquiring unit is used to training sample set being input to machine learning model, and is trained to machine learning model, obtains mesh Mark neural network model.

Preprocessing module 320, for being pre-processed to first audio, by the spectrum energy of first audio Feature is transformed into Bark frequency domain by linear time, obtains the Bark feature of first audio.

As a kind of mode, preprocessing module 320 may include first processing units and the second processing unit.This first Processing unit can be used for carrying out framing adding window to the first audio signal；The second processing unit can be used for in each window First audio signal carries out short time discrete Fourier transform, and the spectrum energy feature of the first audio is transformed into Bark by linear time Frequency domain obtains the Bark feature of the first audio.

Optionally, preprocessing module 320 may include computing unit, which is used to calculate the phase of the first audio Value.

First computing module 330, for the Bark feature to be inputted the target nerve network model that training obtains in advance, Obtain the Bark characteristic ratio parameter of the target nerve network model output, Bark characteristic ratio parameter characterization institute predicate The spectral magnitude feature of sound signal ratio shared in the Bark frequency domain.

As a kind of mode, the first computing module 330 specifically can be used for inputting Bark feature what training in advance obtained Target nerve network model obtains voice signal in the spectral magnitude ratio of Bark frequency domain.

Optionally, the target nerve network model in the embodiment of the present application may include three separation gate convolutional layers and two A shot and long term remembers layer.

Second computing module 340 calculates the amplitude ginseng of the voice signal for being based on the Bark characteristic ratio parameter Number.

As a kind of mode, the second computing module 340 may include the first computing unit and the second computing unit.This One computing unit can be used for carrying out Bark feature inverse conversion to the spectral magnitude ratio of Bark frequency domain, and it is online to obtain voice signal The spectral magnitude ratio of property time domain；Second computing unit can be used for the spectral magnitude ratio and first based on linear time Audio calculates the spectral magnitude of voice signal in the spectrum energy of linear time.

Second obtains module 350, for obtaining target voice based on the magnitude parameters.

As a kind of mode, second, which obtains module 350, can obtain target voice based on spectral magnitude and phase value.Tool Body, second, which obtains module 350, can carry out inverse-Fourier transform to phase value and spectral magnitude, obtain target voice.

A kind of device of voice de-noising provided in this embodiment, by obtaining the first audio, the first audio is to be mixed with language The audio of sound signal and ambient noise signal；Then the first audio is pre-processed, by the spectrum energy of the first audio Feature is transformed into Bark frequency domain by linear time, obtains the Bark feature of the first audio；Bark feature is inputted into training in advance again Obtained target nerve network model obtains the Bark characteristic ratio parameter of target nerve network model output, Bark aspect ratio The spectral magnitude feature of example parameter characterization voice signal ratio shared in Bark frequency domain；It is based on Bark characteristic ratio parameter again Calculate the magnitude parameters of voice signal；It is then based on magnitude parameters and obtains target voice.This method is by by the first audio The Bark feature input target nerve network model that training obtains in advance, and then calculate the magnitude parameters of voice signal, then base Target voice is obtained in magnitude parameters, reduces the calculation amount of neural network model, reduces the ambient noise in voice messaging.

It is apparent to those skilled in the art that for convenience and simplicity of description, foregoing description device and The specific work process of module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, the mutual coupling of shown or discussed module or direct coupling It closes or communication connection can be through some interfaces, the indirect coupling or communication connection of device or module can be electrical property, mechanical Or other forms.

It, can also be in addition, can integrate in a processing module in each functional module in each embodiment of the application It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.

It, can be with the embodiment of the present application also provides one kind referring to Fig. 7, the method and device based on above-mentioned voice de-noising Execute the electronic equipment 12 of the method for aforementioned voice noise reduction.Electronic equipment 12 includes memory 122 and intercouple one Or multiple (one is only shown in figure) processors 124, communication line connects between memory 122 and processor 124.Memory The program that can execute content in previous embodiment is stored in 122, and processor 124 can be executed and be stored in memory 122 Program.

Wherein, processor 124 may include one or more processing core.Processor 124 utilizes various interfaces and route The various pieces in entire electronic equipment 12 are connected, by running or executing the instruction being stored in memory 122, program, generation Code collection or instruction set, and the data being stored in memory 122 are called, execute the various functions and processing number of electronic equipment 12 According to.Optionally, processor 124 can use Digital Signal Processing (Digital Signal Processing, DSP), scene can Program gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA) at least one of example, in hardware realize.Processor 124 can integrating central processor (Central Processing Unit, CPU), in image processor (Graphics Processing Unit, GPU) and modem etc. One or more of combinations.Wherein, the main processing operation system of CPU, user interface and application program etc.；GPU is for being responsible for Show the rendering and drafting of content；Modem is for handling wireless communication.It is understood that above-mentioned modem It can not be integrated into processor 124, be realized separately through one piece of communication chip.

Memory 122 may include random access memory (Random Access Memory, RAM), also may include read-only Memory (Read-Only Memory).Memory 122 can be used for store instruction, program, code, code set or instruction set.It deposits Reservoir 122 may include storing program area and storage data area, wherein the finger that storing program area can store for realizing operating system Enable, for realizing at least one function instruction (such as touch function, sound-playing function, image player function etc.), be used for Realize the instruction etc. of foregoing individual embodiments.Storage data area can also store the data that electronic equipment 12 is created in use (such as phone directory, audio, video data, chat record data) etc..

Referring to FIG. 8, it illustrates a kind of structural block diagrams of computer readable storage medium provided by the embodiments of the present application. Program code is stored in the computer readable storage medium 400, said program code can be called by processor and execute above-mentioned side Method described in method embodiment.

Computer readable storage medium 400 can be such as flash memory, EEPROM (electrically erasable programmable read-only memory), The electronic memory of EPROM, hard disk or ROM etc.Optionally, computer readable storage medium 400 includes non-transient meter Calculation machine readable medium (non-transitory computer-readable storage medium).Computer-readable storage Medium 400 has the memory space for the program code 410 for executing any method and step in the above method.These program codes can With from reading or be written in one or more computer program product in this one or more computer program product. Program code 410 can for example be compressed in a suitable form.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although The application is described in detail with reference to the foregoing embodiments, those skilled in the art are when understanding: it still can be with It modifies the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features；And These are modified or replaceed, do not drive corresponding technical solution essence be detached from each embodiment technical solution of the application spirit and Range.

Claims

1. a kind of method of voice de-noising, which is characterized in that the described method includes:

The first audio is obtained, first audio is to be mixed with the audio of voice signal and ambient noise signal；

First audio is pre-processed, the spectrum energy feature of first audio is transformed by linear time Bark frequency domain obtains the Bark feature of first audio；

The Bark feature is inputted into the target nerve network model that training obtains in advance, obtains the target nerve network model The Bark characteristic ratio parameter of output, the spectral magnitude feature of voice signal described in the Bark characteristic ratio parameter characterization is in institute State ratio shared in Bark frequency domain；

Based on the Bark characteristic ratio parameter, the magnitude parameters of the voice signal are calculated；

Target voice is obtained based on the magnitude parameters.

2. the method according to claim 1, wherein described input what training in advance obtained for the Bark feature Target nerve network model, the step of obtaining the Bark characteristic ratio parameter of target nerve network model output include:

The Bark feature is inputted into the target nerve network model that training obtains in advance, obtains the voice signal described The spectral magnitude ratio of Bark frequency domain；

Described to be based on the Bark characteristic ratio parameter, the step of calculating the magnitude parameters of the voice signal, includes:

Bark feature inverse conversion is carried out to the spectral magnitude ratio of the Bark frequency domain, obtains the voice signal in linear time Spectral magnitude ratio；

Spectral magnitude ratio and first audio based on the linear time calculate institute in the spectrum energy of linear time The spectral magnitude of predicate sound signal.

3. according to the method described in claim 2, it is characterized in that, the method, further includes:

Calculate the phase value of first audio；

It is described based on the magnitude parameters obtain target voice the step of include:

Target voice is obtained based on the spectral magnitude and the phase value.

4. according to the method described in claim 3, it is characterized in that, described obtained based on the spectral magnitude and the phase value The step of taking target voice include:

Inverse-Fourier transform is carried out to the phase value and the spectral magnitude, obtains target voice.

5. the method according to claim 1, wherein described carry out pretreated step packet to first audio It includes:

Framing adding window is carried out to the first audio signal；

Short time discrete Fourier transform is carried out to first audio signal in each window, by the spectrum energy of first audio Feature is transformed into Bark frequency domain by linear time, obtains the Bark feature of first audio.

6. the method according to claim 1, wherein the target nerve network model includes at least one separation Door convolutional layer and at least one shot and long term remember layer, described that the Bark feature is inputted to the target nerve that training obtains in advance Network model, the step of obtaining the Bark characteristic ratio parameter of target nerve network model output include:

The Bark feature is inputted into the separation gate convolutional layer, then the output of the separation gate convolutional layer is input to the length Short-term memory layer, output obtain the Bark characteristic ratio parameter.

7. according to the method described in claim 6, it is characterized in that, the separation gate convolutional layer include four two-dimensional convolution layers, First activation primitive module and the second activation primitive module, four two-dimensional convolution layers include the first cause and effect convolutional layer, the Two cause and effect convolutional layers, third convolutional layer and Volume Four lamination, the first activation primitive module and the third convolutional layer connect Connect, the second activation primitive module is connect with the Volume Four lamination, the convolution kernel of the first cause and effect convolutional layer having a size of The convolution kernel of kw*1, the second cause and effect convolutional layer are rolled up having a size of 1*kh, the first cause and effect convolutional layer and second cause and effect The port number of the convolution kernel of lamination is identical, and the size of the convolution kernel of the third convolutional layer and Volume Four lamination is identical, wherein The processing step of each separation gate convolutional layer includes:

Input data is inputted into the first cause and effect convolutional layer；

The output of the first cause and effect convolutional layer is input to the second cause and effect convolutional layer；

The output of the second cause and effect convolutional layer is separately input into the third convolutional layer and the Volume Four lamination；

The output of the third convolutional layer is input to the first activation primitive module, and by the output of the Volume Four lamination It is input to the second activation primitive module；

The output of the first activation primitive module is multiplied with the output of the second activation primitive module, obtains the separation The output of door convolutional layer.

8. according to the described in any item methods of claim 6-7, which is characterized in that the target nerve network model includes three Separation gate convolutional layer and two shot and long terms remember layer.

9. the method according to claim 1, wherein before the step of the first audio of the acquisition further include:

Training sample set is obtained, the training sample concentrates voice signal and ambient noise signal including preset duration；

The voice signal and ambient noise signal are overlapped on linear time according to default signal-to-noise ratio, and will superposition The training sample set afterwards is input to machine learning model, is trained to the machine learning model, obtains target nerve Network model.

10. a kind of device of voice de-noising, which is characterized in that described device includes:

First obtains module, and for obtaining the first audio, first audio is to be mixed with voice signal and ambient noise letter Number audio；

Preprocessing module, for being pre-processed to first audio, by the spectrum energy feature of first audio by Linear time is transformed into Bark frequency domain, obtains the Bark feature of first audio；

First computing module obtains institute for the Bark feature to be inputted the target nerve network model that training obtains in advance State the Bark characteristic ratio parameter of target nerve network model output, voice signal described in the Bark characteristic ratio parameter characterization Spectral magnitude feature ratio shared in the Bark frequency domain；

Second computing module calculates the magnitude parameters of the voice signal for being based on the Bark characteristic ratio parameter；

Second obtains module, for obtaining target voice based on the magnitude parameters.

11. a kind of electronic equipment, which is characterized in that including memory；

One or more processors；

One or more programs are stored in the memory and are configured as being executed by one or more of processors, institute It states one or more programs and is configured to carry out any method of claim 1-9.

12. a kind of computer readable storage medium, which is characterized in that be stored with program generation in the computer readable storage medium Code, wherein perform claim requires any method of 1-9 when said program code is run by processor.