Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described.
In recent years, as the fast development of voice communication technology and user are to voice quality increasingly higher demands, base
The research of entire voice de-noising is accelerated in the method for deep neural network supervised learning.Voice de-noising is referred to target voice
Signal separate with ambient noise eliminating or inhibiting ambient noise.It, can will be in a large amount of reality as a kind of mode
Targeted voice signal and noise signal mix at random, and as the input of neural network, after supervised training, neural network can
The automatically study output targeted voice signal from training sample, promotes noise reduction effect.However, with target voice sample rate
It improves, the calculation amount of neural network will constantly increase, and be unable to get it and be widely applied.
For example, targeted voice signal is the song of people under the scene sung, the sample rate of audio signal is usually
44.1kHz.Song noise reduction is a kind of scene more special in voice de-noising, and traditional voice noise reduction technology is used in high sampling rate
Under the scene of audio, effect is very unsatisfactory;And the more existing voice de-noising method based on deep neural network is also because of net
Network parameter causes calculation amount excessive excessively to be difficult to apply to the scene of song noise reduction.
For above-mentioned problem, inventor has found by long-term research, for a segment of audio signal, inputs nerve net
The non-stationary increase that will lead to neural computing amount of the signal of network model, and audio signal is directly inputted into convolutional Neural
Network model can further increase the calculation amount of neural network, and audio defeat effect is unobvious.In order to reduce neural network
Calculation amount promotes the noise reduction effect of audio, inventors have found that being converted the spectrum energy of audio by linear time to frequency domain
(domain Bark), using spectrum energy the domain Bark expression as Bark feature, then using Bark feature as the defeated of neural network
Enter, then make neural network using completely new separation gate convolutional layer structure, convolutional neural networks structure can be made to increase convolution
The increase of neural computing amount is reduced while the study visual field of core, while increasing the nonlinear degree of neural network model,
Promote the effect of audio defeat.
Therefore, it is proposed to which a kind of method, apparatus of audio defeat provided by the embodiments of the present application, electronic equipment and storage are situated between
Matter by the way that obtained target nerve network model is trained in the Bark feature input of the first audio in advance, and then calculates voice
The magnitude parameters of signal, then target voice is obtained based on magnitude parameters, the calculation amount of neural network model is reduced, voice is reduced
Ambient noise in information.
For the ease of application scheme is described in detail, first separation gate provided by the embodiments of the present application is rolled up in conjunction with attached drawing below
Product neural network model is illustrated.
Referring to Fig. 1, for the illustrative separation gate volume of one of method of voice de-noising provided by the embodiments of the present application
The schematic network structure of lamination.Wherein, the separation gate convolutional layer include four two-dimensional convolution layers, the first activation primitive module with
And the second activation primitive module.As a kind of mode, which includes the first cause and effect convolutional layer, the second cause and effect volume
Lamination, third convolutional layer and Volume Four lamination.Wherein, the first activation primitive module is connect with third convolutional layer, the second activation
Function module is connect with Volume Four lamination.The convolution kernel size of first cause and effect convolutional layer can be kw*1, the second cause and effect convolutional layer
Convolution kernel size can be 1*kh, the size of the convolution kernel of third convolutional layer and Volume Four lamination can be identical, four two
The convolution kernel port number for tieing up convolutional layer is identical, i.e. the first cause and effect convolutional layer, the second cause and effect convolutional layer, third convolutional layer and the 4th
The port number (for example, as shown in Figure 1 can be c) of the convolution kernel of convolutional layer is identical.Third convolutional layer activates letter in connection first
Digital-to-analogue block and Volume Four lamination, can be by the defeated of the first activation primitive module in the case where connecting the second activation primitive module
It is multiplied out with the output of the second activation primitive module, and then obtains the last output of the separation gate convolutional layer.
Wherein, the first activation primitive module can use Relu (Rectified Linear Unit, line rectification letter
Number), the second activation primitive module can use Sigmoid function.Optionally, in actual implementation, the first activation primitive module
And second activation primitive module can also use other function, be not limited thereto.
Optionally, in the embodiment of the present application, the specific value of kw, kh and c are not construed as limiting.As a kind of mode,
By adjusting these three parameters, separation gate convolutional layer can more effectively learn the voice characteristics information of input, and then preferable
The target voice recognized the need for, or removal is with the ambient noise in noise frequency.
Optionally, in one implementation, as shown in Figure 1, four two-dimensional convolution layers can be respectively cause and effect convolution
Layer (kw*1, c), cause and effect convolutional layer (1*kh, c) and two convolutional layers (1*1, c) separated.Wherein, kw and kh is cause and effect
The convolution kernel size of convolutional layer, c are the convolution kernel port number of convolutional layer.By the way that convolution kernel is separated into two strip convolution kernels
The way of (convolutional layer (1*1, c) that i.e. as shown in Figure 1 two separate), can expand the study visual field of convolutional layer, simultaneously
Using the convolutional layer separated, the calculation amount of convolutional layer can be reduced.
For the ease of better understanding application scheme, first Bark feature involved in the embodiment of the present application is carried out below
Explanation.
The domain Bark is a kind of psychologic acoustics scale of sound.Because of the special tectonic of human ear cochlea, the auditory system of people is produced
A series of critical bands (Critical band) is given birth to.Critical band is sound frequency band, the sound in the same critical band
Signal is easy to happen masking effect, i.e. voice signal in critical band is easy another big by energy and close frequency letter
It number is sheltered, causes the auditory system of people that can not experience this voice signal.As a kind of mode, if voice signal is existed
Critical frequency band is converted into frequency dimension, as soon as each critical frequency band just becomes a Bark, so also by voice signal
The domain Bark is transformed into from linear frequency domain.Optionally, the embodiment of the present application using following formula by voice signal from linear frequency
Domain is transformed into the domain Bark:
Wherein, arctan is arctan function, and f is the linear frequency dimension of voice signal, and Bark (f) is voice signal
Bark domain representation.
Optionally, by voice signal from linear frequency dimension transformation at the domain Bark dimension after, need to tie up linear frequency
The audible spectrum energy feature of voice signal under degree is converted to the Bark feature of the domain Bark dimension.As a kind of mode, audio
(i.e. above-mentioned voice signal) does the value obtained after Short Time Fourier Transform and (the spectrum signature of audio or is called the frequency of audio
Domain representation) it can indicate are as follows:
stft(t,f)=x(t,f)+i×y(t,f),
Wherein, stft(t,f)It indicates the spectrum signature on frequency domain, is made of a vector, that is, the x+yi in formula,
X represents the real part of this spectrum signature, and y represents the imaginary part of feature.
Further, the linear spectral energy of audio can be calculated by following formula and be obtained:
So, linear spectral energy feature is converted to Bark feature and can indicate are as follows:
Bark_feature=mat_mul (stft_energy, stft2bark_matrix),
Wherein, mat_mul representing matrix is multiplied, and stft2bark_matrix indicates the transition matrix of Bark feature.
Optionally, the Bark value for being target voice (such as song) and band exported after neural network learning Bark feature is made an uproar
The ratio bark_mask of audio Bark value.The application is converted using principle ibid, obtains target under linear frequency dimension
The ratio between the spectral magnitude of voice and spectral magnitude with noise frequency mask, conversion formula are as follows:
Mask=mat_mul (bark_mask, bark2stft_matrix),
Wherein, bark2stft_matrix is the inverse conversion matrix of Bark feature.
To be taken as a kind of mode, in the embodiment of the present application including at least one separation gate convolutional layer and at least one
The target nerve network model of shot and long term memory layer is removed to the ambient noise signal in noise frequency.By above-mentioned audio
Bark feature inputs the target nerve network model, the audio frequency characteristics after available denoising, the i.e. audio frequency characteristics of target voice.
Optionally, the separation gate convolutional layer audio frequency characteristics that are used to be made an uproar according to band export the textural characteristics of corresponding targeted voice signal, shot and long term
Remember layer be used for according to textural characteristics output denoising after audio frequency characteristics, i.e., target voice the domain Bark spectrum signature (including
Spectral magnitude and spectrum energy).Wherein, i.e. long memory network (the Long Short-Term in short-term of shot and long term memory layer
Memory, LSTM).Long memory network in short-term used by the application is briefly described with reference to the accompanying drawing.
Referring to Fig. 2, for the length memory network in short-term suitable for the method for voice de-noising provided by the embodiments of the present application
Structural schematic diagram.As shown in Fig. 2, LSTM includes three control doors, door, input gate and out gate are respectively forgotten.Each door
Activation primitive σ in limit indicates S-shaped activation primitive.It can be to upper one layer of output h by S-shaped activation primitivet-1With it is current defeated
Enter XtIt is handled, the cell state C at a upper moment can be determined by following formulat-1The middle data for needing to pass into silence:
ft=σ (Wt·[ht-1,Xt]+bt),
Wherein, ftValue be 0 indicate completely forget, for 1 indicate completely receive.
Further, it can be determined by S-shaped activation primitive and receive which information, tanh generates new candidate valueKnot
Both close, it can be by following formula to hiding layer state Ct-1It is updated:
it=σ (Wi·[ht-1,Xt]+bi)
Further, the information for exporting which part can be determined by activation primitive, it is candidate that tanh generates new output
Value, the value h of the final output hidden layert:
ot=σ (Wo·[ht-1,Xt]+bo)
ht=ot·tanh(Ct)
Optionally, LSTM may include multilayer structure as shown in Figure 2, and it is defeated that each layer receives upper one layer of hidden layer
Out, state vector and data currently entered are as input, and the shadow hidden layer for updating next layer exports and state vector, from
And past key message can be saved, for predicting following information.
Present embodiments are specifically described below in conjunction with attached drawing.
Referring to Fig. 3, showing a kind of flow chart of the method for voice de-noising provided by the embodiments of the present application, the present embodiment
A kind of method of voice de-noising is provided, above-mentioned electronic equipment can be applied, this method comprises:
Step S110: the first audio is obtained, first audio is to be mixed with voice signal and ambient noise signal
Audio.
Wherein, the first audio can be to be mixed with the audio of the voice signal and ambient noise signal of target sampling rate.
Optionally, target sampling rate can be high sampling rate, such as 44.1kHz or 48kHz, be also possible to non-high sampling rate, such as
11.025kHz, 22.05kHz and 24kHz etc. are not construed as limiting the specific value of target sampling rate in the present embodiment.Voice letter
Number characterization clean speech signal either adulterate less noise signal voice signal.As a kind of mode, voice signal can be with
From a segment of audio, such as a Duan Gesheng, one section of voice recorded etc.;Alternatively, voice signal can also come
Derived from video, i.e. voice signal can be the voice signal intercepted in video, specifically, the source of voice signal does not limit
It is fixed.
Optionally, the first audio in the embodiment of the present application can be song (sample rate is usually 44.1kHz).
Ambient noise signal refers to that the voice signal that interference is generated to voice signal, ambient noise signal can derive from
Electromagnetic interference of sound either ambient enviroment etc., ambient noise can be such that the performance of many speech processing systems sharply declines,
Strong influence user experience.It is understood that the first audio can inevitably have ambient noise signal, then, it is
Influence of the ambient noise signal to voice signal is reduced, user experience is promoted, the present embodiment available first audio passes through
Respective handling is carried out to the first audio, to reduce the ambient noise signal of the first audio.
Optionally, in order to promote the voice system function of electronic equipment, electronic equipment can monitor audio signal in real time, that
In this case, any a segment of audio (including the audio data in video) can be identified as the first sound by electronic equipment
Frequently, in order to the ambient noise of the first audio can be reduced in real time.
Wherein, electronic equipment can obtain the first audio in several ways.
As a kind of mode, electronic equipment can obtain the third party client including audio data by audio system program
The audio data of program is held, and then obtains the first audio.It is being transported for example, obtaining game class application program by audio system program
Row during generate gaming audio, obtain sing class application program in the process of running sing audio, obtain video playing
Class application program video playing audio in the process of running or be to obtain the starting sound of electronic equipment during startup
Frequently, optionally, the first audio can be obtained to realize using above-mentioned audio as the first audio.
Alternatively, electronic equipment can obtain audio data as the first audio in real time from network, for example, will
The advertisement of a certain website is dubbed as the first audio.Optionally, electronic equipment can also using the audio data of remote download as
First audio, or the voice of one section of user is recorded as the first audio.The source of first audio and format are unrestricted, In
This will not enumerate.
Step S120: pre-processing first audio, obtains the Bark feature of first audio.
Wherein, as a kind of mode, the pretreatment in the embodiment of the present application, which can refer to, ties up the first audio from linear time
Degree is transformed into frequency domain dimension and is handled.Specifically, by the spectrum signature of the first audio by linear time dimension transformation to Bark
Frequency domain, to obtain the Bark feature of the first audio.Wherein, spectrum signature includes the spectrum energy feature and frequency of the first audio
Spectrum amplitude value tag, optionally, the value of spectrum energy feature are equal to square of the value of spectral magnitude feature, then it is understood that
It is that the Bark feature of the first audio can be understood as expression of the spectrum energy feature in Bark frequency domain of the first audio.
It is understood that the voice signal of the first audio is the voice signal of non-stationary, by carrying out to the first audio
Pretreatment, can reduce the linear degree of Bark feature, in order to which Bark feature is input to preparatory trained target nerve
It can the more efficiently ambient noise signal for removing the first audio after network model.
Step S130: the Bark feature is inputted into the target nerve network model that training obtains in advance, obtains the mesh
Mark the Bark characteristic ratio parameter of neural network model output.
From the foregoing it will be appreciated that the target nerve network model in the embodiment of the present application include at least one separation gate convolutional layer with
And at least one shot and long term remembers layer.It should be noted that separation gate convolutional layer and shot and long term memory layer particular number with
And put in order the embodiment of the present application with no restrictions, it can be set according to the actual situation.
For example, target nerve network model may include 3 separation gate convolutional layers and two length as a kind of mode
Phase remembers layer.In this case, in order to obtain better noise reduction effect, this implementation devises loss function and takes adaptive
The moment estimation technique (ADAM) is answered, the first sound after Bark feature being made to be input to target nerve network model by loss function
The amplitude of the voice signal of frequency reduces distortion;Allow target nerve network model according to above-mentioned including 3 separation gate convolution
The network structure of layer and two shot and long term memory layers, the combining adaptive moment estimation technique is to the first audio newly obtained
It practises, to preferably reduce the ambient noise signal in the first audio.Specifically, when adaptive involved in the embodiment of the present application
It carves the estimation technique and uses factor of momentum BETA1 as 0.9, BETA2 0.999, basic learning rate (LEARNING_RATE) is set as
0.001, the number of iterations is every to be increased by 300000 times, and learning rate falls to original 0.3.In the present embodiment, training batch size
(BATCH_SIZE) 32 are set as, i.e., inputs 32 trained audios when progress primary network trains iteration, sample is repeatable to be extracted.Most
It has trained eventually 1000000 times or so, so that Loss is converged near minimum value.
Optionally, it is shared in Bark frequency domain can to characterize the spectral magnitude feature of voice signal for Bark characteristic ratio parameter
Ratio, i.e. the spectral magnitude feature of voice signal accounts for the frequency spectrum of the first audio (including voice signal and ambient noise signal)
The ratio of amplitude Characteristics.As a kind of mode, by the way that the Bark feature of the first audio is inputted separation gate convolutional layer, then will separation
The output of door convolutional layer is input to shot and long term memory layer, exportable to obtain Bark characteristic ratio parameter (target nerve network model
It can learn the Bark characteristic ratio parameter of voice signal out automatically).
For example, in a specific application scenarios, it is assumed that one section with noise frequency (i.e. the first audio) spectral magnitude be
1, entire audio be 0.8 by spectral magnitude voice (i.e. voice signal) and spectral magnitude be 0.2 noise (i.e. ambient noise
Signal) composition, then this is input to above-mentioned target nerve network model, the target with the corresponding Bark feature of noise frequency
The voice signal that neural network model can be 0.8 with output spectrum amplitude, i.e. the target nerve network model can make an uproar from band
Voice signal " selecting " is come out in bark feature, to obtain the Bark characteristic ratio parameter (being herein 0.8) of voice signal.
Step S140: it is based on the Bark characteristic ratio parameter, calculates the magnitude parameters of the voice signal.
So, in these cases, the magnitude parameters of voice signal can be calculated based on Bark characteristic ratio parameter.
Wherein, what magnitude parameters indicated is the spectral magnitude parameter of voice signal.Specifically, including voice signal linear
Spectral magnitude ratio, spectral magnitude and the spectral magnitude ratio in Bark frequency domain of time domain.By the width for calculating voice signal
Value parameter, the voice signal after can making noise reduction is converted by Bark frequency domain to linear time, to make the waveform of voice signal also
As far as linear time, in order to export voice signal.
Step S150: target voice is obtained based on the magnitude parameters.
Wherein, target voice refers to carrying out the first audio the voice signal after noise reduction.Optionally, it is based on Bark feature
After magnitude parameters are calculated in scale parameter, target voice can be obtained based on magnitude parameters, that is, obtain and eliminate the first audio
The obtained voice signal of noise.
The method of voice de-noising provided in this embodiment, by be blended with voice signal and ambient noise signal first
The Bark feature input target nerve network model that training obtains in advance of audio, the Bark feature for characterizing voice signal is selected
Out, the Bark characteristic ratio parameter of the spectral magnitude feature ratio shared in Bark frequency domain of characterization voice signal is obtained,
The magnitude parameters for being calculated voice signal based on Bark characteristic ratio parameter again are obtained target voice based on magnitude parameters and (eliminated
The voice signal after ambient noise signal in first audio), realize the purpose of noise reduction, it is by target nerve network model straight
The mode for judge to the voice signal in the first audio screening is connect, the calculation amount of neural network model is reduced.
Referring to Fig. 4, showing a kind of flow chart of the method for voice de-noising that another embodiment of the application provides, this reality
It applies example and a kind of method of voice de-noising is provided, can be applied to above-mentioned electronic equipment, this method comprises:
Step S210: training sample set is obtained.
It should be noted that the embodiment of the present application can be trained can be identified in advance by the training sample set of acquisition
Voice signal, and then realize the target nerve network model of noise reduction, it can preferably be filtered out by the model in noise frequency
Noise signal obtains voice signal.
Wherein, the training sample of the embodiment of the present application concentrates the voice signal for including preset duration and ambient noise to believe
Number.Optionally, preset duration can be arbitrary continuation or discrete duration, and preset duration and the background of voice signal are made an uproar
The preset duration of acoustical signal can be equal or unequal.For example, the preset duration of voice signal can be 20 hours, and background
The preset duration of noise signal can be 10 hours;Or voice signal preset duration and ambient noise signal preset duration
It is 15 hours etc., is specifically not construed as limiting.
Optionally, can believe the target song of the different tone colors in continuous preset duration as the voice of preset duration
Number, can also using the target song of the different tone colors in discontinuous preset duration (i.e. preset duration has interruption) as it is default when
Long voice signal.
Similar, it can be using different types of ambient noise in continuous preset duration as the ambient noise of preset duration
Signal can also be believed the different types of ambient noise in discontinuous preset duration as the ambient noise of preset duration
Number.
As a kind of mode, preset duration can be obtained according to preset acquisition modes, for example, whole according to hour
Several times obtain preset duration;Optionally, voice signal and ambient noise signal can also be obtained at random, will obtain voice respectively
The acquisition duration of signal and ambient noise signal is as respective preset duration.
In one implementation, the audio with chronological order that the available user of electronic equipment selects as
The voice signal and ambient noise signal of preset duration;Audio data can also be grabbed from network at random as preset duration
Voice signal and ambient noise signal;Either by the audio number in the audio class application program operational process of electronic equipment
According to the voice signal and ambient noise signal as preset duration.It is worth noting that, the voice signal of preset duration and
The acquisition modes of ambient noise signal and the content sources of acquisition are unrestricted, can be selected according to the actual situation.
Step S220: the voice signal and ambient noise signal are carried out on linear time according to default signal-to-noise ratio
Superposition, and the superimposed training sample set is input to machine learning model, the machine learning model is trained,
Obtain target nerve network model.
It is understood that all unavoidably there is background and make an uproar in any one section of voice data without any noise reduction process
That is, there is signal-to-noise ratio in sound.Signal-to-noise ratio (SIGNAL-NOISE RATIO, SNR), also known as signal to noise ratio, refer to an electronic equipment
Or in electronic system signal and noise ratio.It is understood that in order to increase the target nerve in the embodiment of the present application
The noise reduction accuracy of network model, so that noise reduction algorithm adapts to the audio data of different signal-to-noise ratio, it can be by the language of preset duration
Sound signal and ambient noise signal are overlapped on linear time according to default signal-to-noise ratio, and by superimposed training sample
Collection is input to machine learning model, is trained to machine learning model, obtains target nerve network model.
Optionally, default signal-to-noise ratio can be the random number between 0~20, and specific value is unrestricted.
Wherein, machine learning model can be linear model, kernel method and support vector machines, decision tree and Boosting with
And neural network (including full Connection Neural Network, convolutional neural networks, Recognition with Recurrent Neural Network etc.) etc..Wherein, about each machine
The specific training method of device learning model can be with reference to respective working principle in the prior art, and which is not described herein again.
It should be noted that by voice signal and ambient noise signal according to preset signal-to-noise ratio on linear time
When being overlapped, the voice signal that the training sample taken is concentrated is equal with the preset duration of ambient noise signal, for example, from
2.5 seconds voice signals and 2.5 seconds ambient noise signals are chosen in shorthand in training sample set, so as to so that training obtains
Neural network model adapt to more signal-to-noise ratio band noise frequency.
Step S230: the first audio is obtained.
Wherein, the specific descriptions for obtaining the first audio are referred to the description in previous embodiment to step S110, herein
It repeats no more.
Step S240: framing adding window is carried out to the first audio signal.
Since the first audio signal is non-stationary signal, it is therefore desirable to carry out framing and windowing process to it.As one kind
Mode, the embodiment of the present application use Hanning window (Hanning Window), are arranged a length of 40ms of window (millisecond), and sliding window is
10ms.Wherein, the present embodiment is specifically not construed as limiting the window function taken, and is also possible to other window function, such as quarter window
Function etc..
In a specific application scenarios, if the audio sample rate of voice signal is 44.1kHz, then the window of Hanning window
A length of 1764 audios point, and sliding window is 441 audio points.Optionally, so setting window length can guarantee voice letter
Under the premise of number distortionless, the integral operation speed of target nerve network model is improved.By dividing the first audio signal
Frame adding window can be mutated to avoid interframe.
Step S250: short time discrete Fourier transform is carried out to first audio signal in each window, obtains described first
The Bark feature of audio.
Optionally, short time discrete Fourier transform is carried out to the first audio signal in each window, by the frequency spectrum of the first audio
Energy feature is transformed into Bark frequency domain by linear time, and then obtains the Bark feature of the first audio.Specifically, the application is implemented
The points of short time discrete Fourier transform are set as 2048 by example, then available 1025 frequency dimensions after short time discrete Fourier transform
Value (i.e. stft value).The dimension for the Bark feature taken in the present embodiment is 48 dimensions, then stft_energy is transformed into
The dimension of the transition matrix stft2bark_matrix of Bark feature is 1025*48.
It should be noted that can also calculate first while carrying out short time discrete Fourier transform to the signal in each window
The phase value of audio, specific formula for calculation are as follows:
Wherein, arctan is arctan function.
Step S260: the Bark feature is inputted into the target nerve network model that training obtains in advance, obtains institute's predicate
Spectral magnitude ratio of the sound signal in the Bark frequency domain.
As a kind of mode, as previously mentioned, the target nerve network model in the embodiment of the present application may include three points
Layer is remembered from door convolutional layer and two shot and long terms.Bark feature is being inputted into the target nerve network model that training obtains in advance
When, Bark feature will be firstly inputted to separation gate convolutional layer, and the output of separation gate convolutional layer is then input to shot and long term note
Recall layer, then exports to obtain voice signal in the spectral magnitude ratio (bark_mask) of Bark frequency domain.
Wherein, Bark feature is entered the processing step of each separation gate convolutional layer after separation gate convolutional layer and may include:
By input data (for first separation gate convolutional layer, input data is Bark feature) the first cause and effect convolutional layer of input;Again
The output of first cause and effect convolutional layer is input to the second cause and effect convolutional layer;Then the output of the second cause and effect convolutional layer is inputted respectively
To third convolutional layer and Volume Four lamination;Then the output of third convolutional layer is input to the first activation primitive module, and will
The output of Volume Four lamination is input to the second activation primitive module;Again by the output of the first activation primitive module and the second activation letter
The output of digital-to-analogue block is multiplied, and obtains the output of separation gate convolutional layer.
Step S270: Bark feature inverse conversion is carried out to the spectral magnitude ratio of the Bark frequency domain, obtains the voice
Spectral magnitude ratio of the signal in linear time.
As a kind of mode, formula can be passed through:
Mask=mat_mul (bark_mask, bark2stft_matrix),
The spectral magnitude ratio (bark_mask) of Bark frequency domain is subjected to Bark feature inverse conversion, wherein Bark feature is inverse
The dimension of transition matrix is 25*1025, obtains voice signal in the spectral magnitude ratio (mask) of linear time.Wherein, pass through
It converts to the spectral magnitude ratio of linear time, can be in order to the sound wave of synthetic speech signal, and then check the voice after noise reduction
Signal effect.
Step S280: the frequency of spectral magnitude ratio and first audio based on the linear time in linear time
Spectrum energy calculates the spectral magnitude of the voice signal.
Wherein, the first audio can be calculated in the spectrum energy of linear time by following formula:
As a kind of mode, existed by the first audio in the spectral magnitude ratio (mask) of linear time and the first audio
The spectrum energy stft_mag of linear time, can be calculated the spectral magnitude of voice signal, and specific formula for calculation is as follows:
Step S290: target voice is obtained based on the spectral magnitude and the phase value of the first audio.
As a kind of mode, inverse-Fourier transform can be carried out to above-mentioned phase value and spectral magnitude, obtain target language
Sound.
The present embodiment is illustratively illustrated by taking Fig. 5 as an example below:
As shown in figure 5, for it is provided by the embodiments of the present application to band make an uproar song carry out noise reduction method schematic flow diagram.It is optional
, voice signal is singing voice signals, then the first audio is that band is made an uproar song.By band make an uproar song carry out short time discrete Fourier transform, obtain
Song is made an uproar in the spectrum energy (stft_energy) of linear time to band, then the make an uproar spectrum energy of song of band is done into Bark feature
Conversion, obtains band and makes an uproar song in the Bark feature (Bark_feature) of Bark frequency domain, be then input to Bark feature in advance
Trained neural network model, which includes that 3 separation gate convolutional layers and two shot and long terms remember layer, defeated
Obtain out band make an uproar song singing voice signals Bark frequency spectrum spectral magnitude ratio (Bark_mask), then to spectral magnitude ratio
Example (Bark_mask) carries out feature inverse conversion, obtains singing voice signals in the spectral magnitude ratio (mask) of linear time, then be based on
Spectral magnitude ratio (mask) and to band make an uproar song carry out inverse-Fourier transform when calculated band make an uproar song in linear time
Spectrum energy (stft_energy) calculate the spectral magnitudes (stft_mag) of singing voice signals.
It is worth noting that, having sought band after song of making an uproar to band carries out Short-time Fourier variation and having made an uproar song when linear
The phase (stft_phase) in domain, it is possible to online according to singing voice signals spectral magnitude (stft_mag) and band song of making an uproar
Property time domain phase (stft_phase) synthesis singing voice signals noise reduction after linear time waveform, to obtain singing voice signals,
Target song as shown in Figure 5 is made an uproar song, hence it is evident that reduce ambient noise compared to band.
The method of voice de-noising provided in this embodiment is then concentrated the training sample by obtaining training sample set
Voice signal and ambient noise signal be overlapped on linear time according to default signal-to-noise ratio, by training sample after superposition
Collection is input to machine learning model, is trained to the machine learning model, obtains target nerve network model, then obtains the
One audio, then framing adding window is carried out to the first audio, and short time discrete Fourier transform is carried out to the first audio signal in each window,
The Bark feature of the first audio is obtained, Bark feature is then inputted into the target nerve network model that training obtains in advance, is obtained
Voice signal then carries out the reverse of Bark feature to the spectral magnitude ratio of Bark frequency domain in the spectral magnitude ratio of Bark frequency domain
It changes, obtains voice signal in the spectral magnitude ratio of linear time, be then based on the spectral magnitude ratio and of linear time
One audio calculates the spectral magnitude of voice signal in the spectrum energy of linear time, is finally based on spectral magnitude and the first audio
Phase value obtain target voice.It realizes and is handled by Bark feature of the completely new separation gate convolutional coding structure to input,
So that being significantly reduced the calculation amount and complexity of neural network while guaranteeing noise reduction effect, user experience is promoted.
Referring to Fig. 6, being a kind of structural block diagram of the device of voice de-noising provided by the embodiments of the present application, the present embodiment is mentioned
For a kind of device 300 of voice de-noising, electronic equipment is run on, described device 300 includes: the first acquisition module 310, pretreatment
Module 320, the first computing module 330, the second computing module 340 and second obtain module 350:
First obtains module 310, and for obtaining the first audio, first audio is to be mixed with voice signal and background
The audio of noise signal.
As a kind of mode, device 300 can also include sample set acquiring unit and model acquiring unit, the sample set
Acquiring unit can be used for obtaining training sample set, which may include the voice signal and background of preset duration
Noise signal, the voice signal and ambient noise signal are superimposed on linear time according to default signal-to-noise ratio.The mould
Type acquiring unit is used to training sample set being input to machine learning model, and is trained to machine learning model, obtains mesh
Mark neural network model.
Preprocessing module 320, for being pre-processed to first audio, by the spectrum energy of first audio
Feature is transformed into Bark frequency domain by linear time, obtains the Bark feature of first audio.
As a kind of mode, preprocessing module 320 may include first processing units and the second processing unit.This first
Processing unit can be used for carrying out framing adding window to the first audio signal;The second processing unit can be used for in each window
First audio signal carries out short time discrete Fourier transform, and the spectrum energy feature of the first audio is transformed into Bark by linear time
Frequency domain obtains the Bark feature of the first audio.
Optionally, preprocessing module 320 may include computing unit, which is used to calculate the phase of the first audio
Value.
First computing module 330, for the Bark feature to be inputted the target nerve network model that training obtains in advance,
Obtain the Bark characteristic ratio parameter of the target nerve network model output, Bark characteristic ratio parameter characterization institute predicate
The spectral magnitude feature of sound signal ratio shared in the Bark frequency domain.
As a kind of mode, the first computing module 330 specifically can be used for inputting Bark feature what training in advance obtained
Target nerve network model obtains voice signal in the spectral magnitude ratio of Bark frequency domain.
Optionally, the target nerve network model in the embodiment of the present application may include three separation gate convolutional layers and two
A shot and long term remembers layer.
Second computing module 340 calculates the amplitude ginseng of the voice signal for being based on the Bark characteristic ratio parameter
Number.
As a kind of mode, the second computing module 340 may include the first computing unit and the second computing unit.This
One computing unit can be used for carrying out Bark feature inverse conversion to the spectral magnitude ratio of Bark frequency domain, and it is online to obtain voice signal
The spectral magnitude ratio of property time domain;Second computing unit can be used for the spectral magnitude ratio and first based on linear time
Audio calculates the spectral magnitude of voice signal in the spectrum energy of linear time.
Second obtains module 350, for obtaining target voice based on the magnitude parameters.
As a kind of mode, second, which obtains module 350, can obtain target voice based on spectral magnitude and phase value.Tool
Body, second, which obtains module 350, can carry out inverse-Fourier transform to phase value and spectral magnitude, obtain target voice.
A kind of device of voice de-noising provided in this embodiment, by obtaining the first audio, the first audio is to be mixed with language
The audio of sound signal and ambient noise signal;Then the first audio is pre-processed, by the spectrum energy of the first audio
Feature is transformed into Bark frequency domain by linear time, obtains the Bark feature of the first audio;Bark feature is inputted into training in advance again
Obtained target nerve network model obtains the Bark characteristic ratio parameter of target nerve network model output, Bark aspect ratio
The spectral magnitude feature of example parameter characterization voice signal ratio shared in Bark frequency domain;It is based on Bark characteristic ratio parameter again
Calculate the magnitude parameters of voice signal;It is then based on magnitude parameters and obtains target voice.This method is by by the first audio
The Bark feature input target nerve network model that training obtains in advance, and then calculate the magnitude parameters of voice signal, then base
Target voice is obtained in magnitude parameters, reduces the calculation amount of neural network model, reduces the ambient noise in voice messaging.
It is apparent to those skilled in the art that for convenience and simplicity of description, foregoing description device and
The specific work process of module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, the mutual coupling of shown or discussed module or direct coupling
It closes or communication connection can be through some interfaces, the indirect coupling or communication connection of device or module can be electrical property, mechanical
Or other forms.
It, can also be in addition, can integrate in a processing module in each functional module in each embodiment of the application
It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.
It, can be with the embodiment of the present application also provides one kind referring to Fig. 7, the method and device based on above-mentioned voice de-noising
Execute the electronic equipment 12 of the method for aforementioned voice noise reduction.Electronic equipment 12 includes memory 122 and intercouple one
Or multiple (one is only shown in figure) processors 124, communication line connects between memory 122 and processor 124.Memory
The program that can execute content in previous embodiment is stored in 122, and processor 124 can be executed and be stored in memory 122
Program.
Wherein, processor 124 may include one or more processing core.Processor 124 utilizes various interfaces and route
The various pieces in entire electronic equipment 12 are connected, by running or executing the instruction being stored in memory 122, program, generation
Code collection or instruction set, and the data being stored in memory 122 are called, execute the various functions and processing number of electronic equipment 12
According to.Optionally, processor 124 can use Digital Signal Processing (Digital Signal Processing, DSP), scene can
Program gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable
Logic Array, PLA) at least one of example, in hardware realize.Processor 124 can integrating central processor (Central
Processing Unit, CPU), in image processor (Graphics Processing Unit, GPU) and modem etc.
One or more of combinations.Wherein, the main processing operation system of CPU, user interface and application program etc.;GPU is for being responsible for
Show the rendering and drafting of content;Modem is for handling wireless communication.It is understood that above-mentioned modem
It can not be integrated into processor 124, be realized separately through one piece of communication chip.
Memory 122 may include random access memory (Random Access Memory, RAM), also may include read-only
Memory (Read-Only Memory).Memory 122 can be used for store instruction, program, code, code set or instruction set.It deposits
Reservoir 122 may include storing program area and storage data area, wherein the finger that storing program area can store for realizing operating system
Enable, for realizing at least one function instruction (such as touch function, sound-playing function, image player function etc.), be used for
Realize the instruction etc. of foregoing individual embodiments.Storage data area can also store the data that electronic equipment 12 is created in use
(such as phone directory, audio, video data, chat record data) etc..
Referring to FIG. 8, it illustrates a kind of structural block diagrams of computer readable storage medium provided by the embodiments of the present application.
Program code is stored in the computer readable storage medium 400, said program code can be called by processor and execute above-mentioned side
Method described in method embodiment.
Computer readable storage medium 400 can be such as flash memory, EEPROM (electrically erasable programmable read-only memory),
The electronic memory of EPROM, hard disk or ROM etc.Optionally, computer readable storage medium 400 includes non-transient meter
Calculation machine readable medium (non-transitory computer-readable storage medium).Computer-readable storage
Medium 400 has the memory space for the program code 410 for executing any method and step in the above method.These program codes can
With from reading or be written in one or more computer program product in this one or more computer program product.
Program code 410 can for example be compressed in a suitable form.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field
Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples
It closes and combines.
Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although
The application is described in detail with reference to the foregoing embodiments, those skilled in the art are when understanding: it still can be with
It modifies the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;And
These are modified or replaceed, do not drive corresponding technical solution essence be detached from each embodiment technical solution of the application spirit and
Range.