CN106971707A

CN106971707A - The method and system and intelligent terminal of voice de-noising based on output offset noise

Info

Publication number: CN106971707A
Application number: CN201610024759.2A
Authority: CN
Inventors: 祝铭明
Original assignee: Yutou Technology Hangzhou Co Ltd
Current assignee: Yutou Technology Hangzhou Co Ltd
Priority date: 2016-01-14
Filing date: 2016-01-14
Publication date: 2017-07-21

Abstract

The invention discloses the method and system of the voice de-noising based on output offset noise and intelligent terminal, belong to technical field of voice recognition.Method includes：Gather the voice of outside input；The intensity of sound of voice is obtained, the intensity of sound of voice is matched with the counteracting noise of a plurality of alternative sounds intensity, the intensity of sound identical obtained with voice offsets noise, and output offset noise；The voice of outside input is gathered, judges whether the intensity of sound of voice is higher than a default intensity threshold, and voice is confirmed as into voice to be judged when intensity of sound is higher than intensity threshold；According to the frequency spectrum of voice to be judged, the estimation of each frequency band is identified on generation correspondence voice to be judged, estimation mark is used to represent conspicuousness of the voice on harmonic structure；Probabilistic model of the generation corresponding to the pure voice of voice to be judged；Using each estimation mark as the weight index of the frequency band of corresponding voice to be judged, the pure voice estimate for obtaining being associated with voice is handled according to probabilistic model.

Description

The method and system and intelligent terminal of voice de-noising based on output offset noise

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of voice based on output offset noise The method and system and intelligent terminal of noise reduction.

Background technology

In the prior art, speech recognition must often be used in the intelligent terminal of some support voice operatings Function, i.e., by recognizing that the vocal print and sentence of speaker obtain the instruction that intelligent terminal is able to carry out, And and then corresponding operation is performed according to the instruction.However, in the noise jamming of some non-talking people voices Stronger application scenario (for example apply in the more space of a speaker, or the space applied Ambient noise is stronger), because the phonetic order of ambient noise and speaker blend together, voice can be known Not increasingly difficult, recognition accuracy is substantially reduced.

In the prior art, (it can for example compose and subtract using some existing methods when ambient noise is relatively small Method and Wiener filtering) enter line noise filter in speech recognition process, and achieve more significant effect Really.But under the larger application environment of some ambient noises, prior art can not from ambient noise compared with The technical scheme of pure voice is extracted in big environment.

The content of the invention

According to the above-mentioned problems in the prior art, a kind of voice based on output offset noise is now provided The method and system of noise reduction and the technical scheme of intelligent terminal, are specifically included：

A kind of method of the voice de-noising based on output offset noise, it is adaptable to intelligent terminal, wherein, carry For the counteracting noise of the alternative sounds intensity of a plurality of training in advance, comprise the steps：

Step S1, gathers the voice of outside input；

Step S2, obtains the intensity of sound of the voice, by the intensity of sound of the voice with it is a plurality of not The counteracting noise with intensity of sound is matched, and obtains the intensity of sound identical institute with the voice Counteracting noise is stated, and exports the counteracting noise；

Step S3, gathers the voice of outside input, and judges whether the intensity of sound of the voice is higher than one Default intensity threshold, and confirm as the voice when the intensity of sound is higher than the intensity threshold Voice to be judged, and turn to step S4；

Step S4, it is each on voice to be judged described in generation correspondence according to the frequency spectrum of the voice to be judged The estimation mark of frequency band, the estimation mark is used to represent conspicuousness of the voice on harmonic structure；

Step S5, generation corresponds to the probabilistic model of the pure voice of the voice to be judged；

Step S6, the frequency band of the corresponding voice to be judged is used as using each estimation mark Weight index, obtains being associated with the pure voice estimate of the voice according to probabilistic model processing.

It is preferred that, the method for being somebody's turn to do the voice de-noising based on output offset noise, wherein, the step S4 The estimation mark of middle generation includes the first estimation mark；Or

The estimation mark generated in the step S4 includes the first estimation mark and the second estimation mark.

It is preferred that, the method for being somebody's turn to do the voice de-noising based on output offset noise, wherein, the step S4 In, the step of generating the first estimation mark specifically includes：

Step S41a, according to the frequency spectrum of the voice to be judged, extracts and corresponds to the language to be judged The harmonic structure of sound；

Step S42a, is carried out at regularization to the monitoring value being associated with the number spectral domain of the harmonic structure Reason, and smoothing processing is performed to the monitoring value handled by regularization according to melscale；

Step S43a, further regularization processing is carried out to the monitoring value Jing Guo smoothing processing, with The average for making the monitoring value is 1；

Step S44a, each frequency band of voice to be judged according to the monitoring value generates correspondence The first estimation mark.

It is preferred that, the method for being somebody's turn to do the voice de-noising based on output offset noise, wherein, the step S6 In, specifically included according to the method that the described first estimation mark processing obtains the pure voice estimate：

Step S61a, processing obtains being associated with the posteriority of the Minimum Mean Squared Error estimation of the voice to be judged Probability；

Step S62a, using each first estimation mark as described in the corresponding voice to be judged The weight index of frequency band, it is general to the posteriority for being associated with the voice to be judged according to the probabilistic model Rate is weighted, to obtain the pure voice estimate.

It is preferred that, the method for being somebody's turn to do the voice de-noising based on output offset noise, wherein, the step S4 In, the step of generating the second estimation mark specifically includes：

Step S41b, according to the frequency spectrum of the voice to be judged, extracts and corresponds to the language to be judged The harmonic structure of sound；

Step S42b, is carried out at regularization to the monitoring value being associated with the number spectral domain of the harmonic structure Reason, and smoothing processing is performed to the monitoring value handled by regularization according to melscale；

Step S43b, is carried out at corresponding regularization to the monitoring value Jing Guo smoothing processing from 0 to 1 Reason；

Step S44b, each frequency band of voice to be judged according to the monitoring value generates correspondence The second estimation mark.

It is preferred that, the method for being somebody's turn to do the voice de-noising based on output offset noise, wherein, perform the step After S6, following step is continued executing with always according to the described second estimation mark：

, will each corresponding second estimation mark conduct for each frequency band of the voice to be judged Weight, is obtained pair with performing linear interpolation between the monitoring value and the pure voice estimate and handling The output valve answered.

A kind of system of the voice de-noising based on output offset noise, it is adaptable to intelligent terminal, wherein, bag Include：

Collecting unit, the voice for gathering outside input；

Memory cell, the counteracting noise of the alternative sounds intensity to store a plurality of training in advance；

Matching unit, connects the collecting unit and the memory cell respectively, to the sound is strong Degree is matched with the counteracting noise of a plurality of alternative sounds intensity, obtains the sound with the voice Noise is offset described in intensity identical；

Output unit, connects the matching unit, to export the intensity of sound identical with the voice The counteracting noise；

Judging unit, connects in the collecting unit, the judging unit and presets an intensity threshold, and Whether the intensity of sound of the voice for judging outside input is higher than the intensity threshold, and output is corresponding Judged result；

First processing units, connect the judging unit, for according to the judged result, in institute's predicate The voice is confirmed as voice to be judged, and root by the intensity of sound of sound when being higher than the intensity threshold According to the frequency spectrum of the voice to be judged, the estimation of each frequency band is identified on the generation correspondence voice to be judged, The estimation mark is used to represent conspicuousness of the voice on harmonic structure；

Model generation unit, connects the first processing units, corresponds to the language to be judged for generating The probabilistic model of the pure voice of sound；

Second processing unit, connects the model generation unit, for using each estimation mark as The weight index of the frequency band of the corresponding voice to be judged, is obtained according to probabilistic model processing It is associated with the pure voice estimate of the voice.

It is preferred that, the system for being somebody's turn to do the voice de-noising based on output offset noise, wherein, the estimation mark Including the first estimation mark；Or

The estimation mark includes the first estimation mark and the second estimation mark.

It is preferred that, the system for being somebody's turn to do the voice de-noising based on output offset noise, wherein, first processing Unit is specifically included：

Extraction module, for the frequency spectrum according to the voice to be judged, extracts and waits to sentence described in corresponding to The harmonic structure of conclusion sound；

First processing module, connects the extraction module, for being composed to the number for being associated with the harmonic structure Monitoring value on domain carries out regularization processing, and according to melscale to the prison by regularization processing Control value performs smoothing processing；

Second processing module, connects the first processing module, for the prison Jing Guo smoothing processing Control value carries out further regularization processing, so that the average of the monitoring value is 1；

First generation module, connects the Second processing module, for generating correspondence according to the monitoring value The first estimation mark of each frequency band of the voice to be judged.

It is preferred that, the system for being somebody's turn to do the voice de-noising based on output offset noise, wherein, the second processing Unit is specifically included：

3rd processing module, the least mean-square error for obtaining being associated with the voice to be judged for handling is estimated The posterior probability of meter；

Fourth processing module, connects the 3rd processing module, for being identified with each first estimation As the weight index of the frequency band of the corresponding voice to be judged, according to the probabilistic model to closing The posterior probability for being coupled to the voice to be judged is weighted, and is estimated with obtaining the pure voice Value.

It is preferred that, the system for being somebody's turn to do the voice de-noising based on output offset noise, wherein, first processing Unit includes：

5th processing module, connects the first processing units, for the prison Jing Guo smoothing processing Control value carries out corresponding regularization processing from 0 to 1；

Second generation module, connects the 5th processing module, for generating correspondence according to the monitoring value The second estimation mark of each frequency band of the voice to be judged.

It is preferred that, the system for being somebody's turn to do the voice de-noising based on output offset noise, wherein, in addition to：

3rd processing unit, connects the second processing unit, for for the every of the voice to be judged Individual frequency band, will each corresponding second estimation mark as weight, with the monitoring value with it is described Linear interpolation is performed between pure voice estimate and is handled and obtains corresponding output valve.

A kind of intelligent terminal, wherein, using the method for the above-mentioned voice de-noising based on output offset noise.

A kind of intelligent terminal, wherein, include the system of the above-mentioned voice de-noising based on output offset noise.

The beneficial effect of above-mentioned technical proposal is：

1) a kind of method of the voice de-noising based on output offset noise is provided, by according to extraneous voice Intensity of sound, obtain the counteracting noise of same sound intensity, and output offset noise with extraneous to make an uproar Sound is offset, and reaches the purpose for removing ambient noise exclusive PCR, obtains pure further according to extraneous voice Speech Assessment value, to lift the degree of accuracy of speech recognition；

2) a kind of system of the voice de-noising based on output offset noise is provided, it would be preferable to support realize above-mentioned base In the method for the voice de-noising of output offset noise.

Brief description of the drawings

Fig. 1 be the present invention preferred embodiment in, a kind of voice de-noising based on output offset noise The overall procedure schematic diagram of method；

Fig. 2-4 be the present invention preferred embodiment in, on the basis of Fig. 1, made an uproar based on output offset The schematic flow sheet step by step of the method for the voice de-noising of sound；

Fig. 5 be the present invention preferred embodiment in, a kind of voice de-noising based on output offset noise The general structure schematic diagram of system；

Fig. 6-7 be the present invention preferred embodiment in, on the basis of Fig. 5, made an uproar based on output offset The clustered architecture schematic diagram of the system of the voice de-noising of sound.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the invention, and The embodiment being not all of.Based on the embodiment in the present invention, those of ordinary skill in the art are not making The every other embodiment obtained on the premise of going out creative work, belongs to the scope of protection of the invention.

It should be noted that in the case where not conflicting, the embodiment in the present invention and the spy in embodiment Levying to be mutually combined.

The invention will be further described with specific embodiment below in conjunction with the accompanying drawings, but not as limiting to the invention.

Typically, the speech recognition system being applicable in intelligent terminal includes the part of front-end and back-end two, Certain Voice Conversion Techniques of front end applications extract corresponding characteristic quantity, rear end in the voice that speaker inputs Just according to the characteristic quantity that these are extracted, speech recognition is carried out using the good identification model of training in advance, with The content included in the sentence for determining speaker's input.Then the technical scheme is that to of the prior art The improvement that front end in speech recognition system is carried out, i.e., extracting characteristic quantity according to the voice of outside input During the improvement that carries out, it is intended to reduce influence of the ambient noise to said process.

Therefore, it is existing based on the above-mentioned problems in the prior art in preferred embodiment of the invention A kind of method of the voice de-noising based on output offset noise is provided, it is applied to intelligent terminal, for example, fitted Intelligent robot for supporting voice operating.

In the technical scheme, " voice of outside input " and " voice to be judged " be superposition The voice of the speaker of ambient noise." pure voice " refer to the language of the speaker for eliminating ambient noise Sound.So-called " pure voice estimate " refers to by above-mentioned voice to be judged (i.e. including ambient noise Voice) the obtained pure voice of estimation." frequency spectrum " refer to the power spectrum or amplitude spectrum of voice.

Technical solution of the present invention is based on prior art expansion hereinafter, i.e., based on MMSE (Minimum Mean Square Error, least mean-square error) make and changing in the noise cancellation technique realized of estimation technique Enter obtained technical solution of the present invention.

Therefore, before description technical solution of the present invention, the noise based on MMSE is described first and eliminates skill Art：When providing initial speech value y (corresponding to the voice for being superimposed with ambient noise above), Pure speech value x is modeled as to x probability Distribution Model p (x | y), and from probability Distribution Model p (x | y) Estimate pure voice x estimate.Then MMSE estimations are used in the basic technology in the estimation of follow-up phase.

Then in MMSE estimation techniques, the voice of speaker is collected and recorded first with microphone, sight is used as Voice is surveyed, observation voice is then converted into data signal by way of A/D is changed, and pass through framing And DFT transform (Discrete Fourier Transform, discrete Fourier transform), it is every to be converted into The frequency spectrum of one frame voice.Next, per frame frequency, spectrum is by Mel wave filter group and takes its logarithm (one kind filtering Device group, wherein bandpass filter are arranged in Mel scale at equal intervals), it is then converted into Mel Logarithmic spectrum is simultaneously output.

In the prior art, the Mel logarithmic spectrum based on output, can generate the pure voice estimate of each frame, And corresponding pure voice estimate can be exported.

The probability Distribution Model that MMSE estimation techniques are above formed perform MMSE estimation, and Pure voice estimate can be generated.It is noted that the probability Distribution Model being saved is Mel GMM model (Gaussian Mixture Model, gauss hybrid models) in log-spectral domain, that is, be based on The model that priori learns and generated for each phoneme.Then pure voice can be generated by MMSE estimations to estimate Evaluation is simultaneously used as the vector in Mel log-spectral domain.

Then, specific characteristic quantity can be extracted, such as Mel from the pure voice estimate being output is fallen Spectral coefficient (MFCC) extracts corresponding characteristic quantity, and this feature amount is sent into rear end.In rear end, By using other voice recognition modes such as HMM (Hidden Markov Model, hidden Markov Model), acoustic model or N-gram language models (Chinese language model) etc. configured, based on from The characteristic quantity of front end receiver specifies the content included in the sentence of speaker.

Then in the prior art, the frequency band d (frequency band on melscale) in above-mentioned speech value y frame t Mel log-spectral domain in speech value y_d(t) pure speech value can be expressed as in following formula (1) x_d(t) with noise figure n_d(t) function：

y_d(t)=x_d(t)+log(1+exp(n_d(t)-x_d(t))) (1)

Ignore frame t in above-mentioned formula (1), and when above-mentioned formula (1) is expressed as into vector, can obtain Obtain following formula (2)：

Y=x+g (2)

In above-mentioned formula (2), each frequency band d mismatch vector g can be indicated in following formula (3) Mismatch functional G provide：

g_d=G_d(x, n)=log (1+exp (n_d-x_d)) (3)

Then above-mentioned pure voice x can be modeled as the K mixing GMM models indicated in decimal formula (4)：

In above-mentioned formula (4), γ_k, μ_x,k, and Σ_x,kIndicate respectively that the priori of kth normal distribution is general Rate, mean vector and covariance matrix.

Then by being deployed based on above-mentioned formula (1)-(4) using linear Taylor, mismatch vector g can be carried out Modeling, it is expressed as the K mixing GMM models indicated in following formula (5)：

Mean vector μ in above-mentioned formula (5)_g,kIt can be represented by following formula (6), and covariance square Battle array Σ_g,kIt can be represented by following equation (7)：

μ_g,k≌log(1+exp(μ_n-μ_x,k))=G (μ_x,k,μ_n) (6)

Σ_g,k≌F(μ_x,k,μ_n)²·(Σ_x,k+Σ_n) (7)

Auxiliary function F in above-mentioned formula (7) can be defined as following equation (8)：

F_d(x, n)=(1+exp (x_d-n_d))^-1 (8)

Therefore, handled by following formula (9-1) and obtain above-mentioned pure voice estimate

Correspondingly, pure voice estimate is obtained from speech value y direct estimationsMethod can also be by following public affairs Formula (9-2) is provided：

Here, posterior probability ρ in above formula (9-1) and (9-2)_kAll provided by following equation (10)：

In above-mentioned formula (10), mean vector μ_y,kIt can be represented by following equation (11), and covariance Matrix Σ_y,kIt can be represented by following equation (12)：

μ_y,k≌μ_x,k+G(μ_x,k,μ_n) (11)

Σ_y,k≌{1-F(μ_x,k,μ_n)²}·Σ_x,k+F(μ_x,k,μ_n)²·Σ_n (12)

Then in the prior art, in above-mentioned formula (11)-(12), speech model parameter [μ_x,k,Σ_x,k] can be with Obtained by priori training data, and noise model parameters [μ_n,Σ_n] based in non-speech segments, quilt Give the observation of MMSE estimating parts 514 and set by the noise compensation part 512 based on model.

As above, in other words, the process of above-mentioned MMSE estimations is exactly by pure voice estimateApproximately To use posterior probability ρ_k(y) the k mean of a probability distribution vector μ being weighted as weight_x,k The process of sum.

Then in preferred embodiment of the invention, the method for the above-mentioned voice de-noising based on output offset noise The step of it is specific as shown in Figure 1 there is provided the counteracting noise of the alternative sounds intensity of a plurality of training in advance, Comprise the steps：

Step S1, gathers the voice of outside input；

Step S2, obtains the intensity of sound of voice, and the intensity of sound of voice and a plurality of alternative sounds is strong The counteracting noise of degree is matched, and the intensity of sound identical obtained with voice offsets noise, and output is supported Eliminated noise；

Step S3, gathers the voice of outside input, and judges whether the intensity of sound of voice is preset higher than one Intensity threshold, and voice is confirmed as into voice to be judged when intensity of sound is higher than intensity threshold, and turn To step S4；

Step S4, according to the frequency spectrum of voice to be judged, each frequency band estimates on generation correspondence voice to be judged Meter mark, estimation mark is used to represent conspicuousness of the voice on harmonic structure；

Step S5, probabilistic model of the generation corresponding to the pure voice of voice to be judged；

Step S6, using each estimation mark as the weight index of the frequency band of corresponding voice to be judged, according to The pure voice estimate for obtaining being associated with voice is handled according to probabilistic model.

In the present embodiment, by the intensity of sound according to extraneous voice, same sound intensity is obtained Noise is offset, and output offset noise reaches removal ambient noise row to be offset with extraneous noise Except the purpose of interference, pure Speech Assessment value is obtained further according to extraneous voice, to lift the standard of speech recognition Exactness.

In a specific embodiment, the voice (voice for gathering speaker) of outside is gathered first, And judge whether the intensity of sound of the collected voice is more than a default intensity threshold.The master of the judgement Syllabus be remove some speakers be originally not intended to intelligent terminal carry out Voice command scene, for example The scene that speaker talks in a low voice with other people, or the sentence that speaker lets slip.Therefore, only , could quilt when having the intensity of sound relatively strong (being more than default intensity threshold) for the voice said in speaker It is considered to send phonetic order to intelligent terminal, now intelligent terminal just needs to proceed by speech recognition, And carry out the voice de-noising based on output offset noise before speech recognition.Therefore, above-mentioned judgement can be with Avoid in intelligent terminal on speech recognition and the functional module of the voice de-noising based on output offset noise begins It is in running order eventually, and the power consumption of intelligent terminal can be saved.

In the embodiment, when the intensity of sound of the voice of speaker is more than above-mentioned default intensity threshold, Perform step S4, i.e., according to the frequency spectrum of voice to be judged, each frequency band on generation correspondence voice to be judged Estimation mark.In the embodiment, above-mentioned estimation identifies the conspicuousness for representing voice on harmonic structure.

In the embodiment, generation corresponds to the probabilistic model of the pure voice of voice to be judged, and with each Estimation mark is obtained as the weight index of the frequency band of corresponding voice to be judged according to probabilistic model processing It is associated with the pure voice estimate of voice.

In the preferred embodiment of the present invention, in above-mentioned steps S4, the estimation mark of generation includes first Estimation mark；Or

In above-mentioned steps S4, the estimation mark of generation includes the first estimation mark and the second estimation mark.

In the preferred embodiment of the present invention, as shown in Fig. 2 in above-mentioned steps S4, generation first is estimated The step of meter mark, specifically includes：

Step S41a, according to the frequency spectrum of voice to be judged, extracts the harmonic structure corresponding to voice to be judged；

Step S42a, regularization processing is carried out to the monitoring value being associated with the number spectral domain of harmonic structure, and Smoothing processing is performed to the monitoring value handled by regularization according to melscale；

Step S43a, further regularization processing is carried out to the monitoring value Jing Guo smoothing processing, so that prison The average of control value is 1；

Step S44a, the first estimation that each frequency band of correspondence voice to be judged is generated according to monitoring value is marked Know.

In the preferred embodiment of the present invention, as shown in figure 3, in above-mentioned steps S6, estimating according to first The method that meter mark processing obtains pure voice estimate is specifically included：

Step S61a, the posteriority for handling the Minimum Mean Squared Error estimation for obtaining being associated with voice to be judged is general Rate；

Step S62a, is referred to using each first estimation mark as the weight of the frequency band of corresponding voice to be judged Mark, the posterior probability for being associated with voice to be judged is weighted according to probabilistic model, pure to obtain Voice estimate.

In the preferred embodiment of the present invention, as shown in figure 4, in above-mentioned steps S4, generation second is estimated The step of meter mark, specifically includes：

Step S41b, according to the frequency spectrum of voice to be judged, extracts the harmonic structure corresponding to voice to be judged；

Step S42b, regularization processing is carried out to the monitoring value being associated with the number spectral domain of harmonic structure, and Smoothing processing is performed to the monitoring value handled by regularization according to melscale；

Step S43b, corresponding regularization processing is carried out to the monitoring value Jing Guo smoothing processing from 0 to 1；

Step S44b, the second estimation that each frequency band of correspondence voice to be judged is generated according to monitoring value is marked Know.

In the preferred embodiment of the present invention, after step S6 is performed, identified always according to the second estimation Continue executing with following step：

For each frequency band of voice to be judged, each corresponding second estimation is identified as weight, with Linear interpolation is performed between monitoring value and pure voice estimate and is handled and obtains corresponding output valve.

One embodiment in technical solution of the present invention given below：

In existing MMSE, pure voice estimateProvided by above-mentioned formula (9-1) and (9-2), and often Posterior probability ρ in individual formula_k(y) provided by above-mentioned formula (10).

Then in this embodiment, pure voice estimate is being providedAbove in formula (9-1) and (9-2), CW-MMSE, which is used, utilizes estimation mark α_dThe posterior probability ρ ' of weighting_kRather than posterior probability (y) ρ_k(y) as weight.Formula (13) hereinafter indicates the posterior probability used in the embodiment ρ'_k(y)：

In the embodiment, normal distribution can be by formula (14) table hereinafter in formula (13) above Show, formula (14) is assumed using diagonal covariance.In following formula (14), D represents the dimension of omnidirectional distribution The number of degree：

Above-mentioned formula (14) is represented：Normal distribution N ' (be used to calculate posterior probability ρ ' in formula_k(y) ) be multiplied by using estimation mark α_dIt is used as the index of weight.So-called estimation mark, it is really to represent The mark of the estimation of frequency band.Usually, the estimation of frequency band is the angle from signal degradation caused by ambient noise What degree was carried out.In technical solution of the present invention, estimation mark is defined as follows：

Due to could be aware that the frequency spectrum for the vowel being included in the common speech of the mankind has typically humorous in advance Wave structure, in the environment without ambient noise, the harmonic structure of vowel can be maintained at collected language In the whole frequency band of the frequency spectrum of sound.Correspondingly, when with stronger broadband noise, in many frequency bands The harmonic structure of vowel can be lost, and harmonic structure is only capable of being maintained at being total to for such as phonetic speech power concentration Shake in peak (formant) frequency band.Therefore, in technical solution of the present invention, it is assumed that because ambient noise causes Degeneration seldom occur determine in the frequency band with obvious harmonic structure, and by the conspicuousness of harmonic structure Justice identifies for the estimation of the frequency band.

Estimation mark in technical solution of the present invention is to use LPW (Local Peat Weight, local peaking Weight) generation.LPW mode is for example by the huge change including formant information from collected Removed in the spectrum energy distribution of voice, and only extract the regular crest and ripple corresponding to harmonic structure Paddy, and by its value regularization.It is each by performing following processes generations in technical solution of the present invention The LPW of frame：

First, handled using the algorithm of the frame t of collected voice frequency spectrum, and its logarithmic spectrum Cepstrum is obtained by discrete cosine transform.Then, in the item of the cepstrum of acquisition, only leave corresponding to LPW Item in the domain of the harmonic structure of vowel, and delete other.Hereafter, the cepstrum of processing is carried out instead Discrete cosine transform, log-spectral domain is converted back by cepstrum.Finally, the frequency spectrum executing rule to being changed Change is handled, so that the average of frequency spectrum becomes 1, is derived from LPW.

Next, by being smoothed on melscale to LPW, to obtain corresponding Mel LPW., can be by one group of Mel wave filter to LPW's in the preferred embodiment of the present invention Value is smoothed, to obtain a corresponding value for each Mel frequency band.So-called Mel wave filter, It is a kind of wave filter group, wherein bandpass filter is arranged on melscale at equal intervals.In each plum Your frequency band provides corresponding Mel LPW value.The size of Mel LPW values corresponds to high-resolution The conspicuousness of the harmonic structure of spectral band, and each Mel frequency band one Mel LPW value of correspondence.

In technical solution of the present invention, above-mentioned Mel LPW values can be identified as the estimation of correspondence frequency band. Specifically, the estimation mark α in above-mentioned formula (14)_dIt can be generated by procedure below：

First, Mel LPW dynamic model is compressed by using suitable scaling function such as curvilinear function Enclose.In such as following formula (15), the Mel LPW values w of each frequency band_dIt is converted into α '_d.It is following Formula (15) is indicated Mel LPW values w by using curvilinear function_dBe converted to α '_dMode：

α'_d=1.0/ (1.0+exp (- a. (w_d-1.0))) (15)

In above-mentioned formula (15), a is tuner parameters, it is possible to set appropriate numerical value.

Then, the value α ' to being compressed_dRegularization is handled, so that its average becomes 1.Following formula (16) Indicate to be used for regularization α '_dAnd obtain estimation mark α_dMethod：

When there is the harmonic structure of vowel in obvious spectrum bands in the frame t of voiced portions, correspondence frequency Estimation mark α with d_d1 will be gone above.Now, for frequency band d, the normal state in above formula (14) It is distributed N ' changes big, and frequency band d posterior probability ρ '_k(y) become big.Therefore corresponding to its medial vowel The contribution of the harmonic structure Mel frequency band of significantly composing frequency band become big.

On the contrary, when there is the harmonic structure of vowel in the spectrum bands being lost in the frame t of voiced portions, Correspondence frequency band d estimation mark α_d1 will be become less than.Then for frequency band d, in above formula (14) just State distribution N ' diminishes, and frequency band d posterior probability ρ '_k(y) diminish.Therefore corresponding to wherein first The contribution of the Mel frequency band for the spectrum frequency band that the harmonic structure of sound is lost diminishes.

Second embodiment in technical solution of the present invention given below：

If collected voice is equivalent to pure voice (i.e. in the environment of one almost no ambient noise The voice of the speaker collected, or speaker are very near apart from voice acquisition device such as microphone Situation), then any processing need not be carried out to it, the directly collected voice of output is optimal selection. But, the method according to the voice de-noising based on output offset noise in technical solution of the present invention is carried out If speech processes, even if in these cases, similarly can be according to collected voice to pure voice Estimated, and therefore can export the worse voice estimate of effect than pure voice.

Therefore, propose in this embodiment it is a kind of can be real between speech modality and collected voice The method of existing linear interpolation, wherein estimation mark participates in calculating as weight.

Then in this embodiment, in following formula (17), frequency band d is obtained by linear interpolation function In output valve

In above-mentioned formula (17),Represent the pure voice estimate in frequency band d, β_dIt is represented to frequency Confidence index with d, y_dThe value of voice being collected in frequency band d is represented, andRepresent frequency band d In output valve.In above-mentioned formula (17), β is identified using estimation_dAs weight to linear interpolation function It is weighted, it is become the value from 0 to 1.It can see in linear interpolation function：With β_dConnect Nearly 1, output valveClose to the value y of collected voice_d；Correspondingly, with β_dIt is defeated close to 0 Go out valueClose to pure voice estimate

In technical solution of the present invention, above-mentioned estimation is generated by carrying out regularization processing to Mel LPW values Mark.Estimation mark β in above-mentioned formula (17)_dIt can be generated by following process：

Obtain the value of the Mel LPW for frame t first, i.e., it is for example bent by using appropriate scaling function Line function is by Mel MPW value w_dRegularization processing is carried out, so that w_dValue takes the value from 0 to 1, its In 1 be maximum.Formula (18) hereinafter indicates to be used for by using curvilinear function regularization Mel MPW values w_dAnd obtain estimation mark β_dMode：

β_d=1.0/ (1.0+exp (- a (w_d-1.0-b))) (18)

In above-mentioned formula (18), a and b are tuner parameters, and can be preset according to actual conditions Appropriate numerical value.

When there is the harmonic structure of vowel in obvious spectrum bands in the frame t of voiced portions, correspondence frequency Estimation mark β with d_dClose to 1.The then output valve in frequency band dFor what is indicated in above-mentioned formula (17) The result of linear interpolation, hence in so that the output valveValue y away from collected voice_dDistance ratio away from Pure voice estimateDistance closer to.

On the contrary, when there is the harmonic structure of vowel in the spectrum frequency band being lost in the frame t of voiced portions, Correspondence frequency band d estimation mark β_dClose to 0.The then output valve in frequency band dFor in formula (17) middle finger The result for the linear interpolation shown, hence in so that the output valveAway from pure voice estimateDistance ratio away from Observation y_dDistance closer to.

The present invention preferred embodiment in, above-mentioned first embodiment and second embodiment can with connected applications, Process for example hereinafter：

The frequency spectrum Y for the frame for corresponding to collected voice is obtained first, extracts frequency spectrum Y harmonic wave knot Structure and LPW is generated, and Mel LPW is generated according to LPW.Then with appropriate method to Mel The estimation that LPW carries out regularization processing to generate for each frequency band identifies α, estimation mark α's Average is 1.The estimation for carrying out regularization processing to Mel LPW simultaneously to generate for each frequency band is identified β, estimation mark β value is distributed from 0 to 1.The estimation mark α and estimation for exporting generation respectively are marked Know β.

Hereafter, the frequency spectrum Y corresponding to a frame is converted into Mel logarithmic spectrum y and exported.By using defeated The Mel logarithmic spectrum y gone out and above-mentioned estimation identify α to estimate pure voice.Specifically, using above-mentioned estimation The MMSE posterior probability estimated is weighted as weight by mark α, and exports pure voice Estimate

Then, for each frequency band, in Mel logarithmic spectrum y vectorial and above-mentioned pure voice estimate(plum Vector in your log-spectral domain) between perform linear interpolation.In the calculating process of the linear interpolation, with above-mentioned Estimation mark β is used as weight.Final calculate obtains output valve

Finally, according to obtained output valveCarry out the extraction of specific characteristic quantity, and will extract Characteristic quantity is sent to rear end.Above-mentioned steps are repeated to each frame of collected voice, and When reaching last frame, processing terminates.

In the preferred embodiment of the present invention, based on the voice de-noising based on output offset noise above Method, now provide a kind of system of the voice de-noising based on output offset noise, it is adaptable to intelligent terminal, Its structure it is specific as shown in figure 5, including：

Collecting unit 1, the voice for gathering outside input；

Memory cell 7, the counteracting noise of the alternative sounds intensity to store a plurality of training in advance；

Matching unit 8, connects collecting unit 1 and memory cell 7 respectively, to by intensity of sound and plural number The counteracting noise of individual alternative sounds intensity is matched, and the intensity of sound identical obtained with voice, which is offset, makes an uproar Sound；

Output unit 9, matching connection unit 8, the intensity of sound identical counteracting to export with voice is made an uproar Sound；

Judging unit 2, connects and an intensity threshold is preset in collecting unit 1, judging unit, and for sentencing Whether the intensity of sound of the voice of disconnected outside input is higher than intensity threshold, exports corresponding judged result；

First processing units 3, connection judgment unit 2 is strong in the sound of voice for according to judged result Voice is confirmed as into voice to be judged when degree is higher than intensity threshold, and according to the frequency spectrum of voice to be judged, it is raw The estimation mark of each frequency band on into correspondence voice to be judged, estimation mark is used to represent voice in harmonic wave knot Conspicuousness on structure；

Model generation unit 6, connects first processing units 3, for generating corresponding to the pure of voice to be judged The probabilistic model of voice；

Second processing unit 5, link model generation unit 6, for being identified using each estimation as corresponding The weight index of the frequency band of voice to be judged, the pure voice for obtaining being associated with voice is handled according to probabilistic model Estimate.

In the present embodiment, the intensity of sound of the voice in the external world is obtained using collecting unit 1, passes through matching Unit 8 by the intensity of sound of extraneous voice and a plurality of training in advance in memory cell 7 not in unison The counteracting noise of loudness of a sound degree is matched, to obtain the counteracting noise of same sound intensity, single using output First 9 output offset noises reach to be offset with extraneous noise and remove ambient noise exclusive PCR Purpose, obtains pure Speech Assessment value, to lift the degree of accuracy of speech recognition further according to extraneous voice.

In the preferred embodiment of the present invention, in the system of the above-mentioned voice de-noising based on output offset noise, Estimate that mark can include the first estimation mark；Or

Estimate that mark can include the first estimation mark and the second estimation mark.

In the preferred embodiment of the present invention, in the system of the above-mentioned voice de-noising based on output offset noise, As shown in fig. 6, above-mentioned first processing units 3 are specifically included：

Extraction module 31, for the frequency spectrum according to voice to be judged, is extracted corresponding to the humorous of voice to be judged Wave structure；

First processing module 32, connects extraction module 31, for the number spectral domain to being associated with harmonic structure Monitoring value carry out regularization processing, and according to melscale to by regularization handle monitoring value perform Smoothing processing；

Second processing module 33, connects first processing module 32, for the monitoring value Jing Guo smoothing processing Further regularization processing is carried out, so that the average of monitoring value is 1；

First generation module 34, connects Second processing module 33, for waiting to sentence according to monitoring value generation correspondence First estimation mark of each frequency band of conclusion sound.

In the preferred embodiment of the present invention, in the system of the above-mentioned voice de-noising based on output offset noise, As shown in fig. 7, above-mentioned second processing unit 5 is specifically included：

3rd processing module 51, obtains being associated with the Minimum Mean Squared Error estimation of voice to be judged for processing Posterior probability；

Fourth processing module 52, connect the 3rd processing module 51, for using each first estimation mark as The weight index of the frequency band of corresponding voice to be judged, according to probabilistic model to being associated with voice to be judged Posterior probability is weighted, to obtain pure voice estimate.

In the preferred embodiment of the present invention, in the system of the above-mentioned voice de-noising based on output offset noise, Still as shown in fig. 6, first processing units 3 include：

5th processing module 35, connects first processing units 32, for the monitoring value Jing Guo smoothing processing Corresponding regularization processing is carried out from 0 to 1；

Second generation module 36, connects the 5th processing module 35, for waiting to sentence according to monitoring value generation correspondence Second estimation mark of each frequency band of conclusion sound.

In the preferred embodiment of the present invention, in the system of the above-mentioned voice de-noising based on output offset noise, Still as shown in figure 5, also including：

3rd processing unit 4, connection second processing unit 5, for each frequency band for voice to be judged, Each corresponding second estimation is identified as weight, to be performed between monitoring value and pure voice estimate Linear interpolation and handle obtain corresponding output valve.

The present invention preferred embodiment in, a kind of intelligent terminal is also provided, wherein using it is above-mentioned based on The method of the voice de-noising of output offset noise.

The present invention preferred embodiment in, a kind of intelligent terminal is also provided, including it is above-mentioned based on The system of the voice de-noising of output offset noise.

The foregoing is only preferred embodiments of the present invention, not thereby limit embodiments of the present invention and Protection domain, to those skilled in the art, should can appreciate that all utilization description of the invention And the equivalent substitution made by diagramatic content and the scheme obtained by obvious change, it should include Within the scope of the present invention.

Claims

1. a kind of method of the voice de-noising based on output offset noise, it is adaptable to intelligent terminal, its feature It is the counteracting noise of the alternative sounds intensity there is provided a plurality of training in advance, comprises the steps：

Step S1, gathers the voice of outside input；

2. the method for the voice de-noising as claimed in claim 1 based on output offset noise, its feature exists In the estimation mark generated in the step S4 includes the first estimation mark；Or

3. the method for the voice de-noising as claimed in claim 2 based on output offset noise, its feature exists In in the step S4, the step of generating the first estimation mark specifically includes：

Step S24a, each frequency band of voice to be judged according to the monitoring value generates correspondence The first estimation mark.

4. the method for the voice de-noising as claimed in claim 3 based on output offset noise, its feature exists In, in the step S6, according to described first estimation mark processing obtain the pure voice estimate Method is specifically included：

5. the method for the voice de-noising as claimed in claim 3 based on output offset noise, its feature exists In in the step S4, the step of generating the second estimation mark specifically includes：

6. the method for the voice de-noising as claimed in claim 5 based on output offset noise, its feature exists In performing after the step S6, following step continued executing with always according to the described second estimation mark：

7. a kind of system of the voice de-noising based on output offset noise, it is adaptable to intelligent terminal, its feature It is, including：

Collecting unit, the voice for gathering outside input；

First processing units, connect the judging unit, for according to the judged result, in institute's predicate The voice is confirmed as voice to be judged, and root by the intensity of sound of sound when being higher than the intensity threshold According to the frequency spectrum of the voice to be judged, the estimation of each frequency band is identified on the generation correspondence voice to be judged, The estimation mark is used to represent work of the voice on harmonic structure；

8. the system of the voice de-noising as claimed in claim 7 based on output offset noise, its feature exists In the estimation mark includes the first estimation mark；Or

9. the system of the voice de-noising as claimed in claim 8 based on output offset noise, its feature exists In the first processing units are specifically included：

10. the system of the voice de-noising as claimed in claim 9 based on output offset noise, its feature It is, the second processing unit is specifically included：

11. the system of the voice de-noising as claimed in claim 9 based on output offset noise, its feature It is, the first processing units include：

12. the system of the voice de-noising as claimed in claim 11 based on output offset noise, its feature It is, in addition to：

13. a kind of intelligent terminal, it is characterised in that using as described in claim 1-6 based on output The method for offsetting the voice de-noising of noise.

14. a kind of intelligent terminal, it is characterised in that including as described in claim 7-12 based on output The system for offsetting the voice de-noising of noise.