CN106971741A - The method and system for the voice de-noising that voice is separated in real time - Google Patents
The method and system for the voice de-noising that voice is separated in real time Download PDFInfo
- Publication number
- CN106971741A CN106971741A CN201610024317.8A CN201610024317A CN106971741A CN 106971741 A CN106971741 A CN 106971741A CN 201610024317 A CN201610024317 A CN 201610024317A CN 106971741 A CN106971741 A CN 106971741A
- Authority
- CN
- China
- Prior art keywords
- voice
- judged
- estimation
- processing
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000012545 processing Methods 0.000 claims description 108
- 238000012544 monitoring process Methods 0.000 claims description 47
- 238000001228 spectrum Methods 0.000 claims description 41
- 238000009499 grossing Methods 0.000 claims description 21
- 239000000284 extract Substances 0.000 claims description 19
- 238000012163 sequencing technique Methods 0.000 claims description 14
- 230000003595 spectral effect Effects 0.000 claims description 9
- 238000005520 cutting process Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 7
- 238000004519 manufacturing process Methods 0.000 claims description 4
- 238000012986 modification Methods 0.000 claims description 4
- 230000004048 modification Effects 0.000 claims description 4
- 238000003672 processing method Methods 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 12
- 230000000694 effects Effects 0.000 abstract description 5
- 238000009826 distribution Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 11
- 230000008859 change Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006854 communication Effects 0.000 description 1
- 230000007850 degeneration Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 239000003415 peat Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses the method and system for the voice de-noising in real time being separated voice, belong to technical field of voice recognition.The method for the voice de-noising that voice is separated in real time, by the voice that the sound source of outside input is divided into a plurality of preset time periods, to be matched using characteristic model with voice, noise is isolated with carrying the voice of voice, further according to the noise real-time update characteristic model identified, so as to reach the purpose for the noise that the Real time identification external world is constantly converted, simultaneously using the noise as sample for reference to generate the probabilistic model for the pure voice for judging voice, the voice for carrying voice is handled, to obtain pure voice estimate, the effect of ambient noise removal can be lifted, preferably exclude the interference of ambient noise larger in speech recognition process, lift the degree of accuracy of speech recognition.
Description
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of language in real time being separated voice
The method and system of sound noise reduction.
Background technology
In the prior art, speech recognition must often be used in the intelligent terminal of some support voice operatings
Function, i.e., by recognizing that the vocal print and sentence of speaker obtain the instruction that intelligent terminal is able to carry out,
And and then corresponding operation is performed according to the instruction.Because the sound that speaker sends is pressed from both sides in communication process
The noise that miscellaneous extraneous noise and speaker produce when speaking, therefore the voice received mainly includes
Two parts, a part is produces the voice of noise when carrying speaker's sound and speaking, another part is
The noise that sound is produced in transmitting procedure by the external world, therefore in the noise jamming of some non-talking people voices
Stronger application scenario (for example apply in the more space of a speaker, or the space applied
Ambient noise is stronger), because the phonetic order of ambient noise and speaker blend together, voice can be known
Not increasingly difficult, recognition accuracy is substantially reduced.
Because extraneous noise can be converted constantly, it is therefore desirable to train the noise with constantly converting
The characteristic model of matching, recognizes noise, to filter out noise, but not currently exist it is ideal on
The technical scheme of noise filtering.
The content of the invention
According to the above-mentioned problems in the prior art, a kind of language in real time being separated voice is now provided
The method and system of sound noise reduction and the technical scheme of intelligent terminal, are specifically included:
A kind of method for the voice de-noising in real time being separated voice, it is adaptable to which there is provided one for intelligent terminal
Default characteristic model, comprises the steps:
Step S1, gathers the sound source of outside input, and stores;
Step S2, the sound source is divided into according to the time sequencing of reception the language of a plurality of preset time periods
Sound;
Step S3, the voice according to the time sequencing extracts one, by the voice and the character modules
Type is matched, to obtain the noise that is matched with the characteristic model, and with the initial characteristicses
The voice of the carrying voice of unmatched models, and generate the matching identification of the voice, the matching
Identify for representing that the voice matches completion with the characteristic model;
Step S4, the noise sample for the characteristic model is added by the noise, and according to the noise
Sample is updated to the characteristic model, to form the new characteristic model;
Step S5, judges whether the intensity of sound for carrying the voice of voice is higher than a default intensity threshold
Value, and the voice is confirmed as into voice to be judged when the intensity of sound is higher than the intensity threshold,
And turn to step S6;
Step S6, it is each on voice to be judged described in generation correspondence according to the frequency spectrum of the voice to be judged
The estimation mark of frequency band, the estimation mark is used to represent conspicuousness of the voice on harmonic structure;
Step S7, using the noise as sample for reference, according to the sample for reference and the language to be judged
Sound generation corresponds to the probabilistic model of the pure voice of the voice to be judged;
Step S8, the frequency band of the corresponding voice to be judged is used as using each estimation mark
Weight index, obtains being associated with the pure voice estimate of the voice according to probabilistic model processing;
Step S9, not identified voice is extracted according to the time sequencing, by the voice with
The characteristic model is matched, to obtain the noise matched with the new characteristic model, and
With the voice of the carrying voice of the initial characteristicses unmatched models, and the matching of the voice is generated
Mark, returns and performs the step S4.
It is preferred that, the method for the voice de-noising in real time being separated voice, wherein, the step S6
The estimation mark of middle generation includes the first estimation mark;Or
The estimation mark generated in the step S6 includes the first estimation mark and the second estimation mark.
It is preferred that, the method for the voice de-noising in real time being separated voice, wherein, the step S6
In, the step of generating the first estimation mark specifically includes:
Step S61a, according to the frequency spectrum of the voice to be judged, extracts and corresponds to the language to be judged
The harmonic structure of sound;
Step S62a, is carried out at regularization to the monitoring value being associated with the number spectral domain of the harmonic structure
Reason, and smoothing processing is performed to the monitoring value handled by regularization according to melscale;
Step S63a, further regularization processing is carried out to the monitoring value Jing Guo smoothing processing, with
The average for making the monitoring value is 1;
Step S64a, each frequency band of voice to be judged according to the monitoring value generates correspondence
The first estimation mark.
It is preferred that, the method for the voice de-noising in real time being separated voice, wherein, the step S8
In, specifically included according to the method that the described first estimation mark processing obtains the pure voice estimate:
Step S81a, processing obtains being associated with the posteriority of the Minimum Mean Squared Error estimation of the voice to be judged
Probability;
Step S82a, using each first estimation mark as described in the corresponding voice to be judged
The weight index of frequency band, it is general to the posteriority for being associated with the voice to be judged according to the probabilistic model
Rate is weighted, to obtain the pure voice estimate.
It is preferred that, the method for the voice de-noising in real time being separated voice, wherein, the step S6
In, the step of generating the second estimation mark specifically includes:
Step S61b, according to the frequency spectrum of the voice to be judged, extracts and corresponds to the language to be judged
The harmonic structure of sound;
Step S62b, is carried out at regularization to the monitoring value being associated with the number spectral domain of the harmonic structure
Reason, and smoothing processing is performed to the monitoring value handled by regularization according to melscale;
Step S63b, is carried out at corresponding regularization to the monitoring value Jing Guo smoothing processing from 0 to 1
Reason;
Step S64b, each frequency band of voice to be judged according to the monitoring value generates correspondence
The second estimation mark.
It is preferred that, the method for the voice de-noising in real time being separated voice, wherein, perform the step
After rapid S8, following step is continued executing with always according to the described second estimation mark:
, will each corresponding second estimation mark conduct for each frequency band of the voice to be judged
Weight, is obtained pair with performing linear interpolation between the monitoring value and the pure voice estimate and handling
The output valve answered.
A kind of system for the voice de-noising in real time being separated voice, it is adaptable to intelligent terminal, wherein,
Including:
Collecting unit, the sound source for gathering outside input;
Memory cell, connects the collecting unit, to store the sound source;
Cutting unit, connects the memory cell, the sound source to be divided according to the time sequencing of reception
It is segmented into the voice of a plurality of preset time periods;
Separative element, connects the cutting unit there is provided a default characteristic model, to according to described
Time sequencing extract one described in voice, the voice is matched with the characteristic model, with acquisition and
The noise of the characteristic model matching, and the carrying voice with the initial characteristicses unmatched models
The voice, and generate the matching identification of the voice, the matching identification is used to represent the voice
Completion is matched with the characteristic model;
Model modification unit, connects the separative element, is the character modules the noise to be added
The noise sample of type, and the characteristic model is updated according to the noise sample, it is new to be formed
The characteristic model, and the new characteristic model is sent to the separative element;
Judging unit, connects in the separative element, the judging unit and presets an intensity threshold, and
Whether the intensity of sound of the voice for judging to carry voice is higher than the intensity threshold, and output is corresponding
Judged result;
First processing units, connect the judging unit, for according to the judged result, in institute's predicate
The voice is confirmed as voice to be judged, and root by the intensity of sound of sound when being higher than the intensity threshold
According to the frequency spectrum of the voice to be judged, the estimation of each frequency band is identified on the generation correspondence voice to be judged,
The estimation mark is used to represent conspicuousness of the voice on harmonic structure;
Model generation unit, connects the first processing units and the separative element respectively, for by institute
Noise is stated as sample for reference, is corresponded to according to the sample for reference and the speech production to be judged described
The probabilistic model of the pure voice of voice to be judged;
Second processing unit, connects the model generation unit and the separative element, for every respectively
The individual estimation mark is used as the weight index of the frequency band of the corresponding voice to be judged, foundation institute
State the pure voice estimate that probabilistic model processing obtains being associated with the voice.
It is preferred that, the system of the voice de-noising in real time being separated voice, wherein, the estimation mark
Knowledge includes the first estimation mark;Or
The estimation mark includes the first estimation mark and the second estimation mark.
It is preferred that, the system of the voice de-noising in real time being separated voice, wherein, at described first
Reason unit is specifically included:
Extraction module, for the frequency spectrum according to the voice to be judged, extracts and waits to sentence described in corresponding to
The harmonic structure of conclusion sound;
First processing module, connects the extraction module, for being composed to the number for being associated with the harmonic structure
Monitoring value on domain carries out regularization processing, and according to melscale to the prison by regularization processing
Control value performs smoothing processing;
Second processing module, connects the first processing module, for the prison Jing Guo smoothing processing
Control value carries out further regularization processing, so that the average of the monitoring value is 1;
First generation module, connects the Second processing module, for generating correspondence according to the monitoring value
The first estimation mark of each frequency band of the voice to be judged.
It is preferred that, the system of the voice de-noising in real time being separated voice, wherein, at described second
Reason unit is specifically included:
3rd processing module, the least mean-square error for obtaining being associated with the voice to be judged for handling is estimated
The posterior probability of meter;
Fourth processing module, connects the 3rd processing module, for being identified with each first estimation
As the weight index of the frequency band of the corresponding voice to be judged, according to the probabilistic model to closing
The posterior probability for being coupled to the voice to be judged is weighted, and is estimated with obtaining the pure voice
Value.
It is preferred that, the system of the voice de-noising in real time being separated voice, wherein, at described first
Reason unit includes:
5th processing module, connects the first processing units, for the prison Jing Guo smoothing processing
Control value carries out corresponding regularization processing from 0 to 1;
Second generation module, connects the 5th processing module, for generating correspondence according to the monitoring value
The second estimation mark of each frequency band of the voice to be judged.
It is preferred that, the system of the voice de-noising in real time being separated voice, wherein, in addition to:
3rd processing unit, connects the second processing unit, for for the every of the voice to be judged
Individual frequency band, will each corresponding second estimation mark as weight, with the monitoring value with it is described
Linear interpolation is performed between pure voice estimate and is handled and obtains corresponding output valve.
The beneficial effect of above-mentioned technical proposal is:
1) a kind of method of voice de-noising in real time being separated voice is provided, by by outside input
Sound source is divided into the voice of a plurality of preset time periods, to be matched using characteristic model with voice, point
Noise is separated out with carrying the voice of voice, further according to the noise real-time update characteristic model identified, so that
Reach the purpose of noise that the Real time identification external world is constantly converted, at the same using the noise as sample for reference with
Generation judges the probabilistic model of the pure voice of voice, and the voice for carrying voice is handled, with acquisition
Pure voice estimate, can lift the effect of ambient noise removal, preferably exclude in speech recognition process
The interference of larger ambient noise, lifts the degree of accuracy of speech recognition;
2) a kind of system for the voice de-noising in real time being separated voice is provided, it would be preferable to support realize above-mentioned
The method for the voice de-noising that voice is separated in real time.
Brief description of the drawings
During Fig. 1 is the preferred embodiment of the present invention, a kind of voice de-noising for being separated voice in real time
Method overall procedure schematic diagram;
Fig. 2-4 be the present invention preferred embodiment in, on the basis of Fig. 1, voice is carried out in real time
The schematic flow sheet step by step of the method for the voice de-noising of separation;
During Fig. 5 is the preferred embodiment of the present invention, a kind of voice de-noising for being separated voice in real time
System general structure schematic diagram;
Fig. 6-7 be the present invention preferred embodiment in, on the basis of Fig. 5, voice is carried out in real time
The clustered architecture schematic diagram of the system of the voice de-noising of separation.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out
Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the invention, and
The embodiment being not all of.Based on the embodiment in the present invention, those of ordinary skill in the art are not making
The every other embodiment obtained on the premise of going out creative work, belongs to the scope of protection of the invention.
It should be noted that in the case where not conflicting, the embodiment in the present invention and the spy in embodiment
Levying to be mutually combined.
The invention will be further described with specific embodiment below in conjunction with the accompanying drawings, but not as limiting to the invention.
Typically, the speech recognition system being applicable in intelligent terminal includes the part of front-end and back-end two,
Certain Voice Conversion Techniques of front end applications extract corresponding characteristic quantity, rear end in the voice that speaker inputs
Just according to the characteristic quantity that these are extracted, speech recognition is carried out using the good identification model of training in advance, with
The content included in the sentence for determining speaker's input.Then the technical scheme is that to of the prior art
The improvement that front end in speech recognition system is carried out, i.e., extracting characteristic quantity according to the voice of outside input
During the improvement that carries out, it is intended to reduce influence of the ambient noise to said process.
Therefore, it is existing based on the above-mentioned problems in the prior art in preferred embodiment of the invention
A kind of method for the voice de-noising in real time being separated voice is provided, it is applied to intelligent terminal, for example
Intelligent robot suitable for supporting voice operating.
In the technical scheme, " voice of outside input " and " voice to be judged " be superposition
The voice of the speaker of ambient noise." pure voice " refer to the language of the speaker for eliminating ambient noise
Sound.So-called " pure voice estimate " refers to by above-mentioned voice to be judged (i.e. including ambient noise
Voice) the obtained pure voice of estimation." frequency spectrum " refer to the power spectrum or amplitude spectrum of voice.
Technical solution of the present invention is based on prior art expansion hereinafter, i.e., based on MMSE (Minimum
Mean Square Error, least mean-square error) make and changing in the noise cancellation technique realized of estimation technique
Enter obtained technical solution of the present invention.
Therefore, before description technical solution of the present invention, the noise based on MMSE is described first and eliminates skill
Art:When providing initial speech value y (corresponding to the voice for being superimposed with ambient noise above),
Pure speech value x is modeled as to x probability Distribution Model p (x | y), and from probability Distribution Model p (x | y)
Estimate pure voice x estimate.Then MMSE estimations are used in the basic technology in the estimation of follow-up phase.
Then in MMSE estimation techniques, the voice of speaker is collected and recorded first with microphone, sight is used as
Voice is surveyed, observation voice is then converted into data signal by way of A/D is changed, and pass through framing
And DFT transform (Discrete Fourier Transform, discrete Fourier transform), it is every to be converted into
The frequency spectrum of one frame voice.Next, per frame frequency, spectrum is by Mel wave filter group and takes its logarithm (one kind filtering
Device group, wherein bandpass filter are arranged in Mel scale at equal intervals), it is then converted into Mel
Logarithmic spectrum is simultaneously output.
In the prior art, the Mel logarithmic spectrum based on output, can generate the pure voice estimate of each frame,
And corresponding pure voice estimate can be exported.
The probability Distribution Model that MMSE estimation techniques are above formed perform MMSE estimation, and
Pure voice estimate can be generated.It is noted that the probability Distribution Model being saved is Mel
GMM model (Gaussian Mixture Model, gauss hybrid models) in log-spectral domain, that is, be based on
The model that priori learns and generated for each phoneme.Then pure voice can be generated by MMSE estimations to estimate
Evaluation is simultaneously used as the vector in Mel log-spectral domain.
Then, specific characteristic quantity can be extracted, such as Mel from the pure voice estimate being output is fallen
Spectral coefficient (MFCC) extracts corresponding characteristic quantity, and this feature amount is sent into rear end.In rear end,
By using other voice recognition modes such as HMM (Hidden Markov Model, hidden Markov
Model), acoustic model or N-gram language models (Chinese language model) etc. configured, based on from
The characteristic quantity of front end receiver specifies the content included in the sentence of speaker.
Then in the prior art, the frequency band d (frequency band on melscale) in above-mentioned speech value y frame t
Mel log-spectral domain in speech value yd(t) pure speech value can be expressed as in following formula (1)
xd(t) with noise figure nd(t) function:
yd(t)=xd(t)+log(1+exp(nd(t)-xd(t))) (1)
Ignore frame t in above-mentioned formula (1), and when above-mentioned formula (1) is expressed as into vector, can obtain
Obtain following formula (2):
Y=x+g (2)
In above-mentioned formula (2), each frequency band d mismatch vector g can be indicated in following formula (3)
Mismatch functional G provide:
gd=Gd(x, n)=log (1+exp (nd-xd)) (3)
Then above-mentioned pure voice x can be modeled as the K mixing GMM models indicated in decimal formula (4):
In above-mentioned formula (4), γk, μx,k, and Σx,kIndicate respectively that the priori of kth normal distribution is general
Rate, mean vector and covariance matrix.
Then by being deployed based on above-mentioned formula (1)-(4) using linear Taylor, mismatch vector g can be carried out
Modeling, it is expressed as the K mixing GMM models indicated in following formula (5):
Mean vector μ in above-mentioned formula (5)g,kIt can be represented by following formula (6), and covariance square
Battle array Σg,kIt can be represented by following equation (7):
μg,k≌log(1+exp(μn-μx,k))=G (μx,k,μn) (6)
Σg,k≌F(μx,k,μn)2·(Σx,k+Σn) (7)
Auxiliary function F in above-mentioned formula (7) can be defined as following equation (8):
Fd(x, n)=(1+exp (xd-nd))-1 (8)
Therefore, handled by following formula (9-1) and obtain above-mentioned pure voice estimate
Correspondingly, pure voice estimate is obtained from speech value y direct estimationsMethod can also be by following public affairs
Formula (9-2) is provided:
Here, posterior probability ρ in above formula (9-1) and (9-2)kAll provided by following equation (10):
In above-mentioned formula (10), mean vector μy,kIt can be represented by following equation (11), and covariance
Matrix ΣykIt can be represented by following equation (12):
μy,k≌μx,k+G(μx,k,μn) (11)
Σy,k≌{1-F(μx,k,μn)2}·Σx,k+F(μx,k,μn)2·Σn (12)
Then in the prior art, in above-mentioned formula (11)-(12), speech model parameter [μx,k,Σx,k] can be with
Obtained by priori training data, and noise model parameters [μn,Σn] based in non-speech segments, quilt
Give the observation of MMSE estimating parts 514 and set by the noise compensation part 512 based on model.
As above, in other words, the process of above-mentioned MMSE estimations is exactly by pure voice estimateApproximately
To use posterior probability ρk(y) the k mean of a probability distribution vector μ being weighted as weightx,k
The process of sum.
Then in preferred embodiment of the invention, the side of the above-mentioned voice de-noising in real time being separated voice
The step of method, is specific as shown in Figure 1 there is provided a default characteristic model, comprises the steps:
Step S1, gathers the sound source of outside input, and stores;
Step S2, sound source is divided into according to the time sequencing of reception the voice of a plurality of preset time periods;
Step S3, extracts a voice according to time sequencing, voice is matched with characteristic model, to obtain
The noise matched with characteristic model, and the voice with the carrying voice of initial characteristicses unmatched models are taken,
And the matching identification of voice is generated, matching identification is used to represent that voice matches completion with characteristic model;
Step S4, noise is added the noise sample for being characterized model, and according to noise sample to character modules
Type is updated, to form new characteristic model;
Step S5, judges whether the intensity of sound for carrying the voice of voice is higher than a default intensity threshold,
And voice is confirmed as into voice to be judged when intensity of sound is higher than intensity threshold, and turn to step S6;
Step S6, according to the frequency spectrum of voice to be judged, each frequency band estimates on generation correspondence voice to be judged
Meter mark, estimation mark is used to represent conspicuousness of the voice on harmonic structure;
Step S7, using noise as sample for reference, corresponds to according to sample for reference and speech production to be judged
The probabilistic model of the pure voice of voice to be judged;
Step S8, using each estimation mark as the weight index of the frequency band of corresponding voice to be judged, according to
The pure voice estimate for obtaining being associated with voice is handled according to probabilistic model;
Step S9, a voice not being identified is extracted according to time sequencing, and voice and characteristic model are carried out
Matching, to obtain the noise that is matched with new characteristic model, and takes with initial characteristicses unmatched models
Voice with voice, and the matching identification of voice is generated, return and perform step S4.
In the present embodiment, the method for the voice de-noising in real time separated voice is by by outside input
Sound source be divided into the voices of a plurality of preset time periods, to be matched using characteristic model with voice,
Noise is isolated with carrying the voice of voice, further according to the noise real-time update characteristic model identified, from
And the purpose for the noise that the Real time identification external world is constantly converted is reached, while regarding the noise as sample for reference
To generate the probabilistic model for the pure voice for judging voice, the voice of carrying voice is handled, to obtain
Pure voice estimate is taken, the effect of ambient noise removal can be lifted, speech recognition process is preferably excluded
In larger ambient noise interference, lift the degree of accuracy of speech recognition.
In a specific embodiment, the voice (voice for gathering speaker) of outside is gathered first,
And judge whether the intensity of sound of the collected voice is more than a default intensity threshold.The master of the judgement
Syllabus be remove some speakers be originally not intended to intelligent terminal carry out Voice command scene, for example
The scene that speaker talks in a low voice with other people, or the sentence that speaker lets slip.Therefore, only
, could quilt when having the intensity of sound relatively strong (being more than default intensity threshold) for the voice said in speaker
It is considered to send phonetic order to intelligent terminal, now intelligent terminal just needs to proceed by speech recognition,
And carry out the voice de-noising for being separated voice before speech recognition.Therefore, above-mentioned judgement can be kept away
Exempt from the functional module beginning of voice de-noising for voice being separated on speech recognition and in real time in intelligent terminal
It is in running order eventually, and the power consumption of intelligent terminal can be saved.
In the embodiment, when the intensity of sound of the voice of speaker is more than above-mentioned default intensity threshold,
Perform step S6, i.e., according to the frequency spectrum of voice to be judged, each frequency band on generation correspondence voice to be judged
Estimation mark.In the embodiment, above-mentioned estimation identifies the conspicuousness for representing voice on harmonic structure.
In the embodiment, generation corresponds to the probabilistic model of the pure voice of voice to be judged, and with each
Estimation mark is obtained as the weight index of the frequency band of corresponding voice to be judged according to probabilistic model processing
It is associated with the pure voice estimate of voice.
In the preferred embodiment of the present invention, in above-mentioned steps S6, the estimation mark of generation includes first
Estimation mark;Or
In above-mentioned steps S6, the estimation mark of generation includes the first estimation mark and the second estimation mark.
In the preferred embodiment of the present invention, as shown in Fig. 2 in above-mentioned steps S6, generation first is estimated
The step of meter mark, specifically includes:
Step S61a, according to the frequency spectrum of voice to be judged, extracts the harmonic structure corresponding to voice to be judged;
Step S62a, regularization processing is carried out to the monitoring value being associated with the number spectral domain of harmonic structure, and
Smoothing processing is performed to the monitoring value handled by regularization according to melscale;
Step S63a, further regularization processing is carried out to the monitoring value Jing Guo smoothing processing, so that prison
The average of control value is 1;
Step S64a, the first estimation that each frequency band of correspondence voice to be judged is generated according to monitoring value is marked
Know.
In the preferred embodiment of the present invention, as shown in figure 3, in above-mentioned steps S8, estimating according to first
The method that meter mark processing obtains pure voice estimate is specifically included:
Step S81a, the posteriority for handling the Minimum Mean Squared Error estimation for obtaining being associated with voice to be judged is general
Rate;
Step S82a, is referred to using each first estimation mark as the weight of the frequency band of corresponding voice to be judged
Mark, the posterior probability for being associated with voice to be judged is weighted according to probabilistic model, pure to obtain
Voice estimate.
In the preferred embodiment of the present invention, as shown in figure 4, in above-mentioned steps S6, generation second is estimated
The step of meter mark, specifically includes:
Step S61b, according to the frequency spectrum of voice to be judged, extracts the harmonic structure corresponding to voice to be judged;
Step S62b, regularization processing is carried out to the monitoring value being associated with the number spectral domain of harmonic structure, and
Smoothing processing is performed to the monitoring value handled by regularization according to melscale;
Step S63b, corresponding regularization processing is carried out to the monitoring value Jing Guo smoothing processing from 0 to 1;
Step S64b, the second estimation that each frequency band of correspondence voice to be judged is generated according to monitoring value is marked
Know.
In the preferred embodiment of the present invention, after step S8 is performed, identified always according to the second estimation
Continue executing with following step:
For each frequency band of voice to be judged, each corresponding second estimation is identified as weight, with
Linear interpolation is performed between monitoring value and pure voice estimate and is handled and obtains corresponding output valve.
One embodiment in technical solution of the present invention given below:
In existing MMSE, pure voice estimateProvided by above-mentioned formula (9-1) and (9-2), and often
Posterior probability ρ in individual formulak(y) provided by above-mentioned formula (10).
Then in this embodiment, pure voice estimate is being providedAbove in formula (9-1) and (9-2),
CW-MMSE, which is used, utilizes estimation mark αdThe posterior probability ρ ' of weightingkRather than posterior probability (y)
ρk(y) as weight.Formula (13) hereinafter indicates the posterior probability used in the embodiment
ρ'k(y):
In the embodiment, normal distribution can be by formula (14) table hereinafter in formula (13) above
Show, formula (14) is assumed using diagonal covariance.In following formula (14), D represents the dimension of omnidirectional distribution
The number of degree:
Above-mentioned formula (14) is represented:Normal distribution N ' (be used to calculate posterior probability ρ ' in formulak(y)
) be multiplied by using estimation mark αdIt is used as the index of weight.So-called estimation mark, it is really to represent
The mark of the estimation of frequency band.Usually, the estimation of frequency band is the angle from signal degradation caused by ambient noise
What degree was carried out.In technical solution of the present invention, estimation mark is defined as follows:
Due to could be aware that the frequency spectrum for the vowel being included in the common speech of the mankind has typically humorous in advance
Wave structure, in the environment without ambient noise, the harmonic structure of vowel can be maintained at collected language
In the whole frequency band of the frequency spectrum of sound.Correspondingly, when with stronger broadband noise, in many frequency bands
The harmonic structure of vowel can be lost, and harmonic structure is only capable of being maintained at being total to for such as phonetic speech power concentration
Shake in peak (formant) frequency band.Therefore, in technical solution of the present invention, it is assumed that because ambient noise causes
Degeneration seldom occur determine in the frequency band with obvious harmonic structure, and by the conspicuousness of harmonic structure
Justice identifies for the estimation of the frequency band.
Estimation mark in technical solution of the present invention is to use LPW (Local Peat Weight, local peaking
Weight) generation.LPW mode is for example by the huge change including formant information from collected
Removed in the spectrum energy distribution of voice, and only extract the regular crest and ripple corresponding to harmonic structure
Paddy, and by its value regularization.It is each by performing following processes generations in technical solution of the present invention
The LPW of frame:
First, handled using the algorithm of the frame t of collected voice frequency spectrum, and its logarithmic spectrum
Cepstrum is obtained by discrete cosine transform.Then, in the item of the cepstrum of acquisition, only leave corresponding to LPW
Item in the domain of the harmonic structure of vowel, and delete other.Hereafter, the cepstrum of processing is carried out instead
Discrete cosine transform, log-spectral domain is converted back by cepstrum.Finally, the frequency spectrum executing rule to being changed
Change is handled, so that the average of frequency spectrum becomes 1, is derived from LPW.
Next, by being smoothed on melscale to LPW, to obtain corresponding Mel
LPW., can be by one group of Mel wave filter to LPW's in the preferred embodiment of the present invention
Value is smoothed, to obtain a corresponding value for each Mel frequency band.So-called Mel wave filter,
It is a kind of wave filter group, wherein bandpass filter is arranged on melscale at equal intervals.In each plum
Your frequency band provides corresponding Mel LPW value.The size of Mel LPW values corresponds to high-resolution
The conspicuousness of the harmonic structure of spectral band, and each Mel frequency band one Mel LPW value of correspondence.
In technical solution of the present invention, above-mentioned Mel LPW values can be identified as the estimation of correspondence frequency band.
Specifically, the estimation mark α in above-mentioned formula (14)dIt can be generated by procedure below:
First, Mel LPW dynamic model is compressed by using suitable scaling function such as curvilinear function
Enclose.In such as following formula (15), the Mel LPW values w of each frequency banddIt is converted into α 'd.It is following
Formula (15) is indicated Mel LPW values w by using curvilinear functiondBe converted to α 'dMode:
α'd=1.0/ (1.0+exp (- a. (wd-1.0))) (15)
In above-mentioned formula (15), a is tuner parameters, it is possible to set appropriate numerical value.
Then, the value α ' to being compresseddRegularization is handled, so that its average becomes 1.Following formula (16)
Indicate to be used for regularization α 'dAnd obtain estimation mark αdMethod:
When there is the harmonic structure of vowel in obvious spectrum bands in the frame t of voiced portions, correspondence frequency
Estimation mark α with dd1 will be gone above.Now, for frequency band d, the normal state in above formula (14)
It is distributed N ' changes big, and frequency band d posterior probability ρ 'k(y) become big.Therefore corresponding to its medial vowel
The contribution of the harmonic structure Mel frequency band of significantly composing frequency band become big.
On the contrary, when there is the harmonic structure of vowel in the spectrum bands being lost in the frame t of voiced portions,
Correspondence frequency band d estimation mark αd1 will be become less than.Then for frequency band d, in above formula (14) just
State distribution N ' diminishes, and frequency band d posterior probability ρ 'k(y) diminish.Therefore corresponding to wherein first
The contribution of the Mel frequency band for the spectrum frequency band that the harmonic structure of sound is lost diminishes.
Second embodiment in technical solution of the present invention given below:
If collected voice is equivalent to pure voice (i.e. in the environment of one almost no ambient noise
The voice of the speaker collected, or speaker are very near apart from voice acquisition device such as microphone
Situation), then any processing need not be carried out to it, the directly collected voice of output is optimal selection.
But, the method according to the real-time voice de-noising for being separated voice in technical solution of the present invention is entered
If row speech processes, even if in these cases, similarly can be according to collected voice to pure language
Sound is estimated, and therefore can export the worse voice estimate of effect than pure voice.
Therefore, propose in this embodiment it is a kind of can be real between speech modality and collected voice
The method of existing linear interpolation, wherein estimation mark participates in calculating as weight.
Then in this embodiment, in following formula (17), frequency band d is obtained by linear interpolation function
In output valve
In above-mentioned formula (17),Represent the pure voice estimate in frequency band d, βdIt is represented to frequency
Confidence index with d, ydThe value of voice being collected in frequency band d is represented, andRepresent frequency band d
In output valve.In above-mentioned formula (17), β is identified using estimationdAs weight to linear interpolation function
It is weighted, it is become the value from 0 to 1.It can see in linear interpolation function:With βdConnect
Nearly 1, output valveClose to the value y of collected voiced;Correspondingly, with βdIt is defeated close to 0
Go out valueClose to pure voice estimate
In technical solution of the present invention, above-mentioned estimation is generated by carrying out regularization processing to Mel LPW values
Mark.Estimation mark β in above-mentioned formula (17)dIt can be generated by following process:
Obtain the value of the Mel LPW for frame t first, i.e., it is for example bent by using appropriate scaling function
Line function is by Mel MPW value wdRegularization processing is carried out, so that wdValue takes the value from 0 to 1, its
In 1 be maximum.Formula (18) hereinafter indicates to be used for by using curvilinear function regularization Mel
MPW values wdAnd obtain estimation mark βdMode:
βd=1.0/ (1.0+exp (- a (wd-1.0-b))) (18)
In above-mentioned formula (18), a and b are tuner parameters, and can be preset according to actual conditions
Appropriate numerical value.
When there is the harmonic structure of vowel in obvious spectrum bands in the frame t of voiced portions, correspondence frequency
Estimation mark β with ddClose to 1.The then output valve in frequency band dFor what is indicated in above-mentioned formula (17)
The result of linear interpolation, hence in so that the output valveValue y away from collected voicedDistance ratio away from
Pure voice estimateDistance closer to.
On the contrary, when there is the harmonic structure of vowel in the spectrum frequency band being lost in the frame t of voiced portions,
Correspondence frequency band d estimation mark βdClose to 0.The then output valve in frequency band dFor in formula (17) middle finger
The result for the linear interpolation shown, hence in so that the output valveAway from pure voice estimateDistance ratio away from
Observation ydDistance closer to.
The present invention preferred embodiment in, above-mentioned first embodiment and second embodiment can with connected applications,
Process for example hereinafter:
The frequency spectrum Y for the frame for corresponding to collected voice is obtained first, extracts frequency spectrum Y harmonic wave knot
Structure and LPW is generated, and Mel LPW is generated according to LPW.Then with appropriate method to Mel
The estimation that LPW carries out regularization processing to generate for each frequency band identifies α, estimation mark α's
Average is 1.The estimation for carrying out regularization processing to Mel LPW simultaneously to generate for each frequency band is identified
β, estimation mark β value is distributed from 0 to 1.The estimation mark α and estimation for exporting generation respectively are marked
Know β.
Hereafter, the frequency spectrum Y corresponding to a frame is converted into Mel logarithmic spectrum y and exported.By using defeated
The Mel logarithmic spectrum y gone out and above-mentioned estimation identify α to estimate pure voice.Specifically, using above-mentioned estimation
The MMSE posterior probability estimated is weighted as weight by mark α, and exports pure voice
Estimate
Then, for each frequency band, in Mel logarithmic spectrum y vectorial and above-mentioned pure voice estimate(plum
Vector in your log-spectral domain) between perform linear interpolation.In the calculating process of the linear interpolation, with above-mentioned
Estimation mark β is used as weight.Final calculate obtains output valve
Finally, according to obtained output valveCarry out the extraction of specific characteristic quantity, and will extract
Characteristic quantity is sent to rear end.Above-mentioned steps are repeated to each frame of collected voice, and
When reaching last frame, processing terminates.
In the preferred embodiment of the present invention, dropped based on the real-time voice for being separated voice above
The method made an uproar, now provides a kind of system for the voice de-noising in real time being separated voice, it is adaptable to intelligence
Terminal, its structure it is specific as shown in figure 5, including:
Collecting unit 1, the sound source for gathering outside input;
Memory cell 9, connects collecting unit 1, to store sound source;
Cutting unit 8, connects memory cell 9, multiple sound source to be divided into according to the time sequencing of reception
The voice of several preset time periods;
Separative element 7, connection cutting unit 8 is there is provided a default characteristic model, to suitable according to the time
Sequence extracts a voice, and voice is matched with characteristic model, to obtain the noise matched with characteristic model,
And the voice with the carrying voice of initial characteristicses unmatched models, and the matching identification of voice is generated,
It is used to represent that voice matches completion with characteristic model with mark;
Model modification unit 10, connects separative element 7, and the noise of model is characterized noise to be added
Sample, and characteristic model is updated according to noise sample, to form new characteristic model, and will be new
Characteristic model send to separative element 7;The memory cell 9 of 9 separative element of memory cell 7
Judging unit 2, connects and an intensity threshold is preset in separative element 7, judging unit, and for sentencing
Whether the intensity of sound of the disconnected voice for carrying voice is higher than intensity threshold, exports corresponding judged result;
First processing units 3, connection judgment unit 2 is strong in the sound of voice for according to judged result
Voice is confirmed as into voice to be judged when degree is higher than intensity threshold, and according to the frequency spectrum of voice to be judged, it is raw
The estimation mark of each frequency band on into correspondence voice to be judged, estimation mark is used to represent voice in harmonic wave knot
Conspicuousness on structure;
Model generation unit 6, connects first processing units 3 and separative element 7, for noise to be made respectively
For sample for reference, the pure voice of voice to be judged is corresponded to according to sample for reference and speech production to be judged
Probabilistic model;
Second processing unit 5, difference link model generation unit 6 and separative element 7, for being estimated with each
Meter mark is closed as the weight index of the frequency band of corresponding voice to be judged according to probabilistic model processing
It is coupled to the pure voice estimate of voice.
In the present embodiment, when the sound source of outside input being divided into a plurality of default by cutting unit 8
Between section voice, separative element 7 matched using characteristic model with voice, isolates noise and carry
The voice of voice, model modification unit 10 is according to the noise real-time update characteristic model identified, so as to reach
The purpose of the noise constantly converted to the Real time identification external world, while model generation unit 6 makees the noise
For sample for reference to generate the probabilistic model for the pure voice for judging voice, taken for 5 pairs using second processing unit
The progress of voice with voice is handled, to obtain pure voice estimate, can lift ambient noise removal
Effect, preferably excludes the interference of ambient noise larger in speech recognition process, lifting speech recognition
The degree of accuracy.
In the preferred embodiment of the present invention, the system of the above-mentioned voice de-noising in real time being separated voice
In, estimation mark can include the first estimation mark;Or
Estimate that mark can include the first estimation mark and the second estimation mark.
In the preferred embodiment of the present invention, the system of the above-mentioned voice de-noising in real time being separated voice
In, as shown in fig. 6, above-mentioned first processing units 3 are specifically included:
Extraction module 31, for the frequency spectrum according to voice to be judged, is extracted corresponding to the humorous of voice to be judged
Wave structure;
First processing module 32, connects extraction module 31, for the number spectral domain to being associated with harmonic structure
Monitoring value carry out regularization processing, and according to melscale to by regularization handle monitoring value perform
Smoothing processing;
Second processing module 33, connects first processing module 32, for the monitoring value Jing Guo smoothing processing
Further regularization processing is carried out, so that the average of monitoring value is 1;
First generation module 34, connects Second processing module 33, for waiting to sentence according to monitoring value generation correspondence
First estimation mark of each frequency band of conclusion sound.
In the preferred embodiment of the present invention, the system of the above-mentioned voice de-noising in real time being separated voice
In, as shown in fig. 7, above-mentioned second processing unit 5 is specifically included:
3rd processing module 51, obtains being associated with the Minimum Mean Squared Error estimation of voice to be judged for processing
Posterior probability;
Fourth processing module 52, connect the 3rd processing module 51, for using each first estimation mark as
The weight index of the frequency band of corresponding voice to be judged, according to probabilistic model to being associated with voice to be judged
Posterior probability is weighted, to obtain pure voice estimate.
In the preferred embodiment of the present invention, the system of the above-mentioned voice de-noising in real time being separated voice
In, still as shown in fig. 6, first processing units 3 also include:
5th processing module 35, connects first processing units 32, for the monitoring value Jing Guo smoothing processing
Corresponding regularization processing is carried out from 0 to 1;
Second generation module 36, connects the 5th processing module 35, for waiting to sentence according to monitoring value generation correspondence
Second estimation mark of each frequency band of conclusion sound.
In the preferred embodiment of the present invention, the system of the above-mentioned voice de-noising in real time being separated voice
In, still as shown in figure 5, also including:
3rd processing unit 4, connection second processing unit 5, for each frequency band for voice to be judged,
Each corresponding second estimation is identified as weight, to be performed between monitoring value and pure voice estimate
Linear interpolation and handle obtain corresponding output valve.
In the preferred embodiment of the present invention, a kind of intelligent terminal is also provided, wherein using above-mentioned real-time
The method for the voice de-noising that voice is separated.
In the preferred embodiment of the present invention, a kind of intelligent terminal is also provided, including above-mentioned real-time
The system for the voice de-noising that voice is separated.
The foregoing is only preferred embodiments of the present invention, not thereby limit embodiments of the present invention and
Protection domain, to those skilled in the art, should can appreciate that all utilization description of the invention
And the equivalent substitution made by diagramatic content and the scheme obtained by obvious change, it should include
Within the scope of the present invention.
Claims (12)
1. a kind of method for the voice de-noising in real time being separated voice, it is adaptable to intelligent terminal, it is special
Levy and be, there is provided a default characteristic model, to comprise the steps:
Step S1, gathers the sound source of outside input, and stores;
Step S2, the sound source is divided into according to the time sequencing of reception the language of a plurality of preset time periods
Sound;
Step S3, the voice according to the time sequencing extracts one, by the voice and the character modules
Type is matched, to obtain the noise that is matched with the characteristic model, and with the initial characteristicses
The voice of the carrying voice of unmatched models, and generate the matching identification of the voice, the matching
Identify for representing that the voice matches completion with the characteristic model;
Step S4, the noise sample for the characteristic model is added by the noise, and according to the noise
Sample is updated to the characteristic model, to form the new characteristic model;
Step S5, judges whether the intensity of sound for carrying the voice of voice is higher than a default intensity threshold
Value, and the voice is confirmed as into voice to be judged when the intensity of sound is higher than the intensity threshold,
And turn to step S6;
Step S6, it is each on voice to be judged described in generation correspondence according to the frequency spectrum of the voice to be judged
The estimation mark of frequency band, the estimation mark is used to represent conspicuousness of the voice on harmonic structure;
Step S7, using the noise as sample for reference, according to the sample for reference and the language to be judged
Sound generation corresponds to the probabilistic model of the pure voice of the voice to be judged;
Step S8, the frequency band of the corresponding voice to be judged is used as using each estimation mark
Weight index, obtains being associated with the pure voice estimate of the voice according to probabilistic model processing;
Step S9, not identified voice is extracted according to the time sequencing, by the voice with
The characteristic model is matched, to obtain the noise matched with the new characteristic model, and
With the voice of the carrying voice of the initial characteristicses unmatched models, and the matching of the voice is generated
Mark, returns and performs the step S4.
2. the method for the voice de-noising as claimed in claim 1 in real time being separated voice, its feature
It is, the estimation mark generated in the step S6 includes the first estimation mark;Or
The estimation mark generated in the step S6 includes the first estimation mark and the second estimation mark.
3. the method for the voice de-noising as claimed in claim 2 in real time being separated voice, its feature
It is, in the step S6, the step of generating the first estimation mark specifically includes:
Step S61a, according to the frequency spectrum of the voice to be judged, extracts and corresponds to the language to be judged
The harmonic structure of sound;
Step S62a, is carried out at regularization to the monitoring value being associated with the number spectral domain of the harmonic structure
Reason, and smoothing processing is performed to the monitoring value handled by regularization according to melscale;
Step S63a, further regularization processing is carried out to the monitoring value Jing Guo smoothing processing, with
The average for making the monitoring value is 1;
Step S64a, each frequency band of voice to be judged according to the monitoring value generates correspondence
The first estimation mark.
4. the method for the voice de-noising as claimed in claim 3 in real time being separated voice, its feature
It is, in the step S8, the pure voice estimate is obtained according to the described first estimation mark processing
Method specifically include:
Step S81a, processing obtains being associated with the posteriority of the Minimum Mean Squared Error estimation of the voice to be judged
Probability;
Step S82a, using each first estimation mark as described in the corresponding voice to be judged
The weight index of frequency band, it is general to the posteriority for being associated with the voice to be judged according to the probabilistic model
Rate is weighted, to obtain the pure voice estimate.
5. the method for the voice de-noising as claimed in claim 3 in real time being separated voice, its feature
It is, in the step S6, the step of generating the second estimation mark specifically includes:
Step S61b, according to the frequency spectrum of the voice to be judged, extracts and corresponds to the language to be judged
The harmonic structure of sound;
Step S62b, is carried out at regularization to the monitoring value being associated with the number spectral domain of the harmonic structure
Reason, and smoothing processing is performed to the monitoring value handled by regularization according to melscale;
Step S63b, is carried out at corresponding regularization to the monitoring value Jing Guo smoothing processing from 0 to 1
Reason;
Step S64b, each frequency band of voice to be judged according to the monitoring value generates correspondence
The second estimation mark.
6. the method for the voice de-noising as claimed in claim 5 in real time being separated voice, its feature
It is, performs after the step S8, following step is continued executing with always according to the described second estimation mark:
, will each corresponding second estimation mark conduct for each frequency band of the voice to be judged
Weight, is obtained pair with performing linear interpolation between the monitoring value and the pure voice estimate and handling
The output valve answered.
7. a kind of system for the voice de-noising in real time being separated voice, it is adaptable to intelligent terminal, it is special
Levy and be, including:
Collecting unit, the sound source for gathering outside input;
Memory cell, connects the collecting unit, to store the sound source;
Cutting unit, connects the memory cell, the sound source to be divided according to the time sequencing of reception
It is segmented into the voice of a plurality of preset time periods;
Separative element, connects the cutting unit there is provided a default characteristic model, to according to described
Time sequencing extract one described in voice, the voice is matched with the characteristic model, with acquisition and
The noise of the characteristic model matching, and the carrying voice with the initial characteristicses unmatched models
The voice, and generate the matching identification of the voice, the matching identification is used to represent the voice
Completion is matched with the characteristic model;
Model modification unit, connects the separative element, is the character modules the noise to be added
The noise sample of type, and the characteristic model is updated according to the noise sample, it is new to be formed
The characteristic model, and the new characteristic model is sent to the separative element;
Judging unit, connects in the separative element, the judging unit and presets an intensity threshold, and
Whether the intensity of sound of the voice for judging to carry voice is higher than the intensity threshold, and output is corresponding
Judged result;
First processing units, connect the judging unit, for according to the judged result, in institute's predicate
The voice is confirmed as voice to be judged, and root by the intensity of sound of sound when being higher than the intensity threshold
According to the frequency spectrum of the voice to be judged, the estimation of each frequency band is identified on the generation correspondence voice to be judged,
The estimation mark is used to represent conspicuousness of the voice on harmonic structure;
Model generation unit, connects the first processing units and the separative element respectively, for by institute
Noise is stated as sample for reference, is corresponded to according to the sample for reference and the speech production to be judged described
The probabilistic model of the pure voice of voice to be judged;
Second processing unit, connects the model generation unit and the separative element, for every respectively
The individual estimation mark is used as the weight index of the frequency band of the corresponding voice to be judged, foundation institute
State the pure voice estimate that probabilistic model processing obtains being associated with the voice.
8. the system of the voice de-noising as claimed in claim 7 in real time being separated voice, its feature
It is, the estimation mark includes the first estimation mark;Or
The estimation mark includes the first estimation mark and the second estimation mark.
9. the system of the voice de-noising as claimed in claim 8 in real time being separated voice, its feature
It is, the first processing units are specifically included:
Extraction module, for the frequency spectrum according to the voice to be judged, extracts and waits to sentence described in corresponding to
The harmonic structure of conclusion sound;
First processing module, connects the extraction module, for being composed to the number for being associated with the harmonic structure
Monitoring value on domain carries out regularization processing, and according to melscale to the prison by regularization processing
Control value performs smoothing processing;
Second processing module, connects the first processing module, for the prison Jing Guo smoothing processing
Control value carries out further regularization processing, so that the average of the monitoring value is 1;
First generation module, connects the Second processing module, for generating correspondence according to the monitoring value
The first estimation mark of each frequency band of the voice to be judged.
10. the system of the voice de-noising as claimed in claim 9 in real time being separated voice, it is special
Levy and be, the second processing unit is specifically included:
3rd processing module, the least mean-square error for obtaining being associated with the voice to be judged for handling is estimated
The posterior probability of meter;
Fourth processing module, connects the 3rd processing module, for being identified with each first estimation
As the weight index of the frequency band of the corresponding voice to be judged, according to the probabilistic model to closing
The posterior probability for being coupled to the voice to be judged is weighted, and is estimated with obtaining the pure voice
Value.
11. the system of the voice de-noising as claimed in claim 9 in real time being separated voice, it is special
Levy and be, the first processing units include:
5th processing module, connects the first processing units, for the prison Jing Guo smoothing processing
Control value carries out corresponding regularization processing from 0 to 1;
Second generation module, connects the 5th processing module, for generating correspondence according to the monitoring value
The second estimation mark of each frequency band of the voice to be judged.
12. the system of the voice de-noising as claimed in claim 11 in real time being separated voice, it is special
Levy and be, in addition to:
3rd processing unit, connects the second processing unit, for for the every of the voice to be judged
Individual frequency band, will each corresponding second estimation mark as weight, with the monitoring value with it is described
Linear interpolation is performed between pure voice estimate and is handled and obtains corresponding output valve.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610024317.8A CN106971741B (en) | 2016-01-14 | 2016-01-14 | Method and system for voice noise reduction for separating voice in real time |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610024317.8A CN106971741B (en) | 2016-01-14 | 2016-01-14 | Method and system for voice noise reduction for separating voice in real time |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106971741A true CN106971741A (en) | 2017-07-21 |
CN106971741B CN106971741B (en) | 2020-12-01 |
Family
ID=59335200
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610024317.8A Active CN106971741B (en) | 2016-01-14 | 2016-01-14 | Method and system for voice noise reduction for separating voice in real time |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106971741B (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107613236A (en) * | 2017-09-28 | 2018-01-19 | 努比亚技术有限公司 | A kind of video record method and terminal, storage medium |
CN108053835A (en) * | 2017-11-13 | 2018-05-18 | 河海大学 | A kind of noise estimation method based on passage Taylor series |
CN108133700A (en) * | 2017-12-20 | 2018-06-08 | 南京航空航天大学 | A kind of acoustics black hole vibration and noise reducing device |
CN108877823A (en) * | 2018-07-27 | 2018-11-23 | 三星电子(中国)研发中心 | Sound enhancement method and device |
CN109243429A (en) * | 2018-11-21 | 2019-01-18 | 苏州奇梦者网络科技有限公司 | A kind of pronunciation modeling method and device |
CN109785845A (en) * | 2019-01-28 | 2019-05-21 | 百度在线网络技术(北京)有限公司 | Method of speech processing, device and equipment |
CN109801644A (en) * | 2018-12-20 | 2019-05-24 | 北京达佳互联信息技术有限公司 | Separation method, device, electronic equipment and the readable medium of mixed sound signal |
CN110648680A (en) * | 2019-09-23 | 2020-01-03 | 腾讯科技(深圳)有限公司 | Voice data processing method and device, electronic equipment and readable storage medium |
CN110677716A (en) * | 2019-08-20 | 2020-01-10 | 咪咕音乐有限公司 | Audio processing method, electronic device, and storage medium |
CN111243619A (en) * | 2020-01-06 | 2020-06-05 | 平安科技(深圳)有限公司 | Training method and device for voice signal segmentation model and computer equipment |
CN111292758A (en) * | 2019-03-12 | 2020-06-16 | 展讯通信(上海)有限公司 | Voice activity detection method and device and readable storage medium |
CN111312221A (en) * | 2020-01-20 | 2020-06-19 | 宁波舜韵电子有限公司 | Intelligent range hood based on voice control |
CN111415653A (en) * | 2018-12-18 | 2020-07-14 | 百度在线网络技术(北京)有限公司 | Method and apparatus for recognizing speech |
CN111465982A (en) * | 2017-12-12 | 2020-07-28 | 索尼公司 | Signal processing device and method, training device and method, and program |
CN111640422A (en) * | 2020-05-13 | 2020-09-08 | 广州国音智能科技有限公司 | Voice and human voice separation method and device, terminal and storage medium |
CN111883159A (en) * | 2020-08-05 | 2020-11-03 | 龙马智芯(珠海横琴)科技有限公司 | Voice processing method and device |
CN112000047A (en) * | 2020-09-07 | 2020-11-27 | 广东众科智能科技股份有限公司 | Remote intelligent monitoring system |
WO2020238681A1 (en) * | 2019-05-31 | 2020-12-03 | 京东数字科技控股有限公司 | Audio processing method and device, and man-machine interactive system |
CN112951259A (en) * | 2021-03-01 | 2021-06-11 | 杭州网易云音乐科技有限公司 | Audio noise reduction method and device, electronic equipment and computer readable storage medium |
CN113347519A (en) * | 2020-02-18 | 2021-09-03 | 宏碁股份有限公司 | Method for eliminating specific object voice and ear-wearing type sound signal device using same |
TWI745968B (en) * | 2019-05-20 | 2021-11-11 | 仁寶電腦工業股份有限公司 | Noise reduction method and noise reduction device and noise reduction system using the same |
CN115394310A (en) * | 2022-08-19 | 2022-11-25 | 中邮消费金融有限公司 | Neural network-based background voice removing method and system |
CN115527547A (en) * | 2022-04-29 | 2022-12-27 | 荣耀终端有限公司 | Noise processing method and electronic equipment |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1313983A (en) * | 1999-06-15 | 2001-09-19 | 松下电器产业株式会社 | Noise signal encoder and voice signal encoder |
US20030093269A1 (en) * | 2001-11-15 | 2003-05-15 | Hagai Attias | Method and apparatus for denoising and deverberation using variational inference and strong speech models |
CN101031956A (en) * | 2004-07-22 | 2007-09-05 | 索福特迈克斯有限公司 | Headset for separation of speech signals in a noisy environment |
US7343284B1 (en) * | 2003-07-17 | 2008-03-11 | Nortel Networks Limited | Method and system for speech processing for enhancement and detection |
CN101790752A (en) * | 2007-09-28 | 2010-07-28 | 高通股份有限公司 | Multiple microphone voice activity detector |
CN101814290A (en) * | 2009-02-25 | 2010-08-25 | 三星电子株式会社 | Method for enhancing robustness of voice recognition system |
CN102664006A (en) * | 2012-04-14 | 2012-09-12 | 中国人民解放军国防科学技术大学 | Abnormal voice detecting method based on time-domain and frequency-domain analysis |
CN103165129A (en) * | 2011-12-13 | 2013-06-19 | 北京百度网讯科技有限公司 | Method and system for optimizing voice recognition acoustic model |
CN103177721A (en) * | 2011-12-26 | 2013-06-26 | 中国电信股份有限公司 | Voice recognition method and system |
CN103310798A (en) * | 2012-03-07 | 2013-09-18 | 国际商业机器公司 | System and method for noise reduction |
CN103477386A (en) * | 2011-02-14 | 2013-12-25 | 弗兰霍菲尔运输应用研究公司 | Noise generation in audio codecs |
US20150243284A1 (en) * | 2014-02-27 | 2015-08-27 | Qualcomm Incorporated | Systems and methods for speaker dictionary based speech modeling |
-
2016
- 2016-01-14 CN CN201610024317.8A patent/CN106971741B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1313983A (en) * | 1999-06-15 | 2001-09-19 | 松下电器产业株式会社 | Noise signal encoder and voice signal encoder |
US20030093269A1 (en) * | 2001-11-15 | 2003-05-15 | Hagai Attias | Method and apparatus for denoising and deverberation using variational inference and strong speech models |
US7343284B1 (en) * | 2003-07-17 | 2008-03-11 | Nortel Networks Limited | Method and system for speech processing for enhancement and detection |
CN101031956A (en) * | 2004-07-22 | 2007-09-05 | 索福特迈克斯有限公司 | Headset for separation of speech signals in a noisy environment |
CN101790752A (en) * | 2007-09-28 | 2010-07-28 | 高通股份有限公司 | Multiple microphone voice activity detector |
CN101814290A (en) * | 2009-02-25 | 2010-08-25 | 三星电子株式会社 | Method for enhancing robustness of voice recognition system |
CN103477386A (en) * | 2011-02-14 | 2013-12-25 | 弗兰霍菲尔运输应用研究公司 | Noise generation in audio codecs |
CN103165129A (en) * | 2011-12-13 | 2013-06-19 | 北京百度网讯科技有限公司 | Method and system for optimizing voice recognition acoustic model |
CN103177721A (en) * | 2011-12-26 | 2013-06-26 | 中国电信股份有限公司 | Voice recognition method and system |
CN103310798A (en) * | 2012-03-07 | 2013-09-18 | 国际商业机器公司 | System and method for noise reduction |
CN102664006A (en) * | 2012-04-14 | 2012-09-12 | 中国人民解放军国防科学技术大学 | Abnormal voice detecting method based on time-domain and frequency-domain analysis |
US20150243284A1 (en) * | 2014-02-27 | 2015-08-27 | Qualcomm Incorporated | Systems and methods for speaker dictionary based speech modeling |
Non-Patent Citations (2)
Title |
---|
YI ZHANG: ""Modulation domain blind speech separation in noisy environment"", 《SCIENCEDIRECT》 * |
李然军: "《基于计算听觉场景分析的单声道语音分离》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107613236A (en) * | 2017-09-28 | 2018-01-19 | 努比亚技术有限公司 | A kind of video record method and terminal, storage medium |
CN107613236B (en) * | 2017-09-28 | 2021-01-05 | 盐城市聚龙湖商务集聚区发展有限公司 | Audio and video recording method, terminal and storage medium |
CN108053835A (en) * | 2017-11-13 | 2018-05-18 | 河海大学 | A kind of noise estimation method based on passage Taylor series |
CN111465982A (en) * | 2017-12-12 | 2020-07-28 | 索尼公司 | Signal processing device and method, training device and method, and program |
US11894008B2 (en) | 2017-12-12 | 2024-02-06 | Sony Corporation | Signal processing apparatus, training apparatus, and method |
CN108133700A (en) * | 2017-12-20 | 2018-06-08 | 南京航空航天大学 | A kind of acoustics black hole vibration and noise reducing device |
CN108133700B (en) * | 2017-12-20 | 2020-09-25 | 南京航空航天大学 | Acoustic black hole vibration and noise reduction device |
CN108877823A (en) * | 2018-07-27 | 2018-11-23 | 三星电子(中国)研发中心 | Sound enhancement method and device |
CN108877823B (en) * | 2018-07-27 | 2020-12-18 | 三星电子(中国)研发中心 | Speech enhancement method and device |
CN109243429B (en) * | 2018-11-21 | 2021-12-10 | 苏州奇梦者网络科技有限公司 | Voice modeling method and device |
CN109243429A (en) * | 2018-11-21 | 2019-01-18 | 苏州奇梦者网络科技有限公司 | A kind of pronunciation modeling method and device |
CN111415653B (en) * | 2018-12-18 | 2023-08-01 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing speech |
CN111415653A (en) * | 2018-12-18 | 2020-07-14 | 百度在线网络技术(北京)有限公司 | Method and apparatus for recognizing speech |
CN109801644A (en) * | 2018-12-20 | 2019-05-24 | 北京达佳互联信息技术有限公司 | Separation method, device, electronic equipment and the readable medium of mixed sound signal |
US11430427B2 (en) | 2018-12-20 | 2022-08-30 | Beijing Dajia Internet Information Technology Co., Ltd. | Method and electronic device for separating mixed sound signal |
CN109785845A (en) * | 2019-01-28 | 2019-05-21 | 百度在线网络技术(北京)有限公司 | Method of speech processing, device and equipment |
CN109785845B (en) * | 2019-01-28 | 2021-08-03 | 百度在线网络技术(北京)有限公司 | Voice processing method, device and equipment |
US11200899B2 (en) | 2019-01-28 | 2021-12-14 | Baidu Online Network Technology (Beijing) Co., Ltd. | Voice processing method, apparatus and device |
CN111292758A (en) * | 2019-03-12 | 2020-06-16 | 展讯通信(上海)有限公司 | Voice activity detection method and device and readable storage medium |
CN111292758B (en) * | 2019-03-12 | 2022-10-25 | 展讯通信(上海)有限公司 | Voice activity detection method and device and readable storage medium |
TWI745968B (en) * | 2019-05-20 | 2021-11-11 | 仁寶電腦工業股份有限公司 | Noise reduction method and noise reduction device and noise reduction system using the same |
WO2020238681A1 (en) * | 2019-05-31 | 2020-12-03 | 京东数字科技控股有限公司 | Audio processing method and device, and man-machine interactive system |
CN110677716A (en) * | 2019-08-20 | 2020-01-10 | 咪咕音乐有限公司 | Audio processing method, electronic device, and storage medium |
CN110648680B (en) * | 2019-09-23 | 2024-05-14 | 腾讯科技(深圳)有限公司 | Voice data processing method and device, electronic equipment and readable storage medium |
CN110648680A (en) * | 2019-09-23 | 2020-01-03 | 腾讯科技(深圳)有限公司 | Voice data processing method and device, electronic equipment and readable storage medium |
CN111243619A (en) * | 2020-01-06 | 2020-06-05 | 平安科技(深圳)有限公司 | Training method and device for voice signal segmentation model and computer equipment |
CN111243619B (en) * | 2020-01-06 | 2023-09-22 | 平安科技(深圳)有限公司 | Training method and device for speech signal segmentation model and computer equipment |
CN111312221B (en) * | 2020-01-20 | 2022-07-22 | 宁波舜韵电子有限公司 | Intelligent range hood based on voice control |
CN111312221A (en) * | 2020-01-20 | 2020-06-19 | 宁波舜韵电子有限公司 | Intelligent range hood based on voice control |
CN113347519B (en) * | 2020-02-18 | 2022-06-17 | 宏碁股份有限公司 | Method for eliminating specific object voice and ear-wearing type sound signal device using same |
CN113347519A (en) * | 2020-02-18 | 2021-09-03 | 宏碁股份有限公司 | Method for eliminating specific object voice and ear-wearing type sound signal device using same |
CN111640422A (en) * | 2020-05-13 | 2020-09-08 | 广州国音智能科技有限公司 | Voice and human voice separation method and device, terminal and storage medium |
CN111883159A (en) * | 2020-08-05 | 2020-11-03 | 龙马智芯(珠海横琴)科技有限公司 | Voice processing method and device |
CN112000047A (en) * | 2020-09-07 | 2020-11-27 | 广东众科智能科技股份有限公司 | Remote intelligent monitoring system |
CN112951259A (en) * | 2021-03-01 | 2021-06-11 | 杭州网易云音乐科技有限公司 | Audio noise reduction method and device, electronic equipment and computer readable storage medium |
CN115527547B (en) * | 2022-04-29 | 2023-06-16 | 荣耀终端有限公司 | Noise processing method and electronic equipment |
CN115527547A (en) * | 2022-04-29 | 2022-12-27 | 荣耀终端有限公司 | Noise processing method and electronic equipment |
CN115394310B (en) * | 2022-08-19 | 2023-04-07 | 中邮消费金融有限公司 | Neural network-based background voice removing method and system |
CN115394310A (en) * | 2022-08-19 | 2022-11-25 | 中邮消费金融有限公司 | Neural network-based background voice removing method and system |
Also Published As
Publication number | Publication date |
---|---|
CN106971741B (en) | 2020-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106971741A (en) | The method and system for the voice de-noising that voice is separated in real time | |
CN108597496B (en) | Voice generation method and device based on generation type countermeasure network | |
McLaren et al. | Advances in deep neural network approaches to speaker recognition | |
US9355642B2 (en) | Speaker recognition method through emotional model synthesis based on neighbors preserving principle | |
Xiao et al. | Normalization of the speech modulation spectra for robust speech recognition | |
CN106952643A (en) | A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering | |
CN102968990B (en) | Speaker identifying method and system | |
CN104882144A (en) | Animal voice identification method based on double sound spectrogram characteristics | |
CN104008751A (en) | Speaker recognition method based on BP neural network | |
CN112017682B (en) | Single-channel voice simultaneous noise reduction and reverberation removal system | |
CN109192200B (en) | Speech recognition method | |
CN105206270A (en) | Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM) | |
CN109036460A (en) | Method of speech processing and device based on multi-model neural network | |
CN105225672A (en) | Merge the system and method for the directed noise suppression of dual microphone of fundamental frequency information | |
CN110428853A (en) | Voice activity detection method, Voice activity detection device and electronic equipment | |
CN110189746A (en) | A kind of method for recognizing speech applied to earth-space communication | |
CN113763965A (en) | Speaker identification method with multiple attention characteristics fused | |
CN112017658A (en) | Operation control system based on intelligent human-computer interaction | |
CN114387997B (en) | Voice emotion recognition method based on deep learning | |
Cui et al. | A study of variable-parameter Gaussian mixture hidden Markov modeling for noisy speech recognition | |
CN104157294B (en) | A kind of Robust speech recognition method of market for farm products element information collection | |
CN111899750A (en) | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network | |
Nakatani et al. | Logmax observation model with MFCC-based spectral prior for reduction of highly nonstationary ambient noise | |
CN106971733A (en) | The method and system and intelligent terminal of Application on Voiceprint Recognition based on voice de-noising | |
CN106971707A (en) | The method and system and intelligent terminal of voice de-noising based on output offset noise |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |