CN101533642A - Method for processing voice signal and device - Google Patents

Method for processing voice signal and device Download PDF

Info

Publication number
CN101533642A
CN101533642A CN200910078331A CN200910078331A CN101533642A CN 101533642 A CN101533642 A CN 101533642A CN 200910078331 A CN200910078331 A CN 200910078331A CN 200910078331 A CN200910078331 A CN 200910078331A CN 101533642 A CN101533642 A CN 101533642A
Authority
CN
China
Prior art keywords
cepstrum
channel
voice signal
logarithm
logarithm cepstrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910078331A
Other languages
Chinese (zh)
Other versions
CN101533642B (en
Inventor
张晨
冯宇红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mid Star Technology Ltd By Share Ltd
Original Assignee
Vimicro Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vimicro Corp filed Critical Vimicro Corp
Priority to CN2009100783316A priority Critical patent/CN101533642B/en
Publication of CN101533642A publication Critical patent/CN101533642A/en
Application granted granted Critical
Publication of CN101533642B publication Critical patent/CN101533642B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a method for processing voice signals and a device, aiming at solving the problem of channel disturbance of voice signals. The method comprises the following steps: in cepstrum domain, extracting cepstrum coefficient of currently observed voice signals to obtain cepstrum of observed voices; estimating the estimated value of the cepstrum of transmission channels according to the statistic mean of the cepstrum of the voice signals which do not pass through the signal path; subtracting the estimated value of the cepstrum of transmission channels from the cepstrum of the observed voices to obtain the cepstrum of the voice signals which do not pass through the signal path at present; the cepstrum of the voice signals which do not pass through the signal path being the separation result of the voice signals and channel disturbance. The invention can eliminate the channel disturbance of the voice signals and enhance the capability of resisting the disturbance of the transmission channels in the process of extracting voice recognition features, thereby improving the recognition rate.

Description

A kind of audio signal processing method and device
Technical field
The present invention relates to the speech recognition technology field, particularly relate to a kind of audio signal processing method and device.
Background technology
Speech recognition technology has begun progressively to enter the practical stage through the research of whole world over half a century at present.Speech chip is used more and more widely in recent years, mainly comprise: the phonetic dialing in the telephone communication, voice identification authentication, phonetic entry, the voice control of automobile, Industry Control and medical field, personal digital assistant (Personal Digital Assistant, interactive voice interface PDA), intelligent toy, household remote, or the like.
Speech recognition process comprises that mainly the pre-service, speech recognition features of voice signal extract, carry out the pattern match several sections according to the speech recognition features that extracts.Wherein, it is exactly the extraction of speech recognition features that voice signal is discerned a most important ring, and the characteristic parameter of extraction must satisfy following requirement: the characteristic parameter that extract (1) can be represented phonetic feature effectively, has good differentiation; (2) between each rank parameter good independence is arranged; (3) characteristic parameter is wanted convenience of calculation, and high-efficient algorithm is preferably arranged, to guarantee the real-time implementation of speech recognition.
But, in present speech recognition system,, cause the decline of recognition performance because the influence of the transmission channel of transmission of speech signals causes the characteristic of voice signal that certain variation has taken place.And this problem has in various degree embodiment for different transmission channels.Therefore, in order to suppress or offset the signal distortion of transmission channel introducing, the channel disturbance of need taking measures to eliminate.
Summary of the invention
Technical matters to be solved by this invention provides a kind of audio signal processing method and device, to solve problem of channel disturbance of voice signals.
In order to address the above problem, the invention discloses a kind of audio signal processing method, comprising:
On the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, obtain observing the logarithm cepstrum of voice;
According to the average statistical of the voice signal logarithm cepstrum of channel not, estimation obtains the estimated value of transmission channel logarithm cepstrum;
The logarithm cepstrum of described observation voice is deducted the estimated value of described transmission channel logarithm cepstrum, obtain the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
Wherein, described basis is the average statistical of the voice signal logarithm cepstrum of channel not, and estimation obtains the estimated value of transmission channel logarithm cepstrum, specifically comprises:
Calculate E[Tc (K)]=E[Sc (K)-RefCep (K)]; Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel;
When having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering is similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 1)+(Sc (K)-RefCep (K)) α 1Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is a frame number, α 1Be smoothing factor.
Preferably, described method also comprises: when not having voice signal on the transmission channel, to aforementioned calculation E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2Wherein, α 1With α 2The value difference.
Preferably, described method also comprises: utilize the signal to noise ratio (S/N ratio) of the voice signal that observes, will utilize α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine according to described signal to noise ratio (S/N ratio).
Preferably, described method also comprises:
According to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; Wherein γ is an a small amount of, and γ<α.
The present invention also provides a kind of speech signal processing device, comprising:
The cepstrum coefficient extraction unit is used at the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, and obtains observing the logarithm cepstrum of voice;
Channel logarithm cepstrum evaluation unit is used for the not average statistical of the voice signal logarithm cepstrum of channel of basis, and estimation obtains the estimated value of transmission channel logarithm cepstrum;
The interference separation unit is used for the logarithm cepstrum of described observation voice is deducted the estimated value of described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
Wherein, described channel logarithm cepstrum evaluation unit comprises:
The mean value computation subelement is used to calculate E[Tc (K)]=E[Sc (K)-RefCep (K)]; Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel;
The first estimation subelement is used for when having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering be similar to E[Tc (K)], obtain
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is a frame number, α 1 is a smoothing factor.
Preferably, described channel logarithm cepstrum evaluation unit also comprises: the second estimation subelement is used for when not having voice signal on the transmission channel, to aforementioned calculation E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2Wherein, α 1With α 2The value difference.
Preferably, described channel logarithm cepstrum evaluation unit also comprises:
The comprehensive estimate subelement is used to utilize the signal to noise ratio (S/N ratio) of the voice signal that observes, and will utilize α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine according to described signal to noise ratio (S/N ratio).
Preferably, described device also comprises:
Updating block is used for according to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; Wherein γ is an a small amount of, and γ<α.
Compared with prior art, the present invention has the following advantages:
At first, the present invention is converted to the logarithm cepstrum with the voice signal that observes in the process of extracting speech recognition features, and passes through the not average statistical of the voice signal logarithm cepstrum of channel, comes the logarithm cepstrum of estimating transmission channel; Then, the logarithm cepstrum of described observation voice is deducted the estimated value of transmission channel logarithm cepstrum, thereby, the interference separation of voice signal and transmission channel is come, extract the not voice signal logarithm cepstrum of channel at cepstrum domain.This method can be eliminated the interference of transmission channel to voice signal, improves the ability of anti-transmission-channel interference in the speech recognition features leaching process, thereby improves discrimination.
And, in the process of the logarithm cepstrum of estimating transmission channel, adopt the method for low-pass filtering, utilize the signal of present frame and former frames just can calculate approximate average, so can satisfy the demand of speech recognition features extract real-time.
Secondly, the evaluation method of transmission channel logarithm cepstrum provided by the invention, can carry out different disposal to voice segments (being the situation that has voice signal on the transmission channel) and non-speech segment (being the situation that does not have voice signal on the transmission channel), promptly adopt different estimation equations respectively, thereby estimate the transmission channel logarithm cepstrum of non-speech segment more accurately, further improve the ability of anti-channel disturbance.
Once more, the present invention is according to actual speaker's characteristics, all utilize the voice signal logarithm cepstrum of the current not channel that calculates in the computation process of every frame voice signal, upgrade the not average statistical of the voice signal logarithm cepstrum of channel (initial value is a constant), thereby make described average statistical more near speaker's personal characteristics.
Description of drawings
Fig. 1 is the embodiment of the invention one described a kind of audio signal processing method process flow diagram;
Fig. 2 is the process flow diagram that the described speech recognition features of the embodiment of the invention is extracted;
Fig. 3 is the described a kind of speech signal processing device structural drawing of apparatus of the present invention embodiment;
Fig. 4 is the structural drawing of channel logarithm cepstrum evaluation unit U32 among Fig. 3 of the present invention;
Fig. 5 is another structural drawing of channel logarithm cepstrum evaluation unit U32 among Fig. 3 of the present invention;
Fig. 6 is the described a kind of speech signal processing device structural drawing of another device embodiment of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
The invention provides a kind of audio signal processing method, this method is applicable to general Channel Transmission situation, and general channel satisfies: channel belongs to the convolution channel; The characteristic of channel is more stable, changes slowly; The cepstrum feature of voice signal is tending towards constant from long-time statistical.Therefore, for general Channel Transmission, following relation is arranged:
Suppose that the voice signal of channel (i.e. the voice signal of equalization channel ideally) is not x (n), transmission channel is t (n), and then according to the character of convolution channel, the voice signal s (n) that observes is:
S (n)=x (n) * t (n) (intermediate symbols is a convolution) (1.1)
Above-mentioned formula (1.1) has at frequency domain:
S(i)=X(i)T(i)
Above-mentioned formula (1.1) has at the logarithm cepstrum domain:
Sc(K)=Xc(K)+Tc(K) (1.2)
Promptly at the logarithm cepstrum domain, the logarithm cepstrum Sc (K) of observation voice equals the logarithm cepstrum Tc (K) that the voice signal logarithm cepstrum Xc (K) of channel not adds transmission channel.Wherein, K is a cepstrum parameter.
The present invention utilizes formula (1.2) just, on cepstrum domain, by the voice signal that observes is handled, the interference separation of voice signal and transmission channel is come, thereby eliminate the interference of transmission channel to voice signal, extract the not voice signal logarithm cepstrum of channel, promptly extract the balanced cepstrum feature of voice signal.
The realization principle of elimination channel disturbance of the present invention is as follows:
According to formula (1.2), obtaining not, the voice signal or the balanced voice signal of channel at the logarithm cepstrum domain are:
Xc(K)=Sc(K)-Tc(K) (1.3)
Wherein, Sc (K) can calculate according to observation signal.Therefore, the key of extraction Xc (K) is to estimate the logarithm cepstrum Tc (K) of transmission channel.
To describe the method for eliminating channel disturbance in detail by embodiment below.
Embodiment one:
With reference to Fig. 1, be the described a kind of audio signal processing method process flow diagram of embodiment.
S101 on the logarithm cepstrum domain, carries out cepstrum coefficient to the current voice signal that observes and extracts, and obtains observing the logarithm cepstrum Sc (K) of voice;
The cepstrum coefficient extraction is a general procedure in the voice recognition processing process, and Mel cepstral coefficients (Mel-scale Frequency Cepstral Coefficients is called for short MFCC) is one of characteristic parameter of using always in speech recognition.MFCC has simulated the auditory properties of people's ear, can reflect the apperceive characteristic of people to voice, extracts speaker's personal characteristics from speaker's voice signal, has obtained high recognition in the speech recognition practical application.
Present embodiment can adopt the MFCC coefficient extraction algorithm of standard, this algorithm is at first used FFT (FastFourier Transfonn, Fast Fourier Transform (FFT)) time-domain signal is changed into frequency domain, use the triangular filter group that distributes according to the Mel scale to carry out convolution to its logarithm energy spectrum afterwards, the vector that the output of each wave filter is constituted carries out discrete cosine transform (dct transform) at last, gets the top n coefficient.Because this algorithm belongs to known content, therefore be not described in detail in this.
S102, according to the average statistical of the voice signal logarithm cepstrum of channel not, estimation obtains the estimated value of transmission channel logarithm cepstrum;
This step is the logarithm cepstrum Tc (K) of estimating transmission channel, and the evaluation method that present embodiment adopts is as follows:
The first step utilizes formula (1.3) to calculate the average statistical E[Tc (K) of transmission channel logarithm cepstrum Tc (K)], be specially:
Use E[X] expression calculates the average statistical of X, Xc (K), Sc (K), the Tc (K) of X in can representation formula;
According to formula (1.3), have
E[Xc(K)]=E[Sc(K)]-E[Tc(K)]
That is: E[Tc (K)]=E[Sc (K)]-E[Xc (K)]=E[Sc (K)]-RefCep (K)
=E[Sc(K)-RefCep(K)] (1.4)
Wherein, RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, i.e. E[Xc (K)]=RefCep (K).Described RefCep (K) obtains through long-time statistical according to the voice signal logarithm cepstrum feature vector of (ideal situation) under equalization channel in advance, and K=1-N, N generally get 12.Because RefCep (K) is a constant, still is a constant so RefCep (K) is got average, i.e. E[RefCep (K)]=RefCep (K).
In second step, adopt the method for low-pass filtering to be similar to described average E[Tc (K)], obtain the estimated value of transmission channel logarithm cepstrum;
In formula (1.4), because E[Sc (K)-RefCep (K)] need the long term data statistics to obtain, could further draw E[Tc (K)], so present embodiment is estimated E[Tc (K) by the method for asking approximate value] value.
In order to satisfy real-time demand, at formula (1.4), present embodiment adopts the method for low-pass filtering to be similar to E[X].Described low-pass filtering is meant allows that low frequency signal passes through, but weakens the passing through of signal that (or reduce) frequency is higher than cutoff frequency, promptly removes high frequency interference, thereby reduces sample frequency, avoids frequency aliasing.The method of low-pass filtering has multiple, does not limit at this.What present embodiment adopted is first order IIR (endless impulse response) low-pass filtering, obtains
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1 (1.5)
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is a frame number, α 1Be smoothing factor.
The physical meaning of above-mentioned formula (1.5) is to leach MFCC index variation part slowly, approaches average, therefore can utilize the result of calculation TranCep (K) of formula (1.5) to be similar to average E[Tc (K)].
By formula (1.5) as can be known, present embodiment utilizes the signal of present frame and former frame, just can calculate the average of approximate transmission channel logarithm cepstrum, and will be similar to the estimated value of average, therefore can satisfy the demand of speech recognition features extract real-time as Tc (K).
S103 deducts the estimated value TranCep (K) of described transmission channel logarithm cepstrum with the logarithm cepstrum Sc (K) of described observation voice, obtains the voice signal logarithm cepstrum Xc (K) of current not channel; The voice signal logarithm cepstrum Xc (K) of described not channel is the separating resulting of voice signal and channel disturbance.
Above-mentioned S101 has calculated Sc (K), and S102 has calculated the estimated value of Tc (K), according to formula (1.3), can obtain Xc (K).When voice signal process Channel Transmission, the voice signal logarithm cepstrum after the elimination channel disturbance is Xc (K).
In sum, extracting not, the computing method of the voice signal logarithm cepstrum Xc (K) of channel are summarized as follows:
Xc(K)=Sc(K)-TranCep(K) (1.6)
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1 (1.5)
Wherein, TranCep (K) initial value is 0, and RefCep (K) draws by adding up in advance.
Said method can be eliminated the interference of transmission channel to voice signal, improves the ability of anti-transmission-channel interference in the speech recognition features leaching process, thereby improves discrimination.
Embodiment two:
The method of the foregoing description one has only considered to exist on the transmission channel situation (being voice segments) of voice signal, but for the situation that does not have voice signal on the transmission channel (being non-speech segment), then the evaluation method of transmission channel logarithm cepstrum can not adopt formula (1.5), and should adopt following formula:
TranCep(K) j=TranCep(K) j-1(1-α 2)+Sc(K)α 2 (1.7)
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is a frame number, α 2Also be smoothing factor, but and α 1The value difference.
In the prior art, the method of much writing to disturb is not all considered the processing of non-speech segment, for example frequency domain is based on LMS (Least Mean Square, lowest mean square) blind balance method and cepstrum domain are based on the blind balance method of LMS, these two kinds of methods all are a kind of blind balance methods, by the LMS algorithm, minimize the error of observation phonetic feature and reference voice feature, thereby obtain the balanced speech characteristic parameter of convergent.Described first method is at spectrum domain, and second method is at cepstrum domain, and the blind equalization of doing based on LMS at cepstrum domain can be so that calculated amount be littler, the convergence better effects if.But in non-speech segment, blind equalization algorithm may bring wrong convergence, thereby influences the extraction of speech recognition features.At this problem, present embodiment can carry out different disposal to voice segments and non-speech segment, promptly adopt the estimation equation of different transmission channel logarithm cepstrums respectively, thereby estimate the transmission channel logarithm cepstrum of non-speech segment more accurately, further improve the ability of anti-channel disturbance.
Preferably, present embodiment can also utilize the signal to noise ratio (snr) of the voice signal that observes, and will utilize α 1And α 2Two formula (1.5) and (1.7) of calculating TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2 (1.8)
Wherein, α 3Also be smoothing factor, α 3With α 1, α 3Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
β 1+ β 23, β 1And β 2Determine according to signal to noise ratio (S/N ratio).Signal to noise ratio (S/N ratio) refers to original part in the signal and the ratio of the noise that causes owing to reasons such as equipment self, environmental interference, and usually with " SNR " or " S/N " expression, general is unit with decibel (dB), and signal to noise ratio (S/N ratio) is high more good more.β 1And β 2Satisfy: when SNR is high, β 1β 2When SNR is low, β 1<<β 2, see following table for details:
SNR(dB) 20 15 10 5 0 -5 -10
β 1 100%α 3 90%α 3 80%α 3 70%α 3 50%α 3 20%α 3 0
β 2 0 10%α 3 20%α 3 30%α 3 50%α 3 80%α 3 100%α 3
Table 1
In sum, the computing method of the voice signal logarithm cepstrum Xc (K) of channel are not:
Xc(K)=Sc(K)-TranCep(K) (1.6)
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2 (1.8)
According to table 1, if SNR 〉=0dB, then α 31, otherwise α 32Be that SNR is that 0dB is the critical point of voice segments and non-speech segment.
Embodiment three:
In the aforementioned calculation process, the average statistical RefCep (K) of the voice signal logarithm cepstrum of channel is not by adding up a constant that draws in advance, only representing a blanket average.Present embodiment is in order to make this value more near each speaker's personal characteristics, characteristics according to actual speaker, in the computation process of every frame voice signal, all utilize the voice signal logarithm cepstrum Xc (K) of the current not channel that calculates to upgrade RefCep (K), specific as follows:
RefCep(K) j+1=RefCep(K) j(1-γ)+Xc(K)γ (1.9)
Wherein γ is an a small amount of, and γ<α.
Promptly at each speaker's voice signal, the constant that RefCep (K) initial value draws for statistics, after having calculated the Xc of present frame (K), utilize this Xc (K) to upgrade RefCep (K) according to formula (1.9), the RefCep after the described renewal (K) is used for the calculating of next frame.Like this, speaker's difference, it is also different to upgrade the RefCep (K) that obtains, and RefCep (K) more near speaker's personal characteristics, can improve phonetic recognization rate.
In actual applications, in order to reach better effect, can be in the time of the estimated value TranCep of transmission channel logarithm cepstrum (K) convergent, and upgrade under the signal to noise ratio snr condition with higher.
Based on the explanation of above-mentioned three embodiment, utilize method that the present invention extracts speech recognition features as shown in Figure 2.
S201 carries out the voice enhancement process to the voice signal s (n) that observes, and the voice signal s ' after being enhanced (n);
This step is a pre-treatment step.The purpose that voice strengthen is to extract pure as far as possible raw tone from noisy voice signal, and enhancement algorithms commonly used at present is a lot, as subtracts spectrometry or Wiener filtering algorithm etc., and present embodiment does not elaborate.
S202 (n) carries out the MFCC coefficient to the voice signal s ' after strengthening and extracts, and obtains observing the logarithm cepstrum Sc (K) of voice;
S203 utilizes the signal to noise ratio (S/N ratio) of long-term speech logarithm cepstrum feature average Re fCep (K) and observation signal, eliminates channel disturbance, obtains the balanced cepstrum feature of voice.
Described RefCep (K) promptly refers to the average statistical of the voice signal logarithm cepstrum of not channel above, and the balanced cepstrum feature of described voice is the speech recognition features that extracts, and this speech recognition features is used for follow-up pattern match identifying.
Based on above content,, compare explanation below by the test example for the performance of elimination channel disturbance method of the present invention is described.This test case adopts the instrument of HTK kit as speech recognition, and the MFCC coefficient of employing standard and single order second derivative thereof are as characteristic parameter.Cycle tests is divided into three groups of A, B, C, every group of 50 numeric strings, and each numeric string comprises 8 numerals, and promptly every group of cycle tests comprises 400 numerals.A is one group of data of gathering down with the training data same channel, and B is one group of data of gathering than relative superiority or inferiority with training data different channels signal to noise ratio (S/N ratio), and C is one group of data of more lowly gathering with training data different channels signal to noise ratio (S/N ratio).
The situation of test is following 5 kinds:
1, do not use the interference method of writing to;
2, adopt existing LMS blind equalization algorithm;
3, the example that adopts (1.5) of the present invention, (1.6) formula to constitute;
4, the example that adopts (1.6) of the present invention, (1.8) formula to constitute;
5, the example that adopts (1.9) of the present invention formula to constitute;
According to 5 kinds of top situations, carry out the speech recognition test of A, B, three groups of sequences of C respectively.Recognition result (annotate: it is that test 1 relatively is benchmark that error rate reduces) as shown in the table:
Figure A200910078331D00151
Table 2
From table data as seen, the interference method of writing to provided by the invention has improved action preferably to the cycle tests of gathering down with the training data different channels.And method of the present invention is compared with existing method, and error rate further reduces.
At the explanation of the foregoing description, the present invention also provides corresponding device thereof embodiment.
With reference to Fig. 3, be the described a kind of speech signal processing device structural drawing of embodiment.Described device mainly comprises:
Cepstrum coefficient extraction unit U31 is used at the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, and obtains observing the logarithm cepstrum of voice;
Channel logarithm cepstrum evaluation unit U32 is used for the not average statistical of the voice signal logarithm cepstrum of channel of basis, and estimation obtains the estimated value of transmission channel logarithm cepstrum;
Interference separation unit U33 is used for the logarithm cepstrum of described observation voice is deducted the estimated value of described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
Wherein, with reference to Fig. 4, described channel logarithm cepstrum evaluation unit U32 may further include:
Mean value computation subelement U321 is used to calculate E[Tc (K)]=E[Sc (K)-RefCep (K)];
Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel;
The first estimation subelement U322 is used for when having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering be similar to E[Tc (K)], obtain
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is a frame number, α 1Be smoothing factor.
Preferably, with reference to Fig. 5, described channel logarithm cepstrum evaluation unit U32 can also comprise:
The second estimation subelement U323 is used for when not having voice signal on the transmission channel, to aforementioned calculation E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2Wherein, α 1With α 2The value difference.
Preferably, described channel logarithm cepstrum evaluation unit U32 can also comprise:
The comprehensive estimate subelement is used to utilize the signal to noise ratio (S/N ratio) of the voice signal that observes, and will utilize α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine according to described signal to noise ratio (S/N ratio).
Preferably, with reference to Fig. 6, described device can also comprise:
Updating block U34 is used for according to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; Wherein γ is an a small amount of, and γ<α.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device embodiment, because it is similar substantially to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
More than to a kind of method and device of eliminating transmission channel to voice signal influence provided by the present invention, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1, a kind of audio signal processing method is characterized in that, comprising:
On the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, obtain observing the logarithm cepstrum of voice;
According to the average statistical of the voice signal logarithm cepstrum of channel not, estimation obtains the estimated value of transmission channel logarithm cepstrum;
The logarithm cepstrum of described observation voice is deducted the estimated value of described transmission channel logarithm cepstrum, obtain the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
2, method according to claim 1 is characterized in that, described basis is the average statistical of the voice signal logarithm cepstrum of channel not, and estimation obtains the estimated value of transmission channel logarithm cepstrum, specifically comprises:
Calculate E[Tc (K)]=E[Sc (K)-RefCep (K)];
Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel;
When having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering is similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 1)+(Sc (K)-RefCep (K)) α 1
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is a frame number, α 1Be smoothing factor.
3, method according to claim 2 is characterized in that, also comprises:
When not having voice signal on the transmission channel, to aforementioned calculation E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2
Wherein, α 1With α 2The value difference.
4, method according to claim 3 is characterized in that, also comprises:
The signal to noise ratio (S/N ratio) of the voice signal that utilization observes will be utilized α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine according to described signal to noise ratio (S/N ratio).
5, according to the arbitrary described method of claim 2 to 4, it is characterized in that, also comprise:
According to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel;
Wherein γ is an a small amount of, and γ<α.
6, a kind of speech signal processing device is characterized in that, comprising:
The cepstrum coefficient extraction unit is used at the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, and obtains observing the logarithm cepstrum of voice;
Channel logarithm cepstrum evaluation unit is used for the not average statistical of the voice signal logarithm cepstrum of channel of basis, and estimation obtains the estimated value of transmission channel logarithm cepstrum;
The interference separation unit is used for the logarithm cepstrum of described observation voice is deducted the estimated value of described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
7, device according to claim 6 is characterized in that, described channel logarithm cepstrum evaluation unit comprises:
The mean value computation subelement is used to calculate E[Tc (K)]=E[Sc (K)-RefCep (K)];
Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel;
The first estimation subelement is used for when having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering be similar to E[Tc (K)], obtain
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is a frame number, α 1 is a smoothing factor.
8, device according to claim 7 is characterized in that, described channel logarithm cepstrum evaluation unit also comprises:
The second estimation subelement is used for when not having voice signal on the transmission channel, to aforementioned calculation E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2Wherein, α 1With α 2The value difference.
9, device according to claim 8 is characterized in that, described channel logarithm cepstrum evaluation unit also comprises:
The comprehensive estimate subelement is used to utilize the signal to noise ratio (S/N ratio) of the voice signal that observes, and it is comprehensively as follows to utilize α 1 and α 2 to calculate two formula of TranCep (K):
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine according to described signal to noise ratio (S/N ratio).
10, according to the arbitrary described device of claim 7 to 9, it is characterized in that described device also comprises:
Updating block is used for according to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; Wherein γ is an a small amount of, and γ<α.
CN2009100783316A 2009-02-25 2009-02-25 Method for processing voice signal and device Active CN101533642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100783316A CN101533642B (en) 2009-02-25 2009-02-25 Method for processing voice signal and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100783316A CN101533642B (en) 2009-02-25 2009-02-25 Method for processing voice signal and device

Publications (2)

Publication Number Publication Date
CN101533642A true CN101533642A (en) 2009-09-16
CN101533642B CN101533642B (en) 2013-02-13

Family

ID=41104196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100783316A Active CN101533642B (en) 2009-02-25 2009-02-25 Method for processing voice signal and device

Country Status (1)

Country Link
CN (1) CN101533642B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102947883A (en) * 2010-01-29 2013-02-27 循环逻辑有限责任公司 Method and apparatus for canonical nonlinear analysis of audio signals
CN108848044A (en) * 2018-06-25 2018-11-20 电子科技大学 A kind of extracting method of channel fine feature
CN109599118A (en) * 2019-01-24 2019-04-09 宁波大学 A kind of voice playback detection method of robustness
CN113077787A (en) * 2020-12-22 2021-07-06 珠海市杰理科技股份有限公司 Voice data identification method, device, chip and readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1162838C (en) * 2002-07-12 2004-08-18 清华大学 Speech intensifying-characteristic weighing-logrithmic spectrum addition method for anti-noise speech recognization
US7146315B2 (en) * 2002-08-30 2006-12-05 Siemens Corporate Research, Inc. Multichannel voice detection in adverse environments
CN1182513C (en) * 2003-02-21 2004-12-29 清华大学 Antinoise voice recognition method based on weighted local energy

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102947883A (en) * 2010-01-29 2013-02-27 循环逻辑有限责任公司 Method and apparatus for canonical nonlinear analysis of audio signals
CN108848044A (en) * 2018-06-25 2018-11-20 电子科技大学 A kind of extracting method of channel fine feature
CN109599118A (en) * 2019-01-24 2019-04-09 宁波大学 A kind of voice playback detection method of robustness
CN113077787A (en) * 2020-12-22 2021-07-06 珠海市杰理科技股份有限公司 Voice data identification method, device, chip and readable storage medium

Also Published As

Publication number Publication date
CN101533642B (en) 2013-02-13

Similar Documents

Publication Publication Date Title
CN108831499B (en) Speech enhancement method using speech existence probability
CN105957520B (en) A kind of voice status detection method suitable for echo cancelling system
CN110085249B (en) Single-channel speech enhancement method of recurrent neural network based on attention gating
CN105513605B (en) The speech-enhancement system and sound enhancement method of mobile microphone
CN105741849A (en) Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
CN111899752A (en) Noise suppression method and device for rapidly calculating voice existence probability, storage medium and terminal
CN101790752A (en) Multiple microphone voice activity detector
CN110265065B (en) Method for constructing voice endpoint detection model and voice endpoint detection system
CN109841218B (en) Voiceprint registration method and device for far-field environment
CN103700375B (en) Voice de-noising method and device thereof
CN106328151A (en) Environment de-noising system and application method
CN112037809A (en) Residual echo suppression method based on multi-feature flow structure deep neural network
CN103594093A (en) Method for enhancing voice based on signal to noise ratio soft masking
CN101533642B (en) Method for processing voice signal and device
WO2021007841A1 (en) Noise estimation method, noise estimation apparatus, speech processing chip and electronic device
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
CN111599373B (en) Compression method of noise reduction model
CN108922514A (en) A kind of robust features extracting method based on low frequency logarithmic spectrum
CN116364109A (en) Speech enhancement network signal-to-noise ratio estimator and loss optimization method
CN112802490B (en) Beam forming method and device based on microphone array
Elshamy et al. An iterative speech model-based a priori SNR estimator
CN112259117B (en) Target sound source locking and extracting method
WO2024017110A1 (en) Voice noise reduction method, model training method, apparatus, device, medium, and product
CN110797008B (en) Far-field voice recognition method, voice recognition model training method and server
CN112614502B (en) Echo cancellation method based on double LSTM neural network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20171221

Address after: 100083 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 16

Patentee after: Zhongxing Technology Co., Ltd.

Address before: 100083 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 16

Patentee before: Beijing Vimicro Corporation

CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100083 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 16

Patentee after: Mid Star Technology Limited by Share Ltd

Address before: 100083 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 16

Patentee before: Zhongxing Technology Co., Ltd.