CN101533642B - Method for processing voice signal and device - Google Patents

Method for processing voice signal and device Download PDF

Info

Publication number
CN101533642B
CN101533642B CN2009100783316A CN200910078331A CN101533642B CN 101533642 B CN101533642 B CN 101533642B CN 2009100783316 A CN2009100783316 A CN 2009100783316A CN 200910078331 A CN200910078331 A CN 200910078331A CN 101533642 B CN101533642 B CN 101533642B
Authority
CN
China
Prior art keywords
cepstrum
channel
voice signal
logarithm
logarithm cepstrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009100783316A
Other languages
Chinese (zh)
Other versions
CN101533642A (en
Inventor
张晨
冯宇红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mid Star Technology Ltd By Share Ltd
Original Assignee
Vimicro Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vimicro Corp filed Critical Vimicro Corp
Priority to CN2009100783316A priority Critical patent/CN101533642B/en
Publication of CN101533642A publication Critical patent/CN101533642A/en
Application granted granted Critical
Publication of CN101533642B publication Critical patent/CN101533642B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a method for processing voice signals and a device, aiming at solving the problem of channel disturbance of voice signals. The method comprises the following steps: in cepstrum domain, extracting cepstrum coefficient of currently observed voice signals to obtain cepstrum of observed voices; using lowpass filtering similar to the mean according to the statistic mean of the cepstrum of the voice signals which do not pass through the signal path; subtracting the estimated value of the cepstrum of transmission channels from the cepstrum of the observed voices to obtain the cepstrum of the voice signals which do not pass through the signal path at present; the cepstrum of the voice signals which do not pass through the signal path being the separation result of the voice signals and channel disturbance. The invention can eliminate the channel disturbance of the voice signals and enhance the capability of resisting the disturbance of the transmission channels in the process of extracting voice recognition features, thereby improving the recognition rate.

Description

A kind of audio signal processing method and device
Technical field
The present invention relates to the speech recognition technology field, particularly relate to a kind of audio signal processing method and device.
Background technology
Speech recognition technology has begun progressively to enter the practical stage through the research of whole world over half a century at present.Speech chip is used more and more extensive in recent years, mainly comprise: the phonetic dialing in the telephone communication, voice identification authentication, phonetic entry, the voice control of automobile, Industry Control and medical field, the voice Interaction Interface: of personal digital assistant (Personal Digital Assistant, PDA), intelligent toy, household remote, etc.
Speech recognition process comprises that mainly the pre-service, speech recognition features of voice signal extract, carry out the several parts of pattern match according to the speech recognition features that extracts.Wherein, the extraction that it is exactly speech recognition features that voice signal is identified a most important ring, the characteristic parameter of extraction must satisfy following requirement: the characteristic parameter that extract (1) can represent phonetic feature effectively, has good differentiation; (2) between each rank parameter good independence is arranged; (3) characteristic parameter is wanted convenience of calculation, and efficient algorithm is preferably arranged, to guarantee the real-time implementation of speech recognition.
But, in present speech recognition system, because the impact of the transmission channel of transmission of speech signals causes the characteristic of voice signal that certain variation has occured, cause the decline of recognition performance.And this problem has in various degree embodiment for different transmission channels.Therefore, in order to suppress or offset the signal distortion of transmission channel introducing, the channel disturbance of need to taking measures to eliminate.
Summary of the invention
Technical matters to be solved by this invention provides a kind of audio signal processing method and device, to solve the channel disturbance problem of voice signal.
In order to address the above problem, the invention discloses a kind of audio signal processing method, comprising:
On the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, obtain observing the logarithm cepstrum of voice;
According to the average statistical of the voice signal logarithm cepstrum of channel not, adopt low-pass filtering to be similar to described average, obtain the estimated value of transmission channel logarithm cepstrum;
The logarithm cepstrum of described observation voice is deducted the estimated value of described transmission channel logarithm cepstrum, obtain the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
Wherein, described basis is the average statistical of the voice signal logarithm cepstrum of channel not, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum, specifically comprises:
Calculate E[Tc (K)]=E[Sc (K)-RefCep (K)]; Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, and K is cepstrum parameter;
When having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering is similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 1)+(Sc (K)-RefCep (K)) α 1Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α 1Be smoothing factor.
Preferably, described method also comprises: when not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 1)+Sc (K) α 2Wherein, α 1With α 2Value different, α 2Be smoothing factor.
Preferably, described method also comprises: utilize the signal to noise ratio (S/N ratio) of the voice signal that observes, will utilize α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine according to described signal to noise ratio (S/N ratio).
Preferably, described method also comprises:
According to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; γ<α wherein 3, α 3Be smoothing factor, α 3With α 1, α 2Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
The present invention also provides a kind of speech signal processing device, comprising:
The cepstrum coefficient extraction unit is used at the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, and obtains observing the logarithm cepstrum of voice;
Channel logarithm cepstrum evaluation unit is used for the not average statistical of the voice signal logarithm cepstrum of channel of basis, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum;
The interference separation unit for the estimated value that the logarithm cepstrum of described observation voice is deducted described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
Wherein, described channel logarithm cepstrum evaluation unit comprises:
The mean value computation subelement is used for calculating E[Tc (K)]=E[Sc (K)-RefCep (K)]; Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, and K is cepstrum parameter;
The first estimation subelement is used for above-mentioned formula being carried out low-pass filtering being similar to E[Tc (K) when having voice signal on the transmission channel], obtain
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α 1Be smoothing factor.
Preferably, described channel logarithm cepstrum evaluation unit also comprises: the second estimation subelement, be used for when not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2Wherein, α 1With α 2Value different, α 2Be smoothing factor.
Preferably, described channel logarithm cepstrum evaluation unit also comprises:
The comprehensive estimate subelement for the signal to noise ratio (S/N ratio) of utilizing the voice signal that observes, will utilize α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine according to described signal to noise ratio (S/N ratio).
Preferably, described device also comprises:
Updating block is used for according to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; γ<α wherein 3, α 3Be smoothing factor, α 3With α 1, α 2Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
Compared with prior art, the present invention has the following advantages:
At first, the present invention is converted to the logarithm cepstrum with the voice signal that observes in extracting the process of speech recognition features, and according to the average statistical of the voice signal logarithm cepstrum of channel not, adopt low-pass filtering to be similar to described average, estimation obtains the logarithm cepstrum of transmission channel; Then, the logarithm cepstrum of described observation voice is deducted the estimated value of transmission channel logarithm cepstrum, thereby at cepstrum domain, the interference separation of voice signal and transmission channel is come, extract the not voice signal logarithm cepstrum of channel.This method can be eliminated transmission channel to the interference of voice signal, improves the ability of anti-transmission-channel interference in the speech recognition features leaching process, thereby improves discrimination.
And, in the process of the logarithm cepstrum of estimating transmission channel, adopt the method for low-pass filtering, utilize the signal of present frame and former frames just can calculate approximate average, so can satisfy the demand of speech recognition features extract real-time.
Secondly, the evaluation method of transmission channel logarithm cepstrum provided by the invention, can carry out different disposal to voice segments (being the situation that has voice signal on the transmission channel) and non-speech segment (being the situation that does not have voice signal on the transmission channel), namely adopt respectively different estimation equations, thereby estimate more accurately the transmission channel logarithm cepstrum of non-speech segment, further improve the ability of anti-channel disturbance.
Again, the present invention is according to actual speaker's characteristics, all utilize the voice signal logarithm cepstrum of the current not channel that calculates in the computation process of every frame voice signal, upgrade the not average statistical of the voice signal logarithm cepstrum of channel (initial value is a constant), thereby make described average statistical more near speaker's personal characteristics.
Description of drawings
Fig. 1 is the embodiment of the invention one described a kind of audio signal processing method process flow diagram;
Fig. 2 is the process flow diagram that the described speech recognition features of the embodiment of the invention is extracted;
Fig. 3 is the described a kind of speech signal processing device structural drawing of apparatus of the present invention embodiment;
Fig. 4 is the structural drawing of channel logarithm cepstrum evaluation unit U32 among Fig. 3 of the present invention;
Fig. 5 is another structural drawing of channel logarithm cepstrum evaluation unit U32 among Fig. 3 of the present invention;
Fig. 6 is the described a kind of speech signal processing device structural drawing of another device embodiment of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
The invention provides a kind of audio signal processing method, the method is applicable to general transmission situation, and general channel satisfies: channel belongs to the convolution channel; The characteristic of channel is more stable, changes slowly; The cepstrum feature of voice signal is tending towards constant from long-time statistical.Therefore, for general transmission, following relation is arranged:
Suppose that the voice signal of channel not (namely ideally the voice signal of equalization channel) is x (n), transmission channel is t (n), and then according to the character of convolution channel, the voice signal s (n) that observes is:
S (n)=x (n) * t (n) (intermediate symbols is convolution) (1.1)
Above-mentioned formula (1.1) has at frequency domain:
S(i)=X(i)T(i)
Above-mentioned formula (1.1) has at the logarithm cepstrum domain:
Sc(K)=Xc(K)+Tc(K) (1.2)
Namely at the logarithm cepstrum domain, the logarithm cepstrum Sc (K) of observation voice equals the logarithm cepstrum Tc (K) that the voice signal logarithm cepstrum Xc (K) of channel not adds transmission channel.Wherein, K is cepstrum parameter.
The present invention utilizes formula (1.2) just, on cepstrum domain, by the voice signal that observes is processed, the interference separation of voice signal and transmission channel is come, thereby eliminate transmission channel to the interference of voice signal, extract the not voice signal logarithm cepstrum of channel, namely extract the balanced cepstrum feature of voice signal.
The realization principle of elimination channel disturbance of the present invention is as follows:
According to formula (1.2), obtaining not, voice signal or the balanced voice signal of channel at the logarithm cepstrum domain are:
Xc(K)=Sc(K)-Tc(K) (1.3)
Wherein, Sc (K) can calculate according to observation signal.Therefore, the key of extraction Xc (K) is to estimate the logarithm cepstrum Tc (K) of transmission channel.
The below will describe the method for eliminating channel disturbance in detail by embodiment.
Embodiment one:
With reference to Fig. 1, it is the described a kind of audio signal processing method process flow diagram of embodiment.
S101 on the logarithm cepstrum domain, carries out cepstrum coefficient to the current voice signal that observes and extracts, and obtains observing the logarithm cepstrum Sc (K) of voice;
The cepstrum coefficient extraction is a general procedure in the voice recognition processing process, and Mel cepstral coefficients (Mel-scale Frequency Cepstral Coefficients is called for short MFCC) is one of characteristic parameter of commonly using in speech recognition.MFCC has simulated the auditory properties of people's ear, can reflect that the people to the apperceive characteristic of voice, extracts speaker's personal characteristics from speaker's voice signal, has obtained higher discrimination in the speech recognition practical application.
The present embodiment can adopt the MFCC coefficient extraction algorithm of standard, this algorithm is at first used FFT (Fast Fourier Transfonn, Fast Fourier Transform (FFT)) time-domain signal is changed into frequency domain, use the triangular filter group that distributes according to the Mel scale to carry out convolution to its logarithm energy spectrum afterwards, the vector that at last output of each wave filter is consisted of carries out discrete cosine transform (dct transform), gets the top n coefficient.Because this algorithm belongs to known content, therefore be not described in detail in this.
S102 according to the average statistical of the voice signal logarithm cepstrum of channel not, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum;
This step is the logarithm cepstrum Tc (K) of estimating transmission channel, and the evaluation method that the present embodiment adopts is as follows:
The first step utilizes formula (1.3) to calculate the average statistical E[Tc (K) of transmission channel logarithm cepstrum Tc (K)], be specially:
Use E[X] expression calculates the average statistical of X, Xc (K), Sc (K), the Tc (K) of X in can representation formula;
According to formula (1.3), have
E[Xc(K)]=E[Sc(K)]-E[Tc(K)]
That is: E[Tc (K)]=E[Sc (K)]-E[Xc (K)]=E[Sc (K)]-RefCep (K)
=E[Sc(K)-RefCep(K)] (1.4)
Wherein, RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, i.e. E[Xc (K)]=RefCep (K).Described RefCep (K) obtains through long-time statistical according to the voice signal logarithm cepstrum feature vector of (ideal situation) under equalization channel in advance, and K=1-N, N generally get 12.Because RefCep (K) is a constant, still is a constant so RefCep (K) is got average, i.e. E[RefCep (K)]=RefCep (K).
Second step adopts the method for low-pass filtering to be similar to described average E[Tc (K)], obtain the estimated value of transmission channel logarithm cepstrum;
In formula (1.4), because E[Sc (K)-RefCep (K)] need the long term data statistics to obtain, could further draw E[Tc (K)], so the present embodiment is estimated E[Tc (K) by the method for asking approximate value] value.
In order to satisfy real-time demand, for formula (1.4), the present embodiment adopts the method for low-pass filtering to be similar to E[X].Described low-pass filtering refers to allow that low frequency signal passes through, but weakens the passing through of signal that (or reduce) frequency is higher than cutoff frequency, namely removes high frequency interference, thereby reduces sample frequency, avoids frequency aliasing.The method of low-pass filtering has multiple, is not construed as limiting at this.What the present embodiment adopted is first order IIR (endless impulse response) low-pass filtering, obtains
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1 (1.5)
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α 1Be smoothing factor.
The physical meaning of above-mentioned formula (1.5) is to leach slowly part of MFCC index variation, approaches average, therefore can utilize the result of calculation TranCep (K) of formula (1.5) to be similar to average E[Tc (K)].
By formula (1.5) as can be known, the present embodiment utilizes the signal of present frame and former frame, just can calculate the average of approximate transmission channel logarithm cepstrum, and will be similar to average as the estimated value of Tc (K), therefore can satisfy the demand of speech recognition features extract real-time.
S103 deducts the logarithm cepstrum Sc (K) of described observation voice the estimated value TranCep (K) of described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum Xc (K) of current not channel; The voice signal logarithm cepstrum Xc (K) of described not channel is the separating resulting of voice signal and channel disturbance.
Above-mentioned S101 has calculated Sc (K), and S102 has calculated the estimated value of Tc (K), according to formula (1.3), can obtain Xc (K).When voice signal process transmission, the voice signal logarithm cepstrum after the elimination channel disturbance is Xc (K).
In sum, extracting not, the computing method of the voice signal logarithm cepstrum Xc (K) of channel are summarized as follows:
Xc(K)=Sc(K)-TranCep(K) (1.6)
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1 (1.5)
Wherein, TranCep (K) initial value is that 0, RefCep (K) draws by adding up in advance.
Said method can be eliminated transmission channel to the interference of voice signal, improves the ability of anti-transmission-channel interference in the speech recognition features leaching process, thereby improves discrimination.
Embodiment two:
The method of above-described embodiment one has only considered to exist on the transmission channel situation (being voice segments) of voice signal, but for the situation that does not have voice signal on the transmission channel (being non-speech segment), then the evaluation method of transmission channel logarithm cepstrum can not adopt formula (1.5), and should adopt following formula:
TranCep(K) j=TranCep(K) j-1(1-α 2)+Sc(K)α 2 (1.7)
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α 2Also be smoothing factor, but and α 1Value different.
In the prior art, the method of much writing to disturb is not all considered the processing of non-speech segment, for example frequency domain is based on LMS (Least Mean Square, lowest mean square) blind balance method and cepstrum domain are based on the blind balance method of LMS, these two kinds of methods all are a kind of blind balance methods, by the LMS algorithm, minimize the error of observation phonetic feature and reference voice feature, thus the balanced speech characteristic parameter that obtains restraining.Described first method is at spectrum domain, and second method is at cepstrum domain, and the blind equalization of doing based on LMS at cepstrum domain can be so that calculated amount be less, the convergence better effects if.But in non-speech segment, blind equalization algorithm may bring wrong convergence, thereby affects the extraction of speech recognition features.For this problem, the present embodiment can carry out different disposal to voice segments and non-speech segment, namely adopt respectively the estimation equation of different transmission channel logarithm cepstrums, thereby estimate more accurately the transmission channel logarithm cepstrum of non-speech segment, further improve the ability of anti-channel disturbance.
Preferably, the present embodiment can also utilize the signal to noise ratio (S/N ratio) (SNR) of the voice signal that observes, and will utilize α 1And α 2Two formula (1.5) and (1.7) of calculating TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2 (1.8)
Wherein, α 3Also be smoothing factor, α 3With α 1, α 2Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
β 1+ β 23, β 1And β 2Determine according to signal to noise ratio (S/N ratio).Signal to noise ratio (S/N ratio) refers to original part in the signal and the ratio of the noise that causes owing to reasons such as equipment self, environmental interference, and usually with " SNR " or " S/N " expression, general is unit with decibel (dB), and signal to noise ratio (S/N ratio) is more high better.β 1And β 2Satisfy: when SNR is high, β 1>>β 2When SNR is low, β 1<<β 2, see following table for details:
SNR(dB) 20 15 10 5 0 -5 -10
β 1 100%α 3 90%α 3 80%α 3 70%α 3 50%α 3 20%α 3 0
β 2 0 10%α 3 20%α 3 30%α 3 50%α 3 80%α 3 100%α 3
Table 1
In sum, the computing method of the voice signal logarithm cepstrum Xc (K) of channel are not:
Xc(K)=Sc(K)-TranCep(K) (1.6)
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2 (1.8)
According to table 1, if SNR>=0dB, then α 31, otherwise α 32Be that SNR is that 0dB is the critical point of voice segments and non-speech segment.
Embodiment three:
In above-mentioned computation process, the average statistical RefCep (K) of the voice signal logarithm cepstrum of channel is not by adding up in advance a constant that draws, only representing a blanket average.The present embodiment is in order to make this value more near each speaker's personal characteristics, characteristics according to actual speaker, in the computation process of every frame voice signal, all utilize the voice signal logarithm cepstrum Xc (K) of the current not channel that calculates to upgrade RefCep (K), specific as follows:
RefCep(K) j+1=RefCep(K) j(1-γ)+Xc(K)γ (1.9)
Wherein γ is an a small amount of, and γ<α 3
Namely for each speaker's voice signal, the constant that RefCep (K) initial value draws for statistics, after having calculated the Xc of present frame (K), utilize this Xc (K) to upgrade RefCep (K) according to formula (1.9), the RefCep after the described renewal (K) is used for the calculating of next frame.Like this, the speaker is different, upgrades the RefCep (K) that obtains also different, and RefCep (K) more near speaker's personal characteristics, can improve phonetic recognization rate.
In actual applications, in order to reach better effect, can be in the estimated value TranCep of transmission channel logarithm cepstrum (K) convergence, and upgrade in the higher situation of signal to noise ratio snr.
Based on the explanation of above-mentioned three embodiment, utilize method that the present invention extracts speech recognition features as shown in Figure 2.
S201 carries out voice to the voice signal s (n) that observes and strengthens processing, and the voice signal s ' after being enhanced (n);
This step is pre-treatment step.The purpose that voice strengthen is to extract pure as far as possible raw tone from noisy voice signal, and enhancing algorithm commonly used is a lot of at present, as subtracts spectrometry or Wiener filtering algorithm etc., and the present embodiment does not elaborate.
S202 (n) carries out the MFCC coefficient to the voice signal s ' after strengthening and extracts, and obtains observing the logarithm cepstrum Sc (K) of voice;
S203 utilizes the signal to noise ratio (S/N ratio) of long-term speech logarithm cepstrum feature average RefCep (K) and observation signal, eliminates channel disturbance, obtains the balanced cepstrum feature of voice.
Described RefCep (K) namely refers to the average statistical of the voice signal logarithm cepstrum of not channel above, and the balanced cepstrum feature of described voice is the speech recognition features that extracts, and this speech recognition features is used for follow-up pattern match identifying.
Based on above content, for the performance of elimination channel disturbance method of the present invention is described, compare explanation below by the test example.This test case adopts the HTK kit as the instrument of speech recognition, and the MFCC coefficient of employing standard and single order second derivative thereof are as characteristic parameter.Cycle tests is divided into three groups of A, B, C, every group of 50 numeric strings, and each numeric string comprises 8 numerals, and namely every group of cycle tests comprises 400 numerals.A for the training data same channel under one group of data gathering, B be and training data different channels signal to noise ratio (S/N ratio) than one group of data of relative superiority or inferiority collection, C is the one group of data that more lowly gathers with training data different channels signal to noise ratio (S/N ratio).
The situation of test is following 5 kinds:
1, do not use the interference method of writing to;
2, adopt existing LMS blind equalization algorithm;
3, the example that adopts (1.5) of the present invention, (1.6) formula to consist of;
4, the example that adopts (1.6) of the present invention, (1.8) formula to consist of;
5, the example that adopts (1.9) of the present invention formula to consist of;
According to 5 kinds of top situations, carry out respectively the speech recognition test of A, B, three groups of sequences of C.Recognition result (annotate: it is that relatively test 1 is benchmark that error rate reduces) as shown in the table:
Figure GDA0000107803410000111
Table 2
From table data as seen, the interference method of writing to provided by the invention, to the training data different channels under the cycle tests that gathers preferably improved action is arranged.And method of the present invention is compared with existing method, and error rate further reduces.
For the explanation of above-described embodiment, the present invention also provides corresponding device embodiment.
With reference to Fig. 3, it is the described a kind of speech signal processing device structural drawing of embodiment.Described device mainly comprises:
Cepstrum coefficient extraction unit U31 is used at the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, and obtains observing the logarithm cepstrum of voice;
Channel logarithm cepstrum evaluation unit U32 is used for the not average statistical of the voice signal logarithm cepstrum of channel of basis, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum;
Interference separation unit U33 for the estimated value that the logarithm cepstrum of described observation voice is deducted described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
Wherein, with reference to Fig. 4, described channel logarithm cepstrum evaluation unit U32 may further include:
Mean value computation subelement U321 is used for calculating E[Tc (K)]=E[Sc (K)-RefCep (K)];
Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, and K is cepstrum parameter;
The first estimation subelement U322 is used for above-mentioned formula being carried out low-pass filtering being similar to E[Tc (K) when having voice signal on the transmission channel], obtain
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α 1Be smoothing factor.
Preferably, with reference to Fig. 5, described channel logarithm cepstrum evaluation unit U32 can also comprise:
The second estimation subelement U323 is used for when not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2Wherein, α 1With α 2Value different, α 2Be smoothing factor.
Preferably, described channel logarithm cepstrum evaluation unit U32 can also comprise:
The comprehensive estimate subelement for the signal to noise ratio (S/N ratio) of utilizing the voice signal that observes, will utilize α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine according to described signal to noise ratio (S/N ratio).
Preferably, with reference to Fig. 6, described device can also comprise:
Updating block U34 is used for according to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; γ<α wherein 3, α 3Be smoothing factor, α 3With α 1, α 2Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device embodiment because itself and embodiment of the method basic simlarity, so describe fairly simple, relevant part gets final product referring to the part explanation of embodiment of the method.
Above on a kind of method and device that transmission channel affects voice signal of eliminating provided by the present invention, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. an audio signal processing method is characterized in that, comprising:
On the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, obtain observing the logarithm cepstrum of voice;
According to the average statistical of the voice signal logarithm cepstrum of channel not, adopt low-pass filtering to be similar to described average, obtain the estimated value of transmission channel logarithm cepstrum;
The logarithm cepstrum of described observation voice is deducted the estimated value of described transmission channel logarithm cepstrum, obtain the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
2. method according to claim 1 is characterized in that, described basis is the average statistical of the voice signal logarithm cepstrum of channel not, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum, specifically comprises:
Calculate E[Tc (K)]=E[Sc (K)-RefCep (K)];
Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, and K is cepstrum parameter;
When having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering is similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 1)+(Sc (K)-RefCep (K)) α 1
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α 1Be smoothing factor.
3. method according to claim 2 is characterized in that, also comprises:
When not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2
Wherein, α 1With α 2Value different, α 2Be smoothing factor.
4. method according to claim 3 is characterized in that, also comprises:
The signal to noise ratio (S/N ratio) of the voice signal that utilization observes will be utilized α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine α according to described signal to noise ratio (S/N ratio) 3Be smoothing factor, α 3With α 1, α 2Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
5. method according to claim 4 is characterized in that, also comprises:
According to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel;
γ<α wherein 3, α 3Be smoothing factor, α 3With α 1, α 2Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
6. a speech signal processing device is characterized in that, comprising:
The cepstrum coefficient extraction unit is used at the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, and obtains observing the logarithm cepstrum of voice;
Channel logarithm cepstrum evaluation unit is used for the not average statistical of the voice signal logarithm cepstrum of channel of basis, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum;
The interference separation unit for the estimated value that the logarithm cepstrum of described observation voice is deducted described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.
7. device according to claim 6 is characterized in that, described channel logarithm cepstrum evaluation unit comprises:
The mean value computation subelement is used for calculating E[Tc (K)]=E[Sc (K)-RefCep (K)];
Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, and K is cepstrum parameter;
The first estimation subelement is used for above-mentioned formula being carried out low-pass filtering being similar to E[Tc (K) when having voice signal on the transmission channel], obtain
TranCep(K) j=TranCep(K) j-1(1-α 1)+(Sc(K)-RefCep(K))α 1
Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α 1Be smoothing factor.
8. device according to claim 7 is characterized in that, described channel logarithm cepstrum evaluation unit also comprises:
The second estimation subelement is used for when not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) j=TranCep (K) J-1(1-α 2)+Sc (K) α 2Wherein, α 1With α 2Value different, α 2Be smoothing factor.
9. device according to claim 8 is characterized in that, described channel logarithm cepstrum evaluation unit also comprises:
The comprehensive estimate subelement for the signal to noise ratio (S/N ratio) of utilizing the voice signal that observes, will utilize α 1And α 2Two formula that calculate TranCep (K) are comprehensively as follows:
TranCep(K) j=TranCep(K) j-1(1-α 3)+(Sc(K)-RefCep(K))β 1+Sc(K)β 2
Wherein, β 1+ β 23, β 1And β 2Determine α according to described signal to noise ratio (S/N ratio) 3Be smoothing factor, α 3With α 1, α 2Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
10. device according to claim 9 is characterized in that, described device also comprises:
Updating block is used for according to formula RefCep (K) J+1=RefCep (K) j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; γ<α wherein 3, α 3Be smoothing factor, α 3With α 1, α 2Relation be: in voice segments, α 3Be α 1In non-speech segment, α 3Be α 2
CN2009100783316A 2009-02-25 2009-02-25 Method for processing voice signal and device Active CN101533642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100783316A CN101533642B (en) 2009-02-25 2009-02-25 Method for processing voice signal and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100783316A CN101533642B (en) 2009-02-25 2009-02-25 Method for processing voice signal and device

Publications (2)

Publication Number Publication Date
CN101533642A CN101533642A (en) 2009-09-16
CN101533642B true CN101533642B (en) 2013-02-13

Family

ID=41104196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100783316A Active CN101533642B (en) 2009-02-25 2009-02-25 Method for processing voice signal and device

Country Status (1)

Country Link
CN (1) CN101533642B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2529371A4 (en) * 2010-01-29 2014-04-23 Circular Logic Llc Method and apparatus for canonical nonlinear analysis of audio signals
CN108848044A (en) * 2018-06-25 2018-11-20 电子科技大学 A kind of extracting method of channel fine feature
CN109599118A (en) * 2019-01-24 2019-04-09 宁波大学 A kind of voice playback detection method of robustness
CN113077787A (en) * 2020-12-22 2021-07-06 珠海市杰理科技股份有限公司 Voice data identification method, device, chip and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1397929A (en) * 2002-07-12 2003-02-19 清华大学 Speech intensifying-characteristic weighing-logrithmic spectrum addition method for anti-noise speech recognization
CN1431650A (en) * 2003-02-21 2003-07-23 清华大学 Antinoise voice recognition method based on weighted local energy
CN1679083A (en) * 2002-08-30 2005-10-05 西门子共同研究公司 Multichannel voice detection in adverse environments

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1397929A (en) * 2002-07-12 2003-02-19 清华大学 Speech intensifying-characteristic weighing-logrithmic spectrum addition method for anti-noise speech recognization
CN1679083A (en) * 2002-08-30 2005-10-05 西门子共同研究公司 Multichannel voice detection in adverse environments
CN1431650A (en) * 2003-02-21 2003-07-23 清华大学 Antinoise voice recognition method based on weighted local energy

Also Published As

Publication number Publication date
CN101533642A (en) 2009-09-16

Similar Documents

Publication Publication Date Title
CN108831499B (en) Speech enhancement method using speech existence probability
CN105957520B (en) A kind of voice status detection method suitable for echo cancelling system
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN110428849B (en) Voice enhancement method based on generation countermeasure network
CN103903612B (en) Method for performing real-time digital speech recognition
CN110085249A (en) The single-channel voice Enhancement Method of Recognition with Recurrent Neural Network based on attention gate
CN112017682B (en) Single-channel voice simultaneous noise reduction and reverberation removal system
CN109841218B (en) Voiceprint registration method and device for far-field environment
CN103594094A (en) Self-adaptive spectral subtraction real-time speech enhancement
CN106373559B (en) Robust feature extraction method based on log-spectrum signal-to-noise ratio weighting
CN113178204B (en) Single-channel noise reduction low-power consumption method, device and storage medium
CN106328151A (en) Environment de-noising system and application method
CN103021405A (en) Voice signal dynamic feature extraction method based on MUSIC and modulation spectrum filter
CN103594093A (en) Method for enhancing voice based on signal to noise ratio soft masking
CN112259117B (en) Target sound source locking and extracting method
WO2021007841A1 (en) Noise estimation method, noise estimation apparatus, speech processing chip and electronic device
CN101533642B (en) Method for processing voice signal and device
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
CN115116446B (en) Speaker recognition model construction method in noise environment
CN118016079B (en) Intelligent voice transcription method and system
CN116364109A (en) Speech enhancement network signal-to-noise ratio estimator and loss optimization method
CN112802490B (en) Beam forming method and device based on microphone array
Elshamy et al. An iterative speech model-based a priori SNR estimator
CN113411456A (en) Voice quality assessment method and device based on speech recognition
CN116863952A (en) Noise reduction system for voice mouse

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20171221

Address after: 100083 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 16

Patentee after: Zhongxing Technology Co., Ltd.

Address before: 100083 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 16

Patentee before: Beijing Vimicro Corporation

CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100083 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 16

Patentee after: Mid Star Technology Limited by Share Ltd

Address before: 100083 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 16

Patentee before: Zhongxing Technology Co., Ltd.