CN101533642B

CN101533642B - Method for processing voice signal and device

Info

Publication number: CN101533642B
Application number: CN2009100783316A
Authority: CN
Inventors: 张晨; 冯宇红
Original assignee: Vimicro Corp
Current assignee: Mid Star Technology Ltd By Share Ltd
Priority date: 2009-02-25
Filing date: 2009-02-25
Publication date: 2013-02-13
Anticipated expiration: 2029-02-25
Also published as: CN101533642A

Abstract

The invention provides a method for processing voice signals and a device, aiming at solving the problem of channel disturbance of voice signals. The method comprises the following steps: in cepstrum domain, extracting cepstrum coefficient of currently observed voice signals to obtain cepstrum of observed voices; using lowpass filtering similar to the mean according to the statistic mean of the cepstrum of the voice signals which do not pass through the signal path; subtracting the estimated value of the cepstrum of transmission channels from the cepstrum of the observed voices to obtain the cepstrum of the voice signals which do not pass through the signal path at present; the cepstrum of the voice signals which do not pass through the signal path being the separation result of the voice signals and channel disturbance. The invention can eliminate the channel disturbance of the voice signals and enhance the capability of resisting the disturbance of the transmission channels in the process of extracting voice recognition features, thereby improving the recognition rate.

Description

A kind of audio signal processing method and device

Technical field

The present invention relates to the speech recognition technology field, particularly relate to a kind of audio signal processing method and device.

Background technology

Speech recognition technology has begun progressively to enter the practical stage through the research of whole world over half a century at present.Speech chip is used more and more extensive in recent years, mainly comprise: the phonetic dialing in the telephone communication, voice identification authentication, phonetic entry, the voice control of automobile, Industry Control and medical field, the voice Interaction Interface: of personal digital assistant (Personal Digital Assistant, PDA), intelligent toy, household remote, etc.

Speech recognition process comprises that mainly the pre-service, speech recognition features of voice signal extract, carry out the several parts of pattern match according to the speech recognition features that extracts.Wherein, the extraction that it is exactly speech recognition features that voice signal is identified a most important ring, the characteristic parameter of extraction must satisfy following requirement: the characteristic parameter that extract (1) can represent phonetic feature effectively, has good differentiation; (2) between each rank parameter good independence is arranged; (3) characteristic parameter is wanted convenience of calculation, and efficient algorithm is preferably arranged, to guarantee the real-time implementation of speech recognition.

But, in present speech recognition system, because the impact of the transmission channel of transmission of speech signals causes the characteristic of voice signal that certain variation has occured, cause the decline of recognition performance.And this problem has in various degree embodiment for different transmission channels.Therefore, in order to suppress or offset the signal distortion of transmission channel introducing, the channel disturbance of need to taking measures to eliminate.

Summary of the invention

Technical matters to be solved by this invention provides a kind of audio signal processing method and device, to solve the channel disturbance problem of voice signal.

In order to address the above problem, the invention discloses a kind of audio signal processing method, comprising:

On the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, obtain observing the logarithm cepstrum of voice;

According to the average statistical of the voice signal logarithm cepstrum of channel not, adopt low-pass filtering to be similar to described average, obtain the estimated value of transmission channel logarithm cepstrum;

The logarithm cepstrum of described observation voice is deducted the estimated value of described transmission channel logarithm cepstrum, obtain the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.

Wherein, described basis is the average statistical of the voice signal logarithm cepstrum of channel not, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum, specifically comprises:

Calculate E[Tc (K)]=E[Sc (K)-RefCep (K)]; Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, and K is cepstrum parameter;

When having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering is similar to E[Tc (K)], obtain TranCep (K) _j=TranCep (K) _J-1(1-α ₁)+(Sc (K)-RefCep (K)) α ₁Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α ₁Be smoothing factor.

Preferably, described method also comprises: when not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) _j=TranCep (K) _J-1(1-α ₁)+Sc (K) α ₂Wherein, α ₁With α ₂Value different, α ₂Be smoothing factor.

Preferably, described method also comprises: utilize the signal to noise ratio (S/N ratio) of the voice signal that observes, will utilize α ₁And α ₂Two formula that calculate TranCep (K) are comprehensively as follows:

TranCep(K) _j＝TranCep(K) _j-1(1-α ₃)+(Sc(K)-RefCep(K))β ₁+Sc(K)β ₂；

Wherein, β ₁+ β ₂=α ₃, β ₁And β ₂Determine according to described signal to noise ratio (S/N ratio).

Preferably, described method also comprises:

According to formula RefCep (K) _J+1=RefCep (K) _j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; γ＜α wherein ₃, α ₃Be smoothing factor, α ₃With α ₁, α ₂Relation be: in voice segments, α ₃Be α ₁In non-speech segment, α ₃Be α ₂

The present invention also provides a kind of speech signal processing device, comprising:

The cepstrum coefficient extraction unit is used at the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, and obtains observing the logarithm cepstrum of voice;

Channel logarithm cepstrum evaluation unit is used for the not average statistical of the voice signal logarithm cepstrum of channel of basis, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum;

The interference separation unit for the estimated value that the logarithm cepstrum of described observation voice is deducted described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.

Wherein, described channel logarithm cepstrum evaluation unit comprises:

The mean value computation subelement is used for calculating E[Tc (K)]=E[Sc (K)-RefCep (K)]; Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, and K is cepstrum parameter;

The first estimation subelement is used for above-mentioned formula being carried out low-pass filtering being similar to E[Tc (K) when having voice signal on the transmission channel], obtain

TranCep(K) _j＝TranCep(K) _j-1(1-α ₁)+(Sc(K)-RefCep(K))α ₁；

Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α ₁Be smoothing factor.

Preferably, described channel logarithm cepstrum evaluation unit also comprises: the second estimation subelement, be used for when not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) _j=TranCep (K) _J-1(1-α ₂)+Sc (K) α ₂Wherein, α ₁With α ₂Value different, α ₂Be smoothing factor.

Preferably, described channel logarithm cepstrum evaluation unit also comprises:

The comprehensive estimate subelement for the signal to noise ratio (S/N ratio) of utilizing the voice signal that observes, will utilize α ₁And α ₂Two formula that calculate TranCep (K) are comprehensively as follows:

Preferably, described device also comprises:

Updating block is used for according to formula RefCep (K) _J+1=RefCep (K) _j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; γ＜α wherein ₃, α ₃Be smoothing factor, α ₃With α ₁, α ₂Relation be: in voice segments, α ₃Be α ₁In non-speech segment, α ₃Be α ₂

Compared with prior art, the present invention has the following advantages:

At first, the present invention is converted to the logarithm cepstrum with the voice signal that observes in extracting the process of speech recognition features, and according to the average statistical of the voice signal logarithm cepstrum of channel not, adopt low-pass filtering to be similar to described average, estimation obtains the logarithm cepstrum of transmission channel; Then, the logarithm cepstrum of described observation voice is deducted the estimated value of transmission channel logarithm cepstrum, thereby at cepstrum domain, the interference separation of voice signal and transmission channel is come, extract the not voice signal logarithm cepstrum of channel.This method can be eliminated transmission channel to the interference of voice signal, improves the ability of anti-transmission-channel interference in the speech recognition features leaching process, thereby improves discrimination.

And, in the process of the logarithm cepstrum of estimating transmission channel, adopt the method for low-pass filtering, utilize the signal of present frame and former frames just can calculate approximate average, so can satisfy the demand of speech recognition features extract real-time.

Secondly, the evaluation method of transmission channel logarithm cepstrum provided by the invention, can carry out different disposal to voice segments (being the situation that has voice signal on the transmission channel) and non-speech segment (being the situation that does not have voice signal on the transmission channel), namely adopt respectively different estimation equations, thereby estimate more accurately the transmission channel logarithm cepstrum of non-speech segment, further improve the ability of anti-channel disturbance.

Again, the present invention is according to actual speaker's characteristics, all utilize the voice signal logarithm cepstrum of the current not channel that calculates in the computation process of every frame voice signal, upgrade the not average statistical of the voice signal logarithm cepstrum of channel (initial value is a constant), thereby make described average statistical more near speaker's personal characteristics.

Description of drawings

Fig. 1 is the embodiment of the invention one described a kind of audio signal processing method process flow diagram;

Fig. 2 is the process flow diagram that the described speech recognition features of the embodiment of the invention is extracted;

Fig. 3 is the described a kind of speech signal processing device structural drawing of apparatus of the present invention embodiment;

Fig. 4 is the structural drawing of channel logarithm cepstrum evaluation unit U32 among Fig. 3 of the present invention;

Fig. 5 is another structural drawing of channel logarithm cepstrum evaluation unit U32 among Fig. 3 of the present invention;

Fig. 6 is the described a kind of speech signal processing device structural drawing of another device embodiment of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

The invention provides a kind of audio signal processing method, the method is applicable to general transmission situation, and general channel satisfies: channel belongs to the convolution channel; The characteristic of channel is more stable, changes slowly; The cepstrum feature of voice signal is tending towards constant from long-time statistical.Therefore, for general transmission, following relation is arranged:

Suppose that the voice signal of channel not (namely ideally the voice signal of equalization channel) is x (n), transmission channel is t (n), and then according to the character of convolution channel, the voice signal s (n) that observes is:

S (n)=x (n) * t (n) (intermediate symbols is convolution) (1.1)

Above-mentioned formula (1.1) has at frequency domain:

S(i)＝X(i)T(i)

Above-mentioned formula (1.1) has at the logarithm cepstrum domain:

Sc(K)＝Xc(K)+Tc(K) (1.2)

Namely at the logarithm cepstrum domain, the logarithm cepstrum Sc (K) of observation voice equals the logarithm cepstrum Tc (K) that the voice signal logarithm cepstrum Xc (K) of channel not adds transmission channel.Wherein, K is cepstrum parameter.

The present invention utilizes formula (1.2) just, on cepstrum domain, by the voice signal that observes is processed, the interference separation of voice signal and transmission channel is come, thereby eliminate transmission channel to the interference of voice signal, extract the not voice signal logarithm cepstrum of channel, namely extract the balanced cepstrum feature of voice signal.

The realization principle of elimination channel disturbance of the present invention is as follows:

According to formula (1.2), obtaining not, voice signal or the balanced voice signal of channel at the logarithm cepstrum domain are:

Xc(K)＝Sc(K)-Tc(K) (1.3)

Wherein, Sc (K) can calculate according to observation signal.Therefore, the key of extraction Xc (K) is to estimate the logarithm cepstrum Tc (K) of transmission channel.

The below will describe the method for eliminating channel disturbance in detail by embodiment.

Embodiment one:

With reference to Fig. 1, it is the described a kind of audio signal processing method process flow diagram of embodiment.

S101 on the logarithm cepstrum domain, carries out cepstrum coefficient to the current voice signal that observes and extracts, and obtains observing the logarithm cepstrum Sc (K) of voice;

The cepstrum coefficient extraction is a general procedure in the voice recognition processing process, and Mel cepstral coefficients (Mel-scale Frequency Cepstral Coefficients is called for short MFCC) is one of characteristic parameter of commonly using in speech recognition.MFCC has simulated the auditory properties of people's ear, can reflect that the people to the apperceive characteristic of voice, extracts speaker's personal characteristics from speaker's voice signal, has obtained higher discrimination in the speech recognition practical application.

The present embodiment can adopt the MFCC coefficient extraction algorithm of standard, this algorithm is at first used FFT (Fast Fourier Transfonn, Fast Fourier Transform (FFT)) time-domain signal is changed into frequency domain, use the triangular filter group that distributes according to the Mel scale to carry out convolution to its logarithm energy spectrum afterwards, the vector that at last output of each wave filter is consisted of carries out discrete cosine transform (dct transform), gets the top n coefficient.Because this algorithm belongs to known content, therefore be not described in detail in this.

S102 according to the average statistical of the voice signal logarithm cepstrum of channel not, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum;

This step is the logarithm cepstrum Tc (K) of estimating transmission channel, and the evaluation method that the present embodiment adopts is as follows:

The first step utilizes formula (1.3) to calculate the average statistical E[Tc (K) of transmission channel logarithm cepstrum Tc (K)], be specially:

Use E[X] expression calculates the average statistical of X, Xc (K), Sc (K), the Tc (K) of X in can representation formula;

According to formula (1.3), have

E[Xc(K)]＝E[Sc(K)]-E[Tc(K)]

That is: E[Tc (K)]=E[Sc (K)]-E[Xc (K)]=E[Sc (K)]-RefCep (K)

＝E[Sc(K)-RefCep(K)] (1.4)

Wherein, RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, i.e. E[Xc (K)]=RefCep (K).Described RefCep (K) obtains through long-time statistical according to the voice signal logarithm cepstrum feature vector of (ideal situation) under equalization channel in advance, and K=1-N, N generally get 12.Because RefCep (K) is a constant, still is a constant so RefCep (K) is got average, i.e. E[RefCep (K)]=RefCep (K).

Second step adopts the method for low-pass filtering to be similar to described average E[Tc (K)], obtain the estimated value of transmission channel logarithm cepstrum;

In formula (1.4), because E[Sc (K)-RefCep (K)] need the long term data statistics to obtain, could further draw E[Tc (K)], so the present embodiment is estimated E[Tc (K) by the method for asking approximate value] value.

In order to satisfy real-time demand, for formula (1.4), the present embodiment adopts the method for low-pass filtering to be similar to E[X].Described low-pass filtering refers to allow that low frequency signal passes through, but weakens the passing through of signal that (or reduce) frequency is higher than cutoff frequency, namely removes high frequency interference, thereby reduces sample frequency, avoids frequency aliasing.The method of low-pass filtering has multiple, is not construed as limiting at this.What the present embodiment adopted is first order IIR (endless impulse response) low-pass filtering, obtains

TranCep(K) _j＝TranCep(K) _j-1(1-α ₁)+(Sc(K)-RefCep(K))α ₁ (1.5)

The physical meaning of above-mentioned formula (1.5) is to leach slowly part of MFCC index variation, approaches average, therefore can utilize the result of calculation TranCep (K) of formula (1.5) to be similar to average E[Tc (K)].

By formula (1.5) as can be known, the present embodiment utilizes the signal of present frame and former frame, just can calculate the average of approximate transmission channel logarithm cepstrum, and will be similar to average as the estimated value of Tc (K), therefore can satisfy the demand of speech recognition features extract real-time.

S103 deducts the logarithm cepstrum Sc (K) of described observation voice the estimated value TranCep (K) of described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum Xc (K) of current not channel; The voice signal logarithm cepstrum Xc (K) of described not channel is the separating resulting of voice signal and channel disturbance.

Above-mentioned S101 has calculated Sc (K), and S102 has calculated the estimated value of Tc (K), according to formula (1.3), can obtain Xc (K).When voice signal process transmission, the voice signal logarithm cepstrum after the elimination channel disturbance is Xc (K).

In sum, extracting not, the computing method of the voice signal logarithm cepstrum Xc (K) of channel are summarized as follows:

Xc(K)＝Sc(K)-TranCep(K) (1.6)

TranCep(K) _j＝TranCep(K) _j-1(1-α ₁)+(Sc(K)-RefCep(K))α ₁ (1.5)

Wherein, TranCep (K) initial value is that 0, RefCep (K) draws by adding up in advance.

Said method can be eliminated transmission channel to the interference of voice signal, improves the ability of anti-transmission-channel interference in the speech recognition features leaching process, thereby improves discrimination.

Embodiment two:

The method of above-described embodiment one has only considered to exist on the transmission channel situation (being voice segments) of voice signal, but for the situation that does not have voice signal on the transmission channel (being non-speech segment), then the evaluation method of transmission channel logarithm cepstrum can not adopt formula (1.5), and should adopt following formula:

TranCep(K) _j＝TranCep(K) _j-1(1-α ₂)+Sc(K)α ₂ (1.7)

Wherein, the estimated value of TranCep (K) expression transmission channel logarithm cepstrum, j is frame number, α ₂Also be smoothing factor, but and α ₁Value different.

In the prior art, the method of much writing to disturb is not all considered the processing of non-speech segment, for example frequency domain is based on LMS (Least Mean Square, lowest mean square) blind balance method and cepstrum domain are based on the blind balance method of LMS, these two kinds of methods all are a kind of blind balance methods, by the LMS algorithm, minimize the error of observation phonetic feature and reference voice feature, thus the balanced speech characteristic parameter that obtains restraining.Described first method is at spectrum domain, and second method is at cepstrum domain, and the blind equalization of doing based on LMS at cepstrum domain can be so that calculated amount be less, the convergence better effects if.But in non-speech segment, blind equalization algorithm may bring wrong convergence, thereby affects the extraction of speech recognition features.For this problem, the present embodiment can carry out different disposal to voice segments and non-speech segment, namely adopt respectively the estimation equation of different transmission channel logarithm cepstrums, thereby estimate more accurately the transmission channel logarithm cepstrum of non-speech segment, further improve the ability of anti-channel disturbance.

Preferably, the present embodiment can also utilize the signal to noise ratio (S/N ratio) (SNR) of the voice signal that observes, and will utilize α ₁And α ₂Two formula (1.5) and (1.7) of calculating TranCep (K) are comprehensively as follows:

TranCep(K) _j＝TranCep(K) _j-1(1-α ₃)+(Sc(K)-RefCep(K))β ₁+Sc(K)β ₂ (1.8)

Wherein, α ₃Also be smoothing factor, α ₃With α ₁, α ₂Relation be: in voice segments, α ₃Be α ₁In non-speech segment, α ₃Be α ₂

β ₁+ β ₂=α ₃, β ₁And β ₂Determine according to signal to noise ratio (S/N ratio).Signal to noise ratio (S/N ratio) refers to original part in the signal and the ratio of the noise that causes owing to reasons such as equipment self, environmental interference, and usually with " SNR " or " S/N " expression, general is unit with decibel (dB), and signal to noise ratio (S/N ratio) is more high better.β ₁And β ₂Satisfy: when SNR is high, β ₁＞＞β ₂When SNR is low, β ₁＜＜β ₂, see following table for details:

SNR(dB)

20

15

10

5

0

-5

-10

β ₁

100％α ₃

90％α ₃

80％α ₃

70％α ₃

50％α ₃

20％α ₃

0

β ₂

0

10％α ₃

20％α ₃

30％α ₃

50％α ₃

80％α ₃

100％α ₃

Table 1

In sum, the computing method of the voice signal logarithm cepstrum Xc (K) of channel are not:

Xc(K)＝Sc(K)-TranCep(K) (1.6)

According to table 1, if SNR＞=0dB, then α ₃=α ₁, otherwise α ₃=α ₂Be that SNR is that 0dB is the critical point of voice segments and non-speech segment.

Embodiment three:

In above-mentioned computation process, the average statistical RefCep (K) of the voice signal logarithm cepstrum of channel is not by adding up in advance a constant that draws, only representing a blanket average.The present embodiment is in order to make this value more near each speaker's personal characteristics, characteristics according to actual speaker, in the computation process of every frame voice signal, all utilize the voice signal logarithm cepstrum Xc (K) of the current not channel that calculates to upgrade RefCep (K), specific as follows:

RefCep(K) _j+1＝RefCep(K) _j(1-γ)+Xc(K)γ (1.9)

Wherein γ is an a small amount of, and γ＜α ₃

Namely for each speaker's voice signal, the constant that RefCep (K) initial value draws for statistics, after having calculated the Xc of present frame (K), utilize this Xc (K) to upgrade RefCep (K) according to formula (1.9), the RefCep after the described renewal (K) is used for the calculating of next frame.Like this, the speaker is different, upgrades the RefCep (K) that obtains also different, and RefCep (K) more near speaker's personal characteristics, can improve phonetic recognization rate.

In actual applications, in order to reach better effect, can be in the estimated value TranCep of transmission channel logarithm cepstrum (K) convergence, and upgrade in the higher situation of signal to noise ratio snr.

Based on the explanation of above-mentioned three embodiment, utilize method that the present invention extracts speech recognition features as shown in Figure 2.

S201 carries out voice to the voice signal s (n) that observes and strengthens processing, and the voice signal s ' after being enhanced (n);

This step is pre-treatment step.The purpose that voice strengthen is to extract pure as far as possible raw tone from noisy voice signal, and enhancing algorithm commonly used is a lot of at present, as subtracts spectrometry or Wiener filtering algorithm etc., and the present embodiment does not elaborate.

S202 (n) carries out the MFCC coefficient to the voice signal s ' after strengthening and extracts, and obtains observing the logarithm cepstrum Sc (K) of voice;

S203 utilizes the signal to noise ratio (S/N ratio) of long-term speech logarithm cepstrum feature average RefCep (K) and observation signal, eliminates channel disturbance, obtains the balanced cepstrum feature of voice.

Described RefCep (K) namely refers to the average statistical of the voice signal logarithm cepstrum of not channel above, and the balanced cepstrum feature of described voice is the speech recognition features that extracts, and this speech recognition features is used for follow-up pattern match identifying.

Based on above content, for the performance of elimination channel disturbance method of the present invention is described, compare explanation below by the test example.This test case adopts the HTK kit as the instrument of speech recognition, and the MFCC coefficient of employing standard and single order second derivative thereof are as characteristic parameter.Cycle tests is divided into three groups of A, B, C, every group of 50 numeric strings, and each numeric string comprises 8 numerals, and namely every group of cycle tests comprises 400 numerals.A for the training data same channel under one group of data gathering, B be and training data different channels signal to noise ratio (S/N ratio) than one group of data of relative superiority or inferiority collection, C is the one group of data that more lowly gathers with training data different channels signal to noise ratio (S/N ratio).

The situation of test is following 5 kinds:

1, do not use the interference method of writing to;

2, adopt existing LMS blind equalization algorithm;

3, the example that adopts (1.5) of the present invention, (1.6) formula to consist of;

4, the example that adopts (1.6) of the present invention, (1.8) formula to consist of;

5, the example that adopts (1.9) of the present invention formula to consist of;

According to 5 kinds of top situations, carry out respectively the speech recognition test of A, B, three groups of sequences of C.Recognition result (annotate: it is that relatively test 1 is benchmark that error rate reduces) as shown in the table:

Table 2

From table data as seen, the interference method of writing to provided by the invention, to the training data different channels under the cycle tests that gathers preferably improved action is arranged.And method of the present invention is compared with existing method, and error rate further reduces.

For the explanation of above-described embodiment, the present invention also provides corresponding device embodiment.

With reference to Fig. 3, it is the described a kind of speech signal processing device structural drawing of embodiment.Described device mainly comprises:

Cepstrum coefficient extraction unit U31 is used at the logarithm cepstrum domain, the current voice signal that observes is carried out cepstrum coefficient extract, and obtains observing the logarithm cepstrum of voice;

Channel logarithm cepstrum evaluation unit U32 is used for the not average statistical of the voice signal logarithm cepstrum of channel of basis, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum;

Interference separation unit U33 for the estimated value that the logarithm cepstrum of described observation voice is deducted described transmission channel logarithm cepstrum, obtains the voice signal logarithm cepstrum of current not channel; The voice signal logarithm cepstrum of described not channel is the separating resulting of voice signal and channel disturbance.

Wherein, with reference to Fig. 4, described channel logarithm cepstrum evaluation unit U32 may further include:

Mean value computation subelement U321 is used for calculating E[Tc (K)]=E[Sc (K)-RefCep (K)];

Wherein, the logarithm cepstrum of Tc (K) expression transmission channel, the logarithm cepstrum of Sc (K) expression observation voice; E[X] expression calculates the average statistical of X; RefCep (K) represents the not average statistical of the voice signal logarithm cepstrum of channel, and K is cepstrum parameter;

The first estimation subelement U322 is used for above-mentioned formula being carried out low-pass filtering being similar to E[Tc (K) when having voice signal on the transmission channel], obtain

TranCep(K) _j＝TranCep(K) _j-1(1-α ₁)+(Sc(K)-RefCep(K))α ₁；

Preferably, with reference to Fig. 5, described channel logarithm cepstrum evaluation unit U32 can also comprise:

The second estimation subelement U323 is used for when not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) _j=TranCep (K) _J-1(1-α ₂)+Sc (K) α ₂Wherein, α ₁With α ₂Value different, α ₂Be smoothing factor.

Preferably, described channel logarithm cepstrum evaluation unit U32 can also comprise:

Preferably, with reference to Fig. 6, described device can also comprise:

Updating block U34 is used for according to formula RefCep (K) _J+1=RefCep (K) _j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel; γ＜α wherein ₃, α ₃Be smoothing factor, α ₃With α ₁, α ₂Relation be: in voice segments, α ₃Be α ₁In non-speech segment, α ₃Be α ₂

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device embodiment because itself and embodiment of the method basic simlarity, so describe fairly simple, relevant part gets final product referring to the part explanation of embodiment of the method.

Above on a kind of method and device that transmission channel affects voice signal of eliminating provided by the present invention, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. an audio signal processing method is characterized in that, comprising:

2. method according to claim 1 is characterized in that, described basis is the average statistical of the voice signal logarithm cepstrum of channel not, adopts low-pass filtering to be similar to described average, obtains the estimated value of transmission channel logarithm cepstrum, specifically comprises:

Calculate E[Tc (K)]=E[Sc (K)-RefCep (K)];

When having voice signal on the transmission channel, above-mentioned formula is carried out low-pass filtering is similar to E[Tc (K)], obtain TranCep (K) _j=TranCep (K) _J-1(1-α ₁)+(Sc (K)-RefCep (K)) α ₁

3. method according to claim 2 is characterized in that, also comprises:

When not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) _j=TranCep (K) _J-1(1-α ₂)+Sc (K) α ₂

Wherein, α ₁With α ₂Value different, α ₂Be smoothing factor.

4. method according to claim 3 is characterized in that, also comprises:

The signal to noise ratio (S/N ratio) of the voice signal that utilization observes will be utilized α ₁And α ₂Two formula that calculate TranCep (K) are comprehensively as follows:

TranCep(K) _j=TranCep(K) _j-1(1-α ₃)+(Sc(K)-RefCep(K))β ₁+Sc(K)β ₂；

Wherein, β ₁+ β ₂=α ₃, β ₁And β ₂Determine α according to described signal to noise ratio (S/N ratio) ₃Be smoothing factor, α ₃With α ₁, α ₂Relation be: in voice segments, α ₃Be α ₁In non-speech segment, α ₃Be α ₂

5. method according to claim 4 is characterized in that, also comprises:

According to formula RefCep (K) _J+1=RefCep (K) _j(1-γ)+Xc (K) γ utilizes the voice signal logarithm cepstrum Xc (K) of current not channel, upgrades the not average statistical RefCep (K) of the voice signal logarithm cepstrum of channel;

γ＜α wherein ₃, α ₃Be smoothing factor, α ₃With α ₁, α ₂Relation be: in voice segments, α ₃Be α ₁In non-speech segment, α ₃Be α ₂

6. a speech signal processing device is characterized in that, comprising:

7. device according to claim 6 is characterized in that, described channel logarithm cepstrum evaluation unit comprises:

The mean value computation subelement is used for calculating E[Tc (K)]=E[Sc (K)-RefCep (K)];

TranCep(K) _j=TranCep(K) _j-1(1-α ₁)+(Sc(K)-RefCep(K))α ₁；

8. device according to claim 7 is characterized in that, described channel logarithm cepstrum evaluation unit also comprises:

The second estimation subelement is used for when not having voice signal on the transmission channel, to above-mentioned calculating E[Tc (K)] formula carry out low-pass filtering and be similar to E[Tc (K)], obtain TranCep (K) _j=TranCep (K) _J-1(1-α ₂)+Sc (K) α ₂Wherein, α ₁With α ₂Value different, α ₂Be smoothing factor.

9. device according to claim 8 is characterized in that, described channel logarithm cepstrum evaluation unit also comprises:

TranCep(K) _j=TranCep(K) _j-1(1-α ₃)+(Sc(K)-RefCep(K))β ₁+Sc(K)β ₂；

10. device according to claim 9 is characterized in that, described device also comprises: