CN109979481A

CN109979481A - A kind of speech feature extraction algorithm of the dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient

Info

Publication number: CN109979481A
Application number: CN201910181526.7A
Authority: CN
Inventors: 李铁山; 贺培超; 刘君霞; 左毅; 陈俊龙; 肖杨; 马赫; 艾佳琪
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2019-07-05

Abstract

The speech feature extraction algorithm of the invention discloses a kind of dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient has following steps: S1, pre-processing to audio signal；S2, variation of the pretreated audio signal progress from time domain to frequency domain is handled；S3, using cluster algorithm, calculate the similarity between the inverse discrete cosine transform cepstrum coefficient matrix adjacent column that step S2 is obtained, and related coefficient vector is summed maximum adjacent column merging；Iteration above procedure, until being incorporated into 14 column obtains 14 classes, the obtained dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient is speech feature.The perfect prior art of the present invention does not make full use of after S2 step process similarity feature between class possessed by signal itself, so that the present invention is had wider adaptability, and can obtain higher accuracy of identification in Speaker Identification.

Description

A kind of sound of the dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient Acoustic feature extraction algorithm

Technical field

The invention belongs to speech Feature Extraction Technology fields, and Unsupervised clustering parser is applied to speech feature extraction The speech feature extraction in direction, in particular to a kind of dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient is calculated Method.

Background technique

Speaker Recognition Technology includes feature extraction and modeling identification two parts.Feature extraction is in speaker Recognition Technology Committed step, the overall performance of speech recognition system will be directly influenced.Ordinary circumstance, voice signal pass through framing and adding window After pretreated, a large amount of high-dimensional data can be generated, when extracting speaker characteristic, it is necessary to by removing in original voice Redundancy reduce data dimension.Existing method will use the triangle filter group filtering of Mel scale, by voice signal It is converted to the speech feature vector for meeting characteristic parameter requirement and approximate human auditory system perception characteristics can be met and certain Voice signal can be enhanced in degree and inhibit non-speech audio.Common characteristic parameter has:

Linear prediction analysis coefficient is the principle of sound for simulating the mankind, is obtained by analyzing the cascade model of sound channel short tube Characteristic parameter；Perception linear predictor coefficient is to be applied in spectrum analysis based on auditory model by calculating, and will input voice Signal is handled by human auditory model, substitutes the full pole for being equivalent to LPC of time-domain signal used in linear predictive coding LPC The polynomial characteristic parameter of model prediction；Tandem feature and Bottleneck feature are the two classes spies extracted using neural network Sign；Fbank feature based on wave filter group is equivalent to the discrete cosine transform that MFCC removes final step, compares with MFCC feature Remain more primary voice datas；Linear prediction residue error has been abandoned in signal generating process based on channel model Voice-activated information and the important feature parameter that the characteristic of formant is represented with more than ten a cepstrum coefficients；Speech characteristic parameter MFCC As widest speech characteristic parameter, which is to carry out preemphasis to voice first, framing, adding window, accelerate in Fu Energy spectrum, is then filtered by the triangle filter group of one group of Mel scale, calculates each filter by the pretreatment such as leaf transformation The logarithmic energy of wave device group output obtains MFCC coefficient through discrete cosine transform (DCT), finds out Mel-scale Cepstrum ginseng Number extracts dynamic difference parameter i.e. mel cepstrum coefficients again.S.Al-Rawahya in 2012 et al. refers to the feature extraction side MFCC Method carries out waiting Dividing in frequency domain, proposes Histogram DCT cepstrum coefficient to the DCT cepstrum coefficient obtained after voice pretreatment Method.

We have found that waiting Dividing in frequency domain cepstrum coefficient that can ignore the dynamic characteristic between speech data adjacent column itself, therefore The present invention proposes a kind of new speech feature extraction algorithm i.e. based on the dynamic partition of related coefficient against discrete remaining on this basis The method that string converts cepstrum coefficient utilizes hierarchy clustering method by speech data according to its behavioral characteristics in conjunction with unsupervised learning Similitude carries out clustering, to extract the behavioral characteristics vector that can more describe speech characteristic.

S.Al-Rawahya et al. has found this new feature of DCT Cepstrum in research in 2012, what they proposed Speech feature extraction algorithm based on equal frequency domains DCT Cepstrum coefficient.Pretreated audio signal is converted into frequency domain, Pretreated audio signal is converted into frequency domain spectra multiplication form from convolution, logarithm is taken to it, obtained component with Addition form indicates, obtains discrete cosine transform cepstrum coefficient (DCT Cepstrum coefficient).DCT cepstrum coefficient is with non-linear increasing The periodicity for measuring recording frequency range divides frequency domain character section between 0Hz-600Hz frequency domain with every 50Hz, in 600Hz- It can be regarded as with every 100Hz segmentation frequency domain character section process to speech signal intermediate frequency rate range week between 1000Hz frequency domain The counting of issue.It is simpler than MFCC feature extracting method, faster.

Pearson correlation coefficient (Pearson correlation coefficient), also known as Pearson product-moment phase relation Number (Pearson product-moment correlation coefficient, abbreviation PPMCC or PCCs), is for measuring Two correlations between variable X and Y, value is between -1 and 1.

Summary of the invention

The purpose of the present invention is primarily directed to the speech feature based on equal Dividing in frequency domain inverse discrete cosine transform cepstrum coefficient The inaccuracy of dividing frequency in extraction algorithm proposes a kind of dynamic partition inverse discrete cosine transform cepstrum based on related coefficient The speech feature extraction algorithm of coefficient.The technological means that the present invention uses is as follows:

A kind of speech feature extraction algorithm of the dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient, tool It has the following steps:

S1, audio signal is pre-processed:

Preemphasis, framing and windowing process are successively carried out to audio signal；

It is eliminated by pre-processing because mankind's phonatory organ itself and the equipment bring due to acquisition audio signal are mixed The factors such as folded, higher hamonic wave distortion, high frequency to audio signal quality influence guarantee signal that subsequent processing obtains more evenly, Smoothly, good parameter is provided for speech feature extraction, improves subsequent processing quality.

S2, variation of the pretreated audio signal progress from time domain to frequency domain is handled:

Pretreated audio signal is converted into frequency domain, i.e., pretreated audio signal is converted to frequency from convolution Multiplication form is composed in domain, takes logarithm to it, obtained component indicates in the form of being added, and obtains inverse discrete cosine transform cepstrum coefficient (IDCT Cepstrum coefficient), detailed process are carried out by following formula:

C (q)=IDCT log | DCT { x (k) } |

Wherein, DCT and IDCT is discrete cosine transform and inverse discrete cosine transform respectively, and x (k) is by pretreated Audio signal, C (q) are transformed output signal, i.e. inverse discrete cosine transform cepstrum coefficient；

Inverse discrete cosine transform cepstrum coefficient is a data matrix, due to the intrinsic frequency attribute of speech, is carrying out layer All Column Properties are identical when secondary cluster, and the relative position between each column is unalterable, so we pass through meter It calculates the similarity of each adjacent Column Properties and similar highest adjacent two column is merged, successively clustered.

S3, the inverse discrete cosine transform cepstrum coefficient matrix adjacent column obtained using cluster algorithm, calculating step S2 Between similarity, and sum maximum adjacent column of related coefficient vector is merged；Iteration above procedure, until being incorporated into 14 column 14 classes are obtained, the obtained dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient is speech feature.

The preemphasis realizes that detailed process is carried out by following formula by digital filter:

Y (n)=X (n)-aX (n-1)；

Wherein, Y (n) is the output signal after preemphasis, and the audio signal of X (n) input, a is pre emphasis factor, when n is It carves.

The average power spectra of audio signal is influenced by glottal excitation and mouth and nose radiation, and front end is about in 800Hz or more Decay by 6dB/oct (octave), the more high corresponding ingredient of frequency is smaller, right before analyzing audio signal thus Its high frequency section is promoted.

What it is through speech analysis overall process is " short time analysis technique ".Audio signal has time-varying characteristics, but one In a short time range (generally within the short time of 10~30ms), time-varying characteristics be held essentially constant it is i.e. relatively stable, because And a quasi-steady state process can be seen as, i.e., audio signal has short-term stationarity.So point of any audio signal Analysis and processing must be set up on the basis of " in short-term ", i.e. progress " short-time analysis ", and audio signal segmentation is analyzed to its feature Parameter, wherein each section is known as one " frame ", frame length generally takes 10~30ms.In this way, for whole audio signal, analysis The characteristic parameter time series being made of each frame characteristic parameter out.

The framing is that the output signal after the preemphasis is segmented into mono- frame of 20ms.

Windowing process is also carried out after sub-frame processing to it, the purpose of adding window, which may be considered, makes the voice signal overall situation more Continuously, Gibbs' effect is avoided the occurrence of, the Partial Feature for showing periodic function without periodic voice signal originally is made.Institute Stating adding window is Hamming window adding window.

The variation is Cepstrum Transform.

The cluster algorithm is step analysis algorithm.

The calculating similarity is to calculate Pearson product-moment correlation coefficient, then the specific steps of the step S3 are as follows:

Matrix A represents the inverse discrete cosine transform cepstrum coefficient for the single people m*n dimension that step S2 is acquired, inverse discrete cosine Convert every one-dimensional vector V of cepstrum coefficient₁, V₂…V_nIt regards n class as, acquires V_iAnd V_i+1Pearson correlation coefficient are as follows:

Below it is the specific steps of clustering:

It clusters for the first time:

l₁=r (V₁,V₂)

l₂=r (V₂,V₃)

l₃=r (V₃,V₄)

…

l_n-1=r (V_n-1,V_n)

If the related coefficient vector respectively arranged after first speaker's inverse discrete cosine transform cepstrum coefficient cluster is expressed as p₁ =(l₁,l₂,l₃,...,l_n-1), then after m-th speaker inverse discrete cosine transform cepstrum coefficient cluster the related coefficient that respectively arranges to Amount is expressed as p_M, it sums to the related coefficient vector of all speakers:

If i=argmin (L₁,...,L_n-1), then cluster result are as follows:

(V₁),(V₂),...,(V_i+V_i+1),...,(V_n), i.e.,

All speaker's inverse discrete cosine transform cepstrum coefficient related coefficient vectors are updated:

l_i-1=r (V_i-1,(V_i+V_i+1))

l_i=r ((V_i+V_i+1),V_i+2)

l_i+1=l_i+2

…

l_n-2=l_n-1

Delete l_n-1

Second of cluster:

If j=argmin (L₁,...,L_n-2), then cluster result are as follows:

(V₁),(V₂),...,(V_i+V_i+1),...,(V_j+V_j+1),...,(V_n), i.e.,

It updates again:

l_j-1=r (V_j-1,(V_j+V_j+1))

l_j=r ((V_j+V_j+1),V_j+2)

l_j+1=l_j+2

…

l_n-3=l_n-2

Delete l_n-2

And so on carry out hierarchical clustering until last cluster result is 14 classes, the obtained dynamic based on related coefficient point Cutting inverse discrete cosine transform cepstrum coefficient is speech feature, which is put into GMM model and is identified to judge The feasibility of the algorithm.

The present invention has the advantage that compared with prior art

First, since the present invention passes through the speech feature extraction for the equal Dividing in frequency domain DCT Cepstrum coefficient analysed in depth The property of algorithm, the perfect prior art do not make full use of after S2 step process that similitude is special between class possessed by signal itself Sign makes the present invention have wider adaptability, and can obtain higher accuracy of identification in Speaker Identification.

Second, the present invention is applied to Unsupervised clustering analysis in speech feature extraction, so that the present invention has process letter Bright, speed is quick, occupies the few advantage of computing resource.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do simply to introduce, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.

Fig. 1 is the dynamic partition inverse discrete cosine transform cepstrum system in a specific embodiment of the invention based on related coefficient The flow chart of several speech feature extraction algorithms.

Fig. 2 is inverse discrete cosine transform cepstrum coefficient process of cluster analysis schematic diagram in a specific embodiment of the invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

As shown in Figure 1, a kind of speech of dynamic partition inverse discrete cosine transform cepstrum coefficient based on Similarity measures is special Extraction algorithm is levied, there are following steps:

S1, audio signal is pre-processed:

Y (n)=X (n)-aX (n-1)；

Wherein, Y (n) is the output signal after preemphasis, and X (n) is the audio signal of input, and a is pre emphasis factor, and n is Moment, this paper a value are 0.97.

The adding window is Hamming window adding window.

C (q)=IDCT log | DCT { x (k) } |

S3, the inverse discrete cosine transform cepstrum coefficient matrix adjacent column obtained using cluster algorithm, calculating step S2 Between similarity, and sum maximum adjacent column of related coefficient vector is merged；Iteration above procedure, until being incorporated into 14 column 14 classes are obtained, the obtained dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient is speech feature, tool Steps are as follows for body:

Matrix A represents the inverse discrete cosine transform cepstrum coefficient for the single people m*n dimension that step S2 is acquired, as shown in Fig. 2, handle Every one-dimensional vector V of inverse discrete cosine transform cepstrum coefficient₁, V₂…V_nIt regards n class as, acquires V_iAnd V_i+1Pearson correlation coefficient Are as follows:

Below it is the specific steps of clustering:

It clusters for the first time:

l₁=r (V₁,V₂)

l₂=r (V₂,V₃)

l₃=r (V₃,V₄)

…

l_n-1=r (V_n-1,V_n)

If i=argmin (L₁,...,L_n-1), then cluster result are as follows:

(V₁),(V₂),...,(V_i+V_i+1),...,(V_n), i.e.,

l_i-1=r (V_i-1,(V_i+V_i+1))

l_i=r ((V_i+V_i+1),V_i+2)

l_i+1=l_i+2

…

l_n-2=l_n-1

Delete l_n-1

Second of cluster:

If j=argmin (L₁,...,L_n-2), then cluster result are as follows:

(V₁),(V₂),...,(V_i+V_i+1),...,(V_j+V_j+1),...,(V_n), i.e.,

It updates again:

l_j-1=r (V_j-1,(V_j+V_j+1))

l_j=r ((V_j+V_j+1),V_j+2)

l_j+1=l_j+2

…

l_n-3=l_n-2

Delete l_n-2

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of speech feature extraction algorithm of the dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient, special Sign is with following steps:

S1, audio signal is pre-processed:

Pretreated audio signal is converted into frequency domain, i.e., pretreated audio signal is converted to frequency domain spectra from convolution Multiplication form takes logarithm to it, and obtained component is indicated in the form of being added, and obtains inverse discrete cosine transform cepstrum coefficient, specifically Process is carried out by following formula

C (q)=IDCT log | DCT { x (k) } |

Wherein, DCT and IDCT is discrete cosine transform and inverse discrete cosine transform respectively, and x (k) is to pass through pretreated speech Signal, C (q) are transformed output signal, i.e. inverse discrete cosine transform cepstrum coefficient；

S3, using cluster algorithm, calculate between the inverse discrete cosine transform cepstrum coefficient matrix adjacent column that step S2 is obtained Similarity, and sum maximum adjacent column of related coefficient vector is merged；Iteration above procedure arranges until being incorporated into 14 to obtain the final product To 14 classes, the obtained dynamic partition inverse discrete cosine transform cepstrum coefficient based on related coefficient is speech feature.

2. extraction algorithm according to claim 1, it is characterised in that: the preemphasis realizes have by digital filter Body process is carried out by following formula:

Y (n)=X (n)-aX (n-1)；

Wherein, Y (n) is the output signal after preemphasis, and the audio signal of X (n) input, a is pre emphasis factor, and n is the moment.

3. extraction algorithm according to claim 1, it is characterised in that: the framing is to believe the output after the preemphasis Number it is segmented into mono- frame of 20ms.

4. extraction algorithm according to claim 1, it is characterised in that: the adding window is Hamming window adding window.

5. extraction algorithm according to claim 1, it is characterised in that: the variation is Cepstrum Transform.

6. extraction algorithm according to claim 1, it is characterised in that: the cluster algorithm is step analysis algorithm.

7. extraction algorithm according to claim 1, it is characterised in that: the calculating similarity is to calculate Pearson product-moment phase Relationship number, the then specific steps of the step S3 are as follows:

Matrix A represents the inverse discrete cosine transform cepstrum coefficient for the single people m*n dimension that step S2 is acquired, inverse discrete cosine transform Every one-dimensional vector V of cepstrum coefficient₁, V₂…V_nIt regards n class as, acquires V_iAnd V_i+1Pearson correlation coefficient are as follows:

Below it is the specific steps of clustering:

It clusters for the first time:

l₁=r (V₁,V₂)

l₂=r (V₂,V₃)

l₃=r (V₃,V₄)

…

l_n-1=r (V_n-1,V_n)

If the related coefficient vector respectively arranged after first speaker's inverse discrete cosine transform cepstrum coefficient cluster is expressed as p₁=(l₁, l₂,l₃,...,l_n-1), then the related coefficient vector respectively arranged after m-th speaker inverse discrete cosine transform cepstrum coefficient cluster indicates For p_M, it sums to the related coefficient vector of all speakers:

If i=argmin (L₁,...,L_n-1), then cluster result are as follows:

(V₁),(V₂),...,(V_i+V_i+1),...,(V_n), i.e.,

l_i-1=r (V_i-1,(V_i+V_i+1))

l_i=r ((V_i+V_i+1),V_i+2)

l_i+1=l_i+2

…

l_n-2=l_n-1

Delete l_n-1

Second of cluster:

If j=argmin (L₁,...,L_n-2), then cluster result are as follows:

(V₁),(V₂),...,(V_i+V_i+1),...,(V_j+V_j+1),...,(V_n), i.e.,

It updates again:

l_j-1=r (V_j-1,(V_j+V_j+1))

l_j=r ((V_j+V_j+1),V_j+2)

l_j+1=l_j+2

…

l_n-3=l_n-2

Delete l_n-2

And so on carry out hierarchical clustering until last cluster result is 14 classes, the obtained dynamic partition based on related coefficient is inverse Discrete cosine transform cepstrum coefficient is speech feature, which is put into GMM model and is identified to judge the calculation The feasibility of method.