CN109767756A

CN109767756A - A kind of speech feature extraction algorithm based on dynamic partition inverse discrete cosine transform cepstrum coefficient

Info

Publication number: CN109767756A
Application number: CN201910087494.4A
Authority: CN
Inventors: 左毅; 马赫; 李铁山; 贺培超; 刘君霞; 艾佳琪; 肖杨; 于仁海
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2019-05-17
Anticipated expiration: 2039-01-29
Also published as: JP2020140193A; JP6783001B2; CN109767756B

Abstract

The invention discloses a kind of speech feature extraction algorithms based on dynamic partition inverse discrete cosine transform cepstrum coefficient, with following steps: preemphasis, framing and adding window pretreatment S1, are carried out to audio signal: S2, the variation carried out from time domain to frequency domain for pretreated audio signal are handled: S3, utilizing cluster algorithm, the similarity between the inverse discrete cosine transform cepstrum coefficient that step S2 is obtained is calculated, and maximum adjacent two class of similarity is successively merged；Iteration above procedure, until cluster is to 24 classes, obtained dynamic partition inverse discrete cosine transform cepstrum coefficient is speech feature.The shortcomings that perfect prior art of the present invention does not make full use of speech behavioral characteristics to carry out frequency-domain transform, makes the present invention have wider adaptability, and can obtain higher accuracy of identification in Speaker Identification.

Description

A kind of speech feature extraction based on dynamic partition inverse discrete cosine transform cepstrum coefficient Algorithm

Technical field

The invention belongs to speech Feature Extraction Technology fields, and Unsupervised clustering parser is applied to speech feature extraction Direction, in particular to a kind of speech feature extraction algorithm based on dynamic partition inverse discrete cosine transform cepstrum coefficient.

Background technique

Speaker Recognition Technology includes feature extraction and identification modeling two parts.Feature extraction is in speaker Recognition Technology Committed step, the overall performance of speech recognition system will be directly influenced.Ordinary circumstance, voice signal pass through framing and adding window After pretreated, the data volume of high latitude can be generated, when extracting speaker characteristic, it is necessary to superfluous in original voice by removing Remaining information reduces data dimension.Existing method will use triangle filtering, converts voice signals into and meets characteristic parameter and want The speech feature vector asked simultaneously can meet approximate human auditory system perception characteristics and can enhance voice letter to a certain extent Number and inhibit non-speech audio.Common characteristic parameter has: linear prediction analysis coefficient is the principle of sound for simulating the mankind, is passed through Analyze characteristic parameter obtained from the cascade model of sound channel short tube；Perception linear predictor coefficient is to pass through calculating based on auditory model It is applied in spectrum analysis, input speech signal is handled by human auditory model, is substituted used in linear predictive coding LPC The all-pole modeling for being equivalent to LPC of time-domain signal predicts polynomial characteristic parameter；Tandem feature and Bottleneck are special Sign is two category features extracted using neural network；Fbank feature based on wave filter group is equivalent to MFCC and removes final step Discrete cosine transform, with MFCC feature compared to remaining more primary voice datas；Linear prediction residue error is to be based on Channel model has abandoned the voice-activated information in signal generating process and has represented the characteristic of formant with more than ten a cepstrum coefficients Important feature parameter；For speech characteristic parameter MFCC as widest speech characteristic parameter, which is first to language Sound carries out the pretreatments such as preemphasis, framing, adding window, acceleration Fourier transformation, and energy spectrum is then passed through the three of one group of Mel scale Angular filter group is filtered, and the logarithmic energy for calculating each filter group output is obtained through discrete cosine transform (DCT) MFCC coefficient finds out Mel-scale Cepstrum parameter and extracts dynamic difference parameter i.e. mel cepstrum coefficients again.2012 S.Al-Rawahya et al. refers to MFCC feature extracting method, carries out waiting frequency to the DCT cepstrum coefficient obtained after voice pretreatment Regional partition, the method for proposing Histogram DCT cepstrum coefficient.We have found that waiting Dividing in frequency domain cepstrum coefficient that can ignore speech number According to dynamic characteristic, therefore to propose that a kind of new speech feature extraction algorithm is based on dynamic partition on this basis inverse by the present invention The method of discrete cosine transform cepstrum coefficient utilizes hierarchy clustering method by speech data according to its dynamic in conjunction with unsupervised learning The similitude of feature carries out clustering, to extract the behavioral characteristics vector that can more describe speech characteristic.

In existing research, a kind of speech recognition technology being most widely used be using MFCC as speech feature to Amount, and the machine learning methods such as combination gauss hybrid models (GMM), Hidden Markov Model (HMM) and support vector machines (SVM) Carry out speaker's pattern match.The extraction process of MFCC are as follows: preemphasis is carried out to voice first, framing, adding window, accelerates Fourier Preconditioning；Then energy spectrum is filtered by the triangle filter group of one group of Mel scale；Calculate each filter The logarithmic energy of group output obtains MFCC coefficient through discrete cosine transform (DCT) and brings obtained logarithmic energy into discrete cosine change It changes, finds out Mel-scale Cepstrum parameter and extract dynamic difference parameter i.e. mel cepstrum coefficients MFCC again.

S.Al-Rawahya et al. has found this new feature of DCT Cepstrum in research in 2012, what they proposed Speech feature extraction algorithm based on equal frequency domains DCT Cepstrum coefficient.Pretreated audio signal is converted into frequency domain, Pretreated audio signal is converted into frequency domain spectra multiplication form from convolution, logarithm is taken to it, obtained component with Addition form indicates, obtains discrete cosine transform cepstrum coefficient (DCT Cepstrum coefficient).DCT cepstrum coefficient is with non-linear increasing The periodicity for measuring recording frequency range divides frequency domain character section between 0Hz-600Hz frequency domain with every 50Hz, in 600Hz- It can be regarded as with every 100Hz segmentation frequency domain character section process to speech signal intermediate frequency rate range week between 1000Hz frequency domain The counting of issue.It is simpler than MFCC feature extracting method, faster.

Summary of the invention

The purpose of the present invention is primarily directed to the speech feature based on equal Dividing in frequency domain inverse discrete cosine transform cepstrum coefficient The inaccuracy of dividing frequency in extraction algorithm proposes a kind of speech based on dynamic partition inverse discrete cosine transform cepstrum coefficient Feature extraction algorithm.The technological means that the present invention uses is as follows:

A kind of speech feature extraction algorithm based on dynamic partition inverse discrete cosine transform cepstrum coefficient has following step It is rapid:

S1, audio signal is pre-processed:

Preemphasis, framing and windowing process are successively carried out to audio signal；

It is eliminated by pre-processing because mankind's phonatory organ itself and the equipment bring due to acquisition audio signal are mixed The factors such as folded, higher hamonic wave distortion, high frequency to audio signal quality influence guarantee signal that subsequent processing obtains more evenly, Smoothly, good parameter is provided for speech feature extraction, improves subsequent processing quality.

S2, variation of the pretreated audio signal progress from time domain to frequency domain is handled:

Pretreated audio signal is converted into frequency domain, i.e., pretreated audio signal is converted to frequency from convolution Multiplication form is composed in domain, takes logarithm to it, obtained component indicates in the form of being added, and obtains inverse discrete cosine transform cepstrum coefficient (IDCT Cepstrum coefficient), detailed process are carried out by following formula:

C (q)=IDCT log | DCT { x (k) } |；

Wherein, DCT and IDCT is discrete cosine transform and inverse discrete cosine transform respectively, and x (k) is input audio signal, I.e. pretreated audio signal, C (q) are output voice signal, i.e. inverse discrete cosine transform cepstrum coefficient；

Inverse discrete cosine transform cepstrum coefficient is a data matrix, due to the intrinsic frequency attribute of speech, is carrying out layer All Column Properties are identical when secondary cluster, so we are successively gathered by calculating the similarity of adjacent Column Properties Class.

S3, using cluster algorithm, calculate similar between the inverse discrete cosine transform cepstrum coefficient that step S2 is obtained Degree, and maximum adjacent two class of similarity is successively merged；Iteration above procedure, until cluster, to 24 classes, obtained dynamic is divided Cutting inverse discrete cosine transform cepstrum coefficient (DD-IDCT Cepstrum coefficient) is speech feature.

The preemphasis realizes that detailed process is carried out by following formula by digital filter:

Y (n)=X (n)-aX (n-l)；

Wherein, Y (n) is the output signal after preemphasis, and the audio signal of X (n) input, a is pre emphasis factor, when n is It carves.

The average power spectra of audio signal is influenced by glottal excitation and mouth and nose radiation, and front end is about in 800Hz or more Decay by 6dB/oct (octave), the more high corresponding ingredient of frequency is smaller, right before analyzing audio signal thus Its high frequency section is promoted.

What it is through speech analysis overall process is " short time analysis technique ".Audio signal has time-varying characteristics, but one In a short time range (generally within the short time of 10~30ms), time-varying characteristics be held essentially constant it is i.e. relatively stable, because And a quasi-steady state process can be seen as, i.e., audio signal has short-term stationarity.So point of any audio signal Analysis and processing must be set up on the basis of " in short-term ", i.e. progress " short-time analysis ", and audio signal segmentation is analyzed to its feature Parameter, wherein each section is known as one " frame ", frame length is generally taken as 10~30ms.In this way, for whole audio signal, point The characteristic parameter time series being made of each frame characteristic parameter being precipitated.

The framing is that the output signal after the preemphasis is segmented into mono- frame of 20ms.

Windowing process is also carried out after sub-frame processing to it, the purpose of adding window, which may be considered, makes the voice signal overall situation more Continuously, Gibbs' effect is avoided the occurrence of, the Partial Feature for showing periodic function without periodic voice signal originally is made.Institute Stating adding window is Hamming window adding window.

The variation is Cepstrum Transform.

The cluster algorithm is step analysis algorithm.

The calculating similarity is to calculate Euclidean distance.

The present invention has the advantage that compared with prior art

First, since the present invention passes through the speech feature extraction for the equal Dividing in frequency domain DCT Cepstrum coefficient analysed in depth The shortcomings that property of algorithm, the perfect prior art does not make full use of speech behavioral characteristics to carry out frequency-domain transform, make the present invention With wider adaptability, and higher accuracy of identification can be obtained in Speaker Identification.

Second, the present invention is applied to Unsupervised clustering analysis in speech feature extraction, so that the present invention has process letter Bright, speed is quick, occupies the few advantage of computing resource.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do simply to introduce, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.

Fig. 1 is that the speech in a specific embodiment of the invention based on dynamic partition inverse discrete cosine transform cepstrum coefficient is special Levy the flow chart of extraction algorithm.

Fig. 2 is clustering tree graph in a specific embodiment of the invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

As shown in Figure 1, a kind of speech feature extraction algorithm based on dynamic partition inverse discrete cosine transform cepstrum coefficient, tool It has the following steps:

S1, audio signal is pre-processed:

Y (n)=X (n)-aX (n-l)；

Wherein, Y (n) is the output signal after preemphasis, and the audio signal of X (n) input, a is pre emphasis factor, when n is It carves, this paper a value is 0.97.

The adding window is Hamming window adding window.

C (q)=IDCT log | DCT { x (k) } |；

Wherein, DCT and IDCT is discrete cosine transform and inverse discrete cosine transform respectively, and x (k) is input audio signal, I.e. pretreated audio signal, C (q) are output voice signal, i.e. inverse discrete cosine transform cepstrum coefficient；The variation For Cepstrum Transform.

S3, using cluster algorithm, calculate similar between the inverse discrete cosine transform cepstrum coefficient that step S2 is obtained Degree, and maximum adjacent two class of similarity is successively merged；Iteration above procedure, until cluster, to 24 classes, obtained dynamic is divided Cutting inverse discrete cosine transform cepstrum coefficient is speech feature, the specific steps are as follows:

Matrix A represents the inverse discrete cosine transform cepstrum coefficient for the m people n dimension that step S2 is acquired, as shown in Fig. 2, inverse Every one-dimensional vector V of discrete cosine transform cepstrum coefficient₁, V₂…V_nIt regards n class as, acquires V_iAnd V_jEuclidean distance beBelow it is the specific steps of clustering:

It clusters for the first time:

l₁=Dis (V₁,V₂)

l₂=Dis (V₂,V₃)

…

l_n-1=Dis (V_n-1,V_n)

If i=arg min (l₁,l₂,l₃…l_n-1), then cluster result is

(V₁),(V₂),…(V_i+V_i+1),…(V_n) i.e.

It updates:

l_i-1=Dis (V_i-1,(V_i+V_i+1))

l_i=Dis ((V_i+V_i+1),V_i+2)

l_i+1=l_i+2

…

l_n-1=l_n-2

Delete l_n-1

Second of cluster:

If j=arg min (l₁,l₂,l₃…l_n-2), then cluster result is

(V₁),(V₂),…(V_i+V_i+1),…(V_j+V_j+1),…(V_n) i.e.

It updates again:

l_j-1=Dis (V_j-1,(V_j+V_j+1))

l_j=Dis ((V_j+V_j+1),V_j+2)

l_j+1=l_j+2

…

l_n-3=l_n-2

Delete l_n-2

And so on carry out hierarchical clustering until last cluster result be 24 classes, obtain dynamic partition inverse discrete cosine transform Cepstrum coefficient is speech feature, which is put into the feasibility identified in GMM model to judge the algorithm.

The cluster algorithm is step analysis algorithm.

The calculating similarity is to calculate Euclidean distance.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of speech feature extraction algorithm based on dynamic partition inverse discrete cosine transform cepstrum coefficient, it is characterised in that as follows Step:

S1, audio signal is pre-processed:

Pretreated audio signal is converted into frequency domain, i.e., pretreated audio signal is converted to frequency domain spectra from convolution Multiplication form takes logarithm to it, and obtained component is indicated in the form of being added, and obtains inverse discrete cosine transform cepstrum coefficient, specifically Process is carried out by following formula

C (q)=IDCT log | DCT { x (k) } |；

Wherein, DCT and IDCT is discrete cosine transform and inverse discrete cosine transform respectively, and x (k) is input audio signal, i.e., in advance Treated audio signal, C (q) are output voice signal, i.e. inverse discrete cosine transform cepstrum coefficient；

S3, using cluster algorithm, calculate the similarity between the inverse discrete cosine transform cepstrum coefficient that step S2 is obtained, and Maximum adjacent two class of similarity is successively merged；Iteration above procedure, until cluster is to 24 classes, obtained dynamic partition it is inverse from Dissipating cosine transform cepstrum coefficient is speech feature.

2. extraction algorithm according to claim 1, it is characterised in that: the preemphasis realizes have by digital filter Body process is carried out by following formula:

Y (n)=X (n)-aX (n-l)；

Wherein, Y (n) is the output signal after preemphasis, and the audio signal of X (n) input, a is pre emphasis factor, and n is the moment.

3. extraction algorithm according to claim 1, it is characterised in that: the framing is to believe the output after the preemphasis Number it is segmented into mono- frame of 20ms.

4. extraction algorithm according to claim 1, it is characterised in that: the adding window is Hamming window adding window.

5. extraction algorithm according to claim 1, it is characterised in that: the variation is Cepstrum Transform.

6. extraction algorithm according to claim 1, it is characterised in that: the cluster algorithm is step analysis algorithm.

7. extraction algorithm according to claim 1, it is characterised in that: the calculating similarity is to calculate Euclidean distance.