CN104992707A

CN104992707A - Cleft palate voice glottal stop automatic identification algorithm and device

Info

Publication number: CN104992707A
Application number: CN201510257555.9A
Authority: CN
Inventors: 何凌; 谭洁; 尹恒; 刘奇; 郭春丽; 严苗
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2015-05-19
Filing date: 2015-05-19
Publication date: 2015-10-21

Abstract

The invention discloses a cleft palate voice glottal stop automatic identification algorithm and device, relates to the technical field of voice analysis and identification, and aims to provide a glottal stop automatic identification method and device. A computer is adopted for automatically identifying cleft palate voice glottal stops, effective and objective auxiliary diagnosis is provided to patients and voice teachers, and wide popularization of cleft palate voice assessment and voice treatment is facilitated. According to the technical key points of the invention, the method comprises the steps of: 1, collecting voice signals of syllables to be tested; 2, carrying out initial and final division on the voice signals of syllables, and retaining initial voice signals; 3, extracting characteristic values of the initial voice signals; and 4, sending the characteristic values into trained identification models, wherein the identification models judge whether glottal stops exist in the voice signals of syllables according to the characteristic values.

Description

A kind of cleft palate speech glottal stop automatic identification algorithm and device

Technical field

The present invention relates to speech analysis, recognition technology field, especially a kind of cleft palate speech glottal stop automatic identification algorithm and device.

Background technology

Harelip is modal congenital Craniofacial anomalies, and China has harelip crowds maximum in the world.With harelip unlike, the defect on facial shape is not only in the maximum impact of cleft palate, and due to the defect of upper palatine bone tissue in various degree and soft tissue and deformity, cause the dysfunctions such as patient's speech language, sucking, feed, have a strong impact on population life quality.Usually, after first phase Cleft palate repair, still have a large amount of patient existence voice disorder in various degree.The important step in cleft palate sequence Therapeutic mode to the treatment of cleft palate speech obstacle.

At present, realized by the perceptual judgment of professional voice teacher the assessment of cleft palate speech, this method is subject to the factor impacts such as the clinical experience of voice teacher and subjective state.

The clinical manifestation of cleft palate speech mainly comprises sympathetic response obstacle and dysarthrosis.Wherein, the main clinical manifestation of sympathetic response obstacle is high nasal sound, rhinorrhea gas etc.; Dysarthric main clinical manifestation is consonant deletion, compensatory, reduction, substitutes.Wherein, compensatory structure sound is one of modal wrong structure sound method of Patients with Cleft Palate extremely, its pronunciation principle is that Patients with Cleft Palate is when sending out consonant, because oral airflow branches to nasal cavity through the velopharyngeal opening of dysraphism, there is insufficient pressure in rhinorrhea gas and mouth, cause them to pronounce to utilize the air-flow in pharyngeal cavity before air-flow, thus learn compensatory pronunciation in one way.Glottal stop is clinical modal compensatory structure sound form, has the greatest impact, can occur in whole pressure consonants to speech intelligibility, on Auditory Perception, and patient's tonequality " hard, short ", smudgy.And long-term impact can cause increased thickness of vocal cords, brief summary, trachyphonia, hoarse.Because compensatory structure sound and velopharyngeal function are closely related, it directly maps the degree of velopharyngeal function, therefore has important clinical significance to its accurate evaluation.

Summary of the invention

Technical matters to be solved by this invention is: for above-mentioned Problems existing, a kind of glottal stop automatic identifying method and device are provided, adopt Computer Automatic Recognition cleft palate speech glottal stop, for patient and voice teacher provide effective objective auxiliary diagnosis, contribute to cleft palate speech assessment and popularize with the extensive of speech therapy.

Cleft palate speech glottal stop automatic identification algorithm provided by the invention, comprising:

Step 1: gather syllable verbal audio signal to be measured;

Step 2: the female cutting of sound is carried out to described syllable verbal audio signal, retains initial consonant voice signal;

Step 3: the eigenwert extracting described initial consonant voice signal;

Step 4: described eigenwert sent in trained model of cognition, model of cognition judges whether there is glottal stop in described syllable verbal audio signal according to described eigenwert.

Described step 2 comprises further:

Step 21: windowing framing is carried out to syllable verbal audio signal and obtains some speech frame x _i[n], i gets 1,2,3 ... M;

Step 22: the short-time energy E calculating each speech frame _iand short-time zero-crossing rate Z _i;

Step 23: energy difference e (i) and zero passage rate variance z (i): e (the i)=E that calculate adjacent two frames _i+1-E _i, i=1,2 ..., M-1, z (i)=Z _i+1-Z _i, i=1,2 ..., M-1;

Step 24: each energy difference e (i) is compared with threshold value T1, each zero passage rate variance z (i) is compared with threshold value T2; When meeting e (i)>=T1, simultaneously during z (i)≤T2, if now i=I; Then get speech frame x _i[n], i gets 1,2,3 ... I is the initial consonant voice signal of syllable verbal audio signal.

The initial consonant phonic signal character value that described step 3 is extracted comprise in following characteristics value one or more: spectrum energy strengthening segment eigenwert, MFCC acoustic feature value, critical bands short-time rating spectroscopic eigenvalue, wavelet transformation and Information Entropy Features value, wavelet package transforms and Information Entropy Features value; Wherein,

Extract the spectrum energy strengthening segment eigenwert of initial consonant voice signal: the first to the five spectrum energy strengthening segment eigenwert calculating every frame initial consonant speech frame; Calculate the first spectrum energy strengthening segment eigenwert of the first spectrum energy strengthening segment eigenwert average as initial consonant voice signal of whole initial consonant speech frame, by that analogy, calculate the second to the five spectrum energy strengthening segment eigenwert of initial consonant voice signal;

Extract the MFCC acoustic feature value of initial consonant voice signal: the MFCC acoustic feature value calculating every frame initial consonant speech frame, wherein MFCC coefficient value gets 12, obtains 12 MFCC eigenwerts of every frame initial consonant speech frame; Using the MFCC eigenwert of the mean value of a MFCC eigenwert of whole initial consonant voice signal frame as initial consonant voice signal, by that analogy, the second to the ten two MFCC eigenwert of initial consonant voice signal is calculated;

Extract the critical bands short-time rating spectroscopic eigenvalue of initial consonant voice signal: Short Time Fourier Transform is carried out to every frame initial consonant speech frame, obtain the short-time rating spectrum of every frame initial consonant speech frame; According to critical bands division rule, the short-time rating of every frame initial consonant speech frame spectrum is divided into 20 critical bands; The power of the first critical bands of whole initial consonant speech frame is superimposed and obtains the first critical bands short-time rating spectroscopic eigenvalue of initial consonant voice signal, obtain the second to the two ten critical bands short-time rating spectroscopic eigenvalue by that analogy;

Extract wavelet transformation and the Information Entropy Features value of initial consonant voice signal: three layers of wavelet transformation are carried out to every frame initial consonant speech frame, the signal after obtaining 4 reconstruct is reconstructed to the signal after three layers of wavelet decomposition, calculates the information entropy of the signal after each reconstruct; The mean value of the information entropy of the signal after reconstructing first of whole initial consonant voice signal frame is as the first wavelet transformation of initial consonant voice signal and Information Entropy Features value, by that analogy, the second to the four wavelet transformation and the Information Entropy Features value of initial consonant voice signal is calculated;

Extract wavelet package transforms and the Information Entropy Features value of initial consonant voice signal: three layers of wavelet package transforms are carried out to every frame initial consonant speech frame, signal after obtaining 8 reconstruct is reconstructed to the signal after three layers of WAVELET PACKET DECOMPOSITION, calculates the information entropy of the signal after each reconstruct; The mean value of the information entropy of the signal after reconstructing first of whole initial consonant voice signal frame is as the first wavelet package transforms of initial consonant voice signal and Information Entropy Features value, by that analogy, the second to the six wavelet transformation and the Information Entropy Features value of initial consonant voice signal is calculated.

Step 4 comprises further:

Choose the syllable verbal audio signal some composition true training sample set of known packets containing glottal stop, choose the known false training sample set of the some compositions of syllable verbal audio signal not comprising glottal stop;

Extract the spectrum energy strengthening segment eigenwert of each sample of two training sample sets, MFCC acoustic feature value, critical bands short-time rating spectroscopic eigenvalue, wavelet transformation and Information Entropy Features value and wavelet package transforms and Information Entropy Features value;

The initial consonant phonic signal character value of the syllable verbal audio signal to be measured that obtaining step 3 obtains;

Calculate the initial consonant phonic signal character value of this syllable verbal audio signal to be measured and the distance of each training sample:;

D 1 = \sqrt{Σ_{l = 1}^{5} a {(x_{l} - y_{l})}^{2} + Σ_{l = 6}^{17} b {(x_{l} - y_{l})}^{2} + Σ_{l = 18}^{37} c {(x_{l} - y_{l})}^{2} + Σ_{l = 38}^{41} d {(x_{l} - y_{l})}^{2} + Σ_{l = 42}^{49} e {(x_{l} - y_{l})}^{2}} .

Choose the shortest some training samples of initial consonant phonic signal character value distance from syllable verbal audio signal to be measured, wherein belong to the training sample of true training sample set maximum time then think in described syllable verbal audio signal to be measured containing glottal stop;

Wherein: x _l, l gets 1 ~ 5, is the first to the five spectrum energy strengthening segment eigenwert of syllable verbal audio signal to be measured;

X _l, l gets 6 ~ 17, is the first to the ten two MFCC acoustic feature value of syllable verbal audio signal to be measured;

X _l, l gets 18 ~ 37, is the first to the two ten critical bands short-time rating spectroscopic eigenvalue of syllable verbal audio signal to be measured;

X _l, l gets 38 ~ 41, is the first to the four wavelet transformation and the Information Entropy Features value of syllable verbal audio signal to be measured;

X _l, l gets 42 ~ 49, is the first to the eight wavelet package transforms and the Information Entropy Features value of syllable verbal audio signal to be measured;

Y _l, l gets 1 ~ 5, is the first to the five spectrum energy strengthening segment eigenwert of training sample;

Y _l, l gets 6 ~ 17, is the first to the ten two MFCC acoustic feature value of training sample;

Y _l, l gets 18 ~ 37, is the first to the two ten critical bands short-time rating spectroscopic eigenvalue of training sample;

Y _l, l gets 38 ~ 41, is the first to the four wavelet transformation and the Information Entropy Features value of training sample;

Y _l, l gets 42 ~ 49, is the first to the eight wavelet package transforms and the Information Entropy Features value of training sample;

A, b, c, d, e are weights.

Preferably, the value acquisition methods of described weights comprises:

Choose the syllable verbal audio signal some composition true sample space of known packets containing glottal stop, choose the known syllable verbal audio signal some composition dummy copies space not comprising glottal stop;

Extract the spectrum energy strengthening segment eigenwert of each sample of two sample spaces, MFCC acoustic feature value, critical bands short-time rating spectroscopic eigenvalue, wavelet transformation and Information Entropy Features value and wavelet package transforms and Information Entropy Features value;

With the sample that the spectrum energy strengthening segment eigenwert of the sample of two sample spaces is KNN model of cognition; The recognition correct rate of KNN model of cognition is now a;

With the sample that the MFCC acoustic feature value of the sample of two sample spaces is KNN model of cognition; The recognition correct rate of KNN model of cognition is now b;

With the sample that the critical bands short-time rating spectroscopic eigenvalue of the sample of two sample spaces is KNN model of cognition; The recognition correct rate of KNN model of cognition is now c;

With the sample that the wavelet transformation of the sample of two sample spaces and Information Entropy Features value are KNN model of cognition; The recognition correct rate of KNN model of cognition is now d;

With the sample that the wavelet package transforms of the sample of two sample spaces and Information Entropy Features value are KNN model of cognition; The recognition correct rate of KNN model of cognition is now e.

In sum, owing to have employed technique scheme, the invention has the beneficial effects as follows:

1. present invention achieves the Computer Automatic Recognition of cleft palate speech glottal stop.

2. propose the KNN disaggregated model of improvement, recognition accuracy is up to 93.1%.

Accompanying drawing explanation

Examples of the present invention will be described by way of reference to the accompanying drawings, wherein:

Fig. 1 is algorithm flow chart of the present invention.

Fig. 2 is that in the present invention, critical bands short-time rating spectroscopic eigenvalue extracts process flow diagram.

Fig. 3 is that in the present invention, wavelet/wavelet packet transform and Information Entropy Features value extract process flow diagram.

Fig. 4 is the tree structure schematic diagram of three layers of wavelet transformation in the present invention.

Fig. 5 is the process flow diagram in the present invention, every frame voice signal being calculated to wavelet transformation and Information Entropy Features value.

Fig. 6 is the tree structure schematic diagram of three layers of wavelet package transforms in the present invention.

Fig. 7 is the process flow diagram in the present invention, every frame voice signal being calculated to wavelet package transforms and Information Entropy Features value.

Embodiment

All features disclosed in this instructions, or the step in disclosed all methods or process, except mutually exclusive feature and/or step, all can combine by any way.

Arbitrary feature disclosed in this instructions, unless specifically stated otherwise, all can be replaced by other equivalences or the alternative features with similar object.That is, unless specifically stated otherwise, each feature is an example in a series of equivalence or similar characteristics.

As Fig. 1, first, the cleft palate speech of input is carried out to the pre-service of framing and windowing.Because glottal stop only occurs in the initial consonant part of syllable, therefore first algorithm realizes the cutting of sound mother, and automatic identification algorithm only carries out the speech frame of initial consonant part.

Characteristics extraction is carried out to the voice signal of initial consonant part.

In this algorithm, pattern recognition classifier device adopts K arest neighbors (KNN:k-Nearest Neighbor) sorting algorithm, the KNN sorting algorithm of improvement, support vector machine (SVM:Support Vector Machines) sorting algorithm, realizes the automatic identification of voice signal with or without glottal stop two kind.

Wherein, the automatic recognition system based on KNN, improvement KNN, support vector machine is divided into two major parts: model training and part of detecting.In the training stage, the known voice signal whether containing glottal stop after pretreatment, extract acoustic feature value, this eigenwert, as training sample training mode recognition classifier (being respectively: the KNN of KNN, improvement, SVM classifier), makes it possess recognition capability.At test phase, to the voice signal to be measured inputted after pre-service, extract identical acoustic feature value and extract, obtain realizing automatic discrimination to or without glottal stop two kind by the model of cognition trained.

Lower mask body sets forth the implementation procedure of each step:

The framing of 1 voice signal and windowing

The generation of voice signal depends on the coordinative role of vocal organs, is a kind of vibration signal of quasi periodic.Voice signal is nonstationary random signal, but it is generally acknowledged that voice signal has short-term stationarity characteristic in about 10 ~ 30ms time range.

In cleft palate speech, glottal stop occurs over just initial consonant part.In this algorithm, carry out cutting to the sound mother of a syllable, obtain the voice signal of initial consonant part, its automatic identification algorithm only carries out initial consonant voice signal.Because the pronunciation duration of most of initial consonant is shorter, as: under normal circumstances, the unaspirated stop duration of a sound is in the scope of 0 ~ 32ms; The fricative duration of a sound is between 90ms ~ 220.3ms; Unaspirated affricate, aspirated stop, supply gas the affricative duration of a sound between 0 ~ 220.3ms; Turbid initial consonant duration is between 0 ~ 124ms.Considered that part initial consonant pronunciation duration is shorter, the duration of every frame voice signal is chosen as 10ms, and it is 1/2 frame length that frame moves.

The framing window adopted in this algorithm is Hamming (Hamming) window, and in time domain, voice signal is multiplied by window function, obtains framing windowing signal.Sample frequency due to voice signal is 16000Hz, and namely every frame voice signal length is 160 points, and it is 80 points that frame moves length.

The cutting of 2 initial consonants and simple or compound vowel of a Chinese syllable

In mandarin, the pronunciation of a Chinese character is a syllable.A complete syllable comprises initial consonant and simple or compound vowel of a Chinese syllable part.Initial consonant has consonant to form, and by manner of articulation, can be divided into plosive, affricate, fricative, nasal sound and lateral.In mandarin, have 21 initial consonants.Major part initial consonant is voiceless sound, only has part initial consonant to be voiced sound.Simple or compound vowel of a Chinese syllable is made up of vowel and compound vowel.The phonatory bands of vowel has the vibration of vocal cords, belongs to voiced sound.

Because the pronunciation characteristics of initial consonant and simple or compound vowel of a Chinese syllable is distinct, algorithm is based on the difference of the female pronunciation characteristics of sound, carry out the female cutting of sound by the catastrophe point of short-time energy and short-time zero-crossing rate parameter, the catastrophe point place of short-time energy and short-time zero-crossing rate is the female cut-off of sound.Its algorithm steps is as follows:

(1) set the voice signal of a Chinese character of input as x, its signal total length is L.Carry out framing and windowing process to this voice signal, frame length is 10ms (160 point), and frame moves as 5ms (80 points).Obtaining every frame voice signal is x _i[n], n=1,2 ..., 160, i=1,2 ..., M.Wherein, floor represents and rounds downwards.

(2) to every frame voice signal x _i[n], calculates short-time energy E _iwith short-time zero-crossing rate Z _i:

E_{i} = Σ_{n = 1}^{160} {x_{i}}^{2} [n];

Z_{i} = \frac{1}{2} Σ_{n = 1}^{160} | sgn (x_{i} [n]) - sgn (x_{i} [n - 1]) |;

In formula, sgn is sign function, that is:

sgn (c) = \{\begin{matrix} 1, c &GreaterEqual; 0 \\ - 1, c < 0 \end{matrix}

(3) energy difference e (i) and zero passage rate variance z (i) of adjacent two frames is calculated, as shown in the formula:

e(i)＝E _i+1-E _i,i＝1,2,…,M-1

z(i)＝Z _i+1-Z _i,i＝1,2,…,M-1

Each value in energy difference e (i) and zero passage rate variance z (i) and threshold value T1, T2 are compared.When meeting:

E (i) >=T1, simultaneously during z (i)≤T2, if now i=I.Then I frame and I+1 frame are voice sound signal simple or compound vowel of a Chinese syllable separatrix.Get the front I frame of voice signal, this part is the initial consonant part of syllable.The value of T1 and T2, through great many of experiments, experience value is: T1=0.015, T2=8.

3 characteristics extraction

3.1 spectrum energy strengthening segment acoustic feature value F

The pronunciation device of Patients with Cleft Palate is normal, and the generation of cleft palate speech mainly betides resonance device.Based on the source-filter model of classics, the sound source excitation system of Patients with Cleft Palate is normal, and phonation occurs abnormal at vocal tract filter and oral cavity radiation place.Formant parameter is the acoustic feature value of typical sound channel filtering system, resonance peak is the important parameter characterizing vowel, and be that the initial consonant (consonant) in syllable is processed herein, therefore, adopt spectrum energy strengthening segment as the acoustic feature value of initial consonant herein.Spectrum energy strengthening segment parameter and formant parameter physical significance similar, its computing method are identical.Adopt LPC (LPC:Linear Predictive Coding) method herein, realize the estimation to the first to the five spectrum energy strengthening segment.According to the female segmentation algorithm of the sound in upper joint, obtain initial consonant voice signal x _i[n], i=1,2 ..., I.To every frame initial consonant voice signal x _i[n] calculates the first to the five spectrum energy strengthening segment: F _i=[f _{i_1}, f _{i_2}, f _{i_3}, f _{i_4}, f _{i_5}], i=1,2 ..., I.To the first to the five spectrum energy strengthening segment of all speech frames of initial consonant part, averaged respectively, the spectrum energy strengthening segment eigenwert obtaining initial consonant part of speech signal is:

F＝[f ₁,f ₂,f ₃,f ₄,f ₅]。

3.2MFCC acoustic feature value

Mel cepstral coefficients (MFCC:Mel-Frequence Cepstral Coefficients) is based on the auditory properties of people's ear.MFCC acoustic feature value, by the Homomorphic Processing to voice signal, realizes being separated sound source pumping signal and sound channel response message.In this algorithm, MFCC coefficient value is chosen as 12.

According to the female segmentation algorithm of the sound in upper joint, obtain initial consonant voice signal x _i[n], i=1,2 ..., I.To every frame initial consonant voice signal x _i[n] calculates MFCC eigenwert: M _i=[m _{i_1}, m _{i_2}..., m _{i_12}], i=1,2 ..., I.To the MFCC parameter averaged of all speech frames of initial consonant part, the MFCC acoustic feature value obtaining initial consonant part of speech signal is:

M＝[m ₁,m ₂,m ₃,m ₄,m ₅,m ₆,m ₇,m ₈,m ₉,m ₁₀,m ₁₁,m ₁₂]。

The 3.3 acoustic feature value PSCB composed based on critical bands and short-time rating

This algorithm proposes the acoustic feature value (PSCB PowerSpectrum in Critical Bands) based on critical bands and short-time energy.Its algorithm flow is as shown in Figure 2:

According to the female segmentation algorithm of the sound in upper joint, obtain initial consonant voice signal x _i[n], i=1,2 ..., I.To every frame initial consonant voice signal x _i[n] carries out Short Time Fourier Transform, and wherein, counting of discrete Fourier transformation is 8192:

X_{i} [k] = Σ_{n = 0}^{N - 1} x_{i} [n] e^{- j \frac{2 π}{N} k}

By Short Time Fourier Transform, calculate the short-time rating spectrum of every frame initial consonant voice signal:

S_{i} [k] = X_{i} [k] \cdot X_{i}^{*} [k] = {| X_{i} [k] |}^{2}

Then the short-time rating spectrum of each initial consonant voice signal is matrix:

S = (\begin{matrix} S_{1} [k] \\ S_{2} [k] \\ . . . \\ S_{I} [k] \end{matrix}) .

Critical band is divide according to the auditory properties of people's ear, belongs to standard well known in the art.Frequency and the bandwidth of critical band are as shown in table 1.

The frequency of table 1 critical band and bandwidth (hertz Hz)

Critical band	Low end frequency	High end frequency	Bandwidth	Critical band	Low end frequency	High end frequency	Bandwidth
								0	0	100	100	11	1480	1720	240
11	100	200	100	12	1720	2000	280
								2	200	300	100	13	2000	2320	320
3	300	400	100	14	2320	2700	380
								4	400	510	110	15	2700	3150	450
5	510	630	120	16	3150	3700	550
								6	630	770	140	17	3700	4400	700
7	770	920	150	18	4400	5300	900
								8	920	1080	160	19	5300	6400	1100
9	1080	1270	190	20	6400	7700	1300
								10	1279	1480	210

To the short-time rating spectrum matrix S of initial consonant voice signal, based on frequency and the bandwidth of critical band, frequency band division is carried out to s-matrix, be divided into 20 frequency ranges altogether.Calculate the power magnitude in each frequency range and p _j, j=1,2 ..., 20, finally obtain the acoustic feature value based on critical bands and short-time energy to initial consonant voice signal: PSCB=(p ₁, p ₂..., p ₂₀).

3.4 based on the acoustic feature value of wavelet and wavelet package conversion with information entropy

From wavelet analysis, signal analysis is a kind of multiresolution analysis, is realized by bank of filters.Each fraction stem-butts cutting off this grade of input signal resolves into the rough approximation (general picture) of a low frequency and the detail section of a high frequency.The reconstruct of signal is the inverse process decomposed.Along with the change of wavelet scale, can realize carrying out multiscale analysis by thick and essence to things.Theoretical according to multiresolution, Mallat proposes the fast algorithm of wavelet function feedback, is called Mallat algorithm.This algorithm adopts the decomposition and reconstruction of Mallat algorithm realization wavelet and wavelet packets.

This algorithm proposes based on the acoustic feature value (WTE:Wavelet Transform based Entropy, WPE:Wavelet Packet based Entropy) of wavelet and wavelet package conversion with information entropy.Its algorithm flow as shown in Figure 3.

WTE: according to the female segmentation algorithm of the sound in upper joint, obtain initial consonant voice signal x _i[n], i=1,2 ..., I.Carry out 3 layers of wavelet decomposition (wavelet decomposition tree structure figure as shown in Figure 4) to every frame voice signal, be reconstructed the leaf node of wavelet decomposition, the signal after reconstruct is to the signal after each reconstruct, calculate its information entropy (its process is as shown in Figure 5), its computing formula is:

g_{1} = - Σ_{h} {c_{3}^{0}}^{2} (h) \log {{c_{3}^{0}}^{2} (h)};

g_{1} = - Σ_{h} {c_{3}^{1}}^{2} (h) \log {{c_{3}^{1}}^{2} (h)};

g_{2} = - Σ_{h} {c_{2}^{1}}^{2} (h) \log {{c_{2}^{1}}^{2} (h)};

g_{3} = - Σ_{h} {c_{1}^{1}}^{2} (h) \log {{c_{1}^{1}}^{2} (h)} .

WPE: according to the female segmentation algorithm of the sound in upper joint, obtain initial consonant voice signal x _i[n], i=1,2 ..., I.3 layers of WAVELET PACKET DECOMPOSITION (wavelet packet tree construction as shown in Figure 6) are carried out to every frame voice signal, the signal after the 3rd layer of WAVELET PACKET DECOMPOSITION is reconstructed.Similar in WTE algorithm, the signal after reconstruct is to the signal after each reconstruct, calculate its information entropy (its process is as shown in Figure 7), its computing formula is:

e_{w} = - Σ_{r} {d_{3}^{w}}^{2} (r) \log {{d_{3}^{w}}^{2} (r)}, w = 0,1, . . 7 .

4 algorithm for pattern recognitions

The KNN sorting algorithm of 4.1 classics

KNN algorithm is one of mode identification method of classics, its basic thought is: sample to be tested finds K the training sample closest to test sample book in feature space, statistics and analysis is carried out, the classification finding quantity maximum or the highest classification of similarity to searching out K training sample.This test sample book is identified as and belongs to this classification.

In KNN recognizer used herein, the number K value of arest neighbors is 5.Its calculation procedure is as follows:

(1) known syllable verbal audio signal containing glottal stop and the known syllable verbal audio signal not containing glottal stop is gathered as training sample, wherein form a class sample set by the syllable verbal audio signal of glottal stop, syllable verbal audio signal without glottal stop forms another kind of sample set, and each classification is denoted as C _i(i=1,2).

(2) to sample to be tested and training sample, identical acoustic feature value is calculated: the one in the eigenwert enumerated in Section 3.

(3) calculate the distance of sample to be tested and all training samples, the computing formula of its distance is as follows: wherein x is sample to be tested eigenwert, and y is training sample eigenwert, and N is eigenwert number.

(4) distance of sample to be tested to all training samples is sorted, get front 5 training samples closest to sample to be tested, in this classification belonging to 5 training samples, the classification C that quantity is maximum _ibe the classification of this sample to be tested.

The 4.2 KNN sorting algorithms improved

This algorithm improves KNN algorithm, proposes to be weighted the eigenwert in class.

(1) to training sample and sample to be tested, identical acoustic feature value is calculated: F, MFCC, PSCB, WTE, WPE.Be a vector by these five acoustic feature value sequential concatenation, as eigenwert.For each initial consonant voice signal, the dimension of each parameter is respectively: F:5 dimension, MFCC:12 dimension, PSCB:20 dimension, WTE:4 dimension, WPE:8 dimension.

(2) distance of sample to be tested and all training samples is calculated.When calculating sample to be tested to each training sample distance, different weights are composed to each acoustic feature value.F composes weights a, MFCC composes weights b, PSCB composes weights c, and WTE composes weights d, and WPE composes weights e.The computing formula of its distance is improved to:

D 1 = \sqrt{Σ_{l = 1}^{5} a {(x_{l} - y_{l})}^{2} + Σ_{l = 6}^{17} b {(x_{l} - y_{l})}^{2} + Σ_{l = 18}^{37} c {(x_{l} - y_{l})}^{2} + Σ_{l = 38}^{41} d {(x_{l} - y_{l})}^{2} + Σ_{l = 42}^{49} e {(x_{l} - y_{l})}^{2}}

(3) weights that each acoustic feature value is corresponding be preferably: application KNN sorter time, each acoustic feature value obtain to the accuracy differentiated with or without glottal stop two kind.That is, with the sample that the spectrum energy strengthening segment eigenwert of the sample of two sample spaces is KNN model of cognition, the recognition correct rate of KNN model of cognition is now a; With the sample that the MFCC acoustic feature value of the sample of two sample spaces is KNN model of cognition, the recognition correct rate of KNN model of cognition is now b; With the sample that the critical bands short-time rating spectroscopic eigenvalue of the sample of two sample spaces is KNN model of cognition, the recognition correct rate of KNN model of cognition is now c; With the sample that the wavelet transformation of the sample of two sample spaces and Information Entropy Features value are KNN model of cognition, the recognition correct rate of KNN model of cognition is now d; With the sample that the wavelet package transforms of the sample of two sample spaces and Information Entropy Features value are KNN model of cognition, the recognition correct rate of KNN model of cognition is now e.

4.3SVM algorithm for pattern recognition

Support vector machine (Support Vector Machines, SVM) pattern recognition classifier algorithm is widely used in Speech processing.SVM structure based risk minimization principle builds the lineoid of an optimizing decision, and the distance between the two class samples making this anomaly face, plane both sides nearest maximizes, thus provides good generalization ability to classification problem.SVM is to two classification classification problem Be very effective.The common kernel function of SVM has polynomial function, radial basis function, multi-layer perception（MLP） etc.Gaussian kernel function is the most frequently used radial basis function, has quite high dirigibility.Some researchs also show that this kernel function obtains better effects to Speech processing.Use gaussian kernel function herein, realize the differentiation with or without glottal stop two kind.Its calculation procedure is as follows:

(1) to sample to be tested and training sample, identical acoustic feature value is calculated: as spectrum energy strengthening segment eigenwert F.

(2) the spectrum energy strengthening segment eigenwert of application training sample, to the training of SVM model.

(3) by the spectrum energy strengthening segment eigenwert value of test sample book, input the SVM trained, obtain computing machine automatic discrimination result.

Training sample set in this algorithm contains the cleft palate speech of 4-11 year child cleft palate patient.Recording is carried out in professional recording room, and the time marquis of recording, requires that speaker keeps articulation type that is the most natural, that be accustomed to most.Speaker's lip distance Creative Hs300 digitizing microphone about 5cm, sends out the syllable in " West China Hospital of Stomatology, Sichuan University's speech therapy center mandarin structure sound meter " with the speed of about every 2s/ syllable.Cleft palate speech database used herein comprises children women's Patients with Cleft Palate voice 28 parts, children male sex's Patients with Cleft Palate voice 30 parts.The cleft palate speech collected independently is sentenced separately by 3 professional voice teachers and is listened, and in each syllable (Chinese character), whether initial consonant part glottal stop occurs provides judgement.

5 accuracy confirmatory experiments

The present invention adopts 10 k to roll over cross validation (k-fold cross validation) and verifies the recognition correct rate of each class model in Section 4.The value of k is 10.(being sentenced by professional voice teacher listens in each syllable (Chinese character) to get the syllable verbal audio signal 300 parts comprising glottal stop and do not comprise glottal stop, whether initial consonant part glottal stop occurs provides judgement), be master sample by these 300 parts of syllable verbal audio signals.The various eigenwerts of master sample are extracted according to preceding method.

The KNN sorting algorithm checking of 5.1 classics

300 parts of master samples are divided into ten parts at random, in turn will wherein 9 parts as training sample, remaining 1 part as test sample book.

Whether utilize classical KNN sorting algorithm identification test sample book containing glottal stop, listen result to compare recognition result and sentencing of professional voice teacher, calculating recognition result in this test sample book is correct number, calculates accuracy.

Using second part as test sample book, all the other 9 parts, as training sample, calculate the accuracy of recognition result; By that analogy, using residue eight parts successively as test sample book, remain this accuracy as training sample calculating recognition result of 9 increments.

Such traversal once after, obtain 10 accuracy, calculate the average of these 10 accuracy.

Again these 300 parts of master samples are divided into ten parts at random, successively will be every a as test sample book, remain nine parts as training sample, obtain 10 accuracy, calculate the average of these 10 accuracy.The like, then do 8 such random division and accuracy mean value computation.Finally obtain 10 accuracy averages, then these 10 accuracy averages are averaged again, just obtain the accuracy of this model of cognition.

The 5.2 KNN sorting algorithm checkings improved

Similar with 5.1 joint methods, difference is the eigenwert of KNN disaggregated model sample characteristics being replaced with improvement, model of cognition is replaced with the KNN disaggregated model of improvement.Calculate the accuracy of this model.

5.3SVM algorithm for pattern recognition

Similar with 5.1 joint methods, model of cognition is replaced with SVM model of cognition.Calculate the accuracy of this model.

Finally draw the recognition correct rate of all kinds of model of cognition, see table 2.The accuracy of the KNN disaggregated model after visible improvement is the highest.

Table 2 cleft palate speech is with or without the automatic recognition correct rate of glottal stop

The present invention is not limited to aforesaid embodiment.The present invention expands to any new feature of disclosing in this manual or any combination newly, and the step of the arbitrary new method disclosed or process or any combination newly.

Claims

1. a cleft palate speech glottal stop automatic identification algorithm, is characterized in that, comprising:

Step 1: gather syllable verbal audio signal to be measured;

Step 3: the eigenwert extracting described initial consonant voice signal;

2. a kind of cleft palate speech glottal stop automatic identification algorithm according to claim 1, it is characterized in that, described step 2 comprises further:

Step 21: windowing framing is carried out to syllable verbal audio signal and obtains some speech frame x _i[n], i gets 1,2,3 ... M, n get 1,2,3 ... N, N are frame length;

3. a kind of cleft palate speech glottal stop automatic identification algorithm according to claim 1, it is characterized in that, the initial consonant phonic signal character value that described step 3 is extracted comprise in following characteristics value one or more: spectrum energy strengthening segment eigenwert, MFCC acoustic feature value, critical bands short-time rating spectroscopic eigenvalue, wavelet transformation and Information Entropy Features value, wavelet package transforms and Information Entropy Features value; Wherein,

Extract wavelet package transforms and the Information Entropy Features value of initial consonant voice signal: three layers of wavelet package transforms are carried out to every frame initial consonant speech frame, signal after obtaining 8 reconstruct is reconstructed to the signal after three layers of WAVELET PACKET DECOMPOSITION, calculates the information entropy of the signal after each reconstruct; The mean value of the information entropy of the signal after reconstructing first of whole initial consonant voice signal frame is as the first wavelet package transforms of initial consonant voice signal and Information Entropy Features value, by that analogy, the second to the eight wavelet transformation and the Information Entropy Features value of initial consonant voice signal is calculated.

4. a kind of cleft palate speech glottal stop automatic identification algorithm according to claim 3, it is characterized in that, step 4 comprises further:

Calculate the initial consonant phonic signal character value of this syllable verbal audio signal to be measured and the distance of each training sample:

D 1 = \sqrt{Σ_{l = 1}^{5} a {(x_{l} - y_{1})}^{2} + Σ_{l = 6}^{17} b {(x_{l} - y_{l})}^{2} + Σ_{l = 18}^{37} c {(x_{l} - y_{l})}^{2} + Σ_{l = 38}^{41} d {(x_{l} - y_{l})}^{2} + Σ_{l = 42}^{49} e {(x_{l} - y_{l})}^{2}};

A, b, c, d, e are weights.

5. a kind of cleft palate speech glottal stop automatic identification algorithm according to claim 4, it is characterized in that, the value acquisition methods of described weights comprises:

6. a cleft palate speech glottal stop automatic identification equipment, is characterized in that, comprising:

Voice collecting unit, for gathering syllable verbal audio signal to be measured;

Initial consonant extraction unit, for carrying out the female cutting of sound to described syllable verbal audio signal, retains initial consonant voice signal; Initial consonant characteristics extraction unit, for extracting the eigenwert of described initial consonant voice signal;

Recognition unit, for sending in trained model of cognition by described eigenwert, model of cognition judges whether there is glottal stop in described syllable verbal audio signal according to described eigenwert.

7. a kind of cleft palate speech glottal stop automatic identification equipment according to claim 6, it is characterized in that, described initial consonant extraction unit comprises further:

Windowing framing subelement, obtains some speech frame x for carrying out windowing framing to syllable verbal audio signal _i[n], i gets 1,2,3 ... M, n get 1,2,3 ... N, N are frame length;

Short-time energy computing unit, for calculating the short-time energy E of each speech frame _i;

Short-time zero-crossing rate computing unit, for calculating each speech frame short-time zero-crossing rate Z _i;

Energy difference computing unit, for calculating energy difference e (i) of adjacent two frames: e (i)=E _i+1-E _i, i=1,2 ..., M-1;

Zero passage rate variance computing unit, for calculating zero passage rate variance z (i) of adjacent two frames: z (i)=Z _i+1-Z _i, i=1,2 ..., M-1;

Comparing unit, for each energy difference e (i) being compared with threshold value T1, compares each zero passage rate variance z (i) with threshold value T2; When meeting e (i)>=T1, simultaneously during z (i)≤T2, if now i=I; Then get speech frame x _i[n], i gets 1,2,3 ... I is the initial consonant voice signal of syllable verbal audio signal.

8. a kind of cleft palate speech glottal stop automatic identification equipment according to claim 6, is characterized in that, initial consonant characteristics extraction unit comprise in following subelement one or more:

Spectrum energy strengthening segment characteristics extraction subelement, for calculating the first to the five spectrum energy strengthening segment eigenwert of every frame initial consonant speech frame; Calculate the first spectrum energy strengthening segment eigenwert of the first spectrum energy strengthening segment eigenwert average as initial consonant voice signal of whole initial consonant speech frame, by that analogy, calculate the second to the five spectrum energy strengthening segment eigenwert of initial consonant voice signal;

MFCC acoustic feature value extracts subelement, and for calculating the MFCC acoustic feature value of every frame initial consonant speech frame, wherein MFCC coefficient value gets 12, obtains 12 MFCC eigenwerts of every frame initial consonant speech frame; Using the MFCC eigenwert of the mean value of a MFCC eigenwert of whole initial consonant voice signal frame as initial consonant voice signal, by that analogy, the second to the ten two MFCC eigenwert of initial consonant voice signal is calculated;

Critical bands short-time rating spectroscopic eigenvalue extracts subelement, for carrying out Short Time Fourier Transform to every frame initial consonant speech frame, obtains the short-time rating spectrum of every frame initial consonant speech frame; According to critical bands division rule, the short-time rating of every frame initial consonant speech frame spectrum is divided into 20 critical bands; The power of the first critical bands of whole initial consonant speech frame is superimposed and obtains the first critical bands short-time rating spectroscopic eigenvalue of initial consonant voice signal, obtain the second to the two ten critical bands short-time rating spectroscopic eigenvalue by that analogy;

Wavelet transformation and Information Entropy Features value extract subelement, for carrying out three layers of wavelet transformation to every frame initial consonant speech frame, being reconstructed the signal after obtaining 4 reconstruct, calculating the information entropy of the signal after each reconstruct to the signal after three layers of wavelet decomposition; The mean value of the information entropy of the signal after reconstructing first of whole initial consonant voice signal frame is as the first wavelet transformation of initial consonant voice signal and Information Entropy Features value, by that analogy, the second to the four wavelet transformation and the Information Entropy Features value of initial consonant voice signal is calculated;

Wavelet package transforms and Information Entropy Features value extract subelement, for carrying out three layers of wavelet package transforms to every frame initial consonant speech frame, being reconstructed the signal after obtaining 8 reconstruct, calculating the information entropy of the signal after each reconstruct to the signal after three layers of WAVELET PACKET DECOMPOSITION; The mean value of the information entropy of the signal after reconstructing first of whole initial consonant voice signal frame is as the first wavelet package transforms of initial consonant voice signal and Information Entropy Features value, by that analogy, the second to the eight wavelet package transforms and the Information Entropy Features value of initial consonant voice signal is calculated.

9. a kind of cleft palate speech glottal stop automatic identification equipment according to claim 8, it is characterized in that, recognition unit comprises further:

Sample space collects unit, for choosing the syllable verbal audio signal some composition true training sample set of known packets containing glottal stop, chooses the known false training sample set of the some compositions of syllable verbal audio signal not comprising glottal stop;

Sample characteristics extraction unit, for extracting the spectrum energy strengthening segment eigenwert of each training sample of two training sample sets, MFCC acoustic feature value, critical bands short-time rating spectroscopic eigenvalue, wavelet transformation and Information Entropy Features value and wavelet package transforms and Information Entropy Features value;

Syllable verbal audio signal characteristic value acquiring unit to be measured, for receiving the initial consonant phonic signal character value of the syllable verbal audio signal to be measured that initial consonant characteristics extraction unit extracts;

Metrics calculation unit, the distance for the initial consonant phonic signal character value and each training sample that calculate this syllable verbal audio signal to be measured:

D 1 = \sqrt{Σ_{l = 1}^{5} a {(x_{l} - y_{1})}^{2} + Σ_{l = 6}^{17} b {(x_{l} - y_{l})}^{2} + Σ_{l = 18}^{37} c {(x_{l} - y_{l})}^{2} + Σ_{l = 38}^{41} d {(x_{l} - y_{l})}^{2} + Σ_{l = 42}^{49} e {(x_{l} - y_{l})}^{2}};

A, b, c, d, e are weights.

10. a kind of cleft palate speech glottal stop automatic identification equipment according to claim 9, it is characterized in that, the value acquisition methods of described weights comprises: