CN102543063B

CN102543063B - Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Info

Publication number: CN102543063B
Application number: CN2011104035773A
Authority: CN
Inventors: 李艳雄; 徐鑫; 贺前华
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2011-12-07
Filing date: 2011-12-07
Publication date: 2013-07-24
Anticipated expiration: 2031-12-07
Also published as: CN102543063A

Abstract

The invention discloses a method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers, and relates to a method for estimating speech speed of multiple speakers. The method for estimating speech speed of multiple speakers comprises the following steps: firstly, reading speech flow; detecting changing points of speakers in the speed flow, and segmenting the speech flow into a plurality of speech sections according to the changing points; carrying out clustering of the speakers according to the speech sections, splicing the speech sections of the same speakers according to the sequence to acquire the number of the speakers and the speech sound of each speaker; and finally, estimating the time length of the speech sound of each speaker and the included word numbers to estimate the speech speed of each speaker. Compared with the method for estimating the speech speed of a single speaker based on speech recognition, not only the method can estimate the speech speeds of the multiple speakers, but also the estimating speed is faster.

Description

Cut apart many speakers word speed method of estimation with cluster based on the speaker

Technical field

The present invention relates to that voice signal is handled and mode identification technology, relate in particular to and a kind ofly cut apart many speakers word speed method of estimation with cluster based on the speaker.

Background technology

Development along with voice processing technology, at present the object of speech processes just progressively turns to many speakers voice (for example conference voice, conversational speech) by single speaker's voice, estimates many speakers' word speed and becomes more and more important according to the parameter of each speaker's voice speed adaption ground adjustment speech processing system (for example speech recognition system).In addition, in recording studio or breadboard Recording Process, speaker (for example announcer, host, contact staff etc.) rule of thumb subjectively measures word speed, and is often not accurate enough.Though can adopt the method for artificial mark to estimate speaker's word speed behind the End of Tape, do very time-consumingly like this, this way was just less feasible when particularly data volume was very big.Therefore, the word speed that can automatically estimate many speakers just becomes extremely important.

Existing word speed method of estimation is all at single speaker's voice, word speed that can only the estimate sheet speaker, and can not estimate many speakers' word speed.In addition, existing word speed method of estimation mainly is based on the word speed that voice identification result is estimated the speaker: at first adopt speech recognition device identification aligned phoneme sequence and each phoneme time corresponding point from the input voice; Identifier word sequence and each word time corresponding point again, thus estimate speaker's word speed.

The weak point of above-mentioned word speed method of estimation is:

(1) word speed that can only estimate sheet speaker voice.When containing many speakers' voice in the input voice, the input voice only are used as a speaker's speech processes, and can not get many speakers' word speed estimated result.

(2) speed is slow.Present method is at first carried out speech recognition to the input voice, and aligned phoneme sequence and the word sequence estimation according to identification goes out word speed again.It (generally is hidden Markov model that this method need be trained a large amount of phoneme models, Hidden Markov Model), when identification, also need a large amount of computing (extract feature, estimate the output probability of acoustic model and language model etc.), therefore the speed of this method is slow, is unfavorable for real-time processing.

Summary of the invention

The objective of the invention is to solve the existing in prior technology defective, provide a kind of and cut apart many speakers word speed method of estimation with cluster based on the speaker: cut apart with cluster by the speaker and earlier voice flow is divided into voice segments, the voice segments with identical speaker is stitched together in order again; Estimate number of words and duration in each speaker's voice then respectively, realize that many speakers' word speed is estimated.

The technical solution adopted for the present invention to solve the technical problems comprises the steps:

1) reads in voice flow: read in the voice flow that records many speakers voice;

2) speaker is cut apart: detect that the speaker changes a little in the above-mentioned voice flow, be divided into a plurality of voice segments according to these changes voice flow of naming a person for a particular job;

3) speaker's cluster: utilize the spectral clustering algorithm that the above-mentioned voice segments that splits is carried out speaker's cluster, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice;

4) word speed is estimated: extract energy envelope respectively from each speaker's voice, and determine syllable number by the local maximum point of finding out energy envelope, thereby estimate each speaker's word speed.

Described step 2) step cut apart of speaker comprises:

2.1) utilize and from the above-mentioned voice flow that reads in, find out quiet section and voice segments based on the silence detection algorithm of threshold judgement;

2.2) above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract comprise the Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and the audio frequency characteristics of first order difference (Delta-MFCCs);

2.3) audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, judge that the similarity between the adjacent data window detects the speaker in the long voice segments to change a little;

2.4) change a little according to above-mentioned speaker, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.

Described step 2.1) step based on the silence detection algorithm of threshold judgement comprises:

2.1.1) voice flow that reads in is carried out the branch frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;

2.1.2) the calculating energy thresholding;

2.1.3) energy and the energy threshold of every frame voice compared, the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.

The step that described step 4) word speed is estimated comprises:

4.1) calculate the energy of speaker's voice;

4.2) energy that utilizes low-pass filter that said extracted is come out carries out filtering, obtains energy envelope;

4.3) calculating energy envelope threshold value;

4.4) determine local maximum point in the energy envelope, obtain the number of local maximum point;

4.5) with the number of the local maximum point in this speaker's speech energy envelope as the syllable number, and, obtain this speaker's word speed divided by the duration of these speaker's voice;

4.6) repetition above-mentioned steps 4.1)～4.5), till the word speed of all speaker's voice has all been estimated.

Described local maximum point satisfies following condition:

A) this element value is greater than the energy envelope threshold value;

B) this element value is greater than its forward and backward 0.07 second all elements value;

Described local maximum point position is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable.

The invention has the beneficial effects as follows: utilize the speaker to cut apart the voice flow that to comprise many speakers and be cut into a plurality of voice segments, and each voice segments only comprises a speaker, utilize speaker's cluster that identical speaker's voice segments is combined again, so the present invention can estimate the word speed of many speakers voice.In addition, determine the syllable number by the local maximum point that detects each speaker's speech energy envelope, thereby estimate each speaker's word speed, compare with word speed method of estimation based on speech recognition, do not need to do complicated numerical evaluation (for example calculating of the output probability of acoustic model and language model) thus saved operation time, more be applicable to the occasion that real-time word speed is estimated.

Description of drawings

Fig. 1 is a process flow diagram of the present invention.

Fig. 2 is the synoptic diagram that word speed is estimated in the embodiments of the invention, wherein Fig. 2 (a) is certain speaker's speech waveform figure, the speech energy figure of Fig. 2 (b) for extracting: solid line is an energy envelope, and the dot-and-dash line of band circle is the energy envelope local maximum point, and dotted line is the energy envelope threshold value.

Embodiment

Be described in detail below in conjunction with specific embodiment and Figure of description.

Fig. 1 is the process flow diagram of the method for many speakers of estimation word speed according to an embodiment of the invention.As shown in Figure 1, at first in step 101, read in voice flow.Voice flow is the speech data that records many speakers voice, can be the file of various forms, for example WAV, RAM, MP3, VOX etc.

Then, in step 102, utilization is found out quiet section and voice segments in the voice flow based on the mute detection method of threshold judgement, above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract audio frequency characteristics, the audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, the similarity in the long voice segments of judgement between the adjacent data window detects the speaker and changes a little; Change a little according to above-mentioned speaker at last, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.

Mute detection method in the step 102 specifically may further comprise the steps:

1) voice flow that reads in is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f _s, f wherein _sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;

2) calculate t (the frame voice signal x of 1≤t≤T) _t(n) ENERGY E _t:

E_{t} = Σ_{n = 1}^{N} X_{t}^{2} (n), 1 \leq t \leq T

Obtain the energy vectors E=[E of voice flow ₁, E ₂..., E _T], wherein T is a totalframes;

3) judge quietly significant limitation is arranged with fixing energy threshold with voice because the speech energy various environment under differs greatly, but voice and quiet between the relativeness of energy size be constant, so define adaptive energy thresholding T _E:

T _E＝min(E)+0.3×[mean(E)-min(E)]

Wherein, min (E) is the minimum value of each frame energy, and mean (E) is the mean value of each frame energy.

4) energy and the energy threshold with every frame voice compares, and the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.

The method of utilizing bayesian information criterion to determine that the speaker changes a little in the step 102 specifically may further comprise the steps:

1) will be spliced into a long voice segments through each voice segments that silence detection obtains in order, will grow voice segments and be cut into data window, window length is 2 seconds, and it is 0.1 second that window moves.Each data window is carried out the branch frame, frame length is 32 milliseconds, it is 16 milliseconds that frame moves, from each frame voice signal, extract MFCCs and Delta-MFCCs feature, the dimension M of MFCCs and Delta-MFCCs gets 12, the feature of each data window constitutes an eigenmatrix F, and the dimension d=2M of eigenmatrix F is 24;

2) the BIC distance between two adjacent data windows of calculating (x and y), BIC distance calculation formula is as follows:

ΔBIC = (n_{x} + n_{y}) \ln (| \det (cov (F_{z})) |) - n_{x} \ln (| \det (cov (F_{x})) |) -

n_{y} \ln (| \det (cov (F_{y})) |) - α (d + \frac{d (d + 1)}{2}) \ln (n_{x} + n_{y})

Wherein, z merges the data window that obtains afterwards, n with data window x and y _xAnd n _yBe respectively the frame number of data window x and y, F _x, F _yAnd F _zBe respectively the eigenmatrix of data window x, y and z, cov (F _x), cov (F _y) and cov (F _z) be respectively eigenmatrix F _x, F _yAnd F _zCovariance matrix, it is that penalty coefficient and experiment value are 2.0 that determinant of a matrix value, α are asked in det () expression;

3) if BIC distance, delta B/C greater than zero, then these two data windows are regarded as belonging to two different speakers (being to exist the speaker to change a little between them), otherwise these two data windows are regarded as belonging to same speaker and they are merged;

4) whether the data window that constantly slides judges two BIC distances between the adjacent data window greater than zero, and preserves the speaker and change a little, till the distance of the BIC between all adjacent data windows of long voice segments all has been judged.

The step of said extracted MFCCs and Delta-MFCCs feature comprises:

1) voice signal is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f _s, f wherein _sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;

2) to t (the frame voice signal x of 1≤t≤T) _t(n) (Discrete FourierTransformation DFT) obtains linear spectral X to do discrete Fourier transform (DFT) _t(k):

X_{t} (k) = Σ_{n = 0}^{N - 1} X_{t} (n) e^{- j 2 πnk / N}, (0 \leq n, k \leq N - 1)

3) with above-mentioned linear spectral X _t(k) obtain the Mel frequency spectrum by Mel frequency filter group, carry out logarithm operation again and obtain log spectrum S _t(m), wherein Mel frequency filter group is several bandpass filter H _m(k), 0≤m＜M, M are the number of wave filter, each wave filter has the triangle filtering characteristic, and its centre frequency is f (m), and the interval between hour adjacent f (m) is also less when the m value, along with the interval of the adjacent f of the increase of m (m) becomes greatly gradually, the transport function of each bandpass filter is:

H_{m} (k) = \{\begin{matrix} 0 & (k < f (m - 1)) \\ \frac{k - f (m - 1)}{f (m) - f (m - 1)} & (f (m - 1) \leq k \leq f (m)) \\ \frac{f (m + 1) - k}{f (m + 1) - f (m)} & (f (m) < k \leq f (m + 1)) \\ 0 & (k > f (m + 1)) \end{matrix} (0 \leq m < M)

Wherein, f (m) is defined as follows:

f (m) = (\frac{N}{f_{s}}) B^{- 1} (B (f_{l}) + m \frac{B (f_{h}) - B (f_{l})}{M + 1})

Wherein, f _l, f _hBe the low-limit frequency and the highest frequency of the frequency application scope of wave filter, B ^-1Inverse function for B: B ^-1(b)=700 (e ^B/1125-1), therefore by linear spectral X _t(k) to logarithmic spectrum S _t(m) functional expression is:

S_{t} (m) = \ln (Σ_{k = 0}^{N - 1} {| X_{t} (k) |}^{2} H_{m} (k)), (0 \leq m < M)

4) with above-mentioned log spectrum S _t(m) (Discrete Cosine Transformation DCT) transforms to cepstrum domain, obtains t frame MFCCs, C through discrete cosine transform _t(p):

C_{t} (p) = Σ_{m = 0}^{M - 1} S_{t} (m) \cos (\frac{(m + 0.5) nπ}{M}), (0 \leq p < M)

5) first order difference (Delta-MFCCs) of calculating t frame MFCCs, C _t(p):

C_{t}^{'} (p) = \frac{1}{\sqrt{\frac{Q}{q = - Q} q^{2}}} Σ_{q = - Q}^{Q} q \times C_{t} (p + q), (0 \leq p < M)

Wherein, Q is a constant, and value is 3 during experiment.

6) every frame voice signal is repeated above-mentioned steps 2)～5), obtain the MFCCs and the Delta-MFCCs of all T frame voice signals, with synthetic MFCC matrix of their der group frame by frame and Delta-MFCC matrix, again MFCC matrix and Delta-MFCC matrix are merged the constitutive characteristic matrix F.

Step 103, from each voice segments that splits, extract the audio frequency characteristics that comprises MFCCs and Delta-MFCCs, and utilize the spectral clustering algorithm that the voice segments that each comprises a speaker is carried out speaker's cluster, obtain speaker's number and each speaker's voice.Concrete steps are as follows:

1) each voice segments is carried out the branch frame, frame length is 32 milliseconds, and it is 16 milliseconds that frame moves, and extracts MFCCs and Delta-MFCCs feature from each frame voice signal, and the dimension M of MFCCs and Delta-MFCCs is 12, and the feature of each voice segments constitutes an eigenmatrix F _j, eigenmatrix F _jDimension d=2M be 24;

2) according to each eigenmatrix F _jObtain the eigenmatrix set F={F of all voice segments to be clustered ₁..., F _J, J is the total number of voice segments, constructs affine matrix (Affinity matrix) A ∈ R according to F again ^{J * J}, (i, j) the individual elements A of A _IjBe defined as follows:

A_{ij} = [\begin{matrix} \exp (\frac{- d^{2} (F_{i}, F_{j})}{2 σ_{i} σ_{j}}) & i &NotEqual; j \\ 0 & i = j \end{matrix}]

Wherein, d (F _i, F _j) be eigenmatrix F _iWith F _jBetween Euclidean distance, σ _i(or σ _j) be a scale parameter, be defined as the individual eigenmatrix F of i (or j) _i(or F _j) and other T-1 eigenmatrix between the variance of Euclidean distance vector;

3) structure diagonal matrix D, it the (i, i) individual element equals the capable all elements sum of i of affine matrix A, constructs normalized affine matrix L=D according to matrix D and A again ^-1/2AD ^-1/2

4) the preceding K of compute matrix L _MaxThe eigenwert of individual maximum

And eigenwert vector

V wherein _k(1≤k≤K _Max) be column vector and Estimate optimum classification number (being speaker's number) K according to the difference between the adjacent feature value:

K = \underset{i &Element; [1, K_{\max} - 1]}{\arg \max} (λ_{i} - λ_{i + 1})

According to the speaker's number K that estimates, structural matrix V=[v ₁, v ₂..., v _K] ∈ R ^{J * K}

5) each row of normalization matrix V obtains matrix Y ∈ R ^{J * K}, (j, k) the individual element Y of Y _Jk:

Y_{jk} = \frac{V_{jk}}{\sqrt{(Σ_{k = 1}^{K} V_{jk}^{2})}}, 1 \leq i \leq J;

6) space R is made in each trade among the matrix Y ^KIn a point, utilize K mean algorithm (K-meansalgorithm) that this J capable (being J point) is clustered into the K class.

7) the pairing voice segments of eigenmatrix Fj is judged to k class (i.e. k speaker), the j of and if only if matrix Y capable by cluster in the k class;

8), obtain speaker's number, each speaker's voice and duration thereof according to above-mentioned cluster result.

At last, in step 104, from each speaker's voice, extract energy envelope, and determine the syllable number, estimate each speaker's word speed by detected energy envelope local maximum point.In standard Chinese, each syllable all comprises simple or compound vowel of a Chinese syllable basically, and the simple or compound vowel of a Chinese syllable number is the syllable number, the syllable number is the word number, and the energy maximum of simple or compound vowel of a Chinese syllable in the syllable, therefore can obtain the number of word, thereby estimate word speed by the simple or compound vowel of a Chinese syllable number of detected energy maximum.Concrete steps based on the word speed method of estimation of above-mentioned consideration are as follows:

1) calculate the ENERGY E (n) of each speaker's voice signal s (n):

E(n)＝s ²(n)，1≤n≤Len

Wherein, Len is the total number of sampled point of voice signal;

2) utilize a low-pass filter that ENERGY E (n) is carried out filtering, obtain energy envelope E (n), the technical indicator of this low-pass filter is as follows: based on the FIR wave filter of Equiripple method, sample frequency f _sBe 16000 hertz, cut-off frequecy of passband f _PassBe 50 hertz, stopband cutoff frequency f _StopBe 100 hertz, the maximum attenuation A of passband _PassBe 1dB, the minimal attenuation A of stopband _StopBe 80dB;

3) calculating energy envelope threshold value T _E:

T _E＝0.4×mean(E(n))

Wherein, mean (E (n)) is the mean value of energy envelope;

4) will satisfy following two conditions of elements in the energy envelope as local maximum point:

Condition 1: this element value is greater than energy envelope threshold value T _E,

Condition 2: this element value is greater than its forward and backward 0.07 second all elements value, promptly greater than its forward and backward 0.07 * f _sIndividual element value,

The position (sampled point) at above-mentioned local maximum point place is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable, and the reason of getting 0.07 second is: the minimum value of the average duration of syllable approximately is 0.14 second, thus among the E (n) greater than T _EAnd greater than the position at its forward and backward 0.07 second element value place the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable;

5) with the number of the local maximum point in certain speaker's speech energy envelope as syllable (word) number, with the number of word duration (second), obtain this speaker's word speed (word/second) divided by these speaker's voice;

6) repeat above-mentioned steps 1)～5), till the word speed of all speaker's voice has all been estimated.

The duration that Fig. 2 (a) has provided certain speaker is the oscillogram of 5 seconds voice signal, and Fig. 2 (b) has provided Fig. 2 (a) pairing energy envelope waveform of voice signal (shown in the solid line), the threshold value (shown in the dotted line) of energy envelope and the energy envelope local maximum point (shown in the dot-and-dash line of band circle) that obtains according to above-mentioned word speed estimating step.As can be seen from Figure 2: this speaker's voice signal duration is 5 seconds, and the number of local maximum point is 22, and promptly number of words is 22, and therefore, this speaker's word speed is 4.4 word/seconds (or 264 words per minute clocks).

Though more than by the foregoing description many speakers word speed method of estimation of the present invention is described in detail, therefore can not be interpreted as limitation of the scope of the invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with claims.

Claims

1. cut apart many speakers word speed method of estimation with cluster based on the speaker for one kind, it is characterized in that comprising the steps:

3) speaker's cluster: identical speaker's voice segments is gathered in a class and being stitched together in order, obtain speaker's number and each speaker's voice;

4) word speed is estimated: extract energy envelope respectively from each speaker's voice, and determine syllable number by the local maximum point of finding out energy envelope, thereby estimate each speaker's word speed; This step specifically comprises:

4.1) calculate the energy of speaker's voice;

4.3) calculating energy envelope threshold value;

4.4) determine local maximum point in the energy envelope, obtain the number of local maximum point, specifically be to satisfy following two conditions of elements in the energy envelope as local maximum point:

A) this element value is greater than the energy envelope threshold value;

Described local maximum point position is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable;

2. many speakers word speed method of estimation according to claim 1 is characterized in that described step 2) step cut apart of speaker comprises:

2.2) above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract audio frequency characteristics;

3. many speakers word speed method of estimation according to claim 2 is characterized in that described step 2.1) comprise based on the step of the silence detection algorithm of threshold judgement:

2.1.2) the calculating energy thresholding;

4. many speakers word speed method of estimation according to claim 2 is characterized in that described step 2.2) audio frequency characteristics comprise Mel frequency cepstral coefficient and first order difference thereof.

5. many speakers word speed method of estimation according to claim 1 is characterized in that, speaker's cluster of described step 3) adopts the spectral clustering algorithm.