CN102543063B - Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers - Google Patents

Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers Download PDF

Info

Publication number
CN102543063B
CN102543063B CN2011104035773A CN201110403577A CN102543063B CN 102543063 B CN102543063 B CN 102543063B CN 2011104035773 A CN2011104035773 A CN 2011104035773A CN 201110403577 A CN201110403577 A CN 201110403577A CN 102543063 B CN102543063 B CN 102543063B
Authority
CN
China
Prior art keywords
speaker
voice
energy
speech
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2011104035773A
Other languages
Chinese (zh)
Other versions
CN102543063A (en
Inventor
李艳雄
徐鑫
贺前华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN2011104035773A priority Critical patent/CN102543063B/en
Publication of CN102543063A publication Critical patent/CN102543063A/en
Application granted granted Critical
Publication of CN102543063B publication Critical patent/CN102543063B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers, and relates to a method for estimating speech speed of multiple speakers. The method for estimating speech speed of multiple speakers comprises the following steps: firstly, reading speech flow; detecting changing points of speakers in the speed flow, and segmenting the speech flow into a plurality of speech sections according to the changing points; carrying out clustering of the speakers according to the speech sections, splicing the speech sections of the same speakers according to the sequence to acquire the number of the speakers and the speech sound of each speaker; and finally, estimating the time length of the speech sound of each speaker and the included word numbers to estimate the speech speed of each speaker. Compared with the method for estimating the speech speed of a single speaker based on speech recognition, not only the method can estimate the speech speeds of the multiple speakers, but also the estimating speed is faster.

Description

Cut apart many speakers word speed method of estimation with cluster based on the speaker
Technical field
The present invention relates to that voice signal is handled and mode identification technology, relate in particular to and a kind ofly cut apart many speakers word speed method of estimation with cluster based on the speaker.
Background technology
Development along with voice processing technology, at present the object of speech processes just progressively turns to many speakers voice (for example conference voice, conversational speech) by single speaker's voice, estimates many speakers' word speed and becomes more and more important according to the parameter of each speaker's voice speed adaption ground adjustment speech processing system (for example speech recognition system).In addition, in recording studio or breadboard Recording Process, speaker (for example announcer, host, contact staff etc.) rule of thumb subjectively measures word speed, and is often not accurate enough.Though can adopt the method for artificial mark to estimate speaker's word speed behind the End of Tape, do very time-consumingly like this, this way was just less feasible when particularly data volume was very big.Therefore, the word speed that can automatically estimate many speakers just becomes extremely important.
Existing word speed method of estimation is all at single speaker's voice, word speed that can only the estimate sheet speaker, and can not estimate many speakers' word speed.In addition, existing word speed method of estimation mainly is based on the word speed that voice identification result is estimated the speaker: at first adopt speech recognition device identification aligned phoneme sequence and each phoneme time corresponding point from the input voice; Identifier word sequence and each word time corresponding point again, thus estimate speaker's word speed.
The weak point of above-mentioned word speed method of estimation is:
(1) word speed that can only estimate sheet speaker voice.When containing many speakers' voice in the input voice, the input voice only are used as a speaker's speech processes, and can not get many speakers' word speed estimated result.
(2) speed is slow.Present method is at first carried out speech recognition to the input voice, and aligned phoneme sequence and the word sequence estimation according to identification goes out word speed again.It (generally is hidden Markov model that this method need be trained a large amount of phoneme models, Hidden Markov Model), when identification, also need a large amount of computing (extract feature, estimate the output probability of acoustic model and language model etc.), therefore the speed of this method is slow, is unfavorable for real-time processing.
Summary of the invention
The objective of the invention is to solve the existing in prior technology defective, provide a kind of and cut apart many speakers word speed method of estimation with cluster based on the speaker: cut apart with cluster by the speaker and earlier voice flow is divided into voice segments, the voice segments with identical speaker is stitched together in order again; Estimate number of words and duration in each speaker's voice then respectively, realize that many speakers' word speed is estimated.
The technical solution adopted for the present invention to solve the technical problems comprises the steps:
1) reads in voice flow: read in the voice flow that records many speakers voice;
2) speaker is cut apart: detect that the speaker changes a little in the above-mentioned voice flow, be divided into a plurality of voice segments according to these changes voice flow of naming a person for a particular job;
3) speaker's cluster: utilize the spectral clustering algorithm that the above-mentioned voice segments that splits is carried out speaker's cluster, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice;
4) word speed is estimated: extract energy envelope respectively from each speaker's voice, and determine syllable number by the local maximum point of finding out energy envelope, thereby estimate each speaker's word speed.
Described step 2) step cut apart of speaker comprises:
2.1) utilize and from the above-mentioned voice flow that reads in, find out quiet section and voice segments based on the silence detection algorithm of threshold judgement;
2.2) above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract comprise the Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and the audio frequency characteristics of first order difference (Delta-MFCCs);
2.3) audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, judge that the similarity between the adjacent data window detects the speaker in the long voice segments to change a little;
2.4) change a little according to above-mentioned speaker, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
Described step 2.1) step based on the silence detection algorithm of threshold judgement comprises:
2.1.1) voice flow that reads in is carried out the branch frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
2.1.2) the calculating energy thresholding;
2.1.3) energy and the energy threshold of every frame voice compared, the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.
The step that described step 4) word speed is estimated comprises:
4.1) calculate the energy of speaker's voice;
4.2) energy that utilizes low-pass filter that said extracted is come out carries out filtering, obtains energy envelope;
4.3) calculating energy envelope threshold value;
4.4) determine local maximum point in the energy envelope, obtain the number of local maximum point;
4.5) with the number of the local maximum point in this speaker's speech energy envelope as the syllable number, and, obtain this speaker's word speed divided by the duration of these speaker's voice;
4.6) repetition above-mentioned steps 4.1)~4.5), till the word speed of all speaker's voice has all been estimated.
Described local maximum point satisfies following condition:
A) this element value is greater than the energy envelope threshold value;
B) this element value is greater than its forward and backward 0.07 second all elements value;
Described local maximum point position is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable.
The invention has the beneficial effects as follows: utilize the speaker to cut apart the voice flow that to comprise many speakers and be cut into a plurality of voice segments, and each voice segments only comprises a speaker, utilize speaker's cluster that identical speaker's voice segments is combined again, so the present invention can estimate the word speed of many speakers voice.In addition, determine the syllable number by the local maximum point that detects each speaker's speech energy envelope, thereby estimate each speaker's word speed, compare with word speed method of estimation based on speech recognition, do not need to do complicated numerical evaluation (for example calculating of the output probability of acoustic model and language model) thus saved operation time, more be applicable to the occasion that real-time word speed is estimated.
Description of drawings
Fig. 1 is a process flow diagram of the present invention.
Fig. 2 is the synoptic diagram that word speed is estimated in the embodiments of the invention, wherein Fig. 2 (a) is certain speaker's speech waveform figure, the speech energy figure of Fig. 2 (b) for extracting: solid line is an energy envelope, and the dot-and-dash line of band circle is the energy envelope local maximum point, and dotted line is the energy envelope threshold value.
Embodiment
Be described in detail below in conjunction with specific embodiment and Figure of description.
Fig. 1 is the process flow diagram of the method for many speakers of estimation word speed according to an embodiment of the invention.As shown in Figure 1, at first in step 101, read in voice flow.Voice flow is the speech data that records many speakers voice, can be the file of various forms, for example WAV, RAM, MP3, VOX etc.
Then, in step 102, utilization is found out quiet section and voice segments in the voice flow based on the mute detection method of threshold judgement, above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract audio frequency characteristics, the audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, the similarity in the long voice segments of judgement between the adjacent data window detects the speaker and changes a little; Change a little according to above-mentioned speaker at last, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
Mute detection method in the step 102 specifically may further comprise the steps:
1) voice flow that reads in is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f s, f wherein sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;
2) calculate t (the frame voice signal x of 1≤t≤T) t(n) ENERGY E t:
E t = Σ n = 1 N X t 2 ( n ) , 1 ≤ t ≤ T
Obtain the energy vectors E=[E of voice flow 1, E 2..., E T], wherein T is a totalframes;
3) judge quietly significant limitation is arranged with fixing energy threshold with voice because the speech energy various environment under differs greatly, but voice and quiet between the relativeness of energy size be constant, so define adaptive energy thresholding T E:
T E=min(E)+0.3×[mean(E)-min(E)]
Wherein, min (E) is the minimum value of each frame energy, and mean (E) is the mean value of each frame energy.
4) energy and the energy threshold with every frame voice compares, and the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.
The method of utilizing bayesian information criterion to determine that the speaker changes a little in the step 102 specifically may further comprise the steps:
1) will be spliced into a long voice segments through each voice segments that silence detection obtains in order, will grow voice segments and be cut into data window, window length is 2 seconds, and it is 0.1 second that window moves.Each data window is carried out the branch frame, frame length is 32 milliseconds, it is 16 milliseconds that frame moves, from each frame voice signal, extract MFCCs and Delta-MFCCs feature, the dimension M of MFCCs and Delta-MFCCs gets 12, the feature of each data window constitutes an eigenmatrix F, and the dimension d=2M of eigenmatrix F is 24;
2) the BIC distance between two adjacent data windows of calculating (x and y), BIC distance calculation formula is as follows:
ΔBIC = ( n x + n y ) ln ( | det ( cov ( F z ) ) | ) - n x ln ( | det ( cov ( F x ) ) | ) -
n y ln ( | det ( cov ( F y ) ) | ) - α ( d + d ( d + 1 ) 2 ) ln ( n x + n y )
Wherein, z merges the data window that obtains afterwards, n with data window x and y xAnd n yBe respectively the frame number of data window x and y, F x, F yAnd F zBe respectively the eigenmatrix of data window x, y and z, cov (F x), cov (F y) and cov (F z) be respectively eigenmatrix F x, F yAnd F zCovariance matrix, it is that penalty coefficient and experiment value are 2.0 that determinant of a matrix value, α are asked in det () expression;
3) if BIC distance, delta B/C greater than zero, then these two data windows are regarded as belonging to two different speakers (being to exist the speaker to change a little between them), otherwise these two data windows are regarded as belonging to same speaker and they are merged;
4) whether the data window that constantly slides judges two BIC distances between the adjacent data window greater than zero, and preserves the speaker and change a little, till the distance of the BIC between all adjacent data windows of long voice segments all has been judged.
The step of said extracted MFCCs and Delta-MFCCs feature comprises:
1) voice signal is divided into the T frame, frame length is 32 milliseconds of (frame length corresponding sampling points number N=0.032 * f s, f wherein sSample frequency for voice signal), it is 16 milliseconds that frame moves, if the sampled point number of last frame voice is then cast out it less than N;
2) to t (the frame voice signal x of 1≤t≤T) t(n) (Discrete FourierTransformation DFT) obtains linear spectral X to do discrete Fourier transform (DFT) t(k):
X t ( k ) = Σ n = 0 N - 1 X t ( n ) e - j 2 πnk / N , ( 0 ≤ n , k ≤ N - 1 )
3) with above-mentioned linear spectral X t(k) obtain the Mel frequency spectrum by Mel frequency filter group, carry out logarithm operation again and obtain log spectrum S t(m), wherein Mel frequency filter group is several bandpass filter H m(k), 0≤m<M, M are the number of wave filter, each wave filter has the triangle filtering characteristic, and its centre frequency is f (m), and the interval between hour adjacent f (m) is also less when the m value, along with the interval of the adjacent f of the increase of m (m) becomes greatly gradually, the transport function of each bandpass filter is:
H m ( k ) = 0 ( k < f ( m - 1 ) ) k - f ( m - 1 ) f ( m ) - f ( m - 1 ) ( f ( m - 1 ) &le; k &le; f ( m ) ) f ( m + 1 ) - k f ( m + 1 ) - f ( m ) ( f ( m ) < k &le; f ( m + 1 ) ) 0 ( k > f ( m + 1 ) ) ( 0 &le; m < M )
Wherein, f (m) is defined as follows:
f ( m ) = ( N f s ) B - 1 ( B ( f l ) + m B ( f h ) - B ( f l ) M + 1 )
Wherein, f l, f hBe the low-limit frequency and the highest frequency of the frequency application scope of wave filter, B -1Inverse function for B: B -1(b)=700 (e B/1125-1), therefore by linear spectral X t(k) to logarithmic spectrum S t(m) functional expression is:
S t ( m ) = ln ( &Sigma; k = 0 N - 1 | X t ( k ) | 2 H m ( k ) ) , ( 0 &le; m < M )
4) with above-mentioned log spectrum S t(m) (Discrete Cosine Transformation DCT) transforms to cepstrum domain, obtains t frame MFCCs, C through discrete cosine transform t(p):
C t ( p ) = &Sigma; m = 0 M - 1 S t ( m ) cos ( ( m + 0.5 ) n&pi; M ) , ( 0 &le; p < M )
5) first order difference (Delta-MFCCs) of calculating t frame MFCCs, C t(p):
C t &prime; ( p ) = 1 Q q = - Q q 2 &Sigma; q = - Q Q q &times; C t ( p + q ) , ( 0 &le; p < M )
Wherein, Q is a constant, and value is 3 during experiment.
6) every frame voice signal is repeated above-mentioned steps 2)~5), obtain the MFCCs and the Delta-MFCCs of all T frame voice signals, with synthetic MFCC matrix of their der group frame by frame and Delta-MFCC matrix, again MFCC matrix and Delta-MFCC matrix are merged the constitutive characteristic matrix F.
Step 103, from each voice segments that splits, extract the audio frequency characteristics that comprises MFCCs and Delta-MFCCs, and utilize the spectral clustering algorithm that the voice segments that each comprises a speaker is carried out speaker's cluster, obtain speaker's number and each speaker's voice.Concrete steps are as follows:
1) each voice segments is carried out the branch frame, frame length is 32 milliseconds, and it is 16 milliseconds that frame moves, and extracts MFCCs and Delta-MFCCs feature from each frame voice signal, and the dimension M of MFCCs and Delta-MFCCs is 12, and the feature of each voice segments constitutes an eigenmatrix F j, eigenmatrix F jDimension d=2M be 24;
2) according to each eigenmatrix F jObtain the eigenmatrix set F={F of all voice segments to be clustered 1..., F J, J is the total number of voice segments, constructs affine matrix (Affinity matrix) A ∈ R according to F again J * J, (i, j) the individual elements A of A IjBe defined as follows:
A ij = exp ( - d 2 ( F i , F j ) 2 &sigma; i &sigma; j ) i &NotEqual; j 0 i = j
Wherein, d (F i, F j) be eigenmatrix F iWith F jBetween Euclidean distance, σ i(or σ j) be a scale parameter, be defined as the individual eigenmatrix F of i (or j) i(or F j) and other T-1 eigenmatrix between the variance of Euclidean distance vector;
3) structure diagonal matrix D, it the (i, i) individual element equals the capable all elements sum of i of affine matrix A, constructs normalized affine matrix L=D according to matrix D and A again -1/2AD -1/2
4) the preceding K of compute matrix L MaxThe eigenwert of individual maximum
Figure BDA0000116876130000091
And eigenwert vector
Figure BDA0000116876130000092
V wherein k(1≤k≤K Max) be column vector and Estimate optimum classification number (being speaker's number) K according to the difference between the adjacent feature value:
K = arg max i &Element; [ 1 , K max - 1 ] ( &lambda; i - &lambda; i + 1 )
According to the speaker's number K that estimates, structural matrix V=[v 1, v 2..., v K] ∈ R J * K
5) each row of normalization matrix V obtains matrix Y ∈ R J * K, (j, k) the individual element Y of Y Jk:
Y jk = V jk ( &Sigma; k = 1 K V jk 2 ) , 1 &le; i &le; J ;
6) space R is made in each trade among the matrix Y KIn a point, utilize K mean algorithm (K-meansalgorithm) that this J capable (being J point) is clustered into the K class.
7) the pairing voice segments of eigenmatrix Fj is judged to k class (i.e. k speaker), the j of and if only if matrix Y capable by cluster in the k class;
8), obtain speaker's number, each speaker's voice and duration thereof according to above-mentioned cluster result.
At last, in step 104, from each speaker's voice, extract energy envelope, and determine the syllable number, estimate each speaker's word speed by detected energy envelope local maximum point.In standard Chinese, each syllable all comprises simple or compound vowel of a Chinese syllable basically, and the simple or compound vowel of a Chinese syllable number is the syllable number, the syllable number is the word number, and the energy maximum of simple or compound vowel of a Chinese syllable in the syllable, therefore can obtain the number of word, thereby estimate word speed by the simple or compound vowel of a Chinese syllable number of detected energy maximum.Concrete steps based on the word speed method of estimation of above-mentioned consideration are as follows:
1) calculate the ENERGY E (n) of each speaker's voice signal s (n):
E(n)=s 2(n),1≤n≤Len
Wherein, Len is the total number of sampled point of voice signal;
2) utilize a low-pass filter that ENERGY E (n) is carried out filtering, obtain energy envelope E (n), the technical indicator of this low-pass filter is as follows: based on the FIR wave filter of Equiripple method, sample frequency f sBe 16000 hertz, cut-off frequecy of passband f PassBe 50 hertz, stopband cutoff frequency f StopBe 100 hertz, the maximum attenuation A of passband PassBe 1dB, the minimal attenuation A of stopband StopBe 80dB;
3) calculating energy envelope threshold value T E:
T E=0.4×mean(E(n))
Wherein, mean (E (n)) is the mean value of energy envelope;
4) will satisfy following two conditions of elements in the energy envelope as local maximum point:
Condition 1: this element value is greater than energy envelope threshold value T E,
Condition 2: this element value is greater than its forward and backward 0.07 second all elements value, promptly greater than its forward and backward 0.07 * f sIndividual element value,
The position (sampled point) at above-mentioned local maximum point place is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable, and the reason of getting 0.07 second is: the minimum value of the average duration of syllable approximately is 0.14 second, thus among the E (n) greater than T EAnd greater than the position at its forward and backward 0.07 second element value place the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable;
5) with the number of the local maximum point in certain speaker's speech energy envelope as syllable (word) number, with the number of word duration (second), obtain this speaker's word speed (word/second) divided by these speaker's voice;
6) repeat above-mentioned steps 1)~5), till the word speed of all speaker's voice has all been estimated.
The duration that Fig. 2 (a) has provided certain speaker is the oscillogram of 5 seconds voice signal, and Fig. 2 (b) has provided Fig. 2 (a) pairing energy envelope waveform of voice signal (shown in the solid line), the threshold value (shown in the dotted line) of energy envelope and the energy envelope local maximum point (shown in the dot-and-dash line of band circle) that obtains according to above-mentioned word speed estimating step.As can be seen from Figure 2: this speaker's voice signal duration is 5 seconds, and the number of local maximum point is 22, and promptly number of words is 22, and therefore, this speaker's word speed is 4.4 word/seconds (or 264 words per minute clocks).
Though more than by the foregoing description many speakers word speed method of estimation of the present invention is described in detail, therefore can not be interpreted as limitation of the scope of the invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with claims.

Claims (5)

1. cut apart many speakers word speed method of estimation with cluster based on the speaker for one kind, it is characterized in that comprising the steps:
1) reads in voice flow: read in the voice flow that records many speakers voice;
2) speaker is cut apart: detect that the speaker changes a little in the above-mentioned voice flow, be divided into a plurality of voice segments according to these changes voice flow of naming a person for a particular job;
3) speaker's cluster: identical speaker's voice segments is gathered in a class and being stitched together in order, obtain speaker's number and each speaker's voice;
4) word speed is estimated: extract energy envelope respectively from each speaker's voice, and determine syllable number by the local maximum point of finding out energy envelope, thereby estimate each speaker's word speed; This step specifically comprises:
4.1) calculate the energy of speaker's voice;
4.2) energy that utilizes low-pass filter that said extracted is come out carries out filtering, obtains energy envelope;
4.3) calculating energy envelope threshold value;
4.4) determine local maximum point in the energy envelope, obtain the number of local maximum point, specifically be to satisfy following two conditions of elements in the energy envelope as local maximum point:
A) this element value is greater than the energy envelope threshold value;
B) this element value is greater than its forward and backward 0.07 second all elements value;
Described local maximum point position is the position at energy peak place of the simple or compound vowel of a Chinese syllable of each syllable;
4.5) with the number of the local maximum point in this speaker's speech energy envelope as the syllable number, and, obtain this speaker's word speed divided by the duration of these speaker's voice;
4.6) repetition above-mentioned steps 4.1)~4.5), till the word speed of all speaker's voice has all been estimated.
2. many speakers word speed method of estimation according to claim 1 is characterized in that described step 2) step cut apart of speaker comprises:
2.1) utilize and from the above-mentioned voice flow that reads in, find out quiet section and voice segments based on the silence detection algorithm of threshold judgement;
2.2) above-mentioned voice segments is spliced into a long voice segments in order, and from long voice segments, extract audio frequency characteristics;
2.3) audio frequency characteristics that utilizes said extracted to come out, according to bayesian information criterion, judge that the similarity between the adjacent data window detects the speaker in the long voice segments to change a little;
2.4) change a little according to above-mentioned speaker, voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
3. many speakers word speed method of estimation according to claim 2 is characterized in that described step 2.1) comprise based on the step of the silence detection algorithm of threshold judgement:
2.1.1) voice flow that reads in is carried out the branch frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
2.1.2) the calculating energy thresholding;
2.1.3) energy and the energy threshold of every frame voice compared, the frame that is lower than energy threshold is quiet frame, otherwise is speech frame, and adjacent quiet frame is spliced into one quiet section in order, and adjacent speech frame is spliced into a voice segments in order.
4. many speakers word speed method of estimation according to claim 2 is characterized in that described step 2.2) audio frequency characteristics comprise Mel frequency cepstral coefficient and first order difference thereof.
5. many speakers word speed method of estimation according to claim 1 is characterized in that, speaker's cluster of described step 3) adopts the spectral clustering algorithm.
CN2011104035773A 2011-12-07 2011-12-07 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers Expired - Fee Related CN102543063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104035773A CN102543063B (en) 2011-12-07 2011-12-07 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104035773A CN102543063B (en) 2011-12-07 2011-12-07 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Publications (2)

Publication Number Publication Date
CN102543063A CN102543063A (en) 2012-07-04
CN102543063B true CN102543063B (en) 2013-07-24

Family

ID=46349803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104035773A Expired - Fee Related CN102543063B (en) 2011-12-07 2011-12-07 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers

Country Status (1)

Country Link
CN (1) CN102543063B (en)

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
CN103137137B (en) * 2013-02-27 2015-07-01 华南理工大学 Eloquent speaker finding method in conference audio
JP6171544B2 (en) * 2013-05-08 2017-08-02 カシオ計算機株式会社 Audio processing apparatus, audio processing method, and program
CN104282303B (en) * 2013-07-09 2019-03-29 威盛电子股份有限公司 The method and its electronic device of speech recognition are carried out using Application on Voiceprint Recognition
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN104347068B (en) * 2013-08-08 2020-05-22 索尼公司 Audio signal processing device and method and monitoring system
CN103530432A (en) * 2013-09-24 2014-01-22 华南理工大学 Conference recorder with speech extracting function and speech extracting method
CN104851423B (en) * 2014-02-19 2021-04-13 联想(北京)有限公司 Sound information processing method and device
CN104021785A (en) * 2014-05-28 2014-09-03 华南理工大学 Method of extracting speech of most important guest in meeting
CN104183239B (en) * 2014-07-25 2017-04-19 南京邮电大学 Method for identifying speaker unrelated to text based on weighted Bayes mixture model
CN105161093B (en) * 2015-10-14 2019-07-09 科大讯飞股份有限公司 A kind of method and system judging speaker's number
CN106971734B (en) * 2016-01-14 2020-10-23 芋头科技(杭州)有限公司 Method and system for training and identifying model according to extraction frequency of model
CN106205610B (en) * 2016-06-29 2019-11-26 联想(北京)有限公司 A kind of voice information identification method and equipment
CN107886955B (en) * 2016-09-29 2021-10-26 百度在线网络技术(北京)有限公司 Identity recognition method, device and equipment of voice conversation sample
CN106649513B (en) * 2016-10-14 2020-03-31 盐城工学院 Audio data clustering method based on spectral clustering
CN106531195B (en) * 2016-11-08 2019-09-27 北京理工大学 A kind of dialogue collision detection method and device
CN106782496B (en) * 2016-11-15 2019-08-20 北京科技大学 A kind of crowd's Monitoring of Quantity method based on voice and intelligent perception
CN106782507B (en) * 2016-12-19 2018-03-06 平安科技(深圳)有限公司 The method and device of voice segmentation
CN106782508A (en) * 2016-12-20 2017-05-31 美的集团股份有限公司 The cutting method of speech audio and the cutting device of speech audio
CN107342077A (en) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 A kind of speaker segmentation clustering method and system based on factorial analysis
CN107967912B (en) * 2017-11-28 2022-02-25 广州势必可赢网络科技有限公司 Human voice segmentation method and device
CN109949813A (en) * 2017-12-20 2019-06-28 北京君林科技股份有限公司 A kind of method, apparatus and system converting speech into text
CN108962283B (en) * 2018-01-29 2020-11-06 北京猎户星空科技有限公司 Method and device for determining question end mute time and electronic equipment
CN108683790B (en) * 2018-04-23 2020-09-22 Oppo广东移动通信有限公司 Voice processing method and related product
CN108597521A (en) * 2018-05-04 2018-09-28 徐涌 Audio role divides interactive system, method, terminal and the medium with identification word
CN109461447B (en) * 2018-09-30 2023-08-18 厦门快商通信息技术有限公司 End-to-end speaker segmentation method and system based on deep learning
CN109859742B (en) * 2019-01-08 2021-04-09 国家计算机网络与信息安全管理中心 Speaker segmentation clustering method and device
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Conference content differentiating method, device, computer equipment and storage medium
CN110060665A (en) * 2019-03-15 2019-07-26 上海拍拍贷金融信息服务有限公司 Word speed detection method and device, readable storage medium storing program for executing
CN110364183A (en) * 2019-07-09 2019-10-22 深圳壹账通智能科技有限公司 Method, apparatus, computer equipment and the storage medium of voice quality inspection
CN111312256B (en) * 2019-10-31 2024-05-10 平安科技(深圳)有限公司 Voice identification method and device and computer equipment
CN112017685B (en) * 2020-08-27 2023-12-22 抖音视界有限公司 Speech generation method, device, equipment and computer readable medium
CN112423094A (en) * 2020-10-30 2021-02-26 广州佰锐网络科技有限公司 Double-recording service broadcasting method and device and storage medium
CN112669855A (en) * 2020-12-17 2021-04-16 北京沃东天骏信息技术有限公司 Voice processing method and device
CN112565880B (en) * 2020-12-28 2023-03-24 北京五街科技有限公司 Method and system for playing explanation videos
CN112565881B (en) * 2020-12-28 2023-03-24 北京五街科技有限公司 Self-adaptive video playing method and system
CN112802498B (en) * 2020-12-29 2023-11-24 深圳追一科技有限公司 Voice detection method, device, computer equipment and storage medium
CN112289323B (en) * 2020-12-29 2021-05-28 深圳追一科技有限公司 Voice data processing method and device, computer equipment and storage medium
CN114067787B (en) * 2021-12-17 2022-07-05 广东讯飞启明科技发展有限公司 Voice speech speed self-adaptive recognition system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2990693B2 (en) * 1988-02-29 1999-12-13 株式会社明電舎 Speech synthesizer
US6873953B1 (en) * 2000-05-22 2005-03-29 Nuance Communications Prosody based endpoint detection
CN100505040C (en) * 2005-07-26 2009-06-24 浙江大学 Audio frequency splitting method for changing detection based on decision tree and speaking person
CN100485780C (en) * 2005-10-31 2009-05-06 浙江大学 Quick audio-frequency separating method based on tonic frequency

Also Published As

Publication number Publication date
CN102543063A (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN102543063B (en) Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
Liu et al. Fast speaker change detection for broadcast news transcription and indexing
Zhu et al. Combining speaker identification and BIC for speaker diarization
CN103137137B (en) Eloquent speaker finding method in conference audio
CN103400580A (en) Method for estimating importance degree of speaker in multiuser session voice
CN102968986B (en) Overlapped voice and single voice distinguishing method based on long time characteristics and short time characteristics
Zhou et al. Efficient audio stream segmentation via the combined T/sup 2/statistic and Bayesian information criterion
CN100485780C (en) Quick audio-frequency separating method based on tonic frequency
Lokhande et al. Voice activity detection algorithm for speech recognition applications
Evangelopoulos et al. Multiband modulation energy tracking for noisy speech detection
Ananthapadmanabha et al. Detection of the closure-burst transitions of stops and affricates in continuous speech using the plosion index
CN104021785A (en) Method of extracting speech of most important guest in meeting
CN103559882A (en) Meeting presenter voice extracting method based on speaker division
Vyas A Gaussian mixture model based speech recognition system using Matlab
Jaafar et al. Automatic syllables segmentation for frog identification system
Chaudhary et al. Gender identification based on voice signal characteristics
Moattar et al. A new approach for robust realtime voice activity detection using spectral pattern
KR100717401B1 (en) Method and apparatus for normalizing voice feature vector by backward cumulative histogram
WO2018095167A1 (en) Voiceprint identification method and voiceprint identification system
KR101250668B1 (en) Method for recogning emergency speech using gmm
Chee et al. Automatic detection of prolongations and repetitions using LPCC
Hassan et al. Pattern classification in recognizing Qalqalah Kubra pronuncation using multilayer perceptrons
Couvreur et al. Automatic noise recognition in urban environments based on artificial neural networks and hidden markov models
Maganti et al. Unsupervised speech/non-speech detection for automatic speech recognition in meeting rooms
Kitaoka et al. Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130724

Termination date: 20181207

CF01 Termination of patent right due to non-payment of annual fee