CN102074236A - Speaker clustering method for distributed microphone - Google Patents

Speaker clustering method for distributed microphone Download PDF

Info

Publication number
CN102074236A
CN102074236A CN2010105683868A CN201010568386A CN102074236A CN 102074236 A CN102074236 A CN 102074236A CN 2010105683868 A CN2010105683868 A CN 2010105683868A CN 201010568386 A CN201010568386 A CN 201010568386A CN 102074236 A CN102074236 A CN 102074236A
Authority
CN
China
Prior art keywords
tau
frame
point
sigma
time delay
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105683868A
Other languages
Chinese (zh)
Other versions
CN102074236B (en
Inventor
杨毅
刘加
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huacong Zhijia Technology Co Ltd
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN2010105683868A priority Critical patent/CN102074236B/en
Publication of CN102074236A publication Critical patent/CN102074236A/en
Application granted granted Critical
Publication of CN102074236B publication Critical patent/CN102074236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a speaker clustering method for a distributed microphone, which comprises the following steps: firstly performing pretreatment on signals acquired by the distributed microphone, further adopting the time delay estimation method for calculation against sound source signal fragments, getting a corresponding time delay estimation vector, then ruling out wrong data, performing speaker segmentation, and finally performing speaker clustering according to the speaker segmentation result. The distributed microphone is used as a signal acquisition and output device for calculating the time delay vector of the voice signal fragments, the time delay estimation precision is improved by ruling out the wrong data, and clustering algorithm is adopted for the time delay vector so as to respectively classify the voice signal fragments according to identities of speakers; furthermore, the device has the advantages of low price and convenience in use, and the speaker clustering method can be applied in a multi-person multi-party dialogue scene under a complex acoustic environment.

Description

A kind of speaker clustering method of distributed mike wind
Technical field
The invention belongs to the voice technology field, relate to a kind of speaker clustering method of distributed mike wind particularly.
Background technology
Along with the continuous development of network and mechanics of communication, utilize existing multimedia technology, network and mechanics of communication, distributed proccessing etc. can realize that the many people under the complicated acoustic enviroment scene talk with in many ways.Input of tradition sound source and sound pick-up outfit comprise head microphone, omni-directional and directivity single microphone, microphone array etc.Single microphone has advantages such as volume is little, cheap as traditional sound source input and sound pick-up outfit, but does not possess the ability to neighbourhood noise processing and auditory localization; Microphone array is made up of a plurality of microphones of putting according to specific geometrical position, and spacing wave is carried out the time-space domain Combined Treatment, and its ability comprises: the auditory localization under identification and separating sound-source, the reverberation condition, enhancing voice signal etc.
The sound signal collecting system that distributed mike wind is made up of a plurality of single microphones, each microphone is controlled by distinct device, and without any restriction, the signal that microphone is gathered is not exclusively synchronous in time domain to the arrangement of microphone and spacing.Distributed mike wind is simple in structure, easy to use, the saving cost, meets the requirement of the multi-direction complex dialogs scene of many sound sources, can finish multiple application such as speaker's cluster, identification and location effectively.Different with the microphone array system is, distributed mike wind is to the position of microphone and put without any constraint and restriction, the sound source in the distributed mike wind system and the unknown of microphone position information in addition.
It is one of research topic of field of voice signal that acoustic information is classified automatically, and the speaker cuts apart (Speaker Segmentation) and speaker's cluster (Speaker Clustering) is an important component part wherein.Usual way is: the speaker is cut apart whole tested speech is divided into a series of sound bites, and these sound bites only belong to a certain speaker dependent; The voice that belong to a speaker that speaker's cluster is responsible for disperseing are classified as a class.
Traditional speaker's dividing method moves statistic law based on the window of Gauss model substantially, adopts different distance measures to select, and obtains cut-point by merging based on Bayesian information criterion.Speaker clustering method can adopt evolution Hidden Markov (EHMM) computing method, upgrades segmentation result by weighing the path mark.When speaker's number does not limit, can adopt the method for hierarchical clustering to carry out speaker's cluster.
The speaker clustering method of microphone array mainly utilizes speaker's differences in spatial location to classify.Cardinal principle is: with the space characteristics of time delay estimate vector as the speaker, in GMM/HMM (gauss hybrid models/hidden Markov model) model these features are integrated and classified.The time delay algorithm for estimating of microphone array mainly comprises GCC (broad sense simple crosscorrelation) method and LMS (least mean-square error) method.It is more serious that GCC (broad sense simple crosscorrelation) is influenced by reverberation, produced GCC (broad sense simple crosscorrelation) method of CEP (cepstrum pre-filtering) method and fundamental tone weighting after the improvement, EVD (characteristic value decomposition) and then utilize the technology of subspace and transport function recently to find the solution respectively based on the delay time estimation method of ATF (acoustic transfer function).But therefore the error sensitivity to sampling between each equipment during the microphone array system-computed requires very strict to the voice data synchronism; And sound source number the unknown in common many people Multi-Party Conference scene, microphone position the unknown, the unknown of room acoustics environment promptly need be handled voice data under the scene that time and spatial prior information all lack.
As the single microphone of traditional sound source input and sound pick-up outfit, cheap, simple in structure, shortcoming is to be subject to environmental interference, and can not position sound source; The conventional microphone array system is widely studied, the main cause that does not have commercialization be specialized hardware cost an arm and a leg and algorithm complex higher.
Summary of the invention
In order to overcome the shortcoming of above-mentioned prior art, the objective of the invention is to propose a kind of speaker clustering method of distributed mike wind, with distributed mike wind as signals collecting and output device, the time delay vector of computing voice signal segment, improve the time delay estimated accuracy by the debug data, adopt clustering algorithm that speech signal segment is sorted out respectively by speaker ' s identity to the time delay vector, equipment price is cheap, have advantage easy to use, can be applicable to the many people session operational scenarios in many ways under the complicated acoustic enviroment.
A kind of speaker clustering method of distributed mike wind may further comprise the steps:
The first step is carried out pre-service to the signal of distributed mike elegance collection
At first the multichannel sound-source signal that distributed mike wind is obtained carries out pre-service, earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion, then the multichannel sound-source signal is carried out end-point detection, signal is divided into sound-source signal and non-sound-source signal two classes, the purpose of end-point detection is to distinguish voice signal and non-speech audio from audio digital signals, sound end detecting method can adopt subband spectrum entropy algorithm, at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband, calculate the frequency spectrum entropy of each subband, then the subband spectrum entropy of n frame is in succession obtained the frequency spectrum entropy of every frame through one group of order statistics wave filter, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) FFTIndividual some Y i(0≤i≤N FFT), the probability density of each point on spectrum domain can be used formula (1) expression:
p i = Y i / Σ k = 0 N FFT - 1 Y k - - - ( 1 )
Wherein: Y kBe k the point of voice signal on power spectrum through the FFT conversion, Y iBe i the point of voice signal on power spectrum through the FFT conversion, N FFTBe the number of i, p iBe the probability density of i point on spectrum domain,
The entropy function of corresponding signal on spectrum domain defines available formula (2) expression:
H = - Σ k = 0 N FFT - 1 p k log ( p k ) - - - ( 2 )
Wherein: p kBe the probability density of k point on spectrum domain, N FFTBe the number of i, H is the entropy function on the spectrum domain,
With the N on the frequency domain FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, the probability that calculates each point on the l frame frequency spectral domain as shown in Equation (3):
p l [ k , i ] = ( Y i + Q ) / Σ j = m k m k + 1 - 1 ( Y j + Q ) - - - ( 3 )
Wherein: Y jBe j the point of voice signal on power spectrum through the FFT conversion, Y iBe the point on k the subband,
Figure BSA00000368534100044
(0≤k≤K-1, m k≤ i≤m K+1-1) be the subband lower limit, Q is a constant, p l[k, i] is the probability of each point on the l frame frequency spectral domain,
According to the definition of information entropy, the value of the frequency spectrum entropy of k subband of l frame as shown in Equation (4):
E s [ l , k ] = Σ i = mk m k + 1 - 1 p l [ k , i ] log ( p l [ k , i ] ) ( 0 ≤ k ≤ K - 1 ) - - - ( 4 )
Wherein: p l[k, i] is the probability of each point on the l frame frequency spectral domain, E s[l, k] is the frequency spectrum entropy of k subband of l frame,
We can calculate the spectrum information entropy of l frame according to following formula (5):
H l = - 1 K Σ k = 0 K - 1 E h [ l , k ] - - - ( 5 )
Wherein: E h[l, k] is the frequency spectrum entropy of k subband of l frame, and K is the subband number, H lThe information entropy of k subband of the l frame after handling for popin after filtration is sliding defines as shown in Equation (6):
E h[l,k]=(1-λ)E s(h)[l,k]+λE s(h+1)[l,k](0≤k≤K-1)(6)
Wherein: E S (h)[l, k] preparation method is as follows: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm s[l-N, k], KE s[l, k], KE sOn [l+N, k], this group sub-band information entropy is pressed ascending order rank order, E S (h)[l, k] is E s[l-N, k], KE s[l, k], KE sH maximal value in [l+N, k]; λ is a constant, E h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,
The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H l, work as H lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as T=β Avg+ θ, wherein β=0.01, θ=0.1, E m[k] is E s[0, k], K, E sThe intermediate value of [N-1, k], Avg is the Noise Estimation that input signal begins the N frame most,
Second step, adopt Time Delay Estimation Method to calculate to the sound-source signal fragment, obtain corresponding time delay estimate vector
At first determine volume coordinate, concrete grammar is: each microphone is numbered in order is microphone M1, M2..., Mn, n is the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2, if the position of microphone M1 is an origin, microphone M1 is the starting point coordinate direction of principal axis to the direction of microphone M2, subsequently per 50 frame voice signals are considered as one group of sound bite, adopt Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones, obtain the individual time delay estimated value of n (n-1), as shown in Equation (7):
τ k = τ ^ 12 τ ^ 13 L τ ^ ij T - - - ( 7 )
Wherein:
Figure BSA00000368534100062
Be that delay inequality between i microphone and j the microphone is estimated τ kBe the delay inequality estimate vector,
Time delay is estimated to adopt PHAT (phase tranformation) weighting algorithm, its weighting coefficient as shown in Equation (8), delay time estimation method is shown in formula (9)~(10):
W ( ω ) = 1 | X 1 ( ω ) X 2 * ( ω ) | - - - ( 8 )
Wherein: X 1(ω), X 2(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, *Be conjugate of symbol,
R x 1 x 2 ( n ) = IFFT ( W ( ω ) · X 1 ( ω ) · X 2 * ( ω ) ) - - - ( 9 )
τ ^ = arg max n R x 1 x 2 ( n ) - - - ( 10 )
Wherein:
Figure BSA00000368534100066
Be the broad sense cross correlation function of two paths of signals,
Figure BSA00000368534100067
Be x 1And x 2Between the time
Prolong estimated value,
In the 3rd step, the debug data are also carried out the speaker and are cut apart
At first need to remove invalid data, press following formula (11) calculation delay:
&tau; [ n ] = &tau; ^ [ n - 1 ] SNR < Thr SNR &tau; ^ [ n ] SNR &GreaterEqual; Thr SNR - - - ( 11 )
Wherein: n is the index value of a certain frame, and τ is the delay data of a certain frame correspondence,
Figure BSA00000368534100069
Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:
&tau; [ n ] = &tau; ^ [ n - 1 ] &tau; ^ [ n ] < Thr &tau; ^ [ n ] &tau; ^ [ n ] &GreaterEqual; Thr - - - ( 12 )
Wherein: n is the index value of a certain frame, and τ is the delay data of a certain frame correspondence,
Figure BSA00000368534100072
Be the delay data that a certain frame is estimated, when a certain moment time delay is estimated less than threshold value Thr, adopted the estimation time delay in a lasted moment as the time delay estimated value in this moment,
Then the speaker of different spatial is cut apart calculating, at first calculate posterior probability β ik) as shown in Equation (13):
&beta; i ( &tau; k ) = &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) &alpha; 1 g ( &tau; k ; &mu; 1 . &sigma; 1 2 ) + &alpha; 2 g ( &tau; k ; &mu; 2 . &sigma; 2 2 ) + L + &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) - - - ( 13 )
Wherein: Be defined parameters, α i=1/i, i represent the number of GMM model,
Figure BSA00000368534100075
Initial value adopt K-means algorithm computation, τ kFor formula 7 calculates the time delay estimate vector that obtains, β ik) be posterior probability,
Formula (14) is the parameter update algorithm:
&mu; ^ i = &Sigma; k = 1 n &beta; i ( &tau; k ) &tau; k &Sigma; k = 1 n &beta; i ( &tau; k ) &sigma; ^ i 2 = 1 d &Sigma; k = 1 n &beta; i ( &tau; k ) ( &tau; k - &mu; i ) T ( &tau; k - &mu; i ) &Sigma; k = 1 n &beta; i ( &tau; k ) &alpha; ^ i = 1 n &Sigma; k = 1 n &beta; i ( &tau; k ) - - - ( 14 )
Wherein:
Figure BSA00000368534100077
Be estimates of parameters,
Figure BSA00000368534100078
Be the estimation of GMM model parameter, β ik) be the posterior probability that formula 13 calculates gained, when
Figure BSA00000368534100079
The time stop undated parameter, min is a constant herein, represents minimum tolerance value,
In the 4th step, the result of cutting apart according to the speaker carries out speaker's cluster
Utilize a kind of algorithm that the sound bite after cutting apart is carried out cluster based on K-means, calculate the territory density of each set earlier, as initial point, next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy with the point of density maximum;
Next calculates sample point and upgrades the value at center to the distance at set center, selects the sampled point of coincidence formula (15) to upgrade as new set center,
Func = &Sigma; j = 1 J &Sigma; n = 1 M | | &tau; ^ [ n ] - &tau; j | | 2 - - - ( 15 )
Wherein: Be the time delay estimate vector
Figure BSA00000368534100083
Cluster centre τ with each sound bite jDistance, τ j[n] is center vector, and J is speaker's number, and M is the microphone number,
At last come different spaces speaker's sound bite is sorted out and marked according to the distance of set center vector and sound bite vector.
The present invention has following advantage:
(1), the distributed asynchronous sound sensor that proposes of the present invention, the locus of sonic transducer is not had strict restriction, in addition the synchronism of acquired signal is required lowlyer, compare the microphone array application more flexibly extensively;
(2) the present invention made full use of between the microphone and sound source and microphone between a plurality of delay inequalities carry out information fusion, carry out the speaker by the time delay estimate vector and cut apart, when having reduced the complicacy of traditional speaker's partitioning algorithm, robustness increases;
(3), the present invention made full use of the advantage of distributed mike wind in spatial domain, and single speaker's speech signal segment time delay estimate vector is carried out speaker's cluster, reduced the complicacy of traditional speaker's clustering algorithm;
(4), the speaker clustering method of distributed mike wind of the present invention can be applied to multiple many people session operational scenarios in many ways, it is good to have robustness, adapts to the characteristics of multiple acoustic enviroment, and
The present invention can realize on present palm PC, PDA(Personal Digital Assistant) or mobile phone that its range of application is very extensive.
Description of drawings
Fig. 1 is a schematic flow sheet of the present invention.
Fig. 2 is the schematic flow sheet of end-point detection of the present invention.
Fig. 3 is the synoptic diagram that sound source time delay of the present invention is estimated.
Fig. 4 is the schematic flow sheet that speaker of the present invention is cut apart cluster.
Embodiment
The present invention is described in detail below in conjunction with accompanying drawing.
With reference to Fig. 1, a kind of speaker clustering method of distributed mike wind may further comprise the steps:
The first step is carried out pre-service to the signal of distributed mike elegance collection
With reference to Fig. 2, at first the multichannel sound-source signal that distributed mike wind is obtained carries out pre-service, earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion, then the multichannel sound-source signal is carried out end-point detection, signal is divided into sound-source signal and non-sound-source signal two classes, the purpose of end-point detection is to distinguish voice signal and non-speech audio from audio digital signals, the early stage employing can be distinguished voice signal and noise exactly based on the method for energy and zero-crossing rate, but the voice in the reality are usually polluted by bigger neighbourhood noise, sound end detecting method can adopt subband spectrum entropy algorithm, at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband, calculate the frequency spectrum entropy of each subband, then the subband spectrum entropy of n frame is in succession obtained the frequency spectrum entropy of every frame through one group of order statistics wave filter, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) FFTIndividual some Y i(0≤i≤N FFT), the probability density of each point on spectrum domain can be used formula (1) expression:
p i = Y i / &Sigma; k = 0 N FFT - 1 Y k - - - ( 1 )
Wherein: Y kBe k the point of voice signal on power spectrum through the FFT conversion, Y iBe i the point of voice signal on power spectrum through the FFT conversion, N FFTBe the number of i, p iBe the probability density of i point on spectrum domain,
The entropy function of corresponding signal on spectrum domain defines available formula (2) expression:
H = - &Sigma; k = 0 N FFT - 1 p k log ( p k ) - - - ( 2 )
Wherein: p kBe the probability density of k point on spectrum domain, N FFTBe the number of i, H is the entropy function on the spectrum domain,
With the N on the frequency domain FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, the probability that calculates each point on the i frame frequency spectral domain as shown in Equation (3):
p l [ k , i ] = ( Y i + Q ) / &Sigma; j = m k m k + 1 - 1 ( Y j + Q ) - - - ( 3 )
Wherein: Y iBe j the point of voice signal on power spectrum through the FFT conversion, Y iBe the point on k the subband,
Figure BSA00000368534100104
(0≤k≤K-1, m k≤ i≤m K+1-1) be the subband lower limit, Q is a constant, p l[k, i] is the probability of each point on the l frame frequency spectral domain,
According to the definition of information entropy, the value of the frequency spectrum entropy of k subband of l frame as shown in Equation (4):
E s [ l , k ] = &Sigma; i = mk m k + 1 - 1 p l [ k , i ] log ( p l [ k , i ] ) ( 0 &le; k &le; K - 1 ) - - - ( 4 )
Wherein: p l[k, i] is the probability of each point on the l frame frequency spectral domain, E s[l, k] is the frequency spectrum entropy of k subband of l frame,
We can calculate the spectrum information entropy of l frame according to following formula (5):
H l = - 1 K &Sigma; k = 0 K - 1 E h [ l , k ] - - - ( 5 )
Wherein: E h[l, k] is the frequency spectrum entropy of k subband of l frame, and K is the subband number, H lThe information entropy of k subband of the l frame after handling for popin after filtration is sliding defines as shown in Equation (6):
E h[l,k]=(1-λ)E s(h)[l,k]+λE s(h+1)[l,k](0≤k≤K-1)(6)
Wherein: E S (h)[l, k] preparation method is as follows: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm s[l-N, k], KE s[l, k], KE sOn [l+N, k], this group sub-band information entropy is pressed ascending order rank order, E S (h)[l, k] is E s[l-N, k], KE s[l, k], KE sH maximal value in [l+N, k]; λ is a constant, E h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,
The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H l, work as H lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as T=β Avg+ θ, wherein
Figure BSA00000368534100112
β=0.01, θ=0.1, E m[k] is E s[0, k], K, E sThe intermediate value of [N-1, k], Avg is the Noise Estimation that input signal begins the N frame most,
Second step, adopt Time Delay Estimation Method to calculate to the sound-source signal fragment, obtain corresponding time delay estimate vector
With reference to Fig. 3, at first determine volume coordinate, concrete grammar is: each microphone is numbered in order is microphone M1, M2..., Mn, n is the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2, if the position of microphone M1 is an origin, microphone M1 is the starting point coordinate direction of principal axis to the direction of microphone M2, subsequently per 50 frame voice signals is considered as one group of sound bite, adopts Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones, obtain the individual time delay estimated value of n (n-1), as shown in Equation (7):
&tau; k = &tau; ^ 12 &tau; ^ 13 L &tau; ^ ij T - - - ( 7 )
Wherein:
Figure BSA00000368534100122
Be that delay inequality between i microphone and j the microphone is estimated τ kBe the delay inequality estimate vector,
Time delay is estimated to adopt PHAT (phase tranformation) weighting algorithm, its weighting coefficient as shown in Equation (8), delay time estimation method is shown in formula (9)~(10):
W ( &omega; ) = 1 | X 1 ( &omega; ) X 2 * ( &omega; ) | - - - ( 8 )
Wherein: X 1(ω), X 2(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, *Be conjugate of symbol,
R x 1 x 2 ( n ) = IFFT ( W ( &omega; ) &CenterDot; X 1 ( &omega; ) &CenterDot; X 2 * ( &omega; ) ) - - - ( 9 )
Wherein: X 1(ω), X 2(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, *Be conjugate of symbol, IFFT is anti-FFT conversion,
Figure BSA00000368534100125
Be the broad sense cross correlation function of two paths of signals,
&tau; ^ = arg max n R x 1 x 2 ( n ) - - - ( 10 )
Wherein:
Figure BSA00000368534100127
Be the broad sense cross correlation function of two paths of signals,
Figure BSA00000368534100128
Be x 1And x 2Between the time delay estimated value,
In the 3rd step, the debug data are also carried out the speaker and are cut apart
With reference to Fig. 4, at first need to remove invalid data, press following formula (11) calculation delay:
&tau; [ n ] = &tau; ^ [ n - 1 ] SNR < Thr SNR &tau; ^ [ n ] SNR &GreaterEqual; Thr SNR - - - ( 11 )
Wherein: n is the index value of a certain frame, and τ is the delay data of a certain frame correspondence, Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:
&tau; [ n ] = &tau; ^ [ n - 1 ] &tau; ^ [ n ] < Thr &tau; ^ [ n ] &tau; ^ [ n ] &GreaterEqual; Thr - - - ( 12 )
Wherein: n is the index value of a certain frame, and τ is the delay data of a certain frame correspondence,
Figure BSA00000368534100134
Be the delay data that a certain frame is estimated, when a certain moment time delay is estimated less than threshold value Thr, adopted the estimation time delay in a lasted moment as the time delay estimated value in this moment,
Then the speaker of different spatial is cut apart calculating, at first calculate posterior probability β ik) as shown in Equation (13):
&beta; i ( &tau; k ) = &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) &alpha; 1 g ( &tau; k ; &mu; 1 . &sigma; 1 2 ) + &alpha; 2 g ( &tau; k ; &mu; 2 . &sigma; 2 2 ) + L + &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) - - - ( 12 )
Wherein:
Figure BSA00000368534100136
Be defined parameters, α i=1/i, i represent the number of GMM model,
Figure BSA00000368534100137
Initial value adopt K-means algorithm computation, τ kFor formula 7 calculates the time delay estimate vector that obtains, β ik) be posterior probability,
Formula (14) is the parameter update algorithm:
&mu; ^ i = &Sigma; k = 1 n &beta; i ( &tau; k ) &tau; k &Sigma; k = 1 n &beta; i ( &tau; k ) &sigma; ^ i 2 = 1 d &Sigma; k = 1 n &beta; i ( &tau; k ) ( &tau; k - &mu; i ) T ( &tau; k - &mu; i ) &Sigma; k = 1 n &beta; i ( &tau; k ) &alpha; ^ i = 1 n &Sigma; k = 1 n &beta; i ( &tau; k ) - - - ( 14 )
Wherein:
Figure BSA00000368534100141
Be estimates of parameters,
Figure BSA00000368534100142
Be the estimation of GMM model parameter, β ik) be the posterior probability that formula 13 calculates gained, when
Figure BSA00000368534100143
The time stop undated parameter, min is a constant herein, represents minimum tolerance value,
In the 4th step, the result of cutting apart according to the speaker carries out speaker's cluster
Utilize a kind of algorithm based on K-means that the sound bite after cutting apart is carried out cluster, this algorithm can overcome standard K-means algorithm performance and be subjected to initial value and isolated point to influence big defective,
Calculate earlier the territory density of each set, as initial point, next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy with the point of density maximum;
Next calculates sample point and upgrades the value at center to the distance at set center, selects the sampled point of coincidence formula (15) to upgrade as new set center,
Func = &Sigma; j = 1 J &Sigma; n = 1 M | | &tau; ^ [ n ] - &tau; j | | 2 - - - ( 15 )
Wherein:
Figure BSA00000368534100145
Be the time delay estimate vector
Figure BSA00000368534100146
Cluster centre τ with each sound bite jDistance, τ j[n] is center vector, and J is speaker's number, and M is the microphone number,
At last come different spaces speaker's sound bite is sorted out and marked according to the distance of set center vector and sound bite vector.
In the accompanying drawing:
Figure BSA00000368534100147
Be the locus vector of single sound source,
Figure BSA00000368534100148
Be the locus vector of another single sound source,
Figure BSA00000368534100149
Be respectively single microphone M iM kM jThe locus vector.

Claims (1)

1. the speaker clustering method of a distributed mike wind is characterized in that: may further comprise the steps:
The first step is carried out pre-service to the signal of distributed mike elegance collection
At first the multichannel sound-source signal that distributed mike wind is obtained carries out pre-service, earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion, then the multichannel sound-source signal is carried out end-point detection, signal is divided into sound-source signal and non-sound-source signal two classes, the purpose of end-point detection is to distinguish voice signal and non-speech audio from audio digital signals, sound end detecting method can adopt subband spectrum entropy algorithm, at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband, calculate the frequency spectrum entropy of each subband, then the subband spectrum entropy of n frame is in succession obtained the frequency spectrum entropy of every frame through one group of order statistics wave filter, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) FFTIndividual some Y i(0≤i≤N FFT), the probability density of each point on spectrum domain can be used formula (1) expression:
p i = Y i / &Sigma; k = 0 N FFT - 1 Y k - - - ( 1 )
Wherein: Y kBe k the point of voice signal on power spectrum through the FFT conversion, Y iBe i the point of voice signal on power spectrum through the FFT conversion, N FFTBe the number of i, p iBe the probability density of i point on spectrum domain,
The entropy function of corresponding signal on spectrum domain defines available formula (2) expression:
H = - &Sigma; k = 0 N FFT - 1 p k log ( p k ) - - - ( 2 )
Wherein: p kBe the probability density of k point on spectrum domain, N FFTBe the number of i, H is the entropy function on the spectrum domain,
With the N on the frequency domain FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, the probability that calculates each point on the l frame frequency spectral domain as shown in Equation (3):
p l [ k , i ] = ( Y i + Q ) / &Sigma; j = m k m k + 1 - 1 ( Y j + Q ) - - - ( 3 )
Wherein: Y jBe j the point of voice signal on power spectrum through the FFT conversion, Y iBe the point on k the subband,
Figure FSA00000368534000022
(0≤k≤K-1, m k≤ i≤m K+1-1) be the subband lower limit, Q is a constant, p l[k, i] is the probability of each point on the l frame frequency spectral domain,
According to the definition of information entropy, the value of the frequency spectrum entropy of k subband of l frame as shown in Equation (4):
E s [ l , k ] = &Sigma; i = mk m k + 1 - 1 p l [ k , i ] log ( p l [ k , i ] ) ( 0 &le; k &le; K - 1 ) - - - ( 4 )
Wherein: p l[k, i] is the probability of each point on the l frame frequency spectral domain, E s[l, k] is the frequency spectrum entropy of k subband of l frame,
We can calculate the spectrum information entropy of l frame according to following formula (5):
H l = - 1 K &Sigma; k = 0 K - 1 E h [ l , k ] - - - ( 5 )
Wherein: E h[l, k] is the frequency spectrum entropy of k subband of l frame, and K is the subband number, H lThe information entropy of k subband of the l frame after handling for popin after filtration is sliding defines as shown in Equation (6):
E h[l,k]=(1-λ)E s(h)[l,k]+λE s(h+1)[l,k](0≤k≤K-1)(6)
Wherein: E S (h)[l, k] preparation method is as follows: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm s[l-N, k], KE s[l, k], KE sOn [l+N, k], this group sub-band information entropy is pressed ascending order rank order, E S (h)[l, k] is E s[l-N, k], KE s[l, k], KE sH maximal value in [l+N, k]; λ is a constant, E h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,
The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H l, work as H lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as T=β Avg+ θ, wherein
Figure FSA00000368534000031
β=0.01, θ=0.1, E m[k] is E s[0, k], K, E sThe intermediate value of [N-1, k], Avg is the Noise Estimation that input signal begins the N frame most,
Second step, adopt Time Delay Estimation Method to calculate to the sound-source signal fragment, obtain corresponding time delay estimate vector
At first determine volume coordinate, concrete grammar is: to each microphone M1 that numbers in order, M2..., Mn, n is the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2, if the position of M1 is an origin, M1 is the starting point coordinate direction of principal axis to the direction of M2, subsequently per 50 frame voice signals are considered as one group of sound bite, adopt Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones, obtain the individual time delay estimated value of n (n-1), as shown in Equation (7):
&tau; k = &tau; ^ 12 &tau; ^ 13 L &tau; ^ ij T - - - ( 7 )
Wherein:
Figure FSA00000368534000033
Be that delay inequality between i microphone and j the microphone is estimated τ kBe the delay inequality estimate vector,
Time delay is estimated to adopt PHAT (phase tranformation) weighting algorithm, its weighting coefficient as shown in Equation (8), delay time estimation method is shown in formula (9)~(10):
W ( &omega; ) = 1 | X 1 ( &omega; ) X 2 * ( &omega; ) | - - - ( 8 )
Wherein: X 1(ω), X 2(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, *Be conjugate of symbol,
R x 1 x 2 ( n ) = IFFT ( W ( &omega; ) &CenterDot; X 1 ( &omega; ) &CenterDot; X 2 * ( &omega; ) ) - - - ( 9 )
&tau; ^ = arg max n R x 1 x 2 ( n ) - - - ( 10 )
Wherein: Be the broad sense cross correlation function of two paths of signals,
Figure FSA00000368534000044
Be x 1And x 2Between the time delay estimated value,
In the 3rd step, the debug data are also carried out the speaker and are cut apart
At first need to remove invalid data, press following formula (11) calculation delay:
&tau; [ n ] = &tau; ^ [ n - 1 ] SNR < Thr SNR &tau; ^ [ n ] SNR &GreaterEqual; Thr SNR - - - ( 11 )
Wherein: n is the index value of a certain frame, and τ is the delay data of a certain frame correspondence,
Figure FSA00000368534000046
Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:
&tau; [ n ] = &tau; ^ [ n - 1 ] &tau; ^ [ n ] < Thr &tau; ^ [ n ] &tau; ^ [ n ] &GreaterEqual; Thr - - - ( 12 )
Wherein: n is the index value of a certain frame, and τ is the delay data of a certain frame correspondence,
Figure FSA00000368534000048
Be the delay data that a certain frame is estimated, when a certain moment time delay is estimated less than threshold value Thr, adopted the estimation time delay in a lasted moment as the time delay estimated value in this moment,
Then the speaker of different spatial is cut apart calculating, at first calculate posterior probability β ik) as shown in Equation (13):
&beta; i ( &tau; k ) = &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) &alpha; 1 g ( &tau; k ; &mu; 1 . &sigma; 1 2 ) + &alpha; 2 g ( &tau; k ; &mu; 2 . &sigma; 2 2 ) + L + &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) - - - ( 13 )
Wherein: For defined parameters-, α i=1/i, i represent the number of GMM model,
Figure FSA00000368534000051
Initial value adopt K-means algorithm computation, τ kFor formula 7 calculates the time delay estimate vector that obtains, β ik) be posterior probability,
Formula (14) is the parameter update algorithm:
&mu; ^ i = &Sigma; k = 1 n &beta; i ( &tau; k ) &tau; k &Sigma; k = 1 n &beta; i ( &tau; k ) &sigma; ^ i 2 = 1 d &Sigma; k = 1 n &beta; i ( &tau; k ) ( &tau; k - &mu; i ) T ( &tau; k - &mu; i ) &Sigma; k = 1 n &beta; i ( &tau; k ) &alpha; ^ i = 1 n &Sigma; k = 1 n &beta; i ( &tau; k ) - - - ( 14 )
Wherein:
Figure FSA00000368534000053
Be estimates of parameters,
Figure FSA00000368534000054
Be the estimation of GMM model parameter, β ik) be the posterior probability that formula 13 calculates gained, when The time stop undated parameter, min is a constant herein, represents minimum tolerance value,
In the 4th step, the result of cutting apart according to the speaker carries out speaker's cluster
Utilize a kind of algorithm that the sound bite after cutting apart is carried out cluster based on K-means, calculate the territory density of each set earlier, as initial point, next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy with the point of density maximum;
Next calculates sample point and upgrades the value at center to the distance at set center, selects the sampled point of coincidence formula (15) to upgrade as new set center,
Func = &Sigma; j = 1 J &Sigma; n = 1 M | | &tau; ^ [ n ] - &tau; j | | 2 - - - ( 15 )
Wherein:
Figure FSA00000368534000057
Be the time delay estimate vector
Figure FSA00000368534000058
Cluster centre τ with each sound bite jDistance, τ j[n] is center vector, and J is speaker's number, and M is the microphone number,
At last come different spaces speaker's sound bite is sorted out and marked according to the distance of set center vector and sound bite vector.
CN2010105683868A 2010-11-29 2010-11-29 Speaker clustering method for distributed microphone Active CN102074236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105683868A CN102074236B (en) 2010-11-29 2010-11-29 Speaker clustering method for distributed microphone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105683868A CN102074236B (en) 2010-11-29 2010-11-29 Speaker clustering method for distributed microphone

Publications (2)

Publication Number Publication Date
CN102074236A true CN102074236A (en) 2011-05-25
CN102074236B CN102074236B (en) 2012-06-06

Family

ID=44032754

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105683868A Active CN102074236B (en) 2010-11-29 2010-11-29 Speaker clustering method for distributed microphone

Country Status (1)

Country Link
CN (1) CN102074236B (en)

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509548A (en) * 2011-10-09 2012-06-20 清华大学 Audio indexing method based on multi-distance sound sensor
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
WO2013020380A1 (en) * 2011-08-10 2013-02-14 歌尔声学股份有限公司 Communication headset speech enhancement method and device, and noise reduction communication headset
CN103175897A (en) * 2013-03-13 2013-06-26 西南交通大学 High-speed turnout damage recognition method based on vibration signal endpoint detection
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN103439688A (en) * 2013-08-27 2013-12-11 大连理工大学 Sound source positioning system and method used for distributed microphone arrays
CN104347068A (en) * 2013-08-08 2015-02-11 索尼公司 Audio signal processing device, audio signal processing method and monitoring system
CN104575498A (en) * 2015-01-30 2015-04-29 深圳市云之讯网络技术有限公司 Recognition method and system of effective speeches
CN104766093A (en) * 2015-04-01 2015-07-08 中国科学院上海微系统与信息技术研究所 Sound target sorting method based on microphone array
CN104767739A (en) * 2015-03-23 2015-07-08 电子科技大学 Method for separating unknown multi-protocol mixed data frames into single protocol data frames
CN105161093A (en) * 2015-10-14 2015-12-16 科大讯飞股份有限公司 Method and system for determining the number of speakers
CN105388459A (en) * 2015-11-20 2016-03-09 清华大学 Robustness sound source space positioning method of distributed microphone array network
CN105580076A (en) * 2013-03-12 2016-05-11 谷歌技术控股有限责任公司 Delivery of medical devices
CN105869645A (en) * 2016-03-25 2016-08-17 腾讯科技(深圳)有限公司 Voice data processing method and device
CN106405499A (en) * 2016-09-08 2017-02-15 南京阿凡达机器人科技有限公司 Method for robot to position sound source
CN106504773A (en) * 2016-11-08 2017-03-15 上海贝生医疗设备有限公司 A kind of wearable device and voice and activities monitoring system
CN106887231A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of identification model update method and system and intelligent terminal
CN106940997A (en) * 2017-03-20 2017-07-11 海信集团有限公司 A kind of method and apparatus that voice signal is sent to speech recognition system
CN106981289A (en) * 2016-01-14 2017-07-25 芋头科技(杭州)有限公司 A kind of identification model training method and system and intelligent terminal
CN107202976A (en) * 2017-05-15 2017-09-26 大连理工大学 The distributed microphone array sound source localization system of low complex degree
CN107393549A (en) * 2017-07-21 2017-11-24 北京华捷艾米科技有限公司 Delay time estimation method and device
CN107886951A (en) * 2016-09-29 2018-04-06 百度在线网络技术(北京)有限公司 A kind of speech detection method, device and equipment
CN107885323A (en) * 2017-09-21 2018-04-06 南京邮电大学 A kind of VR scenes based on machine learning immerse control method
CN108364637A (en) * 2018-02-01 2018-08-03 福州大学 A kind of audio sentence boundary detection method
CN108665894A (en) * 2018-04-06 2018-10-16 东莞市华睿电子科技有限公司 A kind of voice interactive method of household appliance
CN108872939A (en) * 2018-04-29 2018-11-23 桂林电子科技大学 Interior space geometric profile reconstructing method based on acoustics mirror image model
CN109087648A (en) * 2018-08-21 2018-12-25 平安科技(深圳)有限公司 Sales counter voice monitoring method, device, computer equipment and storage medium
CN109155130A (en) * 2016-05-13 2019-01-04 伯斯有限公司 Handle the voice from distributed microphone
CN109313910A (en) * 2016-05-19 2019-02-05 微软技术许可有限责任公司 The constant training of displacement of the more speaker speech separation unrelated for talker
CN109618273A (en) * 2018-12-29 2019-04-12 北京声智科技有限公司 The device and method of microphone quality inspection
CN109658948A (en) * 2018-12-21 2019-04-19 南京理工大学 One kind is towards the movable acoustic monitoring method of migratory bird moving
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110290468A (en) * 2019-07-04 2019-09-27 英华达(上海)科技有限公司 Virtual sound insulation communication means, device, system, electronic equipment, storage medium
CN110428842A (en) * 2019-08-13 2019-11-08 广州国音智能科技有限公司 Speech model training method, device, equipment and computer readable storage medium
CN110501674A (en) * 2019-08-20 2019-11-26 长安大学 A kind of acoustical signal non line of sight recognition methods based on semi-supervised learning
CN111063341A (en) * 2019-12-31 2020-04-24 苏州思必驰信息科技有限公司 Method and system for segmenting and clustering multi-person voice in complex environment
CN112581941A (en) * 2020-11-17 2021-03-30 北京百度网讯科技有限公司 Audio recognition method and device, electronic equipment and storage medium
CN112684437A (en) * 2021-01-12 2021-04-20 浙江大学 Passive distance measurement method based on time domain warping transformation
CN112684412A (en) * 2021-01-12 2021-04-20 中北大学 Sound source positioning method and system based on pattern clustering
CN112735385A (en) * 2020-12-30 2021-04-30 科大讯飞股份有限公司 Voice endpoint detection method and device, computer equipment and storage medium
CN113096669A (en) * 2021-03-31 2021-07-09 重庆风云际会智慧科技有限公司 Voice recognition system based on role recognition
CN113178196A (en) * 2021-04-20 2021-07-27 平安国际融资租赁有限公司 Audio data extraction method and device, computer equipment and storage medium
CN113380234A (en) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 Method, device, equipment and medium for generating form based on voice recognition
CN113573212A (en) * 2021-06-04 2021-10-29 成都千立智能科技有限公司 Sound amplification system and microphone channel data selection method
CN113808612A (en) * 2021-11-18 2021-12-17 阿里巴巴达摩院(杭州)科技有限公司 Voice processing method, device and storage medium
US11304019B2 (en) 2017-06-29 2022-04-12 Huawei Technologies Co., Ltd. Delay estimation method and apparatus
CN116030815A (en) * 2023-03-30 2023-04-28 北京建筑大学 Voice segmentation clustering method and device based on sound source position
CN112735385B (en) * 2020-12-30 2024-05-31 中国科学技术大学 Voice endpoint detection method, device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02209027A (en) * 1989-02-09 1990-08-20 Fujitsu Ltd Acoustic echo canceller
JPH1097276A (en) * 1996-09-20 1998-04-14 Canon Inc Method and device for speech recognition, and storage medium
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02209027A (en) * 1989-02-09 1990-08-20 Fujitsu Ltd Acoustic echo canceller
JPH1097276A (en) * 1996-09-20 1998-04-14 Canon Inc Method and device for speech recognition, and storage medium
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer

Cited By (69)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013020380A1 (en) * 2011-08-10 2013-02-14 歌尔声学股份有限公司 Communication headset speech enhancement method and device, and noise reduction communication headset
US9484042B2 (en) 2011-08-10 2016-11-01 Goertek Inc. Speech enhancing method, device for communication earphone and noise reducing communication earphone
KR101353686B1 (en) 2011-08-10 2014-01-20 고어텍 인크 Communication headset speech enhancement method and device, and noise reduction communication headset
CN102509548B (en) * 2011-10-09 2013-06-12 清华大学 Audio indexing method based on multi-distance sound sensor
CN102509548A (en) * 2011-10-09 2012-06-20 清华大学 Audio indexing method based on multi-distance sound sensor
US9685161B2 (en) 2012-07-09 2017-06-20 Huawei Device Co., Ltd. Method for updating voiceprint feature model and terminal
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
CN105580076A (en) * 2013-03-12 2016-05-11 谷歌技术控股有限责任公司 Delivery of medical devices
CN103175897A (en) * 2013-03-13 2013-06-26 西南交通大学 High-speed turnout damage recognition method based on vibration signal endpoint detection
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN104347068A (en) * 2013-08-08 2015-02-11 索尼公司 Audio signal processing device, audio signal processing method and monitoring system
CN103439688A (en) * 2013-08-27 2013-12-11 大连理工大学 Sound source positioning system and method used for distributed microphone arrays
CN104575498A (en) * 2015-01-30 2015-04-29 深圳市云之讯网络技术有限公司 Recognition method and system of effective speeches
CN104575498B (en) * 2015-01-30 2018-08-17 深圳市云之讯网络技术有限公司 Efficient voice recognition methods and system
CN104767739B (en) * 2015-03-23 2018-01-30 电子科技大学 The method that unknown multi-protocols blended data frame is separated into single protocol data frame
CN104767739A (en) * 2015-03-23 2015-07-08 电子科技大学 Method for separating unknown multi-protocol mixed data frames into single protocol data frames
CN104766093A (en) * 2015-04-01 2015-07-08 中国科学院上海微系统与信息技术研究所 Sound target sorting method based on microphone array
CN104766093B (en) * 2015-04-01 2018-02-16 中国科学院上海微系统与信息技术研究所 A kind of acoustic target sorting technique based on microphone array
CN105161093B (en) * 2015-10-14 2019-07-09 科大讯飞股份有限公司 A kind of method and system judging speaker's number
CN105161093A (en) * 2015-10-14 2015-12-16 科大讯飞股份有限公司 Method and system for determining the number of speakers
CN105388459A (en) * 2015-11-20 2016-03-09 清华大学 Robustness sound source space positioning method of distributed microphone array network
CN105388459B (en) * 2015-11-20 2017-08-11 清华大学 The robust sound source space-location method of distributed microphone array network
CN106887231A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of identification model update method and system and intelligent terminal
CN106981289A (en) * 2016-01-14 2017-07-25 芋头科技(杭州)有限公司 A kind of identification model training method and system and intelligent terminal
CN105869645A (en) * 2016-03-25 2016-08-17 腾讯科技(深圳)有限公司 Voice data processing method and device
CN105869645B (en) * 2016-03-25 2019-04-12 腾讯科技(深圳)有限公司 Voice data processing method and device
CN109155130A (en) * 2016-05-13 2019-01-04 伯斯有限公司 Handle the voice from distributed microphone
CN109313910B (en) * 2016-05-19 2023-08-29 微软技术许可有限责任公司 Permutation invariant training for speaker independent multi-speaker speech separation
CN109313910A (en) * 2016-05-19 2019-02-05 微软技术许可有限责任公司 The constant training of displacement of the more speaker speech separation unrelated for talker
CN106405499A (en) * 2016-09-08 2017-02-15 南京阿凡达机器人科技有限公司 Method for robot to position sound source
CN107886951A (en) * 2016-09-29 2018-04-06 百度在线网络技术(北京)有限公司 A kind of speech detection method, device and equipment
CN106504773A (en) * 2016-11-08 2017-03-15 上海贝生医疗设备有限公司 A kind of wearable device and voice and activities monitoring system
CN106940997B (en) * 2017-03-20 2020-04-28 海信集团有限公司 Method and device for sending voice signal to voice recognition system
CN106940997A (en) * 2017-03-20 2017-07-11 海信集团有限公司 A kind of method and apparatus that voice signal is sent to speech recognition system
CN107202976A (en) * 2017-05-15 2017-09-26 大连理工大学 The distributed microphone array sound source localization system of low complex degree
US11950079B2 (en) 2017-06-29 2024-04-02 Huawei Technologies Co., Ltd. Delay estimation method and apparatus
US11304019B2 (en) 2017-06-29 2022-04-12 Huawei Technologies Co., Ltd. Delay estimation method and apparatus
CN107393549A (en) * 2017-07-21 2017-11-24 北京华捷艾米科技有限公司 Delay time estimation method and device
CN107885323A (en) * 2017-09-21 2018-04-06 南京邮电大学 A kind of VR scenes based on machine learning immerse control method
CN108364637A (en) * 2018-02-01 2018-08-03 福州大学 A kind of audio sentence boundary detection method
CN108364637B (en) * 2018-02-01 2021-07-13 福州大学 Audio sentence boundary detection method
CN108665894A (en) * 2018-04-06 2018-10-16 东莞市华睿电子科技有限公司 A kind of voice interactive method of household appliance
CN108872939A (en) * 2018-04-29 2018-11-23 桂林电子科技大学 Interior space geometric profile reconstructing method based on acoustics mirror image model
CN108872939B (en) * 2018-04-29 2020-09-29 桂林电子科技大学 Indoor space geometric outline reconstruction method based on acoustic mirror image model
CN109087648A (en) * 2018-08-21 2018-12-25 平安科技(深圳)有限公司 Sales counter voice monitoring method, device, computer equipment and storage medium
CN109087648B (en) * 2018-08-21 2023-10-20 平安科技(深圳)有限公司 Counter voice monitoring method and device, computer equipment and storage medium
CN109658948A (en) * 2018-12-21 2019-04-19 南京理工大学 One kind is towards the movable acoustic monitoring method of migratory bird moving
CN109618273A (en) * 2018-12-29 2019-04-12 北京声智科技有限公司 The device and method of microphone quality inspection
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110290468A (en) * 2019-07-04 2019-09-27 英华达(上海)科技有限公司 Virtual sound insulation communication means, device, system, electronic equipment, storage medium
CN110290468B (en) * 2019-07-04 2020-09-22 英华达(上海)科技有限公司 Virtual sound insulation communication method, device, system, electronic device and storage medium
CN110428842A (en) * 2019-08-13 2019-11-08 广州国音智能科技有限公司 Speech model training method, device, equipment and computer readable storage medium
CN110501674A (en) * 2019-08-20 2019-11-26 长安大学 A kind of acoustical signal non line of sight recognition methods based on semi-supervised learning
CN111063341A (en) * 2019-12-31 2020-04-24 苏州思必驰信息科技有限公司 Method and system for segmenting and clustering multi-person voice in complex environment
CN112581941A (en) * 2020-11-17 2021-03-30 北京百度网讯科技有限公司 Audio recognition method and device, electronic equipment and storage medium
CN112735385A (en) * 2020-12-30 2021-04-30 科大讯飞股份有限公司 Voice endpoint detection method and device, computer equipment and storage medium
CN112735385B (en) * 2020-12-30 2024-05-31 中国科学技术大学 Voice endpoint detection method, device, computer equipment and storage medium
CN112684412B (en) * 2021-01-12 2022-09-13 中北大学 Sound source positioning method and system based on pattern clustering
CN112684437B (en) * 2021-01-12 2023-08-11 浙江大学 Passive ranging method based on time domain warping transformation
CN112684412A (en) * 2021-01-12 2021-04-20 中北大学 Sound source positioning method and system based on pattern clustering
CN112684437A (en) * 2021-01-12 2021-04-20 浙江大学 Passive distance measurement method based on time domain warping transformation
CN113096669B (en) * 2021-03-31 2022-05-27 重庆风云际会智慧科技有限公司 Speech recognition system based on role recognition
CN113096669A (en) * 2021-03-31 2021-07-09 重庆风云际会智慧科技有限公司 Voice recognition system based on role recognition
CN113178196B (en) * 2021-04-20 2023-02-07 平安国际融资租赁有限公司 Audio data extraction method and device, computer equipment and storage medium
CN113178196A (en) * 2021-04-20 2021-07-27 平安国际融资租赁有限公司 Audio data extraction method and device, computer equipment and storage medium
CN113573212A (en) * 2021-06-04 2021-10-29 成都千立智能科技有限公司 Sound amplification system and microphone channel data selection method
CN113380234A (en) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 Method, device, equipment and medium for generating form based on voice recognition
CN113808612A (en) * 2021-11-18 2021-12-17 阿里巴巴达摩院(杭州)科技有限公司 Voice processing method, device and storage medium
CN116030815A (en) * 2023-03-30 2023-04-28 北京建筑大学 Voice segmentation clustering method and device based on sound source position

Also Published As

Publication number Publication date
CN102074236B (en) 2012-06-06

Similar Documents

Publication Publication Date Title
CN102074236B (en) Speaker clustering method for distributed microphone
CN102103200B (en) Acoustic source spatial positioning method for distributed asynchronous acoustic sensor
US10901063B2 (en) Localization algorithm for sound sources with known statistics
CN108464015B (en) Microphone array signal processing system
JP4816711B2 (en) Call voice processing apparatus and call voice processing method
Miyabe et al. Blind compensation of inter-channel sampling frequency mismatch with maximum likelihood estimation in STFT domain
CN106373589B (en) A kind of ears mixing voice separation method based on iteration structure
CN110970053A (en) Multichannel speaker-independent voice separation method based on deep clustering
CN111429939B (en) Sound signal separation method of double sound sources and pickup
CN101593522A (en) A kind of full frequency domain digital hearing aid method and apparatus
JP2010112996A (en) Voice processing device, voice processing method and program
CN111899756B (en) Single-channel voice separation method and device
Al-Karawi et al. Early reflection detection using autocorrelation to improve robustness of speaker verification in reverberant conditions
KR20210137146A (en) Speech augmentation using clustering of queues
Han et al. Robust GSC-based speech enhancement for human machine interface
CN103901400A (en) Binaural sound source positioning method based on delay compensation and binaural coincidence
Wang et al. Localization based sequential grouping for continuous speech separation
WO2004084187A1 (en) Object sound detection method, signal input delay time detection method, and sound signal processing device
CN111179959B (en) Competitive speaker number estimation method and system based on speaker embedding space
Parada et al. Reverberant speech recognition exploiting clarity index estimation
Hadad et al. Multi-speaker direction of arrival estimation using SRP-PHAT algorithm with a weighted histogram
Imoto et al. Spatial-feature-based acoustic scene analysis using distributed microphone array
Venkatesan et al. Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest
CN109901114A (en) A kind of delay time estimation method suitable for auditory localization
Nakamura et al. Blind spatial sound source clustering and activity detection using uncalibrated microphone array

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20181115

Address after: 100085 Beijing Haidian District Shangdi Information Industry Base Pioneer Road 1 B Block 2 Floor 2030

Patentee after: Beijing Huacong Zhijia Technology Co., Ltd.

Address before: 100084 Beijing 100084 box 82 box, Tsinghua University Patent Office

Patentee before: Tsinghua University