CN102509548A

CN102509548A - Audio indexing method based on multi-distance sound sensor

Info

Publication number: CN102509548A
Application number: CN2011103035808A
Authority: CN
Inventors: 杨毅; 陈国顺; 王胜开
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2011-10-09
Filing date: 2011-10-09
Publication date: 2012-06-20
Anticipated expiration: 2031-10-09
Also published as: CN102509548B

Abstract

The invention discloses an audio indexing method based on a multi-distance sound sensor. In the method, a multi-distance sound sensor is used as an audio recording device for recording the audio information in a multimedia conference, a space multi-delay feature is extracted based on the multi-distance sound sensor as a feature for distinguishing different speakers, and a new flow-type algorithm is adopted to perform dimension reduction of the multi-delay feature and classify the speakers according to the identities. The method can reduce the complexity and calculation cost of the system; finally, the audio segment and identity of each speaker are output by the system as audio index information; the optimal discriminant vector set theory obtained by the method can achieve optimal discrimination theoretically; and the method can be applied to a multi-people multi-party conversion scene in a complicated acoustic environment.

Description

A kind of based on the audio index method of multiple spurs from sonic transducer

Technical field

The invention belongs to the Audiotechnica field, relate to audio index, be specifically related to a kind of based on the audio index method of multiple spurs from sonic transducer.

Background technology

Business activity and daily life are goed deep in teleconference and video conference day by day, and corresponding with it record data present geometry level and increase, and in this type of scene, in section audio data, have a plurality of sound sources usually.Can handle this type data through the audio index technology, alleviate burden like post-processing approach such as speech recognitions.

Object content is searched for and is found in the automatic information extraction from voice data of audio index technology, and speaker's classification is the gordian technique of audio index, and speaker's sorting technique comprises three parts: feature extraction, voice segment, categorised decision.Main algorithm is mixed Gaussian log-likelihood ratio or SVMs.The former adopts versatility training (estimating like maximum likelihood or MAP) to produce speaker model, and the latter adopts the property distinguished training (like GLDS-SVM and bag of N-grams) to produce speaker model.GMM-SVM (gauss hybrid models-SVMs) is a kind of modeling and sorting technique of main flow, sets up the probability density distribution model and measures probability density distribution through the Kullback-Leibler divergence upper bound through GMM.The GMM-SVM method has preferable performance, but still has following problem: GMM existed that multiparameter, training data are limited, GMM-SVM is primarily aimed at Speaker Identification and does not develop into current techique during estimated probability density.

Good fortune mark (the Rich Transcription Evaluation) evaluation and test that got into NBS (National Institute of Standards and Technology) in 2005 is is first evaluated and tested in speaker's key words sorting (Speech Diarization).The purpose of speaker's automatic segmentation mark is to solve voice data is divided into the problem that fragment is classified by the speaker.Good fortune in 2009 mark evaluation and test condition is: words person's number is unknown, microphone position is unknown, the room acoustics environment is unknown, judge a plurality of speakers' identity and voice data classified by speaker ' s identity of the scene that promptly all lacks in time and spatial prior information.SPKR evaluation and test is an important subtask in speaker's key words sorting evaluation and test, and the problem of main research " Who spoke when " its objective is voice data is divided into fragment and classifies according to different speakers.Speaker's sorting technique can be applicable to fields such as speech recognition, audio-frequency information management, retrieval; Help in the audio stream of meeting, voice mail, lecture and news broadcast program, to realize that the speaker follows the tracks of, thereby realize voice data is carried out structurized analysis, understanding and management.

Multiple spurs is a kind of system that is made up of a plurality of sensors from the sonic transducer system, and is unrestricted to the structure of sonic transducer system, and each sonic transducer is by different device control, and the signal that therefore collects is asynchronous.Multiple spurs is simple in structure, easy to use and with low cost from the advantage of sonic transducer system, can be widely used in auditory localization, audio index and identification., can utilize and prolong characteristic for a long time and be used to carry out the not classification of overlapping sound source of space from the singularity of sonic transducer structure based on multiple spurs.But, prolong the proper vector dimension for a long time and increase rapidly along with the sonic transducer number increases.

Recently there is document to point out; Voice signal inside has low dimension flow structure; Riemann proposed flow pattern (Manifold) method first in 1854, (Locality Preserving Proiections LPP) is introduced in the pattern-recognition and receives extensive concern in guarantor office in 2005 projection.LPP is a kind of unsupervised learning method, in learning process, does not consider the classification information of sample.Yu etc. combine the Fisher criterion to propose the projection of discriminating guarantor office (Discriminant Locality Preserving Projections, DLPP) algorithm and be successfully used to recognition of face on the basis of LPP.The flow pattern that influences data based on the algorithm shortcomings dimension-reduction treatment meeting of LPP distributes and causes authentication information to be lost and small sample problem etc.Propose a kind of kernel to people such as small sample problem Yang and differentiated guarantor projection algorithm (the Null-space Locality Preserving Projecitons of office; NDLPP), but this method only utilized the authentication information of kernel and ignored the authentication information in the principal component space.

Summary of the invention

In order to overcome the deficiency of above-mentioned prior art; The object of the present invention is to provide a kind of based on the audio index method of multiple spurs from sonic transducer; Prolong characteristic for a long time through utilization and be used to carry out the not classification of overlapping sound source of space; And higher-dimension is prolonged proper vector for a long time carry out the dimension-reduction treatment based on flow pattern, the optimum discriminant vector collection that is obtained by this algorithm can reach optimum in theory to be differentiated, can be applicable to the many people session operational scenarios in many ways under the complicated acoustic enviroment.

To achieve these goals, the technical scheme of the present invention's employing is:

A kind of based on the audio index method of multiple spurs from sonic transducer, comprise information acquisition step, characteristic extraction step and categorised decision step:

Said information acquisition step realizes from sonic transducer through multiple spurs;

Said characteristic extraction step be with each individual sources and multiple spurs from sonic transducer between a plurality of time delays form and prolong acoustic feature for a long time based on spatial domain; Extract the authentication information of this space domain characteristic as the speaker, it is the element of space characteristics that definition arrives mistiming TDOA:

TDOA = \frac{| | m_{i} - s | | - | | m_{j} - s | |}{c}

M wherein _iAnd m _jRepresent the locus of i and j sonic transducer respectively, s is the locus of sound source, and s is a sound source, and c is the velocity of sound, adopts the GCC-PHAT method to estimate the TDOA value, leaves the space acoustics that sonic transducer obtains based on multiple spurs and is characterized as:

T_{k} = {[\begin{matrix} {\hat{T}}_{12} & {\hat{T}}_{13} & L & {\hat{T}}_{ij} \end{matrix}]}^{T}

Wherein k represents k speaker, and i represents i the sensor of multiple spurs in the sonic transducer system, and j represents j the sensor of multiple spurs in the sonic transducer system, and T represents the TDOA estimated value,

The distinctive structure of said space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously;

The result that said categorised decision step is based on information gathering step and characteristic extraction step adopts the sorting technique to vector to realize.

Further; Extract space-time weighting fusion characteristic in the said characteristic extraction step, that is, said space domain characteristic is combined the authentication information as the speaker with traditional human acoustics characteristic; Such as, TDOA vector sum MFCC proper vector is merged the authentication information as the speaker.

After said characteristic extraction step is accomplished, before the categorised decision step is carried out, carry out dimension-reduction treatment to prolonging acoustic feature for a long time, carry out speaker's classification through single sound source distinctive spatially, said dimension-reduction treatment is carried out through following flow pattern dimension reduction method:

The first step is pressed following formula to the pre-service of TDOA estimated value;

T [n] = \{\begin{matrix} \hat{T} [n - 1] & \hat{T} [n] < Thr \\ \hat{T} [n] & \hat{T} [n] &GreaterEqual; Thr \end{matrix}

Wherein: n is the index value of a certain frame; T is the corresponding delay data of a certain frame;

is the delay data that a certain frame is estimated; When a certain moment time delay was estimated less than threshold value Thr, the estimation time delay that adopted a last moment was as this time delay estimated value constantly;

In second step, utilize euclidean distance between node pair to decide arest neighbors figure G;

The 3rd step, calculate weighted value, between node i on the arest neighbors figure G and j, line is arranged, then the definition of weighted value is following:

S_{ij} = e^{- \frac{{| | T_{i} - T_{j} | |}^{2}}{α}}

Wherein T represents the TDOA estimated value vector of each frame, and α is a constant, S _Ij=S _JiBetween node i on the arest neighbors figure G and j, there is not line, then S _Ij=0;

The 4th step, the decision Feature Mapping, objective function is following:

J (a) = \frac{Σ_{c = 1}^{C} Σ_{k = 1}^{n_{c}} (y_{k}^{c} - e^{c}) W_{k}^{c} {(y_{k}^{c} - e^{c})}^{T}}{Σ_{i, j = 1}^{C} (e_{i} - e_{j}) V_{ij} {(e_{i} - e_{j})}^{T}}

I wherein ¹J, C are the classification number, n _cBe c class sample number, e ^cBe the expectation of c class sample, e _iAnd e _jBe respectively the expectation of i class and j class sample,

And V _IjBe respectively a between class distance weighted value and a type interior distance weighting value, its computing method are with reference to the 3rd step, and minimized the separating of following formula is equivalent to the minimal eigenvalue characteristic of correspondence vector of finding the solution in the following formula:

a ^TX(F-W)X ^Ta＝Λa ^TE(D-V)E ^Ta

Wherein Λ is an eigenvalue matrix, E=[e ₁, K, e _C] be the expectation of i class sample,

D_{Ii}^{c} = Σ_{j = 1}^{n_{c}} V_{Ij},

F = [\begin{matrix} F_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & F_{C} \end{matrix}],

W = [\begin{matrix} W_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & W_{C} \end{matrix}],

D = [\begin{matrix} D_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & D_{C} \end{matrix}],

V = [\begin{matrix} V_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & V_{C} \end{matrix}]

With vector set a ₁, K, a _MArrange from small to large according to its characteristic of correspondence value, then have

x _i→y _i＝A ^Tx _i

I=1 wherein, K, N, A=[a ₁, K, a _M].

Wherein, in said second step, euclidean distance between node pair is defined as follows by mahalanobis distance:

d _ij＝(T _i-T _j)C ^-1(T _i-T _j) ^T

D wherein _IjBe mahalanobis distance, i and j are node, i ¹J, T are the TDOA estimated value vector of each frame, and C is T _iAnd T _jCovariance matrix, figure G seeks neighbor point by the distance of following formula definition.

Sorting algorithm is through above-mentioned flow pattern dimension reduction method; After sorting algorithm was accomplished, categorised decision provided score separately by several different sorters, accomplished the decision-making output with robustness optimization and optimal classification effect through decision level fusion; Categorised decision after the decision level fusion is classification results; The output of system comprises whole voice band and respective classified information thereof, after the said information acquisition step, before the characteristic extraction step; To various voice signal pre-service, said pre-service comprises pre-emphasis and end-point detection.

Among the present invention, said multiple spurs comprises the sonic transducer on independent sonic transducer and the portable equipment from sonic transducer.

The present invention compared with prior art, advantage is:

The purpose of speaker's automatic segmentation mark is to solve voice data is divided into the problem that fragment is classified by the speaker, judge a plurality of speakers' identity and voice data classified by speaker ' s identity of the condition that all lacks in time and spatial prior information usually.Multiple spurs can satisfy the demand that meets the multi-direction complex dialogs scene of many sound sources speaker automatic segmentation mark from microphone system.

A kind of voice signal input system that multiple spurs is made up of a plurality of single microphones from microphone, wherein each microphone can be controlled by distinct device, and the topological structure that microphone is formed has no restriction.For microphone array, multiple spurs has cost, puts advantages such as flexible from microphone.In speaker's key words sorting conference scenario of a plurality of participants; Based on the singularity of multiple spurs from the microphone topological structure; Can with each individual sources and microphone between a plurality of time delays form based on spatial domain prolong acoustic feature for a long time, distinguish carrying out identity from the speaker of overlapping dimensional orientation not.Further improved the performance of multiple spurs with the method for conventional acoustic Feature Fusion from microphone speaker categorizing system.Along with the microphone number increases, prolong the proper vector dimension for a long time and increase rapidly.

Voice signal inside has low dimension flow structure, and projection LPP of guarantor office and the kernel duscriminant guarantor projection NDLPP of office are two kinds of flow pattern dimensionality reduction algorithms that propose recently, and the former is a kind of unsupervised machine learning method, and reckons without information between the sample class; The latter is based on the combination of LPP and Fisher criterion, but only utilized the authentication information of kernel and ignored the authentication information in the principal component space.

The present invention proposes a kind of based on the non-linear duscriminant guarantor office projection algorithm that prolongs characteristic for a long time; Acoustic feature is prolonged in the space for a long time to be handled; Carry out speaker's classification through single sound source distinctive spatially, the optimum discriminant vector collection that is obtained by this algorithm can reach optimum discriminating in theory.The method that the present invention adopts has been avoided can't carrying out the problem of class discrimination in the prior art, and has utilized the authentication information of the kernel and the principal space simultaneously, has improved the efficient of differentiating between class.

Description of drawings

Fig. 1 is that multiple spurs of the present invention leaves sonic transducer system implementation figure, comprises that a plurality of target sound source and multiple spurs are from sonic transducer equipment.

Fig. 2 is a flow pattern dimension reduction method process flow diagram of the present invention.

Fig. 3 the present invention is based on the many speaker classification process figure of multiple spurs from sonic transducer.

Embodiment

Below in conjunction with accompanying drawing and embodiment the present invention is explained further details.

The input equipment of SPKR comprises that head microphone, single microphone, microphone array and multiple spurs are from microphone (Multiple Distance Microphones).Multiple spurs satisfies the demand that meets the multi-direction complex dialogs scene of many sound sources from sonic transducer, can be applicable to auditory localization, speaker's cluster and identification etc., can utilize and prolong characteristic for a long time and be used to carry out the not classification of overlapping sound source of space from the singularity of sonic transducer topological structure based on multiple spurs.

As shown in Figure 1, for a kind of multiple spurs from the sonic transducer system, comprise a plurality of sonic transducers, represent that with four sonic transducer 111-114 wherein these four sonic transducers are placed on the identical platform at random among Fig. 1.Likewise, only represent that these three acoustic targets can be present in any position in same room with three acoustic target 101-103.The stationkeeping of whole sonic transducers and acoustic target is constant in the process of audio recording.

Above-mentioned sonic transducer 111-114 comprises the sonic transducer on independent sonic transducer such as microphone and portable equipment such as notebook computer or the PDA equipment; Can select to use microphone here; A kind of voice signal input system that multiple spurs is made up of a plurality of single microphones from microphone; Wherein each microphone can be controlled by distinct device, and the topological structure that microphone is formed has no restriction.For microphone array, multiple spurs has cost, puts advantages such as flexible from microphone.In speaker's key words sorting conference scenario of a plurality of participants; Based on the singularity of multiple spurs from the microphone topological structure; Can with each individual sources and microphone between a plurality of time delays form based on spatial domain prolong acoustic feature for a long time, distinguish carrying out identity from the speaker of overlapping dimensional orientation not.It is following that definition space prolongs acoustic feature for a long time:

[\begin{matrix} X_{1} \\ X_{2} \\ M \\ X_{N} \end{matrix}] = {[\begin{matrix} {\hat{τ}}_{12}^{1} & {\hat{τ}}_{12}^{2} & L & {\hat{τ}}_{12}^{N} \\ {\hat{τ}}_{13}^{1} & τ_{13}^{2} & L & τ_{13}^{N} \\ M & M & M & M \\ τ_{ij}^{1} & τ_{ij}^{2} & L & τ_{ij}^{N} \end{matrix}]}^{T}

X wherein _kRepresent the corresponding time delay vector of k sound source,

Represent k sound source to estimate to the delay inequality of i microphone and j microphone.

After utilizing multiple spurs to leave meeting participant's under the microphone apparatus recording multimedia conference scenario voice content, carry out the pre-service work of full detail, comprise various voice signal pre-service such as pre-emphasis and end-point detection etc.

Then, extract the authentication information of this space domain characteristic as the speaker, it is the element of space characteristics that definition arrives mistiming TDOA:

TDOA = \frac{| | m_{i} - s | | - | | m_{j} - s | |}{c}

T_{k} = {[\begin{matrix} {\hat{T}}_{12} & {\hat{T}}_{13} & L & {\hat{T}}_{ij} \end{matrix}]}^{T}

Wherein k represents k speaker, and i represents i the microphone of multiple spurs in microphone system, and j represents j the microphone of multiple spurs in microphone system, and T represents the TDOA estimated value.

Have another kind of characteristic in addition, promptly space-time weighting fusion characteristic combines the sound source space characteristics with traditional human acoustics characteristic, for example TDOA vector sum MFCC proper vector is merged the principal character as speaker's classification.This fusion feature provides more authentication informations, can further improve classification accuracy rate.

Wherein, the distinctive structure of said space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously.

The advantage that acoustic feature is prolonged in the space for a long time is that than the conventional acoustic characteristic, the space is prolonged acoustic feature for a long time and had tangible identifiability.But along with the growth of microphone number, the dimension of acoustic feature is with the speed increment of square multiple.Need find a kind of reasonable plan, can in retention data inner structure relation, play the effect that reduces calculated amount, further improve the effect of speaker's classification.

Therefore, after the said extracted step is accomplished, need carry out mainly acoustic feature being prolonged in the space for a long time and carry out dimension-reduction treatment, carry out speaker's classification through single sound source distinctive spatially based on the flow pattern dimensionality reduction algorithm that prolongs characteristic for a long time.

As shown in Figure 2; System's input 201 is meeting participant's under the multimedia conferencing scene a voice content; Multiple spurs is from the essential record equipment of sonic transducer 202 as data; Multiple spurs can obtain the high dimension vector 203 that space characteristics constitutes as element from sonic transducer, and assumed condition 205 is that the distinctive structure of space acoustic feature is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously.

Said dimension-reduction treatment is carried out through following method:

T [n] = \{\begin{matrix} \hat{T} [n - 1] & \hat{T} [n] < Thr \\ \hat{T} [n] & \hat{T} [n] &GreaterEqual; Thr \end{matrix}

Second step, confirm arest neighbors Figure 20 4, utilize Euclidean distance method to seek, perhaps, when the characteristic vector part correlation, arest neighbors figure seeks by mahalanobis distance, being defined as follows of mahalanobis distance:

d _ij＝(x _i-x _j)C ^-1(x _i-x _j) ^T

D wherein _IjBe mahalanobis distance, C is the covariance matrix of sample, and figure G seeks neighbor point by the distance of following formula definition.

The 3rd step, calculate weighted value 206, between node i on the arest neighbors figure G and j, line is arranged, then the definition of weighted value is following:

S_{ij} = e^{- \frac{{| | T_{i} - T_{j} | |}^{2}}{α}}

The 4th step, decision Feature Mapping 207, objective function is following:

J (a) = \frac{Σ_{c = 1}^{C} Σ_{k = 1}^{n_{c}} (y_{k}^{c} - e^{c}) W_{k}^{c} {(y_{k}^{c} - e^{c})}^{T}}{Σ_{i, j = 1}^{C} (e_{i} - e_{j}) V_{ij} {(e_{i} - e_{j})}^{T}}

I wherein ¹J, C are the classification number, n _cBe c class sample number, e ^cBe the expectation of c class sample, e _iAnd e _jBe respectively the expectation of i class and j class sample, And V _IjBe respectively a between class distance weighted value and a type interior distance weighting value, its computing method are with reference to the 3rd step.Minimized the separating of following formula is equivalent to the minimal eigenvalue characteristic of correspondence vector of finding the solution in the following formula:

a ^TX(F-W)X ^Ta＝Λa ^TE(D-V)E ^Ta

D_{Ii}^{c} = Σ_{j = 1}^{n_{c}} V_{Ij},

F = [\begin{matrix} F_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & F_{C} \end{matrix}],

W = [\begin{matrix} W_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & W_{C} \end{matrix}],

D = [\begin{matrix} D_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & D_{C} \end{matrix}],

V = [\begin{matrix} V_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & V_{C} \end{matrix}]

x _i→y _i＝A ^Tx _i

I=1 wherein, K, N, A=[a ₁, K, a _M]

Finally, obtain low dimensional vector 208, obtain the system's output 209 behind the dimensionality reduction then.

As shown in Figure 3, based on the many speaker classification process figure of multiple spurs, comprise following content for a kind of from sonic transducer:

System's input 301 is meeting participant's under the multimedia conferencing scene a voice content; Multiple spurs is from the essential record equipment of sonic transducer 302 as data; Comprise the voice signal preconditioning technique of whole needs, for example pre-emphasis and end-point detection etc. at feature extraction partially-initialized 303.Extract the speaker authentication information of space domain characteristic 304 subsequently as the speaker.In fact, had very big dimension, need carry out dimensionality reduction sorting algorithm 305 and reduce system complexity and carry out speaker's classification by the space domain characteristic 304 of a plurality of sonic transducer collections; Dimensionality reduction sorting algorithm 305 adopts flow pattern dimension reduction method shown in Figure 2; Certainly, the dimensionality reduction algorithm also can use other flow pattern algorithm, like LPP algorithm or DLPP algorithm; But; The LPP algorithm tended to utilize PCA that sample is carried out dimensionality reduction in order to improve the efficient of computing before protecting innings projection, and this stream shape that possibly change sample distributes and causes losing of authentication information; The DLPP algorithm has only utilized the authentication information of type interior scatter matrix principal component space, has lost a large amount of authentication informations of its kernel.

After accomplishing dimensionality reduction sorting algorithm 305, categorised decision 306 will provide classification results.Categorised decision provides score separately by several different sorters usually; Accomplish decision-making output with robustness optimization and optimal classification effect through decision level fusion; And will export in classification results 307 demonstrations, 308 of system's overall situation outputs comprise whole voice band and respective classified information thereof.

Claims

One kind based on multiple spurs from the audio index method of sonic transducer, comprise information acquisition step, characteristic extraction step and categorised decision step, it is characterized in that:

Said information acquisition step realizes from sonic transducer through multiple spurs;

Said characteristic extraction step be with each individual sources and multiple spurs from sonic transducer between a plurality of time delays form and prolong acoustic feature for a long time based on spatial domain; Extract the authentication information of this space domain characteristic as the speaker, it is the element of space characteristics that definition arrives mistiming TDOA:

$TDOA = \frac{| | m_{i} - s | | - | | m_{j} - s | |}{c}$

M wherein _iAnd m _jRepresent the locus of i and j sonic transducer respectively, s is the locus of sound source, and s is a sound source, and c is the velocity of sound, adopts the GCC-PHAT method to estimate the TDOA value, leaves the space acoustics that sonic transducer obtains based on multiple spurs and is characterized as:

$T_{k} = {[\begin{matrix} {\hat{T}}_{12} & {\hat{T}}_{13} & L & {\hat{T}}_{ij} \end{matrix}]}^{T}$

Wherein k represents k speaker, and i represents i the sensor of multiple spurs in the sonic transducer system, and j represents j the sensor of multiple spurs in the sonic transducer system, and T represents the TDOA estimated value,

The distinctive structure of said space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously;

The result that said categorised decision step is based on information gathering step and characteristic extraction step adopts the sorting technique to vector to realize.
2. according to claim 1 based on the audio index method of multiple spurs from sonic transducer; It is characterized in that; Extract space-time weighting fusion characteristic in the said characteristic extraction step, that is, said space domain characteristic is combined the authentication information as the speaker with traditional human acoustics characteristic.
3. according to claim 2 based on multiple spurs from the audio index method of sonic transducer, it is characterized in that, TDOA vector sum MFCC proper vector is merged the authentication information as the speaker.
4. according to claim 1 based on the audio index method of multiple spurs from sonic transducer; It is characterized in that; After said characteristic extraction step is accomplished; Before the categorised decision step is carried out, carry out dimension-reduction treatment, carry out speaker's classification through single sound source distinctive spatially to prolonging acoustic feature for a long time.
5. according to claim 4 based on the audio index method of multiple spurs from sonic transducer, it is characterized in that said dimension-reduction treatment is carried out through following flow pattern dimension reduction method:

The first step is pressed following formula to the pre-service of TDOA estimated value;

$T [n] = \{\begin{matrix} \hat{T} [n - 1] & \hat{T} [n] < Thr \\ \hat{T} [n] & \hat{T} [n] &GreaterEqual; Thr \end{matrix}$

Wherein: n is the index value of a certain frame; T is the corresponding delay data of a certain frame;
is the delay data that a certain frame is estimated; When a certain moment time delay was estimated less than threshold value Thr, the estimation time delay that adopted a last moment was as this time delay estimated value constantly;

In second step, utilize euclidean distance between node pair to decide arest neighbors figure G;

The 3rd step, calculate weighted value, between node i on the arest neighbors figure G and j, line is arranged, then the definition of weighted value is following:

$S_{ij} = e^{- \frac{{| | T_{i} - T_{j} | |}^{2}}{α}}$

Wherein T represents the TDOA estimated value vector of each frame, and α is a constant, S _Ij=S _JiBetween node i on the arest neighbors figure G and j, there is not line, then S _Ij=0;

The 4th step, the decision Feature Mapping, objective function is following:

$J (a) = \frac{Σ_{c = 1}^{C} Σ_{k = 1}^{n_{c}} (y_{k}^{c} - e^{c}) W_{k}^{c} {(y_{k}^{c} - e^{c})}^{T}}{Σ_{i, j = 1}^{C} (e_{i} - e_{j}) V_{ij} {(e_{i} - e_{j})}^{T}}$

I wherein ¹J, C are the classification number, n _cBe c class sample number, e ^cBe the expectation of c class sample, e _iAnd e _jBe respectively the expectation of i class and j class sample,
And V _IjBe respectively a between class distance weighted value and a type interior distance weighting value, its computing method are with reference to the 3rd step, and minimized the separating of following formula is equivalent to the minimal eigenvalue characteristic of correspondence vector of finding the solution in the following formula:

a ^TX(F-W)X ^Ta＝Λa ^TE(D-V)E ^Ta

Wherein Λ is an eigenvalue matrix, E=[e ₁, K, e _C] be the expectation of i class sample, $D_{Ii}^{c} = Σ_{j = 1}^{n_{c}} V_{Ij},$ $F = [\begin{matrix} F_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & F_{C} \end{matrix}],$ $W = [\begin{matrix} W_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & W_{C} \end{matrix}],$ $D = [\begin{matrix} D_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & D_{C} \end{matrix}],$ $V = [\begin{matrix} V_{1} & L & 0 \\ 0 & O & 0 \\ 0 & 0 & V_{C} \end{matrix}]$

With vector set a ₁, K, a _MArrange from small to large according to its characteristic of correspondence value, then have

x _i→y _i＝A ^Tx _i

I=1 wherein, K, N, A=[a ₁, K, a _M].
6. according to claim 5 based on multiple spurs from the audio index method of sonic transducer, it is characterized in that in said second step, euclidean distance between node pair defines as follows through mahalanobis distance:

d _ij＝(T _i-T _j)C ^-1(T _i-T _j) ^T

D wherein _IjBe mahalanobis distance, i and j are node, i ¹J, T are the TDOA estimated value vector of each frame, and C is T _iAnd T _jCovariance matrix, figure G seeks neighbor point by the distance of following formula definition.
7. according to claim 5 based on the audio index method of multiple spurs from sonic transducer; It is characterized in that; After sorting algorithm was accomplished, categorised decision provided score separately by several different sorters, accomplished the decision-making output with robustness optimization and optimal classification effect through decision level fusion.
8. according to claim 7 based on multiple spurs from the audio index method of sonic transducer, it is characterized in that the categorised decision after the decision level fusion is classification results, the output of system comprises whole voice band and respective classified information thereof.
9. according to claim 1 based on the audio index method of multiple spurs from sonic transducer, it is characterized in that after the said information acquisition step, before the characteristic extraction step, to various voice signal pre-service, said pre-service comprises pre-emphasis and end-point detection.
10. according to claim 1 based on multiple spurs from the audio index method of sonic transducer, it is characterized in that said multiple spurs comprises the sonic transducer on independent sonic transducer and the portable equipment from sonic transducer.