CN102509548A - Audio indexing method based on multi-distance sound sensor - Google Patents

Audio indexing method based on multi-distance sound sensor Download PDF

Info

Publication number
CN102509548A
CN102509548A CN2011103035808A CN201110303580A CN102509548A CN 102509548 A CN102509548 A CN 102509548A CN 2011103035808 A CN2011103035808 A CN 2011103035808A CN 201110303580 A CN201110303580 A CN 201110303580A CN 102509548 A CN102509548 A CN 102509548A
Authority
CN
China
Prior art keywords
sonic transducer
multiple spurs
characteristic
speaker
tdoa
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103035808A
Other languages
Chinese (zh)
Other versions
CN102509548B (en
Inventor
杨毅
陈国顺
王胜开
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN 201110303580 priority Critical patent/CN102509548B/en
Publication of CN102509548A publication Critical patent/CN102509548A/en
Application granted granted Critical
Publication of CN102509548B publication Critical patent/CN102509548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses an audio indexing method based on a multi-distance sound sensor. In the method, a multi-distance sound sensor is used as an audio recording device for recording the audio information in a multimedia conference, a space multi-delay feature is extracted based on the multi-distance sound sensor as a feature for distinguishing different speakers, and a new flow-type algorithm is adopted to perform dimension reduction of the multi-delay feature and classify the speakers according to the identities. The method can reduce the complexity and calculation cost of the system; finally, the audio segment and identity of each speaker are output by the system as audio index information; the optimal discriminant vector set theory obtained by the method can achieve optimal discrimination theoretically; and the method can be applied to a multi-people multi-party conversion scene in a complicated acoustic environment.

Description

A kind of based on the audio index method of multiple spurs from sonic transducer
Technical field
The invention belongs to the Audiotechnica field, relate to audio index, be specifically related to a kind of based on the audio index method of multiple spurs from sonic transducer.
Background technology
Business activity and daily life are goed deep in teleconference and video conference day by day, and corresponding with it record data present geometry level and increase, and in this type of scene, in section audio data, have a plurality of sound sources usually.Can handle this type data through the audio index technology, alleviate burden like post-processing approach such as speech recognitions.
Object content is searched for and is found in the automatic information extraction from voice data of audio index technology, and speaker's classification is the gordian technique of audio index, and speaker's sorting technique comprises three parts: feature extraction, voice segment, categorised decision.Main algorithm is mixed Gaussian log-likelihood ratio or SVMs.The former adopts versatility training (estimating like maximum likelihood or MAP) to produce speaker model, and the latter adopts the property distinguished training (like GLDS-SVM and bag of N-grams) to produce speaker model.GMM-SVM (gauss hybrid models-SVMs) is a kind of modeling and sorting technique of main flow, sets up the probability density distribution model and measures probability density distribution through the Kullback-Leibler divergence upper bound through GMM.The GMM-SVM method has preferable performance, but still has following problem: GMM existed that multiparameter, training data are limited, GMM-SVM is primarily aimed at Speaker Identification and does not develop into current techique during estimated probability density.
Good fortune mark (the Rich Transcription Evaluation) evaluation and test that got into NBS (National Institute of Standards and Technology) in 2005 is is first evaluated and tested in speaker's key words sorting (Speech Diarization).The purpose of speaker's automatic segmentation mark is to solve voice data is divided into the problem that fragment is classified by the speaker.Good fortune in 2009 mark evaluation and test condition is: words person's number is unknown, microphone position is unknown, the room acoustics environment is unknown, judge a plurality of speakers' identity and voice data classified by speaker ' s identity of the scene that promptly all lacks in time and spatial prior information.SPKR evaluation and test is an important subtask in speaker's key words sorting evaluation and test, and the problem of main research " Who spoke when " its objective is voice data is divided into fragment and classifies according to different speakers.Speaker's sorting technique can be applicable to fields such as speech recognition, audio-frequency information management, retrieval; Help in the audio stream of meeting, voice mail, lecture and news broadcast program, to realize that the speaker follows the tracks of, thereby realize voice data is carried out structurized analysis, understanding and management.
Multiple spurs is a kind of system that is made up of a plurality of sensors from the sonic transducer system, and is unrestricted to the structure of sonic transducer system, and each sonic transducer is by different device control, and the signal that therefore collects is asynchronous.Multiple spurs is simple in structure, easy to use and with low cost from the advantage of sonic transducer system, can be widely used in auditory localization, audio index and identification., can utilize and prolong characteristic for a long time and be used to carry out the not classification of overlapping sound source of space from the singularity of sonic transducer structure based on multiple spurs.But, prolong the proper vector dimension for a long time and increase rapidly along with the sonic transducer number increases.
Recently there is document to point out; Voice signal inside has low dimension flow structure; Riemann proposed flow pattern (Manifold) method first in 1854, (Locality Preserving Proiections LPP) is introduced in the pattern-recognition and receives extensive concern in guarantor office in 2005 projection.LPP is a kind of unsupervised learning method, in learning process, does not consider the classification information of sample.Yu etc. combine the Fisher criterion to propose the projection of discriminating guarantor office (Discriminant Locality Preserving Projections, DLPP) algorithm and be successfully used to recognition of face on the basis of LPP.The flow pattern that influences data based on the algorithm shortcomings dimension-reduction treatment meeting of LPP distributes and causes authentication information to be lost and small sample problem etc.Propose a kind of kernel to people such as small sample problem Yang and differentiated guarantor projection algorithm (the Null-space Locality Preserving Projecitons of office; NDLPP), but this method only utilized the authentication information of kernel and ignored the authentication information in the principal component space.
Summary of the invention
In order to overcome the deficiency of above-mentioned prior art; The object of the present invention is to provide a kind of based on the audio index method of multiple spurs from sonic transducer; Prolong characteristic for a long time through utilization and be used to carry out the not classification of overlapping sound source of space; And higher-dimension is prolonged proper vector for a long time carry out the dimension-reduction treatment based on flow pattern, the optimum discriminant vector collection that is obtained by this algorithm can reach optimum in theory to be differentiated, can be applicable to the many people session operational scenarios in many ways under the complicated acoustic enviroment.
To achieve these goals, the technical scheme of the present invention's employing is:
A kind of based on the audio index method of multiple spurs from sonic transducer, comprise information acquisition step, characteristic extraction step and categorised decision step:
Said information acquisition step realizes from sonic transducer through multiple spurs;
Said characteristic extraction step be with each individual sources and multiple spurs from sonic transducer between a plurality of time delays form and prolong acoustic feature for a long time based on spatial domain; Extract the authentication information of this space domain characteristic as the speaker, it is the element of space characteristics that definition arrives mistiming TDOA:
TDOA = | | m i - s | | - | | m j - s | | c
M wherein iAnd m jRepresent the locus of i and j sonic transducer respectively, s is the locus of sound source, and s is a sound source, and c is the velocity of sound, adopts the GCC-PHAT method to estimate the TDOA value, leaves the space acoustics that sonic transducer obtains based on multiple spurs and is characterized as:
T k = T ^ 12 T ^ 13 L T ^ ij T
Wherein k represents k speaker, and i represents i the sensor of multiple spurs in the sonic transducer system, and j represents j the sensor of multiple spurs in the sonic transducer system, and T represents the TDOA estimated value,
The distinctive structure of said space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously;
The result that said categorised decision step is based on information gathering step and characteristic extraction step adopts the sorting technique to vector to realize.
Further; Extract space-time weighting fusion characteristic in the said characteristic extraction step, that is, said space domain characteristic is combined the authentication information as the speaker with traditional human acoustics characteristic; Such as, TDOA vector sum MFCC proper vector is merged the authentication information as the speaker.
After said characteristic extraction step is accomplished, before the categorised decision step is carried out, carry out dimension-reduction treatment to prolonging acoustic feature for a long time, carry out speaker's classification through single sound source distinctive spatially, said dimension-reduction treatment is carried out through following flow pattern dimension reduction method:
The first step is pressed following formula to the pre-service of TDOA estimated value;
T [ n ] = T ^ [ n - 1 ] T ^ [ n ] < Thr T ^ [ n ] T ^ [ n ] &GreaterEqual; Thr
Wherein: n is the index value of a certain frame; T is the corresponding delay data of a certain frame;
Figure BDA0000097204460000042
is the delay data that a certain frame is estimated; When a certain moment time delay was estimated less than threshold value Thr, the estimation time delay that adopted a last moment was as this time delay estimated value constantly;
In second step, utilize euclidean distance between node pair to decide arest neighbors figure G;
The 3rd step, calculate weighted value, between node i on the arest neighbors figure G and j, line is arranged, then the definition of weighted value is following:
S ij = e - | | T i - T j | | 2 &alpha;
Wherein T represents the TDOA estimated value vector of each frame, and α is a constant, S Ij=S JiBetween node i on the arest neighbors figure G and j, there is not line, then S Ij=0;
The 4th step, the decision Feature Mapping, objective function is following:
J ( a ) = &Sigma; c = 1 C &Sigma; k = 1 n c ( y k c - e c ) W k c ( y k c - e c ) T &Sigma; i , j = 1 C ( e i - e j ) V ij ( e i - e j ) T
I wherein 1J, C are the classification number, n cBe c class sample number, e cBe the expectation of c class sample, e iAnd e jBe respectively the expectation of i class and j class sample,
Figure BDA0000097204460000045
And V IjBe respectively a between class distance weighted value and a type interior distance weighting value, its computing method are with reference to the 3rd step, and minimized the separating of following formula is equivalent to the minimal eigenvalue characteristic of correspondence vector of finding the solution in the following formula:
a TX(F-W)X Ta=Λa TE(D-V)E Ta
Wherein Λ is an eigenvalue matrix, E=[e 1, K, e C] be the expectation of i class sample,
Figure BDA0000097204460000046
D Ii c = &Sigma; j = 1 n c V Ij , F = F 1 L 0 0 O 0 0 0 F C , W = W 1 L 0 0 O 0 0 0 W C , D = D 1 L 0 0 O 0 0 0 D C , V = V 1 L 0 0 O 0 0 0 V C
With vector set a 1, K, a MArrange from small to large according to its characteristic of correspondence value, then have
x i→y i=A Tx i
I=1 wherein, K, N, A=[a 1, K, a M].
Wherein, in said second step, euclidean distance between node pair is defined as follows by mahalanobis distance:
d ij=(T i-T j)C -1(T i-T j) T
D wherein IjBe mahalanobis distance, i and j are node, i 1J, T are the TDOA estimated value vector of each frame, and C is T iAnd T jCovariance matrix, figure G seeks neighbor point by the distance of following formula definition.
Sorting algorithm is through above-mentioned flow pattern dimension reduction method; After sorting algorithm was accomplished, categorised decision provided score separately by several different sorters, accomplished the decision-making output with robustness optimization and optimal classification effect through decision level fusion; Categorised decision after the decision level fusion is classification results; The output of system comprises whole voice band and respective classified information thereof, after the said information acquisition step, before the characteristic extraction step; To various voice signal pre-service, said pre-service comprises pre-emphasis and end-point detection.
Among the present invention, said multiple spurs comprises the sonic transducer on independent sonic transducer and the portable equipment from sonic transducer.
The present invention compared with prior art, advantage is:
The purpose of speaker's automatic segmentation mark is to solve voice data is divided into the problem that fragment is classified by the speaker, judge a plurality of speakers' identity and voice data classified by speaker ' s identity of the condition that all lacks in time and spatial prior information usually.Multiple spurs can satisfy the demand that meets the multi-direction complex dialogs scene of many sound sources speaker automatic segmentation mark from microphone system.
A kind of voice signal input system that multiple spurs is made up of a plurality of single microphones from microphone, wherein each microphone can be controlled by distinct device, and the topological structure that microphone is formed has no restriction.For microphone array, multiple spurs has cost, puts advantages such as flexible from microphone.In speaker's key words sorting conference scenario of a plurality of participants; Based on the singularity of multiple spurs from the microphone topological structure; Can with each individual sources and microphone between a plurality of time delays form based on spatial domain prolong acoustic feature for a long time, distinguish carrying out identity from the speaker of overlapping dimensional orientation not.Further improved the performance of multiple spurs with the method for conventional acoustic Feature Fusion from microphone speaker categorizing system.Along with the microphone number increases, prolong the proper vector dimension for a long time and increase rapidly.
Voice signal inside has low dimension flow structure, and projection LPP of guarantor office and the kernel duscriminant guarantor projection NDLPP of office are two kinds of flow pattern dimensionality reduction algorithms that propose recently, and the former is a kind of unsupervised machine learning method, and reckons without information between the sample class; The latter is based on the combination of LPP and Fisher criterion, but only utilized the authentication information of kernel and ignored the authentication information in the principal component space.
The present invention proposes a kind of based on the non-linear duscriminant guarantor office projection algorithm that prolongs characteristic for a long time; Acoustic feature is prolonged in the space for a long time to be handled; Carry out speaker's classification through single sound source distinctive spatially, the optimum discriminant vector collection that is obtained by this algorithm can reach optimum discriminating in theory.The method that the present invention adopts has been avoided can't carrying out the problem of class discrimination in the prior art, and has utilized the authentication information of the kernel and the principal space simultaneously, has improved the efficient of differentiating between class.
Description of drawings
Fig. 1 is that multiple spurs of the present invention leaves sonic transducer system implementation figure, comprises that a plurality of target sound source and multiple spurs are from sonic transducer equipment.
Fig. 2 is a flow pattern dimension reduction method process flow diagram of the present invention.
Fig. 3 the present invention is based on the many speaker classification process figure of multiple spurs from sonic transducer.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is explained further details.
The input equipment of SPKR comprises that head microphone, single microphone, microphone array and multiple spurs are from microphone (Multiple Distance Microphones).Multiple spurs satisfies the demand that meets the multi-direction complex dialogs scene of many sound sources from sonic transducer, can be applicable to auditory localization, speaker's cluster and identification etc., can utilize and prolong characteristic for a long time and be used to carry out the not classification of overlapping sound source of space from the singularity of sonic transducer topological structure based on multiple spurs.
As shown in Figure 1, for a kind of multiple spurs from the sonic transducer system, comprise a plurality of sonic transducers, represent that with four sonic transducer 111-114 wherein these four sonic transducers are placed on the identical platform at random among Fig. 1.Likewise, only represent that these three acoustic targets can be present in any position in same room with three acoustic target 101-103.The stationkeeping of whole sonic transducers and acoustic target is constant in the process of audio recording.
Above-mentioned sonic transducer 111-114 comprises the sonic transducer on independent sonic transducer such as microphone and portable equipment such as notebook computer or the PDA equipment; Can select to use microphone here; A kind of voice signal input system that multiple spurs is made up of a plurality of single microphones from microphone; Wherein each microphone can be controlled by distinct device, and the topological structure that microphone is formed has no restriction.For microphone array, multiple spurs has cost, puts advantages such as flexible from microphone.In speaker's key words sorting conference scenario of a plurality of participants; Based on the singularity of multiple spurs from the microphone topological structure; Can with each individual sources and microphone between a plurality of time delays form based on spatial domain prolong acoustic feature for a long time, distinguish carrying out identity from the speaker of overlapping dimensional orientation not.It is following that definition space prolongs acoustic feature for a long time:
X 1 X 2 M X N = &tau; ^ 12 1 &tau; ^ 12 2 L &tau; ^ 12 N &tau; ^ 13 1 &tau; 13 2 L &tau; 13 N M M M M &tau; ij 1 &tau; ij 2 L &tau; ij N T
X wherein kRepresent the corresponding time delay vector of k sound source,
Figure BDA0000097204460000072
Represent k sound source to estimate to the delay inequality of i microphone and j microphone.
After utilizing multiple spurs to leave meeting participant's under the microphone apparatus recording multimedia conference scenario voice content, carry out the pre-service work of full detail, comprise various voice signal pre-service such as pre-emphasis and end-point detection etc.
Then, extract the authentication information of this space domain characteristic as the speaker, it is the element of space characteristics that definition arrives mistiming TDOA:
TDOA = | | m i - s | | - | | m j - s | | c
M wherein iAnd m jRepresent the locus of i and j sonic transducer respectively, s is the locus of sound source, and s is a sound source, and c is the velocity of sound, adopts the GCC-PHAT method to estimate the TDOA value, leaves the space acoustics that sonic transducer obtains based on multiple spurs and is characterized as:
T k = T ^ 12 T ^ 13 L T ^ ij T
Wherein k represents k speaker, and i represents i the microphone of multiple spurs in microphone system, and j represents j the microphone of multiple spurs in microphone system, and T represents the TDOA estimated value.
Have another kind of characteristic in addition, promptly space-time weighting fusion characteristic combines the sound source space characteristics with traditional human acoustics characteristic, for example TDOA vector sum MFCC proper vector is merged the principal character as speaker's classification.This fusion feature provides more authentication informations, can further improve classification accuracy rate.
Wherein, the distinctive structure of said space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously.
The advantage that acoustic feature is prolonged in the space for a long time is that than the conventional acoustic characteristic, the space is prolonged acoustic feature for a long time and had tangible identifiability.But along with the growth of microphone number, the dimension of acoustic feature is with the speed increment of square multiple.Need find a kind of reasonable plan, can in retention data inner structure relation, play the effect that reduces calculated amount, further improve the effect of speaker's classification.
Therefore, after the said extracted step is accomplished, need carry out mainly acoustic feature being prolonged in the space for a long time and carry out dimension-reduction treatment, carry out speaker's classification through single sound source distinctive spatially based on the flow pattern dimensionality reduction algorithm that prolongs characteristic for a long time.
As shown in Figure 2; System's input 201 is meeting participant's under the multimedia conferencing scene a voice content; Multiple spurs is from the essential record equipment of sonic transducer 202 as data; Multiple spurs can obtain the high dimension vector 203 that space characteristics constitutes as element from sonic transducer, and assumed condition 205 is that the distinctive structure of space acoustic feature is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously.
Said dimension-reduction treatment is carried out through following method:
The first step is pressed following formula to the pre-service of TDOA estimated value;
T [ n ] = T ^ [ n - 1 ] T ^ [ n ] < Thr T ^ [ n ] T ^ [ n ] &GreaterEqual; Thr
Wherein: n is the index value of a certain frame; T is the corresponding delay data of a certain frame;
Figure BDA0000097204460000083
is the delay data that a certain frame is estimated; When a certain moment time delay was estimated less than threshold value Thr, the estimation time delay that adopted a last moment was as this time delay estimated value constantly;
Second step, confirm arest neighbors Figure 20 4, utilize Euclidean distance method to seek, perhaps, when the characteristic vector part correlation, arest neighbors figure seeks by mahalanobis distance, being defined as follows of mahalanobis distance:
d ij=(x i-x j)C -1(x i-x j) T
D wherein IjBe mahalanobis distance, C is the covariance matrix of sample, and figure G seeks neighbor point by the distance of following formula definition.
The 3rd step, calculate weighted value 206, between node i on the arest neighbors figure G and j, line is arranged, then the definition of weighted value is following:
S ij = e - | | T i - T j | | 2 &alpha;
Wherein T represents the TDOA estimated value vector of each frame, and α is a constant, S Ij=S JiBetween node i on the arest neighbors figure G and j, there is not line, then S Ij=0;
The 4th step, decision Feature Mapping 207, objective function is following:
J ( a ) = &Sigma; c = 1 C &Sigma; k = 1 n c ( y k c - e c ) W k c ( y k c - e c ) T &Sigma; i , j = 1 C ( e i - e j ) V ij ( e i - e j ) T
I wherein 1J, C are the classification number, n cBe c class sample number, e cBe the expectation of c class sample, e iAnd e jBe respectively the expectation of i class and j class sample, And V IjBe respectively a between class distance weighted value and a type interior distance weighting value, its computing method are with reference to the 3rd step.Minimized the separating of following formula is equivalent to the minimal eigenvalue characteristic of correspondence vector of finding the solution in the following formula:
a TX(F-W)X Ta=Λa TE(D-V)E Ta
Wherein Λ is an eigenvalue matrix, E=[e 1, K, e C] be the expectation of i class sample,
Figure BDA0000097204460000094
D Ii c = &Sigma; j = 1 n c V Ij , F = F 1 L 0 0 O 0 0 0 F C , W = W 1 L 0 0 O 0 0 0 W C , D = D 1 L 0 0 O 0 0 0 D C , V = V 1 L 0 0 O 0 0 0 V C
With vector set a 1, K, a MArrange from small to large according to its characteristic of correspondence value, then have
x i→y i=A Tx i
I=1 wherein, K, N, A=[a 1, K, a M]
Finally, obtain low dimensional vector 208, obtain the system's output 209 behind the dimensionality reduction then.
As shown in Figure 3, based on the many speaker classification process figure of multiple spurs, comprise following content for a kind of from sonic transducer:
System's input 301 is meeting participant's under the multimedia conferencing scene a voice content; Multiple spurs is from the essential record equipment of sonic transducer 302 as data; Comprise the voice signal preconditioning technique of whole needs, for example pre-emphasis and end-point detection etc. at feature extraction partially-initialized 303.Extract the speaker authentication information of space domain characteristic 304 subsequently as the speaker.In fact, had very big dimension, need carry out dimensionality reduction sorting algorithm 305 and reduce system complexity and carry out speaker's classification by the space domain characteristic 304 of a plurality of sonic transducer collections; Dimensionality reduction sorting algorithm 305 adopts flow pattern dimension reduction method shown in Figure 2; Certainly, the dimensionality reduction algorithm also can use other flow pattern algorithm, like LPP algorithm or DLPP algorithm; But; The LPP algorithm tended to utilize PCA that sample is carried out dimensionality reduction in order to improve the efficient of computing before protecting innings projection, and this stream shape that possibly change sample distributes and causes losing of authentication information; The DLPP algorithm has only utilized the authentication information of type interior scatter matrix principal component space, has lost a large amount of authentication informations of its kernel.
After accomplishing dimensionality reduction sorting algorithm 305, categorised decision 306 will provide classification results.Categorised decision provides score separately by several different sorters usually; Accomplish decision-making output with robustness optimization and optimal classification effect through decision level fusion; And will export in classification results 307 demonstrations, 308 of system's overall situation outputs comprise whole voice band and respective classified information thereof.

Claims (10)

  1. One kind based on multiple spurs from the audio index method of sonic transducer, comprise information acquisition step, characteristic extraction step and categorised decision step, it is characterized in that:
    Said information acquisition step realizes from sonic transducer through multiple spurs;
    Said characteristic extraction step be with each individual sources and multiple spurs from sonic transducer between a plurality of time delays form and prolong acoustic feature for a long time based on spatial domain; Extract the authentication information of this space domain characteristic as the speaker, it is the element of space characteristics that definition arrives mistiming TDOA:
    TDOA = | | m i - s | | - | | m j - s | | c
    M wherein iAnd m jRepresent the locus of i and j sonic transducer respectively, s is the locus of sound source, and s is a sound source, and c is the velocity of sound, adopts the GCC-PHAT method to estimate the TDOA value, leaves the space acoustics that sonic transducer obtains based on multiple spurs and is characterized as:
    T k = T ^ 12 T ^ 13 L T ^ ij T
    Wherein k represents k speaker, and i represents i the sensor of multiple spurs in the sonic transducer system, and j represents j the sensor of multiple spurs in the sonic transducer system, and T represents the TDOA estimated value,
    The distinctive structure of said space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously;
    The result that said categorised decision step is based on information gathering step and characteristic extraction step adopts the sorting technique to vector to realize.
  2. 2. according to claim 1 based on the audio index method of multiple spurs from sonic transducer; It is characterized in that; Extract space-time weighting fusion characteristic in the said characteristic extraction step, that is, said space domain characteristic is combined the authentication information as the speaker with traditional human acoustics characteristic.
  3. 3. according to claim 2 based on multiple spurs from the audio index method of sonic transducer, it is characterized in that, TDOA vector sum MFCC proper vector is merged the authentication information as the speaker.
  4. 4. according to claim 1 based on the audio index method of multiple spurs from sonic transducer; It is characterized in that; After said characteristic extraction step is accomplished; Before the categorised decision step is carried out, carry out dimension-reduction treatment, carry out speaker's classification through single sound source distinctive spatially to prolonging acoustic feature for a long time.
  5. 5. according to claim 4 based on the audio index method of multiple spurs from sonic transducer, it is characterized in that said dimension-reduction treatment is carried out through following flow pattern dimension reduction method:
    The first step is pressed following formula to the pre-service of TDOA estimated value;
    T [ n ] = T ^ [ n - 1 ] T ^ [ n ] < Thr T ^ [ n ] T ^ [ n ] &GreaterEqual; Thr
    Wherein: n is the index value of a certain frame; T is the corresponding delay data of a certain frame;
    Figure FDA0000097204450000022
    is the delay data that a certain frame is estimated; When a certain moment time delay was estimated less than threshold value Thr, the estimation time delay that adopted a last moment was as this time delay estimated value constantly;
    In second step, utilize euclidean distance between node pair to decide arest neighbors figure G;
    The 3rd step, calculate weighted value, between node i on the arest neighbors figure G and j, line is arranged, then the definition of weighted value is following:
    S ij = e - | | T i - T j | | 2 &alpha;
    Wherein T represents the TDOA estimated value vector of each frame, and α is a constant, S Ij=S JiBetween node i on the arest neighbors figure G and j, there is not line, then S Ij=0;
    The 4th step, the decision Feature Mapping, objective function is following:
    J ( a ) = &Sigma; c = 1 C &Sigma; k = 1 n c ( y k c - e c ) W k c ( y k c - e c ) T &Sigma; i , j = 1 C ( e i - e j ) V ij ( e i - e j ) T
    I wherein 1J, C are the classification number, n cBe c class sample number, e cBe the expectation of c class sample, e iAnd e jBe respectively the expectation of i class and j class sample,
    Figure FDA0000097204450000025
    And V IjBe respectively a between class distance weighted value and a type interior distance weighting value, its computing method are with reference to the 3rd step, and minimized the separating of following formula is equivalent to the minimal eigenvalue characteristic of correspondence vector of finding the solution in the following formula:
    a TX(F-W)X Ta=Λa TE(D-V)E Ta
    Wherein Λ is an eigenvalue matrix, E=[e 1, K, e C] be the expectation of i class sample, D Ii c = &Sigma; j = 1 n c V Ij , F = F 1 L 0 0 O 0 0 0 F C , W = W 1 L 0 0 O 0 0 0 W C , D = D 1 L 0 0 O 0 0 0 D C , V = V 1 L 0 0 O 0 0 0 V C
    With vector set a 1, K, a MArrange from small to large according to its characteristic of correspondence value, then have
    x i→y i=A Tx i
    I=1 wherein, K, N, A=[a 1, K, a M].
  6. 6. according to claim 5 based on multiple spurs from the audio index method of sonic transducer, it is characterized in that in said second step, euclidean distance between node pair defines as follows through mahalanobis distance:
    d ij=(T i-T j)C -1(T i-T j) T
    D wherein IjBe mahalanobis distance, i and j are node, i 1J, T are the TDOA estimated value vector of each frame, and C is T iAnd T jCovariance matrix, figure G seeks neighbor point by the distance of following formula definition.
  7. 7. according to claim 5 based on the audio index method of multiple spurs from sonic transducer; It is characterized in that; After sorting algorithm was accomplished, categorised decision provided score separately by several different sorters, accomplished the decision-making output with robustness optimization and optimal classification effect through decision level fusion.
  8. 8. according to claim 7 based on multiple spurs from the audio index method of sonic transducer, it is characterized in that the categorised decision after the decision level fusion is classification results, the output of system comprises whole voice band and respective classified information thereof.
  9. 9. according to claim 1 based on the audio index method of multiple spurs from sonic transducer, it is characterized in that after the said information acquisition step, before the characteristic extraction step, to various voice signal pre-service, said pre-service comprises pre-emphasis and end-point detection.
  10. 10. according to claim 1 based on multiple spurs from the audio index method of sonic transducer, it is characterized in that said multiple spurs comprises the sonic transducer on independent sonic transducer and the portable equipment from sonic transducer.
CN 201110303580 2011-10-09 2011-10-09 Audio indexing method based on multi-distance sound sensor Active CN102509548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110303580 CN102509548B (en) 2011-10-09 2011-10-09 Audio indexing method based on multi-distance sound sensor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110303580 CN102509548B (en) 2011-10-09 2011-10-09 Audio indexing method based on multi-distance sound sensor

Publications (2)

Publication Number Publication Date
CN102509548A true CN102509548A (en) 2012-06-20
CN102509548B CN102509548B (en) 2013-06-12

Family

ID=46221623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110303580 Active CN102509548B (en) 2011-10-09 2011-10-09 Audio indexing method based on multi-distance sound sensor

Country Status (1)

Country Link
CN (1) CN102509548B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103117815A (en) * 2012-12-28 2013-05-22 中国人民解放军信息工程大学 Time difference estimation method and device of multi-sensor signals
WO2014082445A1 (en) * 2012-11-29 2014-06-05 华为技术有限公司 Method, device, and system for classifying audio conference minutes
CN106297794A (en) * 2015-05-22 2017-01-04 西安中兴新软件有限责任公司 The conversion method of a kind of language and characters and equipment
CN106940997A (en) * 2017-03-20 2017-07-11 海信集团有限公司 A kind of method and apparatus that voice signal is sent to speech recognition system
CN107851435A (en) * 2015-05-26 2018-03-27 纽昂斯通讯公司 Method and apparatus for reducing the delay in speech recognition application
US10522151B2 (en) 2015-02-03 2019-12-31 Dolby Laboratories Licensing Corporation Conference segmentation based on conversational dynamics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009205177A (en) * 2003-10-03 2009-09-10 Asahi Kasei Corp Data process unit, data processing unit control program and data processing method
CN102074236A (en) * 2010-11-29 2011-05-25 清华大学 Speaker clustering method for distributed microphone

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009205177A (en) * 2003-10-03 2009-09-10 Asahi Kasei Corp Data process unit, data processing unit control program and data processing method
CN102074236A (en) * 2010-11-29 2011-05-25 清华大学 Speaker clustering method for distributed microphone

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014082445A1 (en) * 2012-11-29 2014-06-05 华为技术有限公司 Method, device, and system for classifying audio conference minutes
US8838447B2 (en) 2012-11-29 2014-09-16 Huawei Technologies Co., Ltd. Method for classifying voice conference minutes, device, and system
CN103117815A (en) * 2012-12-28 2013-05-22 中国人民解放军信息工程大学 Time difference estimation method and device of multi-sensor signals
US10522151B2 (en) 2015-02-03 2019-12-31 Dolby Laboratories Licensing Corporation Conference segmentation based on conversational dynamics
CN106297794A (en) * 2015-05-22 2017-01-04 西安中兴新软件有限责任公司 The conversion method of a kind of language and characters and equipment
CN107851435A (en) * 2015-05-26 2018-03-27 纽昂斯通讯公司 Method and apparatus for reducing the delay in speech recognition application
CN106940997A (en) * 2017-03-20 2017-07-11 海信集团有限公司 A kind of method and apparatus that voice signal is sent to speech recognition system
CN106940997B (en) * 2017-03-20 2020-04-28 海信集团有限公司 Method and device for sending voice signal to voice recognition system

Also Published As

Publication number Publication date
CN102509548B (en) 2013-06-12

Similar Documents

Publication Publication Date Title
Cao et al. Polyphonic sound event detection and localization using a two-stage strategy
Torfi et al. 3d convolutional neural networks for cross audio-visual matching recognition
CN102509548B (en) Audio indexing method based on multi-distance sound sensor
Anguera et al. Speaker diarization: A review of recent research
Heittola et al. Audio context recognition using audio event histograms
Adavanne et al. Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features
Carletti et al. Audio surveillance using a bag of aural words classifier
CN106251874A (en) A kind of voice gate inhibition and quiet environment monitoring method and system
Chung et al. Who said that?: Audio-visual speaker diarisation of real-world meetings
Sun et al. Speaker diarization system for RT07 and RT09 meeting room audio
Ziaei et al. Prof-Life-Log: Personal interaction analysis for naturalistic audio streams
CN109935226A (en) A kind of far field speech recognition enhancing system and method based on deep neural network
Jati et al. Hierarchy-aware loss function on a tree structured label space for audio event detection
Waldekar et al. Classification of audio scenes with novel features in a fused system framework
Abrol et al. Learning hierarchy aware embedding from raw audio for acoustic scene classification
Freire-Obregón et al. Improving user verification in human-robot interaction from audio or image inputs through sample quality assessment
Mohmmad et al. Tree cutting sound detection using deep learning techniques based on Mel Spectrogram and MFCC features
Zhang et al. Few-shot bioacoustic event detection using prototypical network with background classs
Jing et al. DCAR: A discriminative and compact audio representation for audio processing
Cabañas-Molero et al. Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
CN111179959A (en) Competitive speaker number estimation method and system based on speaker embedding space
CN116259313A (en) Sound event positioning and detecting method based on time domain convolution network
Friedland et al. Speaker recognition and diarization
Canton-Ferrer et al. Audiovisual event detection towards scene understanding
Vajaria et al. Exploring co-occurence between speech and body movement for audio-guided video localization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant