CN102509548B - Audio indexing method based on multi-distance sound sensor - Google Patents

Audio indexing method based on multi-distance sound sensor Download PDF

Info

Publication number
CN102509548B
CN102509548B CN 201110303580 CN201110303580A CN102509548B CN 102509548 B CN102509548 B CN 102509548B CN 201110303580 CN201110303580 CN 201110303580 CN 201110303580 A CN201110303580 A CN 201110303580A CN 102509548 B CN102509548 B CN 102509548B
Authority
CN
China
Prior art keywords
sonic transducer
multiple spurs
speaker
audio
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201110303580
Other languages
Chinese (zh)
Other versions
CN102509548A (en
Inventor
杨毅
陈国顺
王胜开
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN 201110303580 priority Critical patent/CN102509548B/en
Publication of CN102509548A publication Critical patent/CN102509548A/en
Application granted granted Critical
Publication of CN102509548B publication Critical patent/CN102509548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses an audio indexing method based on a multi-distance sound sensor. In the method, a multi-distance sound sensor is used as an audio recording device for recording the audio information in a multimedia conference, a space multi-delay feature is extracted based on the multi-distance sound sensor as a feature for distinguishing different speakers, and a new flow-type algorithm is adopted to perform dimension reduction of the multi-delay feature and classify the speakers according to the identities. The method can reduce the complexity and calculation cost of the system; finally, the audio segment and identity of each speaker are output by the system as audio index information; the optimal discriminant vector set theory obtained by the method can achieve optimal discrimination theoretically; and the method can be applied to a multi-people multi-party conversion scene in a complicated acoustic environment.

Description

A kind of based on the audio index method of multiple spurs from sonic transducer
Technical field
The invention belongs to the Audiotechnica field, relate to audio index, be specifically related to a kind of based on the audio index method of multiple spurs from sonic transducer.
Background technology
Business activity and daily life are goed deep in teleconference and video conference day by day, and corresponding record data present how much level growths with it, usually have multi-acoustical in section audio data in this type of scene.Can process this class data by the audio index technology, alleviate the burden as post-processing approach such as speech recognitions.
Object content is searched for and is found in the automatic information extraction from voice data of audio index technology, and speaker clustering is the gordian technique of audio index, and the speaker clustering technology comprises three parts: feature extraction, voice segment, categorised decision.Main algorithm is mixed Gaussian log-likelihood ratio or support vector machine.The former adopts versatility training (estimating as maximum likelihood or MAP) to produce speaker model, and the latter adopts the property distinguished training (as GLDS-SVM and bag of N-grams) to produce speaker model.GMM-SVM (gauss hybrid models-support vector machine) is a kind of modeling and sorting technique of main flow, sets up the probability density distribution model and by the Kullback-Leibler divergence upper bound measuring probability density distribution by GMM.The GMM-SVM method has better performance, but still has following problem: during estimated probability density, GMM existed that multiparameter, training data are limited, GMM-SVM does not develop into current techique mainly for Speaker Identification.
Speaker clustering mark (Speech Diarization) is evaluated and tested good fortune mark (the Rich Transcription Evaluation) evaluation and test that entered first NBS (National Institute of Standards and Technology) in 2005.The purpose of speaker's automatic segmentation mark is to solve voice data is divided into the problem that fragment is classified by the speaker.Good fortune in 2009 mark evaluation and test condition is: words person's number is unknown, microphone position is unknown, the room acoustics environment is unknown, judge a plurality of speakers' identity and voice data is classified by speaker ' s identity of the scene that namely all lacks in time and spatial prior information.The SPKR evaluation and test is an important subtask in the evaluation and test of speaker clustering mark, and the problem of main research " Who spoke when " its objective is voice data is divided into fragment and classifies according to different speakers.The speaker clustering technology can be applicable to the fields such as speech recognition, audio-frequency information management, retrieval, help to realize speaker tracking in the audio stream of meeting, voice mail, lecture and news broadcast program, thereby realize voice data is carried out structurized analysis, understanding and management.
Multiple spurs is a kind of system that is comprised of a plurality of sensors from the sonic transducer system, and unrestricted to the structure of sonic transducer system, each sonic transducer is controlled by different equipment, and the signal that therefore collects is asynchronous.Multiple spurs is simple in structure, easy to use and with low cost from the advantage of sonic transducer system, can be widely used in auditory localization, audio index and identification.Based on the singularity of multiple spurs from the sonic transducer structure, can utilize the multi-time Delay feature to be used for carrying out the not classification of overlapping sound source of space.But along with the sonic transducer number increases, multi-time Delay proper vector dimension increases rapidly.
Recently there is document to point out, voice signal inside has the low-dimensional flow structure, Riemann proposed flow pattern (Manifold) method first in 1854,2005 guarantor's office's projection (Locality Preserving Proiections, LPP) be introduced in pattern-recognition and be subject to extensive concern.LPP is a kind of unsupervised learning method, does not consider the classification information of sample in learning process.Yu etc. have proposed discriminating guarantor office's projection (Discriminant Locality Preserving Projections, DLPP) algorithm and have been successfully used to recognition of face in conjunction with the Fisher criterion on the basis of LPP.The flow pattern that affects data based on the algorithm shortcomings dimension-reduction treatment meeting of LPP causes authentication information to be lost and small sample problem etc.Propose a kind of kernel for people such as small sample problem Yang and differentiated guarantor projection algorithm (the Null-space Locality Preserving Projecitons of office, NDLPP), but the method only utilized the authentication information of kernel and ignored the authentication information in the principal component space.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the object of the present invention is to provide a kind of based on the audio index method of multiple spurs from sonic transducer, by utilizing the multi-time Delay feature to be used for carrying out the not classification of overlapping sound source of space, and higher-dimension multi-time Delay proper vector is carried out dimension-reduction treatment based on flow pattern, the optimum discriminant vector collection that is obtained by this algorithm can reach optimum in theory to be differentiated, can be applicable to the many people session operational scenarios in many ways under complicated acoustic enviroment.
To achieve these goals, the technical solution used in the present invention is:
A kind of based on the audio index method of multiple spurs from sonic transducer, comprise information acquisition step, characteristic extraction step and categorised decision step:
Described information acquisition step realizes from sonic transducer by multiple spurs;
Described characteristic extraction step be with each individual sources and multiple spurs from sonic transducer between a plurality of time delays form multi-time Delay acoustic feature based on spatial domain, extract this space domain characteristic as speaker's authentication information, it is the element of space characteristics that definition arrives mistiming TDOA:
TDOA = | | m i - s | | - | | m j - s | | c
M wherein iAnd m jRepresent respectively the locus of i and j sonic transducer, s is the locus of sound source, and s is sound source, and c is the velocity of sound, adopts GCC-PHAT method estimation TDOA value, is characterized as from the space acoustics that sonic transducer obtains based on multiple spurs:
T k = T ^ 12 T ^ 13 L T ^ ij T
Wherein k represents k speaker, and i represents i the sensor of multiple spurs in the sonic transducer system, and j represents j the sensor of multiple spurs in the sonic transducer system, and T represents the TDOA estimated value,
The distinctive structure of described space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously;
Described categorised decision step is to adopt the sorting technique to vector to realize according to the result of information acquisition step and characteristic extraction step.
Further, extract space-time Weighted Fusion feature in described characteristic extraction step, that is, described space domain characteristic is combined authentication information as the speaker with traditional human acoustics feature, such as, with the authentication information of TDOA vector sum MFCC proper vector fusion as the speaker.
After described characteristic extraction step is completed, before the categorised decision step is carried out, the multi-time Delay acoustic feature is carried out dimension-reduction treatment, carry out speaker clustering by single sound source distinctive spatially, described dimension-reduction treatment is undertaken by following flow pattern dimension reduction method:
The first step is pressed following formula to the pre-service of TDOA estimated value;
T [ n ] = T ^ [ n - 1 ] T ^ [ n ] < Thr T ^ [ n ] T ^ [ n ] &GreaterEqual; Thr
Wherein: n is the index value of a certain frame, and T is delay data corresponding to a certain frame,
Figure BDA0000097204460000042
Delay data for a certain frame is estimated when a certain moment time delay is estimated less than threshold value Thr, adopted the estimation time delay in a upper moment as this time delay estimated value constantly;
Second step utilizes euclidean distance between node pair to decide arest neighbors figure G;
In the 3rd step, the Determining Weights value has line between the node i on arest neighbors figure G and j, and weighted value is defined as follows:
S ij = e - | | T i - T j | | 2 &alpha;
Wherein T represents the TDOA estimated value vector of each frame, and α is constant, S ij=S jiThere is no line between the node i on arest neighbors figure G and j, S ij=0;
The 4th step determined Feature Mapping, and objective function is as follows:
J ( a ) = &Sigma; c = 1 C &Sigma; k = 1 n c ( y k c - e c ) W k c ( y k c - e c ) T &Sigma; i , j = 1 C ( e i - e j ) V ij ( e i - e j ) T
I wherein 1J, C are the classification number, n cBe c class sample number, e cBe the expectation of c class sample, e iAnd e jBe respectively the expectation of i class and j class sample,
Figure BDA0000097204460000045
And V ijBe respectively between class distance weighted value and inter-object distance weighted value, its computing method are with reference to the 3rd step, and the minimized solution of following formula is equivalent to the minimal eigenvalue characteristic of correspondence vector of finding the solution in following formula:
a TX(F-W)X Ta=Λa TE(D-V)E Ta
Wherein Λ is eigenvalue matrix, E=[e 1, K, e C] be the expectation of i class sample, D ii c = &Sigma; j = 1 n c V ij , F = F 1 L 0 0 O 0 0 0 F C , W = W 1 L 0 0 O 0 0 0 W C , D = D 1 L 0 0 O 0 0 0 D C , V = V 1 L 0 0 O 0 0 0 V C
With vector set a 1, K, a MArrange from small to large according to its characteristic of correspondence value, have
x i→y i=A Tx i
I=1 wherein, K, N, A=[a 1, K, a M].
Wherein, in described second step, euclidean distance between node pair is defined as follows by mahalanobis distance:
d ij=(T i-T j)C -1(T i-T j) T
D wherein ijBe mahalanobis distance, i and j are node, i 1J, T are the TDOA estimated value vector of each frame, and C is T iAnd T jCovariance matrix, figure G seeks neighbor point by the distance of following formula definition.
Sorting algorithm is by above-mentioned flow pattern dimension reduction method, after sorting algorithm is completed, categorised decision provides score separately by several different sorters, complete the decision-making output with robustness optimization and optimal classification effect by decision level fusion, categorised decision after decision level fusion is classification results, the output of system comprises whole voice band and corresponding classified information thereof, after described information acquisition step, before characteristic extraction step, to various voice signal pre-service, described pre-service comprises pre-emphasis and end-point detection.
In the present invention, described multiple spurs comprises sonic transducer on independent sonic transducer and portable equipment from sonic transducer.
The present invention compared with prior art, advantage is:
The purpose of speaker's automatic segmentation mark is to solve voice data is divided into the problem that fragment is classified by the speaker, judge a plurality of speakers' identity and voice data is classified by speaker ' s identity of the condition that usually all lacks in time and spatial prior information.Multiple spurs can satisfy from microphone system the demand that meets the multi-direction complex dialogs scene of many sound sources speaker automatic segmentation mark.
A kind of voice signal input system that multiple spurs is comprised of a plurality of single microphones from microphone, wherein each microphone can be controlled by distinct device, and the topological structure that microphone is formed is without any restriction.For microphone array, multiple spurs from microphone have expense cheap, the advantage such as put flexibly.In a plurality of participants' speaker clustering mark conference scenario, based on the singularity of multiple spurs from the microphone topological structure, can with each individual sources and microphone between a plurality of time delays form multi-time Delay acoustic feature based on spatial domain, distinguish carrying out identity from the speaker of overlapping dimensional orientation not.Further improved the performance of multiple spurs from microphone speaker clustering system with the method for conventional acoustic Fusion Features.Along with the microphone number increases, multi-time Delay proper vector dimension increases rapidly.
Voice signal inside has the low-dimensional flow structure, and the guarantor projection LPP of office and the kernel duscriminant guarantor projection NDLPP of office are two kinds of flow pattern dimension-reduction algorithms that propose recently, and the former is a kind of unsupervised machine learning method, and reckons without information between the sample class; The latter is based on the combination of LPP and Fisher criterion, but only utilized the authentication information of kernel and ignored the authentication information in the principal component space.
The present invention proposes a kind of non-linear duscriminant guarantor office projection algorithm based on the multi-time Delay feature, space multi-time Delay acoustic feature is processed, carry out speaker clustering by single sound source distinctive spatially, the optimum discriminant vector collection that is obtained by this algorithm can reach optimum discriminating in theory.The method that the present invention adopts has been avoided can't carrying out the problem of class discrimination in the prior art, and has utilized simultaneously the authentication information of kernel and the principal space, has improved the efficient of differentiating between class.
Description of drawings
Fig. 1 be multiple spurs of the present invention from sonic transducer System Implementation figure, comprise that a plurality of target sound source and multiple spurs are from sonic transducer equipment.
Fig. 2 is flow pattern dimension reduction method process flow diagram of the present invention.
Fig. 3 the present invention is based on multiple spurs from many speakers classification process figure of sonic transducer.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further details.
The input equipment of SPKR comprises that head microphone, single microphone, microphone array and multiple spurs are from microphone (Multiple Distance Microphones).Multiple spurs satisfies the demand that meets the multi-direction complex dialogs scene of many sound sources from sonic transducer, can be applicable to auditory localization, speaker clustering and identification etc.Based on the singularity of multiple spurs from the sonic transducer topological structure, can utilize the multi-time Delay feature to be used for carrying out the not classification of overlapping sound source of space.
As shown in Figure 1, from the sonic transducer system, comprise a plurality of sonic transducers for a kind of multiple spurs, represent with four sonic transducer 111-114 wherein in Fig. 1, these four sonic transducers are placed on identical platform at random.Similarly, only represent with three acoustic target 101-103, these three acoustic targets can be present in any position of same room.The position of whole sonic transducers and acoustic target immobilizes in the process of audio recording.
Above-mentioned sonic transducer 111-114 comprises the sonic transducer on independent sonic transducer such as microphone and portable equipment such as notebook computer or PDA equipment, but choice for use microphone herein, a kind of voice signal input system that multiple spurs is comprised of a plurality of single microphones from microphone, wherein each microphone can be controlled by distinct device, and the topological structure that microphone is formed is without any restriction.For microphone array, multiple spurs from microphone have expense cheap, the advantage such as put flexibly.In a plurality of participants' speaker clustering mark conference scenario, based on the singularity of multiple spurs from the microphone topological structure, can with each individual sources and microphone between a plurality of time delays form multi-time Delay acoustic feature based on spatial domain, distinguish carrying out identity from the speaker of overlapping dimensional orientation not.Definition space multi-time Delay acoustic feature is as follows:
X 1 X 2 M X N = &tau; ^ 12 1 &tau; ^ 12 2 L &tau; ^ 12 N &tau; ^ 13 1 &tau; 13 2 L &tau; 13 N M M M M &tau; ij 1 &tau; ij 2 L &tau; ij N T
X wherein kRepresent k the time delay vector that sound source is corresponding,
Figure BDA0000097204460000072
Represent that k sound source is to the time delay estimation of i microphone and j microphone.
After the voice content that utilizes multiple spurs meeting participant under the microphone apparatus recording multimedia conference scenario, carry out the pre-service work of full detail, comprise various voice signal pre-service such as pre-emphasis and end-point detection etc.
Then, extract this space domain characteristic as speaker's authentication information, it is the element of space characteristics that definition arrives mistiming TDOA:
TDOA = | | m i - s | | - | | m j - s | | c
M wherein iAnd m jRepresent respectively the locus of i and j sonic transducer, s is the locus of sound source, and s is sound source, and c is the velocity of sound, adopts GCC-PHAT method estimation TDOA value, is characterized as from the space acoustics that sonic transducer obtains based on multiple spurs:
T k = T ^ 12 T ^ 13 L T ^ ij T
Wherein k represents k speaker, and i represents i the microphone of multiple spurs in the microphone system, and j represents j the microphone of multiple spurs in the microphone system, and T represents the TDOA estimated value.
Have in addition another kind of feature, namely space-time Weighted Fusion feature, be combined the sound source space characteristics with traditional human acoustics feature, for example TDOA vector sum MFCC proper vector merged the principal character as speaker clustering.This fusion feature provides more authentication informations, can further improve classification accuracy rate.
Wherein, the distinctive structure of described space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously.
The advantage of space multi-time Delay acoustic feature is, than the conventional acoustic feature, space multi-time Delay acoustic feature has obvious identifiability.But along with the growth of microphone number, the dimension of acoustic feature is with the speed increment of square multiple.Need to find a kind of rational scheme, can play the effect that reduces calculated amount when keeping data inner structure relation, further improve the effect of speaker clustering.
Therefore, after the said extracted step is completed, need to carry out mainly space multi-time Delay acoustic feature being carried out dimension-reduction treatment based on the flow pattern dimension-reduction algorithm of multi-time Delay feature, carry out speaker clustering by single sound source distinctive spatially.
As shown in Figure 2, system's input 201 is the voice content of meeting participant under the multimedia conferencing scene, multiple spurs is from the essential record equipment of sonic transducer 202 as data, multiple spurs can obtain from sonic transducer the high dimension vector 203 that space characteristics consists of as element, assumed condition 205 is that the distinctive structure of space acoustic feature is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously.
Described dimension-reduction treatment is carried out by the following method:
The first step is pressed following formula to the pre-service of TDOA estimated value;
T [ n ] = T ^ [ n - 1 ] T ^ [ n ] < Thr T ^ [ n ] T ^ [ n ] &GreaterEqual; Thr
Wherein: n is the index value of a certain frame, and T is delay data corresponding to a certain frame,
Figure BDA0000097204460000083
Delay data for a certain frame is estimated when a certain moment time delay is estimated less than threshold value Thr, adopted the estimation time delay in a upper moment as this time delay estimated value constantly;
Second step is determined arest neighbors Figure 20 4, utilizes Euclidean distance method to seek, and perhaps, when proper vector was partly relevant, arest neighbors figure sought by mahalanobis distance, and mahalanobis distance is defined as follows:
d ij=(x i-x j)C -1(x i-x j) T
D wherein ijBe mahalanobis distance, C is the covariance matrix of sample, and figure G seeks neighbor point by the distance of following formula definition.
In the 3rd step, Determining Weights value 206 has line between the node i on arest neighbors figure G and j, and weighted value is defined as follows:
S ij = e - | | T i - T j | | 2 &alpha;
Wherein T represents the TDOA estimated value vector of each frame, and α is constant, S ij=S jiThere is no line between the node i on arest neighbors figure G and j, S ij=0;
The 4th step determined Feature Mapping 207, and objective function is as follows:
J ( a ) = &Sigma; c = 1 C &Sigma; k = 1 n c ( y k c - e c ) W k c ( y k c - e c ) T &Sigma; i , j = 1 C ( e i - e j ) V ij ( e i - e j ) T
I wherein 1J, C are the classification number, n cBe c class sample number, e cBe the expectation of c class sample, e iAnd e jBe respectively the expectation of i class and j class sample,
Figure BDA0000097204460000093
And V ijBe respectively between class distance weighted value and inter-object distance weighted value, its computing method are with reference to the 3rd step.The minimized solution of following formula is equivalent to the minimal eigenvalue characteristic of correspondence vector of finding the solution in following formula:
a TX(F-W)X Ta=Λa TE(D-V)E Ta
Wherein Λ is eigenvalue matrix, E=[e 1, K, e C] be the expectation of i class sample,
Figure BDA0000097204460000094
D ii c = &Sigma; j = 1 n c V ij , F = F 1 L 0 0 O 0 0 0 F C , W = W 1 L 0 0 O 0 0 0 W C , D = D 1 L 0 0 O 0 0 0 D C , V = V 1 L 0 0 O 0 0 0 V C
With vector set a 1, K, a MArrange from small to large according to its characteristic of correspondence value, have
x i→y i=A Tx i
I=1 wherein, K, N, A=[a 1, K, a M]
Finally, obtain low dimensional vector 208, then obtain the system's output 209 after dimensionality reduction.
As shown in Figure 3, based on the many speaker classification process figure of multiple spurs from sonic transducer, comprise following content for a kind of:
System's input 301 is the voice content of meeting participant under the multimedia conferencing scene, multiple spurs is from the essential record equipment of sonic transducer 302 as data, the voice signal preconditioning technique that comprises whole needs at feature extraction partially-initialized 303, such as pre-emphasis and end-point detection etc.Extract subsequently space domain characteristic 304 as speaker's authentication information of speaker.In fact, the space domain characteristic 304 that is gathered by a plurality of sonic transducers has very large dimension, need to carry out dimensionality reduction sorting algorithm 305 reduces system complexity and carries out speaker clustering, dimensionality reduction sorting algorithm 305 adopts flow pattern dimension reduction method shown in Figure 2, certainly, dimension-reduction algorithm also can use other flow pattern algorithm, as LPP algorithm or DLPP algorithm, but, the LPP algorithm is in order to improve the efficient of computing, tended to utilize PCA to carry out dimensionality reduction to sample before protecting innings projection, the loss that this may change the fluid flow of sample and cause authentication information; The DLPP algorithm has only utilized the authentication information of scatter matrix principal component space in class, has lost a large amount of authentication informations of its kernel.
After completing dimensionality reduction sorting algorithm 305, categorised decision 306 will provide classification results.Categorised decision provides score separately by several different sorters usually, complete the decision-making output with robustness optimization and optimal classification effect by decision level fusion, and will export in classification results 307 demonstrations, 308 of system's overall situation outputs comprise whole voice band and corresponding classified information thereof.

Claims (6)

1. one kind based on the audio index method of multiple spurs from sonic transducer, comprises information acquisition step, characteristic extraction step and categorised decision step, it is characterized in that:
Described information acquisition step realizes from sonic transducer by multiple spurs;
Described characteristic extraction step be with each individual sources and multiple spurs from sonic transducer between a plurality of time delays form multi-time Delay acoustic feature based on spatial domain, extract this space domain characteristic as speaker's authentication information, it is the element of space characteristics that definition arrives mistiming TDOA:
TDOA = | | m i - s | | - | | m j - s | | c
M wherein iAnd m jRepresent respectively the locus of i and j sonic transducer, s is the locus of sound source, and c is the velocity of sound, adopts GCC-PHAT method estimation TDOA value, is characterized as from the space acoustics that sonic transducer obtains based on multiple spurs:
T k = T ^ 12 T ^ 13 L T ^ ij T
Wherein k represents k speaker, and i represents i the sensor of multiple spurs in the sonic transducer system, and j represents j the sensor of multiple spurs in the sonic transducer system, and L represents suspension points, look like for from
Figure FDA00002721035900013
Until
Figure FDA00002721035900014
T represents the TDOA estimated value,
The distinctive structure of described space domain characteristic is consistent on the statistics flow pattern, and this flow pattern does not belong to overall linear flow pattern simultaneously;
Described categorised decision step is to adopt the sorting technique to vector to realize according to the result of information acquisition step and characteristic extraction step.
2. according to claim 1 based on the audio index method of multiple spurs from sonic transducer, it is characterized in that, extract space-time Weighted Fusion feature in described characteristic extraction step, that is, described space domain characteristic is combined authentication information as the speaker with traditional human acoustics feature.
3. according to claim 2ly it is characterized in that based on the audio index method of multiple spurs from sonic transducer, TDOA vector sum MFCC proper vector is merged authentication information as the speaker.
4. according to claim 1 based on the audio index method of multiple spurs from sonic transducer, it is characterized in that, after described characteristic extraction step is completed, before the categorised decision step is carried out, the multi-time Delay acoustic feature is carried out dimension-reduction treatment, carry out speaker clustering by single sound source distinctive spatially.
5. according to claim 1ly it is characterized in that based on the audio index method of multiple spurs from sonic transducer, after described information acquisition step, before characteristic extraction step, to various voice signal pre-service, described pre-service comprises pre-emphasis and end-point detection.
6. according to claim 1ly it is characterized in that based on the audio index method of multiple spurs from sonic transducer, described multiple spurs comprises sonic transducer on independent sonic transducer and portable equipment from sonic transducer.
CN 201110303580 2011-10-09 2011-10-09 Audio indexing method based on multi-distance sound sensor Active CN102509548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110303580 CN102509548B (en) 2011-10-09 2011-10-09 Audio indexing method based on multi-distance sound sensor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110303580 CN102509548B (en) 2011-10-09 2011-10-09 Audio indexing method based on multi-distance sound sensor

Publications (2)

Publication Number Publication Date
CN102509548A CN102509548A (en) 2012-06-20
CN102509548B true CN102509548B (en) 2013-06-12

Family

ID=46221623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110303580 Active CN102509548B (en) 2011-10-09 2011-10-09 Audio indexing method based on multi-distance sound sensor

Country Status (1)

Country Link
CN (1) CN102509548B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968991B (en) 2012-11-29 2015-01-21 华为技术有限公司 Method, device and system for sorting voice conference minutes
CN103117815B (en) * 2012-12-28 2014-11-19 中国人民解放军信息工程大学 Time difference estimation method and device of multi-sensor signals
EP3254453B1 (en) 2015-02-03 2019-05-08 Dolby Laboratories Licensing Corporation Conference segmentation based on conversational dynamics
CN106297794A (en) * 2015-05-22 2017-01-04 西安中兴新软件有限责任公司 The conversion method of a kind of language and characters and equipment
US9666192B2 (en) * 2015-05-26 2017-05-30 Nuance Communications, Inc. Methods and apparatus for reducing latency in speech recognition applications
CN106940997B (en) * 2017-03-20 2020-04-28 海信集团有限公司 Method and device for sending voice signal to voice recognition system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009205177A (en) * 2003-10-03 2009-09-10 Asahi Kasei Corp Data process unit, data processing unit control program and data processing method
CN102074236A (en) * 2010-11-29 2011-05-25 清华大学 Speaker clustering method for distributed microphone

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009205177A (en) * 2003-10-03 2009-09-10 Asahi Kasei Corp Data process unit, data processing unit control program and data processing method
CN102074236A (en) * 2010-11-29 2011-05-25 清华大学 Speaker clustering method for distributed microphone

Also Published As

Publication number Publication date
CN102509548A (en) 2012-06-20

Similar Documents

Publication Publication Date Title
Cao et al. Polyphonic sound event detection and localization using a two-stage strategy
CN106251874B (en) A kind of voice gate inhibition and quiet environment monitoring method and system
CN102509548B (en) Audio indexing method based on multi-distance sound sensor
Heittola et al. Context-dependent sound event detection
Carletti et al. Audio surveillance using a bag of aural words classifier
Imoto Introduction to acoustic event and scene analysis
Adavanne et al. Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features
CN104019885A (en) Sound field analysis system
CN102890930A (en) Speech emotion recognizing method based on hidden Markov model (HMM) / self-organizing feature map neural network (SOFMNN) hybrid model
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
CN109935226A (en) A kind of far field speech recognition enhancing system and method based on deep neural network
CN108877809A (en) A kind of speaker&#39;s audio recognition method and device
CN108182418A (en) A kind of thump recognition methods based on multidimensional acoustic characteristic
Wang et al. Exploring audio semantic concepts for event-based video retrieval
Waldekar et al. Classification of audio scenes with novel features in a fused system framework
Wang et al. Speaker counting model based on transfer learning from SincNet bottleneck layer
Xia et al. Frame-wise dynamic threshold based polyphonic acoustic event detection
Mohmmad et al. Tree cutting sound detection using deep learning techniques based on Mel Spectrogram and MFCC features
CN111179959B (en) Competitive speaker number estimation method and system based on speaker embedding space
CN116259313A (en) Sound event positioning and detecting method based on time domain convolution network
Espi et al. Spectrogram patch based acoustic event detection and classification in speech overlapping conditions
Lu et al. Context-based environmental audio event recognition for scene understanding
Espi et al. Acoustic event detection in speech overlapping scenarios based on high-resolution spectral input and deep learning
Canton-Ferrer et al. Audiovisual event detection towards scene understanding
Khan et al. Wearable sensor-based location-specific occupancy detection in smart environments

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant