CN109545229B - Speaker recognition method based on voice sample characteristic space track - Google Patents

Speaker recognition method based on voice sample characteristic space track Download PDF

Info

Publication number
CN109545229B
CN109545229B CN201910027145.3A CN201910027145A CN109545229B CN 109545229 B CN109545229 B CN 109545229B CN 201910027145 A CN201910027145 A CN 201910027145A CN 109545229 B CN109545229 B CN 109545229B
Authority
CN
China
Prior art keywords
voice
speaker
sample
feature space
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910027145.3A
Other languages
Chinese (zh)
Other versions
CN109545229A (en
Inventor
贺前华
吴克乾
谢伟
庞文丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910027145.3A priority Critical patent/CN109545229B/en
Publication of CN109545229A publication Critical patent/CN109545229A/en
Priority to PCT/CN2019/111530 priority patent/WO2020143263A1/en
Priority to SG11202103091XA priority patent/SG11202103091XA/en
Application granted granted Critical
Publication of CN109545229B publication Critical patent/CN109545229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a speaker recognition method based on a voice sample feature space track, which comprises the steps of clustering unlabeled voice data features to obtain voice feature space representation: identifying a subset; registering a speaker by using the labeled voice sample to obtain distribution information and motion trail information of the speaker in a voice feature space; and recognizing the voice sample to be recognized by utilizing the space distribution information of the voice characteristics of the speaker and the motion trail information of the voice sample. The invention adopts the thought of speaker voice characteristic space positioning, the speaker recognition calculation complexity is low, and the problem of high GMM-UBM calculation complexity is solved; and the voice characteristic space of the speaker in one language can be used as the voice characteristic space of the speaker in the other language for recognition, thereby realizing the sharing of data.

Description

Speaker recognition method based on voice sample characteristic space track
Technical Field
The invention relates to the field of biological feature recognition, in particular to a speaker recognition method based on a voice sample feature space track.
Background
With the development of artificial intelligence technology, audio perception has become a hotspot in audio processing technology research, where audio classification or audio recognition is a core problem of audio perception, and in engineering applications, audio classification is represented by speaker recognition, audio event detection, and the like. The speaker recognition technology is an identity verification technology, namely a biological characteristic recognition technology. The biological characteristic recognition technology is a technology for automatically recognizing the identity of an individual by utilizing biological characteristics, and comprises fingerprint recognition, iris recognition, gene recognition, face recognition and the like. Compared with other identity verification technologies, speaker recognition is more convenient and natural, and has lower user invasiveness. The speaker identification utilizes the voice signal to carry out the identification, and has the advantages of natural man-machine interaction, easy extraction of the voice signal, realization of remote identification and the like.
Existing speaker recognition systems include two phases: a training phase and an identification phase. In the training phase, the system uses the collected speaker voices to build a model for the speaker; in the recognition phase, the system matches the input speech to the speaker model to make decisions. The speaker recognition system needs to extract features reflecting the speaker's personality from the speech signal and build an accurate model to distinguish the speaker from other speakers. The current commonly used audio classification technology mainly comprises two main types, one is to generate a statistical model, such as a Gaussian mixture model GMM and a hidden Markov model HMM, and the other is a method based on a deep neural network, such as DNN, RNN or LSTM. In either technique, a large number of labeled training samples are required, and in order to achieve better recognition performance, the deep neural network method has higher requirements on the sample size. The GMM or HMM based method does not take into account in particular the distinguishing information between different audio classes, nor the sharing of sample data of different classes, such as: the method mentioned in paper Speaker Verification Using Adapted Gaussian Mixture Models (Digital Signal Processing (2000), 19-41) by MIT Reynold et al has a high computational complexity; the deep neural network method shows good performance under the support of large samples, such as the paper End-to-End Text-Dependent Speaker Verification (2016IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2016, pages: 5115-5119) of Google corporation, which uses a neural network to extract features from voice and train, but the training of the neural network requires a large amount of labeled voice, and the acquisition cost of a large amount of samples is very high, and the deep neural network method lacks interpretation and is quite a black box.
The existing speaker recognition technology is high in computational complexity, a large amount of marked speaker voice data is needed to train the model, and a large amount of marked voice data is collected, so that huge workload is needed. It is therefore desirable to find a speaker recognition method and system that is more convenient and efficient.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art, and provides a speaker recognition method based on a voice sample feature space track, wherein a voice feature space is independent of speakers, texts and languages, so that the voice feature space can be constructed by adopting any qualified voice data, and the sharing of the voice data is realized; the voice track of the speaker can be constructed even if one sample is used, so that a large amount of marked voice data is not needed, and the defect that a large amount of marked voice data needs to be acquired in the prior art is overcome.
The aim of the invention can be achieved by the following technical scheme:
a speaker recognition method based on a speech sample feature space trajectory, wherein a speech sample can be regarded as a motion of a speech feature space, having an active space and trajectory characteristics in the space, the method comprising the steps of:
step 1), constructing a voice feature space omega: clustering unlabeled voice samples in a feature space by using a clustering method, and generating a certain expression of the subclass data as expression omega= { g of the voice feature space by using the subclass data obtained by clustering k ,k=1,2,…,K};
Step 2), constructing speaker knowledge: the pure voice sample with speaker attribute labels is utilized to obtain the distribution information and the motion trail information of the pure voice sample on the voice characteristic space omega;
step 3), speaker identification: for a voice sample to be recognized, firstly, the voice characteristic space distribution expression and the track of the sample are obtained, then, the difference between the sample distribution and the prior distribution and the accumulated local distribution difference along the track are calculated by utilizing the voice characteristic space distribution information of a speaker, and the difference is used as the basis of speaker recognition and is judged.
Further, in the process of constructing the voice feature space Ω in step 1), any clean voice sample can be used, and there is no constraint on the speaker and language factors.
Further, in step 1), the K-means or other clustering method is adopted to cluster the voice samples in the feature space, and the voiceFeature space expression Ω= { g k K=1, 2, …, K } can be a localization-capable identifier such as a distribution function (e.g., gaussian distribution function), a cluster center vector (centroid), or a generation model (e.g., hidden markov model or neural network) of class data, called a feature space identifier, and the class identifier sub-scale K used in the speech feature space determines the speech feature space expression granularity, and the larger K, the finer the speech feature space expression. On the other hand, the accuracy of the spatial expression is related to the data scale, and the more abundant the data, the more complete the spatial expression; meanwhile, the more targeted the data that construct the speech feature space, the more accurate the spatial representation will be for a particular problem.
Further, in step 2), the pure speech sample with speaker attribute label is used to label the speech feature space, and Gaussian distribution g is adopted k (m k ,U k ) As a spatial identifier, speaker characteristic spatial distribution information is obtained as follows:
1. calculating each feature f of a speech sample t And space identifier g k (m k ,U k ) Is defined as:
Figure GDA0004077923440000031
wherein, the space identifier is represented by multi-dimensional Gaussian distribution, m k Mean vector representing kth gaussian distribution, U k A variance matrix representing a kth multidimensional gaussian distribution;
2. calculating speaker sample set and space identifier g k (m k ,U k ) Expected value of position association degree:
Figure GDA0004077923440000032
in the method, in the process of the invention,
Figure GDA0004077923440000033
representing the nth sampleT frame feature and spatial identifier g k (m k ,U k ) Is a degree of association of (1);
3. the speaker characteristic space distribution is calculated as follows:
Figure GDA0004077923440000034
further, in step 2), the motion trail timing information of the speaker voice sample in the voice feature space Ω is represented as a neighborhood sequence ψ of voice sample features 1 Ψ 2 …Ψ T Whereas speech sample feature f t Delta neighborhood of be ψ t ={g k |d tk < delta }, where d tk Mahalanobis distance (Mahalanobis distance) for speech sample characteristics and speech sample distribution, i.e
Figure GDA0004077923440000035
Further, the speech sample feature f t Delta neighborhood ψ of (1) t ={g k |d tk The decision threshold of < delta > refers to the characteristic of normal distribution, and 2 < delta < 3 is selected.
Further, in step 3), the speech sample f= { f 1 ,f 2 ,…,f T The speaker recognition process of comprises the following steps:
1. calculate the speech sample f= { f 1 ,f 2 ,…,f T Distribution p= (P) in speech feature space Ω 1 ,p 2 ,…,p K ) Wherein
Figure GDA0004077923440000036
2. Determining a speech sample f= { f 1 ,f 2 ,…,f T Motion trajectory ψ in speech feature space Ω 1 Ψ 2 …Ψ T ,Ψ t ={g k |d tk <δ};
3. Calculate the sample distribution p= (P 1 ,p 2 ,…,p K ) First with speaker sSpatial distribution of empirical features
Figure GDA0004077923440000041
Distance of->
Figure GDA0004077923440000042
Then the set of possible solutions S containing the true solution is screened p :/>
Figure GDA0004077923440000043
4. Calculating a motion trajectory ψ 1 Ψ 2 …Ψ T Distance measurement of (2)
Figure GDA0004077923440000044
From S p Is selected to be possible->
Figure GDA0004077923440000045
Speaker recognition is completed.
Specifically, in step 3), only the spatial distribution information p= (P) of the voice samples is used 1 ,p 2 ,…,p K ) Or the movement track distance
Figure GDA0004077923440000046
Good speaker recognition performance can be obtained.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the speaker identification method based on the voice sample feature space track provided by the invention has the advantages that a large number of voice features are clustered, marked data are not needed, the data samples for establishing the voice feature space are derived from different speakers, no exact requirements are met for speaking content, speaker age and languages, the problem that a large number of marked voices are needed in the neural network method is solved, and the data acquisition for establishing the voice space is convenient to realize.
2. The invention provides a speaker recognition method based on a voice sample characteristic space track, which is based on speaker languageThe positioning and track information of the sound features in the voice feature space are opposite, and the generated model is absolute, unlike signal source generation model methods such as Hidden Markov Models (HMM) and the like; compared with the method of the deep neural network, the method has the interpretability, and each knowledge data has certain physical semantics, such as correlation degree distribution information P= (P) of sample characteristics on space omega 1 ,p 2 ,…,p K ) Namely, the active space range of the sample (the space represented by the identifier subset corresponding to the non-zero element) is expressed, and the distribution in the space is expressed.
3. The speaker recognition method based on the voice sample feature space track provided by the invention is essentially that voice features are positioned in space, the voice features of different speakers are positioned on the established voice feature space, the voice feature positioning information of different speakers is represented by using the association degree, the distinguishing property among different speakers is expressed by less calculation amount, and compared with the method of GMM or HMM which needs to model each speaker by using a generation model, the method has lower calculation complexity.
4. The invention provides a speaker recognition method based on a voice sample feature space track, wherein a voice feature space identifier subset is a reference system for positioning speaker voice features, is a relative relation, and has no strict relation requirement with a sample to be recognized, so that feature spaces have shareability, and the established voice feature spaces can be transferred to other speaker data sets for recognition, such as: the speech feature space of the speaker in one language can be used as the speech feature space of the speaker in the other language for recognition, so that the sharing of data is realized.
Drawings
Fig. 1 is a schematic flow chart of a speaker recognition method in embodiment 1 of the present invention.
Fig. 2 is a flowchart illustrating steps for establishing a speech feature space in embodiment 1 of the present invention.
Fig. 3 is a flowchart illustrating steps for generating spatial distribution information and trajectory information of speaker voice characteristics in embodiment 1 of the present invention.
Fig. 4 is a flowchart of steps for recognizing a voice sample to be recognized in embodiment 1 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Example 1:
the embodiment provides a speaker recognition method based on a voice sample feature space track, and a schematic flow chart is shown in fig. 1, and comprises the following three steps:
1) The voice characteristic space omega is established, any pure voice sample can be used without any constraint on factors such as a speaker, languages and the like, then the voice sample is clustered in the characteristic space by using a clustering method, and subclass data obtained by clustering is expressed as expression { g) of the voice characteristic space k ,k=1,2,…,K};
2) Constructing speaker knowledge, including two parts of distribution information and motion trail information of the speaker in a voice feature space;
3) And for the voice sample to be identified, identifying by utilizing the space distribution information of the voice characteristics of the speaker and the motion trail information of the voice sample.
Referring to fig. 2, a flowchart of the steps for establishing a speech feature space in this embodiment is shown. Using speaker voice data in an aishell Chinese corpus as an unlabeled voice sample set, wherein the aishell contains 400 speakers in total, 60 wav files of each person are selected for training a voice feature space, and the unlabeled voice sample set X= { X is extracted 1 ,x 2 ,.....,x N 12-dimensional MFCC features of }, thereby obtaining feature set F x ={f i x ,i=1,2,…,t N Of f, where f i x Is a short-time frame feature, t N Sum of frames for all samples;
then use F x ={f i x ,i=1,2,…,t N Training a GMM with a mixing degree of K by the feature sequence, discarding the weight information of the GMM, and reserving each Gaussian component as an identification subset of a voice feature space. Wherein K isFor the number of audio feature space identifiers, selecting 4096 for the number of identifiers K so as to give a description with higher precision to the audio feature space;
the speech feature space identifier is denoted as Ω= { g k K=1, 2, …, K }, where g=n (m, U) is a multidimensional gaussian distribution function;
referring to fig. 3, a flowchart of the steps for generating speaker characteristic spatial distribution information in this embodiment is shown. For each person in the aishell, 20 wav files are used to annotate the speech feature space. Target speaker speech sample set y= { (Y) 1 ,s 1 ),(y 2 ,s 2 ),.....,(y M ,s M )},s i ∈S={S l L=1, 2, …, L } (speaker set), speaker S l Is Y as a sample of (C) l ={y m |s m =S l M=1, 2, …, M }, extracting its audio feature sequence as
Figure GDA0004077923440000061
Calculating all features f of a speech sample t And space identifier g k (m k ,U k ) Position association degree of (3):
Figure GDA0004077923440000062
calculating speaker sample set and space identifier g k (m k ,U k ) Expected value of position association degree:
Figure GDA0004077923440000063
wherein the method comprises the steps of
Figure GDA0004077923440000064
The t frame feature and identifier g for the nth sample k (m k ,U k ) Is a position association degree of (a);
the speaker characteristic space distribution is calculated as follows:
Figure GDA0004077923440000065
and processing the registered voice of each speaker in the target speaker set to obtain voice characteristic distribution information of each speaker.
Motion trail timing information of voice sample is expressed as neighborhood sequence psi of voice sample characteristics 1 Ψ 2 …Ψ T Whereas sample feature f t Delta neighborhood of be ψ t ={g k |d tk < delta }, where d tk Mahalanobis distance for characteristics and distribution, i.e
Figure GDA0004077923440000066
Referring to fig. 4, a flowchart of the steps for recognizing a voice sample in the present embodiment is shown. The speech sample to be recognized is characterized by f= { f 1 ,f 2 ,…,f T The identification process is as follows:
calculate the speech f= { f 1 ,f 2 ,…,f T Distribution p= (P) in feature space Ω 1 ,p 2 ,…,p K );
Wherein feature f t And space identifier g k (m k ,U k ) The degree of positional association is:
Figure GDA0004077923440000067
speaker sample to be identified and space identifier g k (m k ,U k ) The degree of positional association is:
Figure GDA0004077923440000071
determining a speech feature f= { f 1 ,f 2 ,…,f T Trajectory ψ in feature space Ω 1 Ψ 2 …Ψ T Wherein ψ is t ={g k |d tk <δ};
Calculate the sample distribution p= (P 1 ,p 2 ,…,p K ) Prior feature spatial distribution with speaker s
Figure GDA0004077923440000072
Distance of->
Figure GDA0004077923440000073
Wherein beta takes 2 and then filters the set of possible solutions S containing true solutions p
Figure GDA0004077923440000074
Selecting 10 speakers with the smallest distance as candidate recognition results; />
Calculating the locus ψ 1 Ψ 2 …Ψ T Distance measurement of (2)
Figure GDA0004077923440000075
Wherein beta is taken as 2, and a speaker with the smallest track distance is selected from the 10 candidate speakers as a recognition result, namely +.>
Figure GDA0004077923440000076
Example 2:
the embodiment provides a speaker recognition method based on a voice sample characteristic space track, which comprises the following steps:
step 1, establishing a voice feature space identifier subset by using voice data of an English corpus time;
step 2, registering the target speaker set by using the voice data in the aishell corpus, as in the embodiment 1;
step 3, the speech sample to be recognized is recognized, as in embodiment 1.
Compared with the embodiment 1, the obtained recognition effect has a small difference, and can prove that the voice characteristic space of the speaker in the other language can be used as the voice characteristic space of the speaker in the other language, thereby realizing the sharing of data.
The above description is only of the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive conception of the present invention equally within the scope of the disclosure of the present invention.

Claims (7)

1. A speaker recognition method based on a speech sample feature space trajectory, wherein a speech sample can be regarded as a motion of a speech feature space, having an active space and trajectory characteristics in the space, the method comprising the steps of:
step 1), constructing a voice feature space omega: clustering the unlabeled voice samples in the feature space by using a clustering method, and expressing subclass data obtained by clustering as expression omega= { g of the voice feature space k K=1, 2, …, K }, K being the feature space identifier g k Is the number of (3);
step 2), constructing speaker knowledge: the pure voice sample with speaker attribute labels is utilized to obtain the distribution information and the motion trail information of the pure voice sample on the voice characteristic space omega;
step 3), speaker identification: for a voice sample to be recognized, firstly, the voice characteristic space distribution expression and the track of the sample are obtained, then, the difference between the sample distribution and the prior distribution and the accumulated local distribution difference along the track are calculated by utilizing the voice characteristic space distribution information of a speaker, and the difference is used as the basis of speaker recognition and is judged.
2. The method for speaker recognition based on the spatial trajectory of the characteristics of the speech samples according to claim 1, wherein: in the process of constructing the voice characteristic space omega in the step 1), any pure voice sample can be used, and no constraint is imposed on a speaker and language factors.
3. The method for speaker recognition based on spatial trajectories of features of speech samples of claim 1, characterized byThe method is characterized in that: the speech feature space expresses omega= { g k K=1, 2, …, K is a distribution function of class data, a cluster center vector or a model-generating identifier g with positioning capability k ,g k The scale K of the feature space identifier used by the voice feature space is called as a feature space identifier, and the larger K is, the finer the voice feature space expression granularity is determined.
4. The method for speaker recognition based on the spatial trajectory of the characteristics of the speech samples according to claim 1, wherein: in the step 2), the pure voice sample with speaker attribute labeling is used for labeling the voice feature space, and Gaussian distribution g is adopted k (m k ,U k ) As a feature space identifier, speaker feature space distribution information is obtained as follows:
1. calculating each feature f of a speech sample t And feature space identifier g k (m k ,U k ) Is defined as:
Figure FDA0004077923430000011
wherein, the characteristic space identifier is represented by multidimensional Gaussian distribution, m k Mean vector representing kth gaussian distribution, U k A variance matrix representing a kth multidimensional gaussian distribution;
2. calculating speaker sample set and feature space identifier g k (m k ,U k ) Expected value of position association degree:
Figure FDA0004077923430000021
in the method, in the process of the invention,
Figure FDA0004077923430000022
the t frame feature representing the nth sample and feature nullInterval tag g k (m k ,U k ) Is a degree of association of (1);
3. the speaker characteristic space distribution is calculated as follows:
Figure FDA0004077923430000023
5. the method for speaker recognition based on the spatial trajectory of the characteristics of the voice sample of claim 4, wherein: in step 2), the motion trail timing information of the speaker voice sample in the voice feature space Ω is expressed as a neighborhood sequence ψ of the voice sample features 1 Ψ 2 …Ψ T Whereas speech sample feature f t Delta neighborhood of be ψ t ={g k |d tk < delta }, where d tk Mahalanobis distance (Mahalanobis distance) for speech sample characteristics and speech sample distribution, i.e
Figure FDA0004077923430000024
6. The method for speaker recognition based on the spatial trajectory of the characteristics of the voice sample according to claim 5, wherein: the speech sample feature f t Delta neighborhood ψ of (1) t ={g k |d tk The decision threshold of < delta > refers to the characteristic of normal distribution, and 2 < delta < 3 is selected.
7. The method for speaker recognition based on the spatial trajectory of speech samples as claimed in claim 5, wherein in step 3), the speech samples f= { f 1 ,f 2 ,…,f T The speaker recognition process of comprises the following steps:
1. calculate the speech sample f= { f 1 ,f 2 ,…,f T Distribution p= (P) in speech feature space Ω 1 ,p 2 ,…,p K ) Wherein
Figure FDA0004077923430000025
2. Determining a speech sample f= { f 1 ,f 2 ,…,f T Motion trajectory ψ in speech feature space Ω 1 Ψ 2 …Ψ T ,Ψ t ={g k |d tk <δ};
3. Calculate the sample distribution p= (P 1 ,p 2 ,…,p K ) Prior feature spatial distribution with speaker s
Figure FDA0004077923430000031
Distance of->
Figure FDA0004077923430000032
Then the set of possible solutions S containing the true solution is screened p :/>
Figure FDA0004077923430000033
4. Calculating a motion trajectory ψ 1 Ψ 2 …Ψ T Distance measurement of (2)
Figure FDA0004077923430000034
From S p To select possible solutions
Figure FDA0004077923430000035
Speaker recognition is completed. />
CN201910027145.3A 2019-01-11 2019-01-11 Speaker recognition method based on voice sample characteristic space track Active CN109545229B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201910027145.3A CN109545229B (en) 2019-01-11 2019-01-11 Speaker recognition method based on voice sample characteristic space track
PCT/CN2019/111530 WO2020143263A1 (en) 2019-01-11 2019-10-16 Speaker identification method based on speech sample feature space trajectory
SG11202103091XA SG11202103091XA (en) 2019-01-11 2019-10-16 A Speaker Recognition Method Based on Trajectories in Feature Spaces of Voice Samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910027145.3A CN109545229B (en) 2019-01-11 2019-01-11 Speaker recognition method based on voice sample characteristic space track

Publications (2)

Publication Number Publication Date
CN109545229A CN109545229A (en) 2019-03-29
CN109545229B true CN109545229B (en) 2023-04-21

Family

ID=65835222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910027145.3A Active CN109545229B (en) 2019-01-11 2019-01-11 Speaker recognition method based on voice sample characteristic space track

Country Status (3)

Country Link
CN (1) CN109545229B (en)
SG (1) SG11202103091XA (en)
WO (1) WO2020143263A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545229B (en) * 2019-01-11 2023-04-21 华南理工大学 Speaker recognition method based on voice sample characteristic space track
CN111081261B (en) * 2019-12-25 2023-04-21 华南理工大学 Text-independent voiceprint recognition method based on LDA
CN111128128B (en) * 2019-12-26 2023-05-23 华南理工大学 Voice keyword detection method based on complementary model scoring fusion
CN111933156B (en) * 2020-09-25 2021-01-19 广州佰锐网络科技有限公司 High-fidelity audio processing method and device based on multiple feature recognition
CN112487978B (en) * 2020-11-30 2024-04-16 清华珠三角研究院 Method and device for positioning speaker in video and computer storage medium
CN113611285B (en) * 2021-09-03 2023-11-24 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN117235435B (en) * 2023-11-15 2024-02-20 世优(北京)科技有限公司 Method and device for determining audio signal loss function

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067517A (en) * 1996-02-02 2000-05-23 International Business Machines Corporation Transcription of speech data with segments from acoustically dissimilar environments
CN1652206A (en) * 2005-04-01 2005-08-10 郑方 Sound veins identifying method
JP2009063773A (en) * 2007-09-05 2009-03-26 Nippon Telegr & Teleph Corp <Ntt> Speech feature learning device and speech recognition device, and method, program and recording medium thereof
CN102024455A (en) * 2009-09-10 2011-04-20 索尼株式会社 Speaker recognition system and method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5598507A (en) * 1994-04-12 1997-01-28 Xerox Corporation Method of speaker clustering for unknown speakers in conversational audio data
CN102479511A (en) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 Large-scale voiceprint authentication method and system
CN105845141A (en) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 Speaker confirmation model, speaker confirmation method and speaker confirmation device based on channel robustness
US10637898B2 (en) * 2017-05-24 2020-04-28 AffectLayer, Inc. Automatic speaker identification in calls
CN109065028B (en) * 2018-06-11 2022-12-30 平安科技(深圳)有限公司 Speaker clustering method, speaker clustering device, computer equipment and storage medium
CN109065059A (en) * 2018-09-26 2018-12-21 新巴特(安徽)智能科技有限公司 The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established
CN109545229B (en) * 2019-01-11 2023-04-21 华南理工大学 Speaker recognition method based on voice sample characteristic space track

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067517A (en) * 1996-02-02 2000-05-23 International Business Machines Corporation Transcription of speech data with segments from acoustically dissimilar environments
CN1652206A (en) * 2005-04-01 2005-08-10 郑方 Sound veins identifying method
JP2009063773A (en) * 2007-09-05 2009-03-26 Nippon Telegr & Teleph Corp <Ntt> Speech feature learning device and speech recognition device, and method, program and recording medium thereof
CN102024455A (en) * 2009-09-10 2011-04-20 索尼株式会社 Speaker recognition system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种新型的与文本相关的说话人识别方法研究;周雷等;《上海师范大学学报(自然科学版)》;20170415;第46卷(第02期);第224-230页 *
基于聚类统计与文本无关的说话人识别研究;邓浩江等;《电路与系统学报》;20010930;第6卷(第03期);第77-80页 *

Also Published As

Publication number Publication date
WO2020143263A1 (en) 2020-07-16
SG11202103091XA (en) 2021-04-29
CN109545229A (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN109545229B (en) Speaker recognition method based on voice sample characteristic space track
Zhuang et al. Real-world acoustic event detection
Gao et al. Transition movement models for large vocabulary continuous sign language recognition
CN111128128B (en) Voice keyword detection method based on complementary model scoring fusion
CN111696522B (en) Tibetan language voice recognition method based on HMM and DNN
Li et al. Towards zero-shot learning for automatic phonemic transcription
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
Srivastava et al. Significance of neural phonotactic models for large-scale spoken language identification
CN111597328A (en) New event theme extraction method
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
Bluche et al. Predicting detection filters for small footprint open-vocabulary keyword spotting
Bhati et al. Self-expressing autoencoders for unsupervised spoken term discovery
Bhati et al. Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications.
Frihia et al. HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications
Han et al. Boosted subunits: a framework for recognising sign language from videos
Nwe et al. Speaker clustering and cluster purification methods for RT07 and RT09 evaluation meeting data
Yao et al. Real time large vocabulary continuous sign language recognition based on OP/Viterbi algorithm
Ananthakrishnan et al. Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling
Kesiraju et al. Topic identification of spoken documents using unsupervised acoustic unit discovery
Nyodu et al. Automatic identification of Arunachal language using K-nearest neighbor algorithm
Sawakare et al. Speech recognition techniques: a review
Cornaggia-Urrigshardt et al. Speech recognition lab
Chandrakala et al. Combination of generative models and SVM based classifier for speech emotion recognition
Feng et al. Exploiting language-mismatched phoneme recognizers for unsupervised acoustic modeling
Mouaz et al. A new framework based on KNN and DT for speech identification through emphatic letters in Moroccan dialect

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant