CN109545229B - Speaker recognition method based on voice sample characteristic space track - Google Patents
Speaker recognition method based on voice sample characteristic space track Download PDFInfo
- Publication number
- CN109545229B CN109545229B CN201910027145.3A CN201910027145A CN109545229B CN 109545229 B CN109545229 B CN 109545229B CN 201910027145 A CN201910027145 A CN 201910027145A CN 109545229 B CN109545229 B CN 109545229B
- Authority
- CN
- China
- Prior art keywords
- voice
- speaker
- sample
- feature space
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000008569 process Effects 0.000 claims description 7
- 238000005315 distribution function Methods 0.000 claims description 4
- 238000005259 measurement Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000002372 labelling Methods 0.000 claims 2
- 238000004364 calculation method Methods 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 10
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000012549 training Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/08—Use of distortion metrics or a particular distance between probe pattern and reference templates
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a speaker recognition method based on a voice sample feature space track, which comprises the steps of clustering unlabeled voice data features to obtain voice feature space representation: identifying a subset; registering a speaker by using the labeled voice sample to obtain distribution information and motion trail information of the speaker in a voice feature space; and recognizing the voice sample to be recognized by utilizing the space distribution information of the voice characteristics of the speaker and the motion trail information of the voice sample. The invention adopts the thought of speaker voice characteristic space positioning, the speaker recognition calculation complexity is low, and the problem of high GMM-UBM calculation complexity is solved; and the voice characteristic space of the speaker in one language can be used as the voice characteristic space of the speaker in the other language for recognition, thereby realizing the sharing of data.
Description
Technical Field
The invention relates to the field of biological feature recognition, in particular to a speaker recognition method based on a voice sample feature space track.
Background
With the development of artificial intelligence technology, audio perception has become a hotspot in audio processing technology research, where audio classification or audio recognition is a core problem of audio perception, and in engineering applications, audio classification is represented by speaker recognition, audio event detection, and the like. The speaker recognition technology is an identity verification technology, namely a biological characteristic recognition technology. The biological characteristic recognition technology is a technology for automatically recognizing the identity of an individual by utilizing biological characteristics, and comprises fingerprint recognition, iris recognition, gene recognition, face recognition and the like. Compared with other identity verification technologies, speaker recognition is more convenient and natural, and has lower user invasiveness. The speaker identification utilizes the voice signal to carry out the identification, and has the advantages of natural man-machine interaction, easy extraction of the voice signal, realization of remote identification and the like.
Existing speaker recognition systems include two phases: a training phase and an identification phase. In the training phase, the system uses the collected speaker voices to build a model for the speaker; in the recognition phase, the system matches the input speech to the speaker model to make decisions. The speaker recognition system needs to extract features reflecting the speaker's personality from the speech signal and build an accurate model to distinguish the speaker from other speakers. The current commonly used audio classification technology mainly comprises two main types, one is to generate a statistical model, such as a Gaussian mixture model GMM and a hidden Markov model HMM, and the other is a method based on a deep neural network, such as DNN, RNN or LSTM. In either technique, a large number of labeled training samples are required, and in order to achieve better recognition performance, the deep neural network method has higher requirements on the sample size. The GMM or HMM based method does not take into account in particular the distinguishing information between different audio classes, nor the sharing of sample data of different classes, such as: the method mentioned in paper Speaker Verification Using Adapted Gaussian Mixture Models (Digital Signal Processing (2000), 19-41) by MIT Reynold et al has a high computational complexity; the deep neural network method shows good performance under the support of large samples, such as the paper End-to-End Text-Dependent Speaker Verification (2016IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2016, pages: 5115-5119) of Google corporation, which uses a neural network to extract features from voice and train, but the training of the neural network requires a large amount of labeled voice, and the acquisition cost of a large amount of samples is very high, and the deep neural network method lacks interpretation and is quite a black box.
The existing speaker recognition technology is high in computational complexity, a large amount of marked speaker voice data is needed to train the model, and a large amount of marked voice data is collected, so that huge workload is needed. It is therefore desirable to find a speaker recognition method and system that is more convenient and efficient.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art, and provides a speaker recognition method based on a voice sample feature space track, wherein a voice feature space is independent of speakers, texts and languages, so that the voice feature space can be constructed by adopting any qualified voice data, and the sharing of the voice data is realized; the voice track of the speaker can be constructed even if one sample is used, so that a large amount of marked voice data is not needed, and the defect that a large amount of marked voice data needs to be acquired in the prior art is overcome.
The aim of the invention can be achieved by the following technical scheme:
a speaker recognition method based on a speech sample feature space trajectory, wherein a speech sample can be regarded as a motion of a speech feature space, having an active space and trajectory characteristics in the space, the method comprising the steps of:
step 1), constructing a voice feature space omega: clustering unlabeled voice samples in a feature space by using a clustering method, and generating a certain expression of the subclass data as expression omega= { g of the voice feature space by using the subclass data obtained by clustering k ,k=1,2,…,K};
Step 2), constructing speaker knowledge: the pure voice sample with speaker attribute labels is utilized to obtain the distribution information and the motion trail information of the pure voice sample on the voice characteristic space omega;
step 3), speaker identification: for a voice sample to be recognized, firstly, the voice characteristic space distribution expression and the track of the sample are obtained, then, the difference between the sample distribution and the prior distribution and the accumulated local distribution difference along the track are calculated by utilizing the voice characteristic space distribution information of a speaker, and the difference is used as the basis of speaker recognition and is judged.
Further, in the process of constructing the voice feature space Ω in step 1), any clean voice sample can be used, and there is no constraint on the speaker and language factors.
Further, in step 1), the K-means or other clustering method is adopted to cluster the voice samples in the feature space, and the voiceFeature space expression Ω= { g k K=1, 2, …, K } can be a localization-capable identifier such as a distribution function (e.g., gaussian distribution function), a cluster center vector (centroid), or a generation model (e.g., hidden markov model or neural network) of class data, called a feature space identifier, and the class identifier sub-scale K used in the speech feature space determines the speech feature space expression granularity, and the larger K, the finer the speech feature space expression. On the other hand, the accuracy of the spatial expression is related to the data scale, and the more abundant the data, the more complete the spatial expression; meanwhile, the more targeted the data that construct the speech feature space, the more accurate the spatial representation will be for a particular problem.
Further, in step 2), the pure speech sample with speaker attribute label is used to label the speech feature space, and Gaussian distribution g is adopted k (m k ,U k ) As a spatial identifier, speaker characteristic spatial distribution information is obtained as follows:
1. calculating each feature f of a speech sample t And space identifier g k (m k ,U k ) Is defined as:
wherein, the space identifier is represented by multi-dimensional Gaussian distribution, m k Mean vector representing kth gaussian distribution, U k A variance matrix representing a kth multidimensional gaussian distribution;
2. calculating speaker sample set and space identifier g k (m k ,U k ) Expected value of position association degree:
in the method, in the process of the invention,representing the nth sampleT frame feature and spatial identifier g k (m k ,U k ) Is a degree of association of (1);
3. the speaker characteristic space distribution is calculated as follows:
further, in step 2), the motion trail timing information of the speaker voice sample in the voice feature space Ω is represented as a neighborhood sequence ψ of voice sample features 1 Ψ 2 …Ψ T Whereas speech sample feature f t Delta neighborhood of be ψ t ={g k |d tk < delta }, where d tk Mahalanobis distance (Mahalanobis distance) for speech sample characteristics and speech sample distribution, i.e
Further, the speech sample feature f t Delta neighborhood ψ of (1) t ={g k |d tk The decision threshold of < delta > refers to the characteristic of normal distribution, and 2 < delta < 3 is selected.
Further, in step 3), the speech sample f= { f 1 ,f 2 ,…,f T The speaker recognition process of comprises the following steps:
1. calculate the speech sample f= { f 1 ,f 2 ,…,f T Distribution p= (P) in speech feature space Ω 1 ,p 2 ,…,p K ) Wherein
2. Determining a speech sample f= { f 1 ,f 2 ,…,f T Motion trajectory ψ in speech feature space Ω 1 Ψ 2 …Ψ T ,Ψ t ={g k |d tk <δ};
3. Calculate the sample distribution p= (P 1 ,p 2 ,…,p K ) First with speaker sSpatial distribution of empirical featuresDistance of->Then the set of possible solutions S containing the true solution is screened p :/>
4. Calculating a motion trajectory ψ 1 Ψ 2 …Ψ T Distance measurement of (2)From S p Is selected to be possible->Speaker recognition is completed.
Specifically, in step 3), only the spatial distribution information p= (P) of the voice samples is used 1 ,p 2 ,…,p K ) Or the movement track distanceGood speaker recognition performance can be obtained.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the speaker identification method based on the voice sample feature space track provided by the invention has the advantages that a large number of voice features are clustered, marked data are not needed, the data samples for establishing the voice feature space are derived from different speakers, no exact requirements are met for speaking content, speaker age and languages, the problem that a large number of marked voices are needed in the neural network method is solved, and the data acquisition for establishing the voice space is convenient to realize.
2. The invention provides a speaker recognition method based on a voice sample characteristic space track, which is based on speaker languageThe positioning and track information of the sound features in the voice feature space are opposite, and the generated model is absolute, unlike signal source generation model methods such as Hidden Markov Models (HMM) and the like; compared with the method of the deep neural network, the method has the interpretability, and each knowledge data has certain physical semantics, such as correlation degree distribution information P= (P) of sample characteristics on space omega 1 ,p 2 ,…,p K ) Namely, the active space range of the sample (the space represented by the identifier subset corresponding to the non-zero element) is expressed, and the distribution in the space is expressed.
3. The speaker recognition method based on the voice sample feature space track provided by the invention is essentially that voice features are positioned in space, the voice features of different speakers are positioned on the established voice feature space, the voice feature positioning information of different speakers is represented by using the association degree, the distinguishing property among different speakers is expressed by less calculation amount, and compared with the method of GMM or HMM which needs to model each speaker by using a generation model, the method has lower calculation complexity.
4. The invention provides a speaker recognition method based on a voice sample feature space track, wherein a voice feature space identifier subset is a reference system for positioning speaker voice features, is a relative relation, and has no strict relation requirement with a sample to be recognized, so that feature spaces have shareability, and the established voice feature spaces can be transferred to other speaker data sets for recognition, such as: the speech feature space of the speaker in one language can be used as the speech feature space of the speaker in the other language for recognition, so that the sharing of data is realized.
Drawings
Fig. 1 is a schematic flow chart of a speaker recognition method in embodiment 1 of the present invention.
Fig. 2 is a flowchart illustrating steps for establishing a speech feature space in embodiment 1 of the present invention.
Fig. 3 is a flowchart illustrating steps for generating spatial distribution information and trajectory information of speaker voice characteristics in embodiment 1 of the present invention.
Fig. 4 is a flowchart of steps for recognizing a voice sample to be recognized in embodiment 1 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Example 1:
the embodiment provides a speaker recognition method based on a voice sample feature space track, and a schematic flow chart is shown in fig. 1, and comprises the following three steps:
1) The voice characteristic space omega is established, any pure voice sample can be used without any constraint on factors such as a speaker, languages and the like, then the voice sample is clustered in the characteristic space by using a clustering method, and subclass data obtained by clustering is expressed as expression { g) of the voice characteristic space k ,k=1,2,…,K};
2) Constructing speaker knowledge, including two parts of distribution information and motion trail information of the speaker in a voice feature space;
3) And for the voice sample to be identified, identifying by utilizing the space distribution information of the voice characteristics of the speaker and the motion trail information of the voice sample.
Referring to fig. 2, a flowchart of the steps for establishing a speech feature space in this embodiment is shown. Using speaker voice data in an aishell Chinese corpus as an unlabeled voice sample set, wherein the aishell contains 400 speakers in total, 60 wav files of each person are selected for training a voice feature space, and the unlabeled voice sample set X= { X is extracted 1 ,x 2 ,.....,x N 12-dimensional MFCC features of }, thereby obtaining feature set F x ={f i x ,i=1,2,…,t N Of f, where f i x Is a short-time frame feature, t N Sum of frames for all samples;
then use F x ={f i x ,i=1,2,…,t N Training a GMM with a mixing degree of K by the feature sequence, discarding the weight information of the GMM, and reserving each Gaussian component as an identification subset of a voice feature space. Wherein K isFor the number of audio feature space identifiers, selecting 4096 for the number of identifiers K so as to give a description with higher precision to the audio feature space;
the speech feature space identifier is denoted as Ω= { g k K=1, 2, …, K }, where g=n (m, U) is a multidimensional gaussian distribution function;
referring to fig. 3, a flowchart of the steps for generating speaker characteristic spatial distribution information in this embodiment is shown. For each person in the aishell, 20 wav files are used to annotate the speech feature space. Target speaker speech sample set y= { (Y) 1 ,s 1 ),(y 2 ,s 2 ),.....,(y M ,s M )},s i ∈S={S l L=1, 2, …, L } (speaker set), speaker S l Is Y as a sample of (C) l ={y m |s m =S l M=1, 2, …, M }, extracting its audio feature sequence asCalculating all features f of a speech sample t And space identifier g k (m k ,U k ) Position association degree of (3):
calculating speaker sample set and space identifier g k (m k ,U k ) Expected value of position association degree:
wherein the method comprises the steps ofThe t frame feature and identifier g for the nth sample k (m k ,U k ) Is a position association degree of (a);
the speaker characteristic space distribution is calculated as follows:
and processing the registered voice of each speaker in the target speaker set to obtain voice characteristic distribution information of each speaker.
Motion trail timing information of voice sample is expressed as neighborhood sequence psi of voice sample characteristics 1 Ψ 2 …Ψ T Whereas sample feature f t Delta neighborhood of be ψ t ={g k |d tk < delta }, where d tk Mahalanobis distance for characteristics and distribution, i.e
Referring to fig. 4, a flowchart of the steps for recognizing a voice sample in the present embodiment is shown. The speech sample to be recognized is characterized by f= { f 1 ,f 2 ,…,f T The identification process is as follows:
calculate the speech f= { f 1 ,f 2 ,…,f T Distribution p= (P) in feature space Ω 1 ,p 2 ,…,p K );
Wherein feature f t And space identifier g k (m k ,U k ) The degree of positional association is:
speaker sample to be identified and space identifier g k (m k ,U k ) The degree of positional association is:
determining a speech feature f= { f 1 ,f 2 ,…,f T Trajectory ψ in feature space Ω 1 Ψ 2 …Ψ T Wherein ψ is t ={g k |d tk <δ};
Calculate the sample distribution p= (P 1 ,p 2 ,…,p K ) Prior feature spatial distribution with speaker sDistance of->Wherein beta takes 2 and then filters the set of possible solutions S containing true solutions p :Selecting 10 speakers with the smallest distance as candidate recognition results; />
Calculating the locus ψ 1 Ψ 2 …Ψ T Distance measurement of (2)Wherein beta is taken as 2, and a speaker with the smallest track distance is selected from the 10 candidate speakers as a recognition result, namely +.>
Example 2:
the embodiment provides a speaker recognition method based on a voice sample characteristic space track, which comprises the following steps:
step 1, establishing a voice feature space identifier subset by using voice data of an English corpus time;
step 2, registering the target speaker set by using the voice data in the aishell corpus, as in the embodiment 1;
step 3, the speech sample to be recognized is recognized, as in embodiment 1.
Compared with the embodiment 1, the obtained recognition effect has a small difference, and can prove that the voice characteristic space of the speaker in the other language can be used as the voice characteristic space of the speaker in the other language, thereby realizing the sharing of data.
The above description is only of the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive conception of the present invention equally within the scope of the disclosure of the present invention.
Claims (7)
1. A speaker recognition method based on a speech sample feature space trajectory, wherein a speech sample can be regarded as a motion of a speech feature space, having an active space and trajectory characteristics in the space, the method comprising the steps of:
step 1), constructing a voice feature space omega: clustering the unlabeled voice samples in the feature space by using a clustering method, and expressing subclass data obtained by clustering as expression omega= { g of the voice feature space k K=1, 2, …, K }, K being the feature space identifier g k Is the number of (3);
step 2), constructing speaker knowledge: the pure voice sample with speaker attribute labels is utilized to obtain the distribution information and the motion trail information of the pure voice sample on the voice characteristic space omega;
step 3), speaker identification: for a voice sample to be recognized, firstly, the voice characteristic space distribution expression and the track of the sample are obtained, then, the difference between the sample distribution and the prior distribution and the accumulated local distribution difference along the track are calculated by utilizing the voice characteristic space distribution information of a speaker, and the difference is used as the basis of speaker recognition and is judged.
2. The method for speaker recognition based on the spatial trajectory of the characteristics of the speech samples according to claim 1, wherein: in the process of constructing the voice characteristic space omega in the step 1), any pure voice sample can be used, and no constraint is imposed on a speaker and language factors.
3. The method for speaker recognition based on spatial trajectories of features of speech samples of claim 1, characterized byThe method is characterized in that: the speech feature space expresses omega= { g k K=1, 2, …, K is a distribution function of class data, a cluster center vector or a model-generating identifier g with positioning capability k ,g k The scale K of the feature space identifier used by the voice feature space is called as a feature space identifier, and the larger K is, the finer the voice feature space expression granularity is determined.
4. The method for speaker recognition based on the spatial trajectory of the characteristics of the speech samples according to claim 1, wherein: in the step 2), the pure voice sample with speaker attribute labeling is used for labeling the voice feature space, and Gaussian distribution g is adopted k (m k ,U k ) As a feature space identifier, speaker feature space distribution information is obtained as follows:
1. calculating each feature f of a speech sample t And feature space identifier g k (m k ,U k ) Is defined as:
wherein, the characteristic space identifier is represented by multidimensional Gaussian distribution, m k Mean vector representing kth gaussian distribution, U k A variance matrix representing a kth multidimensional gaussian distribution;
2. calculating speaker sample set and feature space identifier g k (m k ,U k ) Expected value of position association degree:
in the method, in the process of the invention,the t frame feature representing the nth sample and feature nullInterval tag g k (m k ,U k ) Is a degree of association of (1);
3. the speaker characteristic space distribution is calculated as follows:
5. the method for speaker recognition based on the spatial trajectory of the characteristics of the voice sample of claim 4, wherein: in step 2), the motion trail timing information of the speaker voice sample in the voice feature space Ω is expressed as a neighborhood sequence ψ of the voice sample features 1 Ψ 2 …Ψ T Whereas speech sample feature f t Delta neighborhood of be ψ t ={g k |d tk < delta }, where d tk Mahalanobis distance (Mahalanobis distance) for speech sample characteristics and speech sample distribution, i.e
6. The method for speaker recognition based on the spatial trajectory of the characteristics of the voice sample according to claim 5, wherein: the speech sample feature f t Delta neighborhood ψ of (1) t ={g k |d tk The decision threshold of < delta > refers to the characteristic of normal distribution, and 2 < delta < 3 is selected.
7. The method for speaker recognition based on the spatial trajectory of speech samples as claimed in claim 5, wherein in step 3), the speech samples f= { f 1 ,f 2 ,…,f T The speaker recognition process of comprises the following steps:
1. calculate the speech sample f= { f 1 ,f 2 ,…,f T Distribution p= (P) in speech feature space Ω 1 ,p 2 ,…,p K ) Wherein
2. Determining a speech sample f= { f 1 ,f 2 ,…,f T Motion trajectory ψ in speech feature space Ω 1 Ψ 2 …Ψ T ,Ψ t ={g k |d tk <δ};
3. Calculate the sample distribution p= (P 1 ,p 2 ,…,p K ) Prior feature spatial distribution with speaker sDistance of->Then the set of possible solutions S containing the true solution is screened p :/>
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910027145.3A CN109545229B (en) | 2019-01-11 | 2019-01-11 | Speaker recognition method based on voice sample characteristic space track |
PCT/CN2019/111530 WO2020143263A1 (en) | 2019-01-11 | 2019-10-16 | Speaker identification method based on speech sample feature space trajectory |
SG11202103091XA SG11202103091XA (en) | 2019-01-11 | 2019-10-16 | A Speaker Recognition Method Based on Trajectories in Feature Spaces of Voice Samples |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910027145.3A CN109545229B (en) | 2019-01-11 | 2019-01-11 | Speaker recognition method based on voice sample characteristic space track |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109545229A CN109545229A (en) | 2019-03-29 |
CN109545229B true CN109545229B (en) | 2023-04-21 |
Family
ID=65835222
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910027145.3A Active CN109545229B (en) | 2019-01-11 | 2019-01-11 | Speaker recognition method based on voice sample characteristic space track |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN109545229B (en) |
SG (1) | SG11202103091XA (en) |
WO (1) | WO2020143263A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109545229B (en) * | 2019-01-11 | 2023-04-21 | 华南理工大学 | Speaker recognition method based on voice sample characteristic space track |
CN111081261B (en) * | 2019-12-25 | 2023-04-21 | 华南理工大学 | Text-independent voiceprint recognition method based on LDA |
CN111128128B (en) * | 2019-12-26 | 2023-05-23 | 华南理工大学 | Voice keyword detection method based on complementary model scoring fusion |
CN111933156B (en) * | 2020-09-25 | 2021-01-19 | 广州佰锐网络科技有限公司 | High-fidelity audio processing method and device based on multiple feature recognition |
CN112487978B (en) * | 2020-11-30 | 2024-04-16 | 清华珠三角研究院 | Method and device for positioning speaker in video and computer storage medium |
CN113611285B (en) * | 2021-09-03 | 2023-11-24 | 哈尔滨理工大学 | Language identification method based on stacked bidirectional time sequence pooling |
CN117235435B (en) * | 2023-11-15 | 2024-02-20 | 世优(北京)科技有限公司 | Method and device for determining audio signal loss function |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6067517A (en) * | 1996-02-02 | 2000-05-23 | International Business Machines Corporation | Transcription of speech data with segments from acoustically dissimilar environments |
CN1652206A (en) * | 2005-04-01 | 2005-08-10 | 郑方 | Sound veins identifying method |
JP2009063773A (en) * | 2007-09-05 | 2009-03-26 | Nippon Telegr & Teleph Corp <Ntt> | Speech feature learning device and speech recognition device, and method, program and recording medium thereof |
CN102024455A (en) * | 2009-09-10 | 2011-04-20 | 索尼株式会社 | Speaker recognition system and method |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5598507A (en) * | 1994-04-12 | 1997-01-28 | Xerox Corporation | Method of speaker clustering for unknown speakers in conversational audio data |
CN102479511A (en) * | 2010-11-23 | 2012-05-30 | 盛乐信息技术(上海)有限公司 | Large-scale voiceprint authentication method and system |
CN105845141A (en) * | 2016-03-23 | 2016-08-10 | 广州势必可赢网络科技有限公司 | Speaker confirmation model, speaker confirmation method and speaker confirmation device based on channel robustness |
US10637898B2 (en) * | 2017-05-24 | 2020-04-28 | AffectLayer, Inc. | Automatic speaker identification in calls |
CN109065028B (en) * | 2018-06-11 | 2022-12-30 | 平安科技(深圳)有限公司 | Speaker clustering method, speaker clustering device, computer equipment and storage medium |
CN109065059A (en) * | 2018-09-26 | 2018-12-21 | 新巴特(安徽)智能科技有限公司 | The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established |
CN109545229B (en) * | 2019-01-11 | 2023-04-21 | 华南理工大学 | Speaker recognition method based on voice sample characteristic space track |
-
2019
- 2019-01-11 CN CN201910027145.3A patent/CN109545229B/en active Active
- 2019-10-16 SG SG11202103091XA patent/SG11202103091XA/en unknown
- 2019-10-16 WO PCT/CN2019/111530 patent/WO2020143263A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6067517A (en) * | 1996-02-02 | 2000-05-23 | International Business Machines Corporation | Transcription of speech data with segments from acoustically dissimilar environments |
CN1652206A (en) * | 2005-04-01 | 2005-08-10 | 郑方 | Sound veins identifying method |
JP2009063773A (en) * | 2007-09-05 | 2009-03-26 | Nippon Telegr & Teleph Corp <Ntt> | Speech feature learning device and speech recognition device, and method, program and recording medium thereof |
CN102024455A (en) * | 2009-09-10 | 2011-04-20 | 索尼株式会社 | Speaker recognition system and method |
Non-Patent Citations (2)
Title |
---|
一种新型的与文本相关的说话人识别方法研究;周雷等;《上海师范大学学报(自然科学版)》;20170415;第46卷(第02期);第224-230页 * |
基于聚类统计与文本无关的说话人识别研究;邓浩江等;《电路与系统学报》;20010930;第6卷(第03期);第77-80页 * |
Also Published As
Publication number | Publication date |
---|---|
WO2020143263A1 (en) | 2020-07-16 |
SG11202103091XA (en) | 2021-04-29 |
CN109545229A (en) | 2019-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109545229B (en) | Speaker recognition method based on voice sample characteristic space track | |
Zhuang et al. | Real-world acoustic event detection | |
Gao et al. | Transition movement models for large vocabulary continuous sign language recognition | |
CN111128128B (en) | Voice keyword detection method based on complementary model scoring fusion | |
CN111696522B (en) | Tibetan language voice recognition method based on HMM and DNN | |
Li et al. | Towards zero-shot learning for automatic phonemic transcription | |
CN116110405B (en) | Land-air conversation speaker identification method and equipment based on semi-supervised learning | |
Srivastava et al. | Significance of neural phonotactic models for large-scale spoken language identification | |
CN111597328A (en) | New event theme extraction method | |
WO2023048746A1 (en) | Speaker-turn-based online speaker diarization with constrained spectral clustering | |
Bluche et al. | Predicting detection filters for small footprint open-vocabulary keyword spotting | |
Bhati et al. | Self-expressing autoencoders for unsupervised spoken term discovery | |
Bhati et al. | Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications. | |
Frihia et al. | HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications | |
Han et al. | Boosted subunits: a framework for recognising sign language from videos | |
Nwe et al. | Speaker clustering and cluster purification methods for RT07 and RT09 evaluation meeting data | |
Yao et al. | Real time large vocabulary continuous sign language recognition based on OP/Viterbi algorithm | |
Ananthakrishnan et al. | Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling | |
Kesiraju et al. | Topic identification of spoken documents using unsupervised acoustic unit discovery | |
Nyodu et al. | Automatic identification of Arunachal language using K-nearest neighbor algorithm | |
Sawakare et al. | Speech recognition techniques: a review | |
Cornaggia-Urrigshardt et al. | Speech recognition lab | |
Chandrakala et al. | Combination of generative models and SVM based classifier for speech emotion recognition | |
Feng et al. | Exploiting language-mismatched phoneme recognizers for unsupervised acoustic modeling | |
Mouaz et al. | A new framework based on KNN and DT for speech identification through emphatic letters in Moroccan dialect |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |