CN109545229B

CN109545229B - Speaker recognition method based on voice sample characteristic space track

Info

Publication number: CN109545229B
Application number: CN201910027145.3A
Authority: CN
Inventors: 贺前华; 吴克乾; 谢伟; 庞文丰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2023-04-21
Anticipated expiration: 2039-01-11
Also published as: WO2020143263A1; SG11202103091XA; CN109545229A

Abstract

The invention discloses a speaker recognition method based on a voice sample feature space track, which comprises the steps of clustering unlabeled voice data features to obtain voice feature space representation: identifying a subset; registering a speaker by using the labeled voice sample to obtain distribution information and motion trail information of the speaker in a voice feature space; and recognizing the voice sample to be recognized by utilizing the space distribution information of the voice characteristics of the speaker and the motion trail information of the voice sample. The invention adopts the thought of speaker voice characteristic space positioning, the speaker recognition calculation complexity is low, and the problem of high GMM-UBM calculation complexity is solved; and the voice characteristic space of the speaker in one language can be used as the voice characteristic space of the speaker in the other language for recognition, thereby realizing the sharing of data.

Description

Speaker recognition method based on voice sample characteristic space track

Technical Field

The invention relates to the field of biological feature recognition, in particular to a speaker recognition method based on a voice sample feature space track.

Background

With the development of artificial intelligence technology, audio perception has become a hotspot in audio processing technology research, where audio classification or audio recognition is a core problem of audio perception, and in engineering applications, audio classification is represented by speaker recognition, audio event detection, and the like. The speaker recognition technology is an identity verification technology, namely a biological characteristic recognition technology. The biological characteristic recognition technology is a technology for automatically recognizing the identity of an individual by utilizing biological characteristics, and comprises fingerprint recognition, iris recognition, gene recognition, face recognition and the like. Compared with other identity verification technologies, speaker recognition is more convenient and natural, and has lower user invasiveness. The speaker identification utilizes the voice signal to carry out the identification, and has the advantages of natural man-machine interaction, easy extraction of the voice signal, realization of remote identification and the like.

Existing speaker recognition systems include two phases: a training phase and an identification phase. In the training phase, the system uses the collected speaker voices to build a model for the speaker; in the recognition phase, the system matches the input speech to the speaker model to make decisions. The speaker recognition system needs to extract features reflecting the speaker's personality from the speech signal and build an accurate model to distinguish the speaker from other speakers. The current commonly used audio classification technology mainly comprises two main types, one is to generate a statistical model, such as a Gaussian mixture model GMM and a hidden Markov model HMM, and the other is a method based on a deep neural network, such as DNN, RNN or LSTM. In either technique, a large number of labeled training samples are required, and in order to achieve better recognition performance, the deep neural network method has higher requirements on the sample size. The GMM or HMM based method does not take into account in particular the distinguishing information between different audio classes, nor the sharing of sample data of different classes, such as: the method mentioned in paper Speaker Verification Using Adapted Gaussian Mixture Models (Digital Signal Processing (2000), 19-41) by MIT Reynold et al has a high computational complexity; the deep neural network method shows good performance under the support of large samples, such as the paper End-to-End Text-Dependent Speaker Verification (2016IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2016, pages: 5115-5119) of Google corporation, which uses a neural network to extract features from voice and train, but the training of the neural network requires a large amount of labeled voice, and the acquisition cost of a large amount of samples is very high, and the deep neural network method lacks interpretation and is quite a black box.

The existing speaker recognition technology is high in computational complexity, a large amount of marked speaker voice data is needed to train the model, and a large amount of marked voice data is collected, so that huge workload is needed. It is therefore desirable to find a speaker recognition method and system that is more convenient and efficient.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art, and provides a speaker recognition method based on a voice sample feature space track, wherein a voice feature space is independent of speakers, texts and languages, so that the voice feature space can be constructed by adopting any qualified voice data, and the sharing of the voice data is realized; the voice track of the speaker can be constructed even if one sample is used, so that a large amount of marked voice data is not needed, and the defect that a large amount of marked voice data needs to be acquired in the prior art is overcome.

The aim of the invention can be achieved by the following technical scheme:

a speaker recognition method based on a speech sample feature space trajectory, wherein a speech sample can be regarded as a motion of a speech feature space, having an active space and trajectory characteristics in the space, the method comprising the steps of:

step 1), constructing a voice feature space omega: clustering unlabeled voice samples in a feature space by using a clustering method, and generating a certain expression of the subclass data as expression omega= { g of the voice feature space by using the subclass data obtained by clustering _k ,k＝1,2,…,K}；

Step 2), constructing speaker knowledge: the pure voice sample with speaker attribute labels is utilized to obtain the distribution information and the motion trail information of the pure voice sample on the voice characteristic space omega;

step 3), speaker identification: for a voice sample to be recognized, firstly, the voice characteristic space distribution expression and the track of the sample are obtained, then, the difference between the sample distribution and the prior distribution and the accumulated local distribution difference along the track are calculated by utilizing the voice characteristic space distribution information of a speaker, and the difference is used as the basis of speaker recognition and is judged.

Further, in the process of constructing the voice feature space Ω in step 1), any clean voice sample can be used, and there is no constraint on the speaker and language factors.

Further, in step 1), the K-means or other clustering method is adopted to cluster the voice samples in the feature space, and the voiceFeature space expression Ω= { g _k K=1, 2, …, K } can be a localization-capable identifier such as a distribution function (e.g., gaussian distribution function), a cluster center vector (centroid), or a generation model (e.g., hidden markov model or neural network) of class data, called a feature space identifier, and the class identifier sub-scale K used in the speech feature space determines the speech feature space expression granularity, and the larger K, the finer the speech feature space expression. On the other hand, the accuracy of the spatial expression is related to the data scale, and the more abundant the data, the more complete the spatial expression; meanwhile, the more targeted the data that construct the speech feature space, the more accurate the spatial representation will be for a particular problem.

Further, in step 2), the pure speech sample with speaker attribute label is used to label the speech feature space, and Gaussian distribution g is adopted _k (m _k ,U _k ) As a spatial identifier, speaker characteristic spatial distribution information is obtained as follows:

1. calculating each feature f of a speech sample _t And space identifier g _k (m _k ,U _k ) Is defined as:

wherein, the space identifier is represented by multi-dimensional Gaussian distribution, m _k Mean vector representing kth gaussian distribution, U _k A variance matrix representing a kth multidimensional gaussian distribution;

2. calculating speaker sample set and space identifier g _k (m _k ,U _k ) Expected value of position association degree:

in the method, in the process of the invention,

representing the nth sampleT frame feature and spatial identifier g _k (m _k ,U _k ) Is a degree of association of (1);

3. the speaker characteristic space distribution is calculated as follows:

further, in step 2), the motion trail timing information of the speaker voice sample in the voice feature space Ω is represented as a neighborhood sequence ψ of voice sample features ₁ Ψ ₂ …Ψ _T Whereas speech sample feature f _t Delta neighborhood of be ψ _t ＝{g _k |d _tk < delta }, where d _tk Mahalanobis distance (Mahalanobis distance) for speech sample characteristics and speech sample distribution, i.e

Further, the speech sample feature f _t Delta neighborhood ψ of (1) _t ＝{g _k |d _tk The decision threshold of < delta > refers to the characteristic of normal distribution, and 2 < delta < 3 is selected.

Further, in step 3), the speech sample f= { f ₁ ,f ₂ ,…,f _T The speaker recognition process of comprises the following steps:

1. calculate the speech sample f= { f ₁ ,f ₂ ,…,f _T Distribution p= (P) in speech feature space Ω ₁ ,p ₂ ,…,p _K ) Wherein

2. Determining a speech sample f= { f ₁ ,f ₂ ,…,f _T Motion trajectory ψ in speech feature space Ω ₁ Ψ ₂ …Ψ _T ，Ψ _t ＝{g _k |d _tk ＜δ}；

3. Calculate the sample distribution p= (P ₁ ,p ₂ ,…,p _K ) First with speaker sSpatial distribution of empirical features

Distance of->

Then the set of possible solutions S containing the true solution is screened _p ：/>

4. Calculating a motion trajectory ψ ₁ Ψ ₂ …Ψ _T Distance measurement of (2)

From S _p Is selected to be possible->

Speaker recognition is completed.

Specifically, in step 3), only the spatial distribution information p= (P) of the voice samples is used ₁ ,p ₂ ,…,p _K ) Or the movement track distance

Good speaker recognition performance can be obtained.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the speaker identification method based on the voice sample feature space track provided by the invention has the advantages that a large number of voice features are clustered, marked data are not needed, the data samples for establishing the voice feature space are derived from different speakers, no exact requirements are met for speaking content, speaker age and languages, the problem that a large number of marked voices are needed in the neural network method is solved, and the data acquisition for establishing the voice space is convenient to realize.

2. The invention provides a speaker recognition method based on a voice sample characteristic space track, which is based on speaker languageThe positioning and track information of the sound features in the voice feature space are opposite, and the generated model is absolute, unlike signal source generation model methods such as Hidden Markov Models (HMM) and the like; compared with the method of the deep neural network, the method has the interpretability, and each knowledge data has certain physical semantics, such as correlation degree distribution information P= (P) of sample characteristics on space omega ₁ ,p ₂ ,…,p _K ) Namely, the active space range of the sample (the space represented by the identifier subset corresponding to the non-zero element) is expressed, and the distribution in the space is expressed.

3. The speaker recognition method based on the voice sample feature space track provided by the invention is essentially that voice features are positioned in space, the voice features of different speakers are positioned on the established voice feature space, the voice feature positioning information of different speakers is represented by using the association degree, the distinguishing property among different speakers is expressed by less calculation amount, and compared with the method of GMM or HMM which needs to model each speaker by using a generation model, the method has lower calculation complexity.

4. The invention provides a speaker recognition method based on a voice sample feature space track, wherein a voice feature space identifier subset is a reference system for positioning speaker voice features, is a relative relation, and has no strict relation requirement with a sample to be recognized, so that feature spaces have shareability, and the established voice feature spaces can be transferred to other speaker data sets for recognition, such as: the speech feature space of the speaker in one language can be used as the speech feature space of the speaker in the other language for recognition, so that the sharing of data is realized.

Drawings

Fig. 1 is a schematic flow chart of a speaker recognition method in embodiment 1 of the present invention.

Fig. 2 is a flowchart illustrating steps for establishing a speech feature space in embodiment 1 of the present invention.

Fig. 3 is a flowchart illustrating steps for generating spatial distribution information and trajectory information of speaker voice characteristics in embodiment 1 of the present invention.

Fig. 4 is a flowchart of steps for recognizing a voice sample to be recognized in embodiment 1 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Example 1:

the embodiment provides a speaker recognition method based on a voice sample feature space track, and a schematic flow chart is shown in fig. 1, and comprises the following three steps:

1) The voice characteristic space omega is established, any pure voice sample can be used without any constraint on factors such as a speaker, languages and the like, then the voice sample is clustered in the characteristic space by using a clustering method, and subclass data obtained by clustering is expressed as expression { g) of the voice characteristic space _k ,k＝1,2,…,K}；

2) Constructing speaker knowledge, including two parts of distribution information and motion trail information of the speaker in a voice feature space;

3) And for the voice sample to be identified, identifying by utilizing the space distribution information of the voice characteristics of the speaker and the motion trail information of the voice sample.

Referring to fig. 2, a flowchart of the steps for establishing a speech feature space in this embodiment is shown. Using speaker voice data in an aishell Chinese corpus as an unlabeled voice sample set, wherein the aishell contains 400 speakers in total, 60 wav files of each person are selected for training a voice feature space, and the unlabeled voice sample set X= { X is extracted ₁ ,x ₂ ,.....,x _N 12-dimensional MFCC features of }, thereby obtaining feature set F _x ＝{f _i ^x ,i＝1,2,…,t _N Of f, where f _i ^x Is a short-time frame feature, t _N Sum of frames for all samples;

then use F _x ＝{f _i ^x ,i＝1,2,…,t _N Training a GMM with a mixing degree of K by the feature sequence, discarding the weight information of the GMM, and reserving each Gaussian component as an identification subset of a voice feature space. Wherein K isFor the number of audio feature space identifiers, selecting 4096 for the number of identifiers K so as to give a description with higher precision to the audio feature space;

the speech feature space identifier is denoted as Ω= { g _k K=1, 2, …, K }, where g=n (m, U) is a multidimensional gaussian distribution function;

referring to fig. 3, a flowchart of the steps for generating speaker characteristic spatial distribution information in this embodiment is shown. For each person in the aishell, 20 wav files are used to annotate the speech feature space. Target speaker speech sample set y= { (Y) ₁ ,s ₁ ),(y ₂ ,s ₂ ),.....,(y _M ,s _M )}，s _i ∈S＝{S _l L=1, 2, …, L } (speaker set), speaker S _l Is Y as a sample of (C) _l ＝{y _m |s _m ＝S _l M=1, 2, …, M }, extracting its audio feature sequence as

Calculating all features f of a speech sample _t And space identifier g _k (m _k ,U _k ) Position association degree of (3):

calculating speaker sample set and space identifier g _k (m _k ,U _k ) Expected value of position association degree:

wherein the method comprises the steps of

The t frame feature and identifier g for the nth sample _k (m _k ,U _k ) Is a position association degree of (a);

the speaker characteristic space distribution is calculated as follows:

and processing the registered voice of each speaker in the target speaker set to obtain voice characteristic distribution information of each speaker.

Motion trail timing information of voice sample is expressed as neighborhood sequence psi of voice sample characteristics ₁ Ψ ₂ …Ψ _T Whereas sample feature f _t Delta neighborhood of be ψ _t ＝{g _k |d _tk < delta }, where d _tk Mahalanobis distance for characteristics and distribution, i.e

Referring to fig. 4, a flowchart of the steps for recognizing a voice sample in the present embodiment is shown. The speech sample to be recognized is characterized by f= { f ₁ ,f ₂ ,…,f _T The identification process is as follows:

calculate the speech f= { f ₁ ,f ₂ ,…,f _T Distribution p= (P) in feature space Ω ₁ ,p ₂ ,…,p _K )；

Wherein feature f _t And space identifier g _k (m _k ,U _k ) The degree of positional association is:

speaker sample to be identified and space identifier g _k (m _k ,U _k ) The degree of positional association is:

determining a speech feature f= { f ₁ ,f ₂ ,…,f _T Trajectory ψ in feature space Ω ₁ Ψ ₂ …Ψ _T Wherein ψ is _t ＝{g _k |d _tk ＜δ}；

Calculate the sample distribution p= (P ₁ ,p ₂ ,…,p _K ) Prior feature spatial distribution with speaker s

Distance of->

Wherein beta takes 2 and then filters the set of possible solutions S containing true solutions _p ：

Selecting 10 speakers with the smallest distance as candidate recognition results; />

Calculating the locus ψ ₁ Ψ ₂ …Ψ _T Distance measurement of (2)

Wherein beta is taken as 2, and a speaker with the smallest track distance is selected from the 10 candidate speakers as a recognition result, namely +.>

Example 2:

the embodiment provides a speaker recognition method based on a voice sample characteristic space track, which comprises the following steps:

step 1, establishing a voice feature space identifier subset by using voice data of an English corpus time;

step 2, registering the target speaker set by using the voice data in the aishell corpus, as in the embodiment 1;

step 3, the speech sample to be recognized is recognized, as in embodiment 1.

Compared with the embodiment 1, the obtained recognition effect has a small difference, and can prove that the voice characteristic space of the speaker in the other language can be used as the voice characteristic space of the speaker in the other language, thereby realizing the sharing of data.

The above description is only of the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive conception of the present invention equally within the scope of the disclosure of the present invention.

Claims

1. A speaker recognition method based on a speech sample feature space trajectory, wherein a speech sample can be regarded as a motion of a speech feature space, having an active space and trajectory characteristics in the space, the method comprising the steps of:

step 1), constructing a voice feature space omega: clustering the unlabeled voice samples in the feature space by using a clustering method, and expressing subclass data obtained by clustering as expression omega= { g of the voice feature space _k K=1, 2, …, K }, K being the feature space identifier g _k Is the number of (3);

2. The method for speaker recognition based on the spatial trajectory of the characteristics of the speech samples according to claim 1, wherein: in the process of constructing the voice characteristic space omega in the step 1), any pure voice sample can be used, and no constraint is imposed on a speaker and language factors.

3. The method for speaker recognition based on spatial trajectories of features of speech samples of claim 1, characterized byThe method is characterized in that: the speech feature space expresses omega= { g _k K=1, 2, …, K is a distribution function of class data, a cluster center vector or a model-generating identifier g with positioning capability _k ，g _k The scale K of the feature space identifier used by the voice feature space is called as a feature space identifier, and the larger K is, the finer the voice feature space expression granularity is determined.

4. The method for speaker recognition based on the spatial trajectory of the characteristics of the speech samples according to claim 1, wherein: in the step 2), the pure voice sample with speaker attribute labeling is used for labeling the voice feature space, and Gaussian distribution g is adopted _k (m _k ,U _k ) As a feature space identifier, speaker feature space distribution information is obtained as follows:

1. calculating each feature f of a speech sample _t And feature space identifier g _k (m _k ,U _k ) Is defined as:

wherein, the characteristic space identifier is represented by multidimensional Gaussian distribution, m _k Mean vector representing kth gaussian distribution, U _k A variance matrix representing a kth multidimensional gaussian distribution;

2. calculating speaker sample set and feature space identifier g _k (m _k ,U _k ) Expected value of position association degree:

in the method, in the process of the invention,

the t frame feature representing the nth sample and feature nullInterval tag g _k (m _k ,U _k ) Is a degree of association of (1);

3. the speaker characteristic space distribution is calculated as follows:

5. the method for speaker recognition based on the spatial trajectory of the characteristics of the voice sample of claim 4, wherein: in step 2), the motion trail timing information of the speaker voice sample in the voice feature space Ω is expressed as a neighborhood sequence ψ of the voice sample features ₁ Ψ ₂ …Ψ _T Whereas speech sample feature f _t Delta neighborhood of be ψ _t ＝{g _k |d _tk < delta }, where d _tk Mahalanobis distance (Mahalanobis distance) for speech sample characteristics and speech sample distribution, i.e

6. The method for speaker recognition based on the spatial trajectory of the characteristics of the voice sample according to claim 5, wherein: the speech sample feature f _t Delta neighborhood ψ of (1) _t ＝{g _k |d _tk The decision threshold of < delta > refers to the characteristic of normal distribution, and 2 < delta < 3 is selected.

7. The method for speaker recognition based on the spatial trajectory of speech samples as claimed in claim 5, wherein in step 3), the speech samples f= { f ₁ ,f ₂ ,…,f _T The speaker recognition process of comprises the following steps:

3. Calculate the sample distribution p= (P ₁ ,p ₂ ,…,p _K ) Prior feature spatial distribution with speaker s

Distance of->

From S _p To select possible solutions

Speaker recognition is completed. />