WO2020143263A1

WO2020143263A1 - Speaker identification method based on speech sample feature space trajectory

Info

Publication number: WO2020143263A1
Application number: PCT/CN2019/111530
Authority: WO
Inventors: 贺前华; 吴克乾; 谢伟; 庞文丰
Original assignee: 华南理工大学
Priority date: 2019-01-11
Filing date: 2019-10-16
Publication date: 2020-07-16
Also published as: CN109545229A; CN109545229B; SG11202103091XA

Abstract

A speaker identification method based on a speech sample feature space trajectory, the method comprising: clustering unmarked speech data features to obtain a speech feature space representing an identifier subset; using a marked speech sample to implement speaker registration and obtaining motion trajectory information and distribution information of the speaker in the speech feature space; and using the distribution information of the speaker in the speech feature space and motion trajectory information of the speech sample to implement identification of a speech sample to be identified. Using the idea of speaker speech feature spatial positioning, the complexity of identifying the speaker is low, solving the problem of the high degree of complexity of GMM-UBM calculation; the speech feature space of a speaker in one language can serve as the speech feature space for identification of the speaker in another language, implementing data sharing.

Description

A speaker recognition method based on the trajectory of the feature space of speech samples

Technical field

The invention relates to the field of biometric recognition, and in particular to a speaker recognition method based on the trajectory of the feature space of speech samples.

Background technique

With the development of artificial intelligence technology, audio perception has become a hotspot in the research of audio processing technology. Among them, audio classification or audio recognition is the core issue of audio perception. In engineering applications, audio classification manifests as speaker recognition, audio event recognition, audio Event detection, etc. Speaker recognition technology is a kind of identity verification technology---biometric recognition technology. Biometric recognition technology is a technology that uses biometrics to automatically identify individuals, including fingerprint recognition, iris recognition, gene recognition, and face recognition. Compared with other identity verification technologies, speaker identification is more convenient, natural, and has lower user intrusiveness. Speaker recognition uses voice signals for identity recognition, which has the advantages of natural human-computer interaction, easy extraction of voice signals, and remote recognition.

The existing speaker recognition system includes two stages: training stage and recognition stage. In the training stage, the system uses the collected speaker speech to build a model for the speaker; in the recognition stage, the system matches the input speech with the speaker model to make a decision. The speaker recognition system needs to extract features that can reflect the personality of the speaker from the speech signal, and establish an accurate model to distinguish the difference between the speaker and other speakers. At present, there are two main types of audio classification technology, one is to generate statistical models, such as the mixed Gaussian model GMM and the hidden Markov model HMM, and the other is based on deep neural network methods, such as DNN, RNN or LSTM. . No matter what kind of technology, a large number of labeled training samples are required. In order to achieve better recognition performance, the deep neural network method requires higher sample size. The GMM or HMM-based method does not give special consideration to the distinguishing information between different audio classes, nor does it consider the sharing of sample data of different classes, such as: MIT Professor Reynold's paper "Speaker Verification Using Adapted Gaussian Mixture Models" (Digital Signal Processing 10 (2000), 19–41.) The method mentioned has a high computational complexity; with the support of large samples, the deep neural network method has shown very good performance, such as the paper of Google "End-to-End Text-Dependent Speaker Verification" (2016 IEEE International Conference on Acoustics, Speech and Processing (ICASSP), 2016, Pages: 5115-5119) uses neural networks to extract features and train speech, but neural networks The training requires a lot of labeled speech, and the acquisition cost of a large number of samples is very high, and the deep neural network method lacks explanation, which is quite a black box.

Existing speaker recognition technologies tend to be computationally complex, requiring a large amount of annotated speaker speech data to train the model, and collecting a large amount of annotated speech data requires huge workload. Therefore, we need to find a more convenient and effective speaker recognition method and system.

Summary of the invention

The purpose of the present invention is to provide a speaker recognition method based on the trajectory of the feature space of the speech sample, in which the speech feature space does not depend on the speaker, text and language, so the construction of the speech feature space can be Use any qualified voice data to achieve the sharing of voice data; and the speaker's voice trajectory can be constructed even with a sample, so a large amount of annotated voice data is not required, which overcomes the need to collect a large amount of annotated voice data in the prior art defect.

The object of the present invention can be achieved by the following technical solutions:

A speaker recognition method based on the trajectory of the feature space of the voice sample, in which a voice sample can be regarded as a movement in the feature space of the voice, with trajectory characteristics in the active space and space, the method includes the following steps:

Step 1), construct the voice feature space Ω: cluster the unlabeled voice samples in the feature space using a clustering method, and generate an expression of the subclass data from the subclass data obtained by the clustering as the expression of the voice feature space Ω={g _k ,k=1, 2,...,K};

Step 2), construct speaker knowledge: use pure speech samples marked with speaker attributes to obtain their distribution information and movement trajectory information in the speech feature space Ω;

Step 3) Speaker recognition: For the speech sample to be recognized, first obtain the spatial distribution expression and trajectory of the speech feature of the sample, and then use the speaker's speech feature spatial distribution information to calculate the difference between the sample distribution and the prior distribution and the accumulation along the trajectory The local distribution difference is used as the basis for speaker identification and judgment.

Further, in the process of constructing the speech feature space Ω in step 1), any pure speech samples can be used without any constraints on speakers and language factors.

Further, in step 1), K-means or other clustering methods are used to cluster the speech samples in the feature space, and the speech feature space expression Ω={g _k ,k=1, 2,...,K} can be Distribution functions of class data (such as Gaussian distribution function), cluster center vector (centroid) or generative models (such as hidden Markov model or neural network), which have positioning capabilities, are called feature space identifiers, voice features The scale K of the class identifier used in the space determines the granularity of voice feature spatial expression. The larger K is, the finer the voice feature spatial expression. On the other hand, the accuracy of spatial expression is related to the size of the data. The richer the data, the more complete the spatial expression. At the same time, the more targeted the data for constructing the speech feature space, the more precise the spatial expression will be for a particular problem.

Further, in step 2), the pure voice samples with speaker attribute annotations are used to mark the voice feature space. When the Gaussian distribution g _k (m _k , U _k ) is used as the spatial identifier, the speaker feature space distribution information Obtained as follows:

1. Calculate the positional correlation between each feature f _{t of the} speech sample and the spatial identifier g _k (m _k , U _k ), defined as:

In the formula, the spatial identifier is represented by a multidimensional Gaussian distribution, m _k represents the mean vector of the kth Gaussian distribution, and U _k represents the variance matrix of the kth multidimensional Gaussian distribution;

2. Calculate the expected value of the positional correlation between the speaker sample set and the spatial identifier g _k (m _k , U _k ):

In the formula,

Represents the degree of association between the t-th frame feature of the n-th sample and the spatial identifier g _k (m _k , U _k );

3. The spatial distribution of speaker features is calculated as:

Further, in step 2), the time sequence information of the movement trajectory of the speaker's speech sample on the speech feature space Ω is expressed as the neighborhood sequence Ψ ₁ Ψ ₂ …Ψ _{T of the} feature of the speech sample, and the δ neighborhood of the feature f _t of the speech sample Is Ψ _t = {g _k |d _tk <δ}, where d _tk is the Mahalanobis distance (Mahalanobis distance) between the feature of the speech sample and the distribution of the speech sample, ie

Further, the decision threshold of the δ neighborhood Ψ _t ={g _k |d _tk <δ} of the feature f _t of the speech sample refers to the characteristics of the normal distribution, and 2<δ<3 is selected.

Further, in step 3), the speaker recognition process of the speech sample f={f ₁ , f ₂ ,..., f _T } includes the following steps:

1. Calculate the distribution of speech samples f = {f ₁ , f ₂ , ..., f _T } in the speech feature space Ω P = (p ₁ , p ₂ , ..., p _K ), where

2. Determine the movement trajectory of the speech sample f = {f ₁ , f ₂ ,..., f _T } in the speech feature space Ω Ψ ₁ Ψ ₂ …Ψ _T , Ψ _t = {g _k |d _tk <δ};

3. Calculate the sample distribution P = (p ₁ , p ₂ , ..., p _K ) and the spatial distribution of the prior features of the speaker s

the distance

Then filter the possible solution set S _p containing the real solution:

4. Calculate the distance metric of the trajectory Ψ ₁ Ψ ₂ …Ψ _T

Select possible solutions from S _p

Complete speaker recognition.

Specifically, in step 3), only the spatial distribution information P=(p ₁ ,p ₂ ,...,p _K ) of the speech samples or the distance of the movement track

Both can obtain very good speaker recognition performance.

Compared with the prior art, the present invention has the following advantages and beneficial effects:

1. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, in which the establishment of the voice feature space is to cluster a large number of voice features without the need for annotated data, to establish data samples of the voice feature space It can be derived from different speakers. There is no exact requirement for the content of the speaker, the age of the speaker, and the language. It overcomes the problem of the need for a large number of labeled voices in the neural network method, and the data collection established in the voice space is easy to implement.

2. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, based on the location and trajectory information of the speaker's voice feature in the voice feature space, is different from the signal source generation model method, such as Hidden Markov Model (HMM), etc., the positioning is relative, and the generative model is absolute; compared with the deep neural network method, it is interpretable, and each knowledge data has a certain physical semantics, such as the sample features on the space Ω The correlation degree distribution information of P = (p ₁ , p ₂ , ..., p _K ) expresses the range of the active space of the sample (the space represented by the identifier subset corresponding to the non-zero elements), and also expresses in the space Distribution.

3. A speaker recognition method based on the trajectory of the feature space of the voice samples provided by the present invention is essentially that the voice features are located in the space. For the voice features of different speakers, the location is established on the established voice feature space and the association is used. Degree to represent the speech feature location information of different speakers, and expresses the distinction between different speakers with less calculation, compared to GMM or HMM that requires a generative model to model each speaker The method has a lower computational complexity.

4. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, wherein the voice feature space identifier subset is a reference system used to locate the speaker's voice feature, which is a relative relationship and is not strictly related to the sample to be recognized The relationship requires that the feature space is shared, and the established voice feature space can be transferred to other speaker data sets for recognition, for example: a speaker's voice feature space of one language can be used as a voice feature for speaker recognition of another language Space to achieve data sharing.

BRIEF DESCRIPTION

FIG. 1 is a schematic flowchart of a speaker recognition method in Embodiment 1 of the present invention.

FIG. 2 is a flowchart of steps for establishing a voice feature space in Embodiment 1 of the present invention.

FIG. 3 is a flowchart of steps for generating spatial distribution information and trajectory information of speaker voice features in Embodiment 1 of the present invention.

FIG. 4 is a flowchart of steps for recognizing speech samples to be recognized in Embodiment 1 of the present invention.

detailed description

The present invention will be described in further detail below with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.

Example 1:

This embodiment provides a speaker recognition method based on the trajectory of the voice sample feature space. The schematic flowchart is shown in FIG. 1 and includes the following three steps:

1) Establish a voice feature space Ω, you can use any pure voice samples, without any restrictions on speakers, language and other factors, and then use the clustering method to cluster the voice samples in the feature space, and the subclass data obtained by the clustering is expressed Is the expression of the speech feature space {g _k ,k=1, 2,...,K};

2) Construct speaker knowledge, including speaker distribution information and motion trajectory information in the speech feature space;

3) For the voice samples to be recognized, use the spatial distribution information of the speaker's voice features and the motion trajectory information of the voice samples for recognition.

Referring to FIG. 2, it is a flowchart of steps for establishing a voice feature space in this embodiment. Use the speaker speech data in the aishell Chinese corpus as the unlabeled speech sample set. Aishell contains a total of 400 speakers. Select 60 wav files for each person to train the speech feature space and extract the unlabeled speech sample set X = 12-dimensional MFCC features of {x ₁ ,x ₂ ,.....,x _N } to obtain the feature set F _x ={f _i ^x ,i=1, 2,..., t _N }, where f _i ^x Is a short-time frame feature, t _N is the sum of the number of frames of all samples;

Then use F _x ={f _i ^x ,i=1, 2,..., t _N }feature sequence to train a GMM with mixing degree K, discard the weight information of GMM, and retain each Gaussian component as an identifier of the speech feature space set. Among them, K is the number of identifiers of the audio feature space, and the number of identifiers K is selected to be 4096, so as to give a higher precision description to the audio feature space;

The speech feature space identifier is expressed as Ω={g _k ,k=1, 2,...,K}, where g=N(m,U) is a multi-dimensional Gaussian distribution function;

Referring to FIG. 3, it is a flowchart of steps for generating speaker feature space distribution information in this embodiment. For each person in aishell, 20 wav files are used to label the speech feature space. Target speaker speech sample set Y={(y ₁ ,s ₁ ),(y ₂ ,s ₂ ),... (y _M ,s _M )}, s _i ∈S={S _l ,l =1,2,...,L} (speaker set), the sample of the speaker S _l is Y _l ={y _m |s _m =S _l ,m=1, 2,..., M}, and the audio features are extracted The sequence is

Calculate the positional correlation between all the features f _{t of the} speech sample and the spatial identifier g _k (m _k , U _k ):

Calculate the expected value of the positional correlation between the speaker sample set and the spatial identifier g _k (m _k , U _k ):

among them

The position correlation between the t-th frame feature of the n-th sample and the identifier g _k (m _k , U _k );

The spatial distribution of speaker features is calculated as:

The registered speech of each speaker in the target speaker set is processed to obtain the speech feature distribution information of each speaker.

The timing information of the motion trajectory of the speech sample is expressed as the neighborhood sequence Ψ ₁ Ψ ₂ …Ψ _T of the speech sample feature, and the δ neighborhood of the sample feature f _t is Ψ _t ={g _k |d _tk <δ}, where d _tk Mahalanobis distance is the feature and distribution, namely

Referring to FIG. 4, it is a flowchart of steps for recognizing speech samples in this embodiment. The feature of the speech sample to be recognized is f={f ₁ , f ₂ ,..., f _T }, and the recognition process is as follows:

Calculate the distribution of speech f = {f ₁ , f ₂ , ..., f _T } in the feature space Ω P = (p ₁ , p ₂ , ..., p _K );

Among them, the positional correlation between the feature f _t and the spatial identifier g _k (m _k , U _k ) is:

The positional correlation between the sample of the speaker to be recognized and the spatial identifier g _k (m _k , U _k ) is:

Determine the trajectory Ψ ₁ Ψ ₂ …Ψ _{T of the} speech feature f = {f ₁ , f ₂ ,..., f _T } in the feature space Ω, where Ψ _t = {g _k |d _tk <δ};

Calculate the sample distribution P = (p ₁ , p ₂ , ..., p _K ) and the prior feature spatial distribution of the speaker s

the distance

Where α is 2 and then the possible solution set S _p containing the real solution is selected:

Select the 10 speakers with the smallest distance as the candidate recognition results;

Calculate the distance metric of trajectories Ψ ₁ Ψ ₂ …Ψ _T

Where α is 2, and the speaker with the smallest trajectory distance is selected from the 10 candidate candidates as the recognition result, namely

Example 2:

This embodiment provides a speaker recognition method based on the trajectory of the feature space of a voice sample, including the following steps:

Step 1: Use the voice data of the English corpus timit to establish a voice feature space identifier sub-collection;

Step 2: Use the voice data in the aishell corpus to register the target speaker set, as in Example 1;

Step 3: Recognize the speech samples to be recognized, as in Embodiment 1.

The obtained recognition effect has a small gap compared with that in Embodiment 1. It can be proved that the speaker speech feature space of another language can be used as the speech feature space of speaker recognition of another language, and data sharing is realized.

The above is only the preferred embodiment of the invention patent, but the scope of protection of the invention patent is not limited to this, any person skilled in the art in the technical field is within the scope disclosed by the invention patent, according to the invention patent Technical solutions and their invention patent ideas are equivalently replaced or changed, and all belong to the protection scope of the invention patent.

Claims

A speaker recognition method based on a trajectory of a feature space of a voice sample, in which a voice sample can be regarded as a motion in the voice feature space, has trajectory characteristics in an active space and space, and is characterized in that the method includes the following steps:

Step 1), construct the voice feature space Ω: cluster the unlabeled voice samples in the feature space using a clustering method, and generate an expression of the subclass data from the subclass data obtained by the clustering as the expression of the voice feature space Ω={g k ,k=1, 2,...,K};

Step 2), construct speaker knowledge: use pure speech samples marked with speaker attributes to obtain their distribution information and movement trajectory information in the speech feature space Ω;

Step 3) Speaker recognition: For the speech sample to be recognized, first obtain the spatial distribution expression and trajectory of the speech feature of the sample, and then use the speaker's speech feature spatial distribution information to calculate the difference between the sample distribution and the prior distribution and the accumulation along the trajectory The local distribution difference is used as the basis for speaker identification and judgment.
A speaker recognition method based on the trajectory of the feature space of speech samples according to claim 1, characterized in that: step 1) in the process of constructing the speech feature space Ω, any pure speech samples can be used There are no constraints.
A speaker recognition method based on the trajectory of the feature space of a voice sample according to claim 1, characterized in that: the voice feature space expression Ω = {g k , k = 1, 2, ..., K} can be a class The data-distribution function, cluster center vector, or generation model, which have positioning capabilities, are called feature space identifiers. The size of the class identifier used in the voice feature space K determines the granularity of the voice feature space expression. The larger K, the voice The finer the feature space expression.
A speaker recognition method based on the trajectory of the feature space of a voice sample according to claim 1, characterized in that: in step 2), the voice feature space is annotated with pure voice samples labeled with speaker attributes. When the distribution g k (m k , U k ) is used as a spatial identifier, the spatial distribution information of speaker features is obtained as follows:

1. Calculate the positional correlation between each feature f t of the speech sample and the spatial identifier g k (m k , U k ), defined as:

In the formula, the spatial identifier is represented by a multidimensional Gaussian distribution, m k represents the mean vector of the kth Gaussian distribution, and U k represents the variance matrix of the kth multidimensional Gaussian distribution;

2. Calculate the expected value of the positional correlation between the speaker sample set and the spatial identifier g k (m k , U k ):

In the formula,
Represents the degree of association between the t-th frame feature of the n-th sample and the spatial identifier g k (m k , U k );

3. The spatial distribution of speaker features is calculated as:
A speaker recognition method based on the trajectory of the feature space of a voice sample according to claim 4, characterized in that: in step 2), the trajectory information of the trajectory of the speaker's voice sample in the voice feature space Ω is expressed as a feature of the voice sample The neighborhood sequence of Ψ 1 Ψ 2 …Ψ T , and the δ neighborhood of the speech sample feature f t is
Where d tk is the Mahalanobis distance between the features of the speech sample and the distribution of the speech sample, ie
A speaker recognition method based on the trajectory of the feature space of the speech sample according to claim 5, characterized in that: the δ neighborhood of the speech sample feature f t Ψ t = {g k |d tk <δ} The threshold refers to the characteristics of the normal distribution, and 2<δ<3 is selected.
A speaker recognition method based on the trajectory of the feature space of speech samples according to claim 5, characterized in that, in step 3), speaker recognition of speech samples f = {f 1 , f 2 ,..., f T } The process includes the following steps:

1. Calculate the distribution of speech samples f = {f 1 , f 2 , ..., f T } in the speech feature space Ω P = (p 1 , p 2 , ..., p K ), where

2. Determine the movement trajectory of the speech sample f = {f 1 , f 2 ,..., f T } in the speech feature space Ω Ψ 1 Ψ 2 …Ψ T , Ψ t = {g k |d tk <δ};

3. Calculate the sample distribution P = (p 1 , p 2 , ..., p K ) and the spatial distribution of the prior features of the speaker s
the distance
Then filter the possible solution set S p containing the real solution:

4. Calculate the distance metric of the trajectory Ψ 1 Ψ 2 …Ψ T
Select possible solutions from S p
Complete speaker recognition.