WO2019232826A1

WO2019232826A1 - I-vector extraction method, speaker recognition method and apparatus, device, and medium

Info

Publication number: WO2019232826A1
Application number: PCT/CN2018/092589
Authority: WO
Inventors: 涂宏
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-06-06
Filing date: 2018-06-25
Publication date: 2019-12-12
Also published as: CN109065022B; CN109065022A

Abstract

Disclosed are an i-vector extraction method, a speaker recognition method and apparatus, a device, and a medium. The i-vector extraction method comprises: obtaining training voice data of a speaker, and extracting a training voice feature corresponding to the training voice data; training, on the basis of a preset UBM, a total variability subspace corresponding to the preset UBM; projecting the training voice feature on the total variability subspace, and obtaining a first i-vector; and projecting the first i-vector on the total variability subspace, and obtaining a registration i-vector corresponding to the speaker. According to the method, training voice feature data is projected twice, i.e., the dimension is reduced, so that more noise features can be removed, thereby improving the purity of the extracted voice feature of a speaker; moreover, after dimension reduction, the computation space is reduced and the recognition efficiency of voice recognition is also improved.

Description

I-vector vector extraction method, speaker recognition method, device, device and medium

This application is based on a Chinese invention application filed on June 06, 2018 with application number 201810574010.4, entitled "i-vector vector extraction method, speaker recognition method, device, device, and medium", and claims its priority.

Technical field

The present application relates to the field of speech recognition, and in particular, to an i-vector vector extraction method, a speaker recognition method, a device, a device, and a medium.

Background technique

Speaker recognition, also called voiceprint recognition, is a kind of biometric authentication technology that uses specific speaker information contained in a voice signal to identify the identity of the speaker. In recent years, the introduction of i-vector (identity-vector) modeling methods based on vector analysis has significantly improved the performance of speaker recognition systems. In vector analysis of speaker speech, usually the channel subspace will contain speaker information. In i-vector space, a low-dimensional total variable space is used to represent the speaker subspace and channel subspace. The speaker's speech is projected into the space through dimensionality reduction to obtain a fixed-length vector representation (i-vector vector). . However, there are still many interference factors in the i-vector vectors obtained by the existing i-vector modeling, which increases the complexity when using the i-vector vectors for speaker recognition.

Summary of the Invention

Based on this, it is necessary to provide an i-vector vector extraction method, device, computer equipment, and storage medium that can remove more interference factors in response to the above technical problems.

An i-vector vector extraction method includes:

Obtain the training voice data of the speaker, and extract the training voice features corresponding to the training voice data;

Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;

Project training speech features on the overall change subspace to obtain the first i-vector vector;

The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.

An i-vector vector extraction device includes:

A voice data acquisition module for acquiring training voice data of a speaker and extracting training voice features corresponding to the training voice data;

A training change space module for training an overall change subspace corresponding to a preset UBM model based on a preset UBM model;

A projection change space module, configured to project training speech features on the overall change subspace to obtain a first i-vector vector;

An i-vector vector module is used to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the steps of the i-vector vector extraction method when the processor executes the computer-readable instructions.

A computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions are executed by a processor, the steps of the i-vector vector extraction method are implemented.

This implementation also provides a speaker recognition method, including:

Obtaining test voice data, and the test voice data carries a speaker identification;

Obtain corresponding test i-vector vectors based on test speech data;

Query the database based on the speaker identity to obtain a registered i-vector vector corresponding to the speaker identity;

The cosine similarity algorithm is used to obtain the similarity between the test i-vector vector and the registered i-vector vector, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.

A speaker recognition device includes:

A test data acquisition module for acquiring test voice data, and the test voice data carries a speaker identifier;

A test vector obtaining module, configured to process test voice data by using an i-vector vector extraction method to obtain a corresponding test i-vector vector;

A registration vector module for querying a database based on a speaker identifier to obtain a registered i-vector vector corresponding to the speaker identifier;

The corresponding speaker module is determined and used to obtain the similarity between the test i-vector vector and the registered i-vector vector using a cosine similarity algorithm, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:

It also includes using the i-vector vector extraction method to process the test voice data to obtain the corresponding test i-vector vector;

One or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:

The first i-vector vector is projected onto the overall changing subspace, and a registered i-vector vector corresponding to the speaker is obtained.

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

1 is a schematic diagram of an application environment of an i-vector vector extraction method according to an embodiment of the present application;

2 is a flowchart of an i-vector vector extraction method according to an embodiment of the present application;

3 is another specific flowchart of an i-vector vector extraction method according to an embodiment of the present application;

4 is another specific flowchart of an i-vector vector extraction method according to an embodiment of the present application;

5 is another specific flowchart of an i-vector vector extraction method according to an embodiment of the present application;

6 is a specific flowchart of a speaker recognition method according to an embodiment of the present application;

7 is a schematic block diagram of an i-vector vector extraction device according to an embodiment of the present application;

8 is a schematic block diagram of a speaker recognition device according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

The i-vector vector extraction method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 1, where a computer device communicates with an identification server through a network. Among them, computer equipment includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The identification server can be implemented by an independent server or a server cluster composed of multiple servers.

In one embodiment, as shown in FIG. 2, an i-vector vector extraction method is provided. The method is applied to the identification server in FIG. 1 as an example, and includes the following steps:

S10. Acquire the training voice data of the speaker, and extract the training voice features corresponding to the training voice data.

The speaker's training speech data is the original speech data provided by the speaker. The training speech feature is a speech feature that represents a speaker different from others, and is applied in this embodiment. Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) can be used as the training speech feature.

The test found that the human ear is like a filter bank and only pays attention to certain specific frequency components (human hearing is non-linear to frequency), which means that the human ear receives a limited number of signals at a sound frequency. However, these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. The resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the auditory characteristics of the human ear, which is also the physical meaning of the Mel scale.

S20. Train the overall change subspace corresponding to the preset UBM model based on the preset UBM model.

Among them, the preset UBM (Universal Background Model) is a Gaussian Mixture Models (Gaussian Mixture Models) that represent a large number of non-specific speaker speech feature distributions. UBM model training usually uses a large amount of speech data that is independent of specific speakers and channels. Therefore, UBM models can usually be considered as models that are independent of specific speakers. It only fits the distribution of human speech features and does not represent a certain A specific speaker. The UBM model is preset in the recognition server because the voice data for training a specific speaker is usually very small during the voiceprint registration phase of the voiceprint recognition process. The GMM model is used to model the speaker's voice characteristics and train the specific speaker The voice data cannot usually cover the feature space where the GMM is located. Therefore, the parameters of the UBM model can be adjusted according to the characteristics of the training speech to characterize the personality information of a specific speaker. Features that are not covered by the training speech can be approximated by similar feature distributions in the UBM model. This method can better solve the training System performance problems caused by insufficient voice.

The total change subspace, also called T space (Total Space), is a direct setting of a globally changing projection matrix to contain all possible information of the speaker in the voice data. The speaker space and channel space are not separated in the T space. . T-space can project high-dimensional full statistics (supervectors) onto i-vectors that can be used as low-dimensional speaker representations, and play a role in reducing dimensions. The training process of the T space includes: according to a preset UBM model, using a vector analysis and an EM (Expectation Maximum Algorithm) algorithm to calculate the T space from the convergence.

In this step, the overall change subspace obtained based on the preset UBM model does not distinguish between speaker space and channel space, and converges the information of the channel space and the information of the channel space into one space to reduce the computational complexity and facilitate further based on the overall Change the subspace to get the i-vector vector.

S30. Project the training speech features on the overall change subspace to obtain a first i-vector vector.

The first i-vector vector is a vector representing a fixed-length vector obtained by projecting training speech features onto a low-dimensional global change subspace, that is, an i-vector vector.

Specifically, in this step, the formula s ₁ = m + Tw ₁ is adopted, and a high-dimensional training speech feature projection can be obtained to form a low-dimensional first i-vector vector after the overall change subspace, which reduces the dimensional sum of the training speech feature projection and Remove more noise to facilitate speaker identification based on the first i-vector vector.

S40. Project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.

Among them, the overall change subspace is obtained through step S20. The overall change subspace does not separate the speaker space and the channel space, and directly sets a globally changing T (Total Variable Space) space to contain all possible possibilities in the voice data. Information.

The registered i-vector vector is a fixed-length projected first i-vector vector into a low-dimensional global change subspace, which is obtained for recording in the database of the recognition server and used to associate with the speaker ID as an identity. The vector of vector representation, i-vector.

In a specific implementation, in step S40, the training voice feature is projected on the overall change subspace to obtain a first i-vector vector, which specifically includes the following steps:

S41. Use the formula s ₂ = m + Tw _{2 to} project the first i-vector vector on the overall change subspace to obtain the registered i-vector vector, where s ₂ is a D * G dimension that is in phase with the registered i-vector vector. ultra corresponding mean vector; m is speaker-independent and channel-independent ultra-dimensional vector D * G; T is the total variation subspace dimension DG * m; w ₂ i-vector is a registered vector of dimension M.

In this embodiment, s ₂ may use the Gaussian mean supervector of the first i-vector vector obtained in step S30; m is a D * G-dimensional supervector that is independent of the speaker and independent of the channel, and the average value corresponding to the UBM model is super. Vectors are spliced together; w ₂ is a set of random vectors obeying the standard normal distribution, which is the registered i-vector vector, and the dimension of the registered i-vector vector is M.

Further, the acquisition process of T (total change subspace) in the formula is: training the high-dimensional sufficient statistics of the UBM model, and then iteratively updating the high-dimensional sufficient statistics through the EM algorithm to generate a convergent T-space. Bring T space into the formula s ₂ = m + Tw ₂ , because s ₂ , m, and T are all known, w ₂ can be obtained, that is, the registered i-vector vector, where w ₂ = (s _2- m) / T.

The i-vector vector extraction method provided in this embodiment obtains a first i-vector vector by projecting training speech features on the overall change subspace, and then projects the first i-vector vector on the overall change subspace a second time. Obtaining the registered i-vector vector, so that after training the feature data of the speech after two projections, that is, reducing the dimensionality, more noise features can be removed, which improves the purity of the extracted speaker's speech features, while reducing the computation space and reducing speech The recognition efficiency of recognition. The speaker recognition method provided by this implementation uses the i-vector vector extraction method for recognition and reduces the complexity of recognition.

In an embodiment, as shown in FIG. 3, in step S10, the training voice feature corresponding to the training voice data is extracted, and specifically includes the following steps:

S11: Preprocess the training voice data to obtain preprocessed voice data.

In a specific embodiment, in step S11, the training voice data is pre-processed to obtain the pre-processed voice data, which specifically includes the following steps:

S111: the training speech data for pre-emphasis, pre-emphasis process is calculated as _{_{s' n = s n -a *}} s n-1, wherein the amplitude of the signal on the time domain s _n, s _n-1 with the The signal amplitude corresponding to s _n at the previous moment, s' _n is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9 <a <1.0.

Among them, pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process. In order to obtain a better signal waveform at the receiving end, the damaged signal needs to be compensated. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform. Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.

Formula of the present embodiment, the training speech data for pre-emphasis, the pre-emphasis is _{_{s' n = s n -a *}} s n-1, where, s _n on the time-domain signal amplitude, i.e. the voice data expressed in the time domain speech amplitude (amplitude), s _n-1 s _n is the opposite of the signal amplitude of a time, s' _n for the signal amplitude in the time domain after the pre-emphasis, a is the pre-emphasis coefficient The value of a ranges from 0.9 <a <1.0. Here, the effect of pre-emphasis of 0.97 is better. The use of the pre-emphasis processing can eliminate interference caused by vocal cords and lips during vocalization, can effectively compensate the suppressed high-frequency part of the training voice data, and can highlight the high-frequency formants of the training voice data, and strengthen the signal amplitude of the training voice data To help extract training speech features.

S112: Perform frame processing on the pre-emphasized training voice data.

Specifically, after pre-emphasizing the training voice data, frame processing should also be performed. Framing refers to the speech processing technology that cuts the entire voice signal into several segments. The size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length. Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames. The framed processing of the training voice data can divide the training voice data into several pieces of voice data, and the training voice data can be subdivided to facilitate the extraction of training voice features.

S113: Perform window processing on the framed training speech data to obtain pre-processed speech data. The calculation formula for windowing is

Wherein, N is the window length, of n-time, the signal amplitude on the time domain s _n, s' _n for the amplitude of the signal on the windowed time domain.

Specifically, after frame processing is performed on the training speech data, discontinuities will appear at the beginning and end of each frame, so the more frames there are, the greater the error between the training speech data and the training speech data. The use of windowing can solve this problem, making the framed training speech data continuous, and enabling each frame to exhibit the characteristics of a periodic function. The windowing process specifically refers to the processing of training speech data by using a window function. The window function can select the Hamming window.

N Hamming window length, n is the time, s _n of the signal amplitude on the time domain, s' _n in the time domain signal after the amplitude is windowed. Windowing the training voice data and obtaining pre-processed voice data can make the signal of the training voice data in the time domain after the frame become continuous, which is helpful for extracting the training voice features of the training voice data.

The pre-processing operations of the training voice data in steps S211-S213 provide a basis for extracting the training voice features of the training voice data, which can make the extracted training voice features more representative of the training voice data, and train out according to the training voice features. Corresponding GMM-UBM model.

S12: Perform a fast Fourier transform on the preprocessed voice data to obtain the frequency spectrum of the training voice data, and obtain the power spectrum of the training voice data according to the frequency spectrum.

Among them, Fast Fourier Transform (FFT) refers to a collective term for an efficient and fast method for computing discrete Fourier transforms using a computer. The use of this algorithm can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the FFT algorithm's computational savings will be.

Specifically, a fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from a signal amplitude in a time domain to a signal amplitude (spectrum) in a frequency domain. The formula for calculating the spectrum is

N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit. After obtaining the frequency spectrum of the pre-processed voice data, the power spectrum of the pre-processed voice data can be directly obtained based on the frequency spectrum. The power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the training voice data. The formula for calculating the power spectrum of the training speech data is

N is the frame size, and s (k) is the signal amplitude in the frequency domain. By converting the pre-processed speech data from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then obtaining the power spectrum of the training speech data according to the signal amplitude in the frequency domain, the training is extracted from the power spectrum of the training speech data. Speech features provide an important technical basis.

S13: Use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data.

Among them, the power spectrum of the training speech data using the Mel scale filter bank is a Mel frequency analysis of the power spectrum, and the Mel frequency analysis is an analysis based on human auditory perception. The test found that the human ear is like a filter bank and only pays attention to certain specific frequency components (human hearing is non-linear to frequency), which means that the human ear receives a limited number of sound frequencies. However, these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.

In this embodiment, the power spectrum of the training speech data is processed by using a Mel scale filter bank, and the Mel power spectrum of the training speech data is obtained. The frequency domain signals are segmented by using the Mel scale filter bank, so that the final The frequency segment corresponds to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the training speech data can be obtained. By performing Mel frequency analysis on the power spectrum of the training speech data, the Mel power spectrum obtained after the analysis retains a frequency portion that is closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the training speech data. .

S14: Perform cepstrum analysis on the Mel power spectrum to obtain MFCC features of the training speech data.

Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .

Specifically, cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum results, the MFCC features of the training speech data are analyzed and acquired. Through this cepstrum analysis, the features contained in the Mel power spectrum of the training speech data that were originally too high in dimension and difficult to use directly can be converted into easy-to-use features by performing cepstrum analysis on the Mel power spectrum ( MFCC feature feature vector used for training or recognition). The MFCC feature can be used as a coefficient for distinguishing different voices from the training voice feature. The training voice feature can reflect the difference between the voices and can be used to identify and distinguish the training voice data.

In a specific embodiment, in step S14, cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data, including the following steps:

S141: Take the log value of the Mel power spectrum, and obtain the Mel power spectrum to be transformed.

Specifically, according to the definition of the cepstrum, a log value log of the Mel power spectrum is taken to obtain a Mel power spectrum m to be transformed.

S142: Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain MFCC features of training speech data.

Specifically, a discrete cosine transform (DCT) is performed on the transformed Mel power spectrum m to obtain corresponding MFCC features of training speech data. Generally, the second to thirteenth coefficients are taken as the training speech features. The training Speech features can reflect the differences between speech data. The formula for discrete cosine transform of the transformed Mel power spectrum m is

N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters. Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and Compared with the Fourier transform, the result of the indirect training speech feature has no imaginary part, and has obvious advantages in calculation.

Steps S11-S14 perform feature extraction processing on the training voice data based on the training technology. The finally obtained training voice feature can well reflect the training voice data, and the training voice feature can train the corresponding GMM-UBM model, and then obtain the registration i -vector vector, so that the registered i-vector vector obtained during training is more accurate when performing speech recognition.

It should be noted that the features extracted above are MFCC features. The training voice features should not be limited to only MFCC features here. Instead, it should be considered that the voice features obtained by training techniques can effectively reflect the features of voice data. Can be used as training speech features for recognition and model training. In this embodiment, the training voice data is pre-processed, and corresponding pre-processed voice data is obtained. Preprocessing the training voice data can better extract the training voice features of the training voice data, so that the extracted training voice features can better represent the training voice data, and use the training voice features for voice recognition.

In an embodiment, as shown in FIG. 4, in step S20, that is, the overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model, and specifically includes the following steps:

S21. Obtain high-dimensional sufficient statistics of the preset UBM model.

Among them, the UBM model is a multi-person, channel-equalized, and male-female-sound-equal enough voice to train a high-order GMM model to describe speaker-independent feature distributions. The UBM model can adjust the parameters of the UBM model according to the characteristics of the training speech to characterize the personality information of a specific speaker. Features that are not covered by the training speech features are approximated by similar feature distributions in the UBM model to solve the performance problem caused by insufficient training speech. .

Statistics is a function of sample data. In statistics, T (x) is a sufficient statistic of the parameter θ of the unknown distribution P, if and only if T (x) can provide all the information of θ, that is, there is no statistics The quantity can provide additional information about θ. Statistics are actually a kind of compression of the data distribution. During the process of processing samples into statistics, the information contained in the samples may be lost. If the samples are processed into statistics, there is no loss of information. This statistic is called a full statistic. For example, for a Gaussian distribution, the expectation and covariance matrix are its two sufficient statistics, because if these two parameters are known, a Gaussian distribution can be uniquely determined.

Specifically, the process of obtaining high-dimensional sufficient statistics of the preset UBM model is: determining a speaker sample X = {x1, x2, ..., xn}, and the sample obeys the distribution F (x) corresponding to the preset UBM model. , The parameter is theta. For this group of samples, the statistic is T, T = r (x1, x2, ..., xn). If T follows the distribution F (T), and the parameter theta of the distribution F (x) of the sample X can be obtained from F (T), that is, all the information about theta contained in F (x) is included in F (T) ), Then T is the high-dimensional sufficient statistic of the preset UBM model.

In this step, the recognition server obtains a zero-order sufficient statistic and a first-order sufficient statistic of the preset UBM model, which is used as a technical basis for training the overall change subspace.

S22. Iterate the high-dimensional sufficient statistics by using the maximum expectation algorithm to obtain the corresponding overall change subspace.

Among them, the maximum expectation algorithm (Expectation Maximization Algorithm) is an iterative algorithm used in statistics to find the maximum likelihood estimation of parameters in a probability model that depends on unobservable hidden variables. For example, two parameters A and B are initialized. In the initial state, the values of both parameters are unknown, but the information of B can be obtained by obtaining the information of A, and the information of B can be obtained by the same way. If you first give A some initial value to get the estimated value of B, and then start from the current value of B, re-estimate the value of A until it continues to converge.

EM's algorithm flow is as follows: 1. Initialize the distribution parameters; 2. Repeat steps E and M until convergence: Step E: Estimate the expected value of the unknown parameter and give the current parameter estimate; Step M: Re-estimate the distribution parameters so that the data The likelihood is the largest and gives the expected estimate of the unknown variable. By alternately using the E and M steps, the parameters of the model are gradually improved so that the likelihood of the parameters and the training samples gradually increases, and finally ends at a maximum point.

Specifically, iteratively obtaining the overall change subspace is achieved by the following steps:

Step 1: According to the high-dimensional sufficient statistics, the average vectors of M Gaussian components (each vector has D dimensions) are concatenated to form a Gaussian mean supervector, that is, an M * D-dimensional vector, and an M * D-dimensional vector is used. Form F (x), F (x) is the MD dimension vector; at the same time, use the zero-order sufficient statistics to construct N, N is the MD diagonal matrix, and the posterior probability is concatenated as the main diagonal element. Among them, the posterior probability refers to the probability of re-correction after obtaining the result information. For example: something has happened, the reason why this thing is required to happen is the magnitude of the possibility caused by a certain factor, which is the posterior probability.

Step 2: Initialize the T space and construct a [MD, V] -dimensional matrix, where the dimension of V is much smaller than the MD dimension, and the dimension of V is the dimension of the first i-vector vector.

Step 3: Fix the T space and use the maximum expectation algorithm to iterate the following formula repeatedly to estimate the zero-order sufficient statistics and the first-order sufficient statistics of the hidden variable w. When the iterative calculation reaches a specified number of times (5-6 times), the T space can be considered to converge to fix the T space:

In this formula, w is a hidden variable and I is an identity matrix; Σ is a covariance matrix of the UMM model in MD x MD dimensions, and its diagonal elements are Σ ₁ , ... Σ _m ; F is a high-dimensional full statistic First-order sufficient statistics of N; N is a MD x MD diagonal matrix.

In this embodiment, an iterative EM algorithm is provided to provide a simple and stable iterative algorithm to calculate the posterior density function to obtain the overall change subspace; obtaining the overall change subspace can project the high-dimensional sufficient statistics (supervector) of the preset UBM model To low-dimensional implementation, the vector after dimensionality reduction is further used for speech recognition.

In an embodiment, as shown in FIG. 5, in step S30, the training voice feature is projected on the overall change subspace to obtain a first i-vector vector, which specifically includes the following steps:

S31. Based on the training speech features and a preset UBM model, a mean MAP adaptive method is used to obtain a GMM-UBM model.

Among them, the training speech feature is a speech feature that represents the speaker different from others, and is applied in this embodiment, and Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) may be used as the training speech feature.

Specifically, based on a preset UBM model, a maximum posterior probability MAP is used to adaptively train a GMM model of speech features to update the mean vector of each Gaussian component. Then, a GMM model with M components is generated, that is, a GMM-UBM model is generated. The average vector of each Gaussian component of the GMM-UBM model (each vector has D dimension) is used as a concatenation unit to form a Gaussian mean supervector of M * D dimension.

S32. Use the formula s ₁ = m + Tw _{1 to} project the training speech feature on the overall change subspace to obtain the first i-vector vector, where s ₁ is the C * F dimension GMM-UBM model and the training speech feature. corresponding to the mean super vector; m is independent of the speaker and independent channel C * F dimensional supervector; T is the total variation subspace of dimension CF * N; w ₁ is the first i-vector vector of dimension N .

In this embodiment, s ₁ may use the Gaussian mean supervector obtained in step S31; m is a speaker-independent and channel-independent M * D-dimensional supervector, which is a concatenation of the mean supervectors corresponding to the UBM model; w ₁ Is a set of random vectors obeying the standard normal distribution, that is, the first i-vector vector, and the dimension of the first i-vector vector is N.

Further, the acquisition process of T (total change subspace) in the formula is: training the high-dimensional sufficient statistics of the UBM model, and then iteratively updating the high-dimensional sufficient statistics through the EM algorithm to generate a convergent T-space. Bring T space into the formula s ₁ = m + Tw ₁ , because s ₁ , m and T are all known, we can obtain w ₁ , that is, the first i-vector vector, where w ₁ = (s ₁ -m) / T.

In steps S31 to S32, by using the formula s ₁ = m + Tw _1, the training voice feature can be projected on the overall change subspace to obtain the first i-vector vector, and the training voice feature can be reduced for the first time to simplify the training voice. The complexity of the features is also convenient for further processing of low-dimensional first i-vector vectors or for speech recognition.

In one embodiment, as shown in FIG. 6, a speaker recognition method is provided. The method is applied to the recognition server in FIG. 1 as an example, and includes the following steps:

S50. Acquire test voice data, and the test voice data carries a speaker identifier.

The test voice data is the voice data of the speaker corresponding to the speaker ID that is claimed to be carried. The speaker ID is a unique identifier used to indicate the identity of the speaker, including, but not limited to, a user name, an ID number, a mobile phone number, and the like.

Two basic elements are required to complete the speech recognition process: speech and identity, which are applied in this embodiment. Speech is the test voice data and identity is the speaker identification, so that the recognition server further determines whether the identity claimed by the test voice data is the true corresponding identity. .

S60. Use the i-vector vector extraction method to process the test voice data to obtain the corresponding test i-vector vector.

Among them, the test i-vector vector is a fixed-length vector representation (ie, i-vector) obtained by projecting test voice features onto a low-dimensional overall change subspace, which is used to verify identity.

In this step, a test i-vector vector corresponding to the test voice data can be obtained, and the acquisition process is the same as the corresponding registered i-vector vector based on the training voice feature, which is not repeated here.

S70. Query the database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification.

The database is a database that records the registered i-vector vector corresponding to the speaker and the speaker identification.

The registered i-vector vector is a fixed-length vector representation (ie, i-vector) recorded in the database of the identification server and used to associate with the speaker ID as an identity.

In this step, the recognition server can find the corresponding registered i-vector vector in the database based on the speaker identification carried in the test voice data, so as to further compare the registered i-vector vector with the test i-vector vector.

S80. Use the cosine similarity algorithm to obtain the similarity between the test i-vector vector and the registered i-vector vector, and check whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.

Specifically, the similarity between the obtained test i-vector vector and the registered i-vector vector can be determined by the following formula:

Among them, A _i and _Bi represent the components of the vector A and the vector B, respectively. It can be known from the above formula that the similarity ranges from -1 to 1, where -1 indicates that the two vectors are in opposite directions, 1 indicates that the two vectors point in the same direction, and 0 indicates that the two vectors are independent. In -1, it can be understood that the closer the similarity is to 1, the closer the two vectors are, and the closer the similarity or dissimilarity between the two vectors is. Applied to this embodiment, the threshold of cosθ can be set in advance according to actual experience. If the similarity between the test i-vector vector and the registered i-vector vector is greater than the threshold, the test i-vector vector and the registered i-vector vector are considered similar, that is, it can be determined that the test voice data corresponds to the speaker identification in the database. .

In this embodiment, the cosine similarity algorithm can be used to determine the similarity between the test i-vector vector and the registered i-vector vector, which is simple and fast, which is helpful for quickly confirming the recognition result.

The i-vector vector extraction method provided in the embodiment of the present application obtains the first i-vector vector by projecting the training speech features on the overall change subspace, and then projects the first i-vector vector on the overall change subspace a second time. The registered i-vector vector is obtained on the system, so that after training the feature data of the speech after two projections, that is, after reducing the dimension, more noise features can be removed, which improves the purity of the speaker's feature extraction. The recognition efficiency of speech recognition reduces the recognition complexity.

Further, the processing of feature extraction of the training voice data based on the training technology to obtain the registered i-vector vector can well reflect the training voice data, making the registered i-vector vector obtained by training more accurate when performing speech recognition. ; Provide a simple and stable iterative algorithm to calculate the posterior density function to obtain the overall change subspace through the EM algorithm iteration. Obtaining the overall change subspace can project the high-dimensional sufficient statistics of the preset UBM model to the low-dimensional implementation, which is conducive to dimension reduction The subsequent vectors are further subjected to speech recognition.

The speaker recognition method provided in the embodiment of the present application uses the i-vector vector extraction method to process the test voice data to obtain the corresponding test i-vector vector, which can reduce the complexity of obtaining the test i-vector vector; at the same time, through the cosine The similarity algorithm can determine the similarity between the test i-vector vector and the registered i-vector vector, which is simple and fast, which is helpful for quickly confirming the recognition result.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

In one embodiment, an i-vector vector extraction device is provided. The i-vector vector extraction device corresponds to the i-vector vector extraction method in the above embodiment in a one-to-one correspondence. As shown in FIG. 7, the i-vector vector extraction device includes a speech data acquisition module 10, a training variation space module 20, a projection variation space module 30, and an i-vector vector acquisition module 40. The detailed description of each function module is as follows:

The voice data acquisition module 10 is configured to acquire training voice data of a speaker and extract training voice features corresponding to the training voice data.

The training change space module 20 is configured to train an overall change subspace corresponding to the preset UBM model based on the preset UBM model.

The projection change space module 30 is configured to project training speech features on the overall change subspace to obtain a first i-vector vector.

An i-vector vector module 40 is configured to project the first i-vector vector on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.

Preferably, the acquired voice data module 10 includes an acquired voice data unit 11, an acquired data power spectrum unit 12, an acquired Mel power spectrum unit 13, and an acquired MFCC feature unit 14.

The voice data acquisition unit 11 is configured to preprocess the training voice data and obtain preprocessed voice data.

A data power spectrum obtaining unit 12 is configured to perform a fast Fourier transform on the pre-processed voice data, obtain a frequency spectrum of the training voice data, and obtain a power spectrum of the training voice data according to the frequency spectrum.

A Mel power spectrum obtaining unit 13 is configured to process a power spectrum of the training voice data by using a Mel scale filter bank, and obtain a Mel power spectrum of the training voice data.

An MFCC feature unit 14 is used to perform cepstrum analysis on the Mel power spectrum to obtain MFCC features of training speech data.

The training variation space module 20 includes an acquisition high-dimensional statistics unit 21 and an acquisition change subspace unit 22.

A high-dimensional statistics obtaining unit 21 is configured to obtain high-dimensional sufficient statistics of a preset UBM model.

A change subspace obtaining unit 22 is configured to iterate the high-dimensional sufficient statistics by using a maximum expectation algorithm to obtain a corresponding overall change subspace.

The projection change space module 30 includes an acquisition GMM-UBM model unit 31 and an acquisition first vector unit 32.

A GMM-UBM model obtaining unit 31 is configured to obtain a GMM-UBM model based on a training speech feature and a preset UBM model, and adopt a mean MAP adaptive method.

A first vector unit 32 is obtained, and is used to obtain a first i-vector vector by using a formula s ₁ = m + Tw ₁ , where s ₁ is a mean supervector corresponding to the GMM-UBM model of the C * F dimension; m is the same as independent and speaker-independent channel ultra-dimensional vector C * F; T is the total variation subspace of dimension CF * N; w ₁ i-vector is the first vector of dimension N.

Preferably, the obtaining i-vector vector module 40 includes obtaining a registration vector unit 41.

A registration vector unit 41 is used for projecting the first i-vector vector on the overall change subspace by using the formula s ₂ = m + Tw ₂ to obtain a registration i-vector vector, where s ₂ is a D * G-dimensional AND The mean supervector corresponding to the registered i-vector vector; m is a D * G-dimensional supervector that is independent of the speaker and channel independent; T is the overall changing subspace with a dimension of DG * M; w ₂ is the registered i-vector vector , The dimension is M.

For the specific limitation of the i-vector vector extraction device, refer to the foregoing limitation on the i-vector vector extraction method, which is not repeated here. Each module in the i-vector vector extraction device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a speaker recognition device is provided, and the speaker recognition device corresponds to the speaker recognition method in the embodiment described above. As shown in FIG. 8, the speaker recognition device includes a test data acquisition module 50, a test vector acquisition module 60, a registration vector acquisition module 70, and a corresponding speaker determination module 80. The detailed description of each function module is as follows:

The acquisition test data module 50 is configured to acquire test voice data, and the test voice data carries a speaker identifier;

The test vector obtaining module 60 is configured to process the test voice data by using an i-vector vector extraction method to obtain a corresponding test i-vector vector;

A registration vector obtaining module 70, configured to query a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;

The corresponding speaker module 80 is determined, and is used to obtain the similarity between the test i-vector vector and the registered i-vector vector by using a cosine similarity algorithm, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.

For specific limitations on the speaker recognition device, reference may be made to the foregoing limitation on the speaker recognition method, and details are not described herein again. Each module in the speaker recognition device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a server, and the internal structure diagram may be as shown in FIG. 9. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data related to the i-vector vector extraction method or speaker recognition method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement an i-vector vector extraction method or a speaker recognition method.

In an embodiment, a computer device is provided, which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented: acquiring a speaker Training speech data and extract training speech features corresponding to the training speech data; training the overall change subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training speech feature on the overall change subspace to obtain the first i-vector vector; the first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.

In one embodiment, the training voice features corresponding to the training voice data are extracted, and the processor implements the following steps when the processor executes computer-readable instructions: pre-processing the training voice data to obtain pre-processed voice data; The Fourier transform obtains the frequency spectrum of the training speech data, and obtains the power spectrum of the training speech data according to the frequency spectrum; uses the Mel scale filter bank to process the power spectrum of the training speech data, and obtains the Mel power spectrum of the training speech data; Cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.

In one embodiment, the overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model. When the processor executes computer-readable instructions, the following steps are implemented: obtaining high-dimensional sufficient statistics of the preset UBM model; using The maximum expectation algorithm iterates high-dimensional sufficient statistics to obtain the corresponding overall change subspace.

In one embodiment, the training voice feature is projected on the overall change subspace to obtain the first i-vector vector. When the processor executes computer-readable instructions, the following steps are implemented: based on the training voice feature and a preset UBM model, using the mean The MAP adaptive method is used to obtain the GMM-UBM model; the formula s ₁ = m + Tw _{1 is used to} project the training speech features on the overall changing subspace to obtain the first i-vector vector, where s ₁ is a C * F-dimensional GMM -The average supervector corresponding to the training speech features in the UBM model; m is a C * F-dimensional supervector that is independent of the speaker and channel-independent; T is the overall change subspace with a dimension of CF * N; w ₁ is the first i-vector vector with dimension N.

In one embodiment, the first i-vector vector is projected on the overall change subspace to obtain the registered i-vector vector corresponding to the speaker. When the processor executes the computer-readable instructions, the following steps are implemented:

Use the formula s ₂ = m + Tw _{2 to} project the first i-vector vector onto the overall change subspace to obtain the registered i-vector vector, where s ₂ is a D * G dimension corresponding to the registered i-vector vector ultra mean vector; m is speaker-independent and channel-independent ultra-dimensional vector D * G; T is the total variation subspace dimension DG * m; w ₂ i-vector is a registered vector of dimension M.

In one embodiment, a computer device is provided, which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the processor implements the following steps: obtaining a test voice Data, the test voice data carries the speaker ID; based on the test voice data, the corresponding test i-vector vector is obtained; query the database based on the speaker ID, and obtain the registered i-vector vector corresponding to the speaker ID; using the cosine similarity algorithm to obtain Test the similarity between the i-vector vector and the registered i-vector vector, and test whether the i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.

In one embodiment, a computer-readable storage medium is provided on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the following steps are performed: obtaining training voice data of a speaker, and extracting training voice data Corresponding training speech features; training the overall change subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training speech features on the overall change subspace to obtain the first i-vector vector; The vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.

In one embodiment, the training voice features corresponding to the training voice data are extracted, and the computer-readable instructions are executed by the processor to implement the following steps: pre-processing the training voice data to obtain pre-processed voice data; quickly performing pre-processing voice data Fourier transform to obtain the frequency spectrum of the training speech data, and obtain the power spectrum of the training speech data according to the frequency spectrum; use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data; The cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.

In one embodiment, an overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model, and when the computer-readable instructions are executed by the processor, the following steps are performed: obtaining high-dimensional sufficient statistics of the preset UBM model; The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.

In one embodiment, the training voice feature is projected on the overall change subspace to obtain a first i-vector vector. When the computer-readable instructions are executed by the processor, the following steps are implemented: based on the training voice feature and a preset UBM model, using Mean MAP adaptive method to obtain the GMM-UBM model; use the formula s ₁ = m + Tw _{1 to} project the training speech features on the overall changing subspace to obtain the first i-vector vector, where s ₁ is C * F dimension In the GMM-UBM model, the mean supervector corresponding to the features of the training speech; m is a C * F-dimensional supervector that is independent of the speaker and channel-independent; T is the overall change subspace with a dimension of CF * N; w ₁ is the first An i-vector vector with dimension N.

In one embodiment, the first i-vector vector is projected on the overall change subspace to obtain the registered i-vector vector corresponding to the speaker. When the computer-readable instructions are executed by the processor, the following steps are implemented:

Use the formula s ₂ = m + Tw _{2 to} project the first i-vector vector on the overall change subspace to obtain the registered i-vector vector, where s ₂ is a D * G dimension corresponding to the registered i-vector vector ultra mean vector; m is speaker-independent and channel-independent ultra-dimensional vector D * G; T is the total variation subspace dimension DG * m; w ₂ i-vector is a registered vector of dimension M.

In one embodiment, a computer-readable storage medium is provided on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the following steps are performed: obtaining test voice data, and the test voice data carrying a speaker identification; Obtain the corresponding test i-vector vector based on the test speech data; query the database based on the speaker ID to obtain the registered i-vector vector corresponding to the speaker ID; use the cosine similarity algorithm to obtain the test i-vector vector and the registered i-vector The similarity of the vectors, based on the similarity detection test whether the i-vector vector and the registered i-vector correspond to the same speaker.

A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims

An i-vector vector extraction method is characterized in that it includes:

Acquiring training speech data of a speaker, and extracting training speech features corresponding to the training speech data;

Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;

Projecting the training speech feature on the overall change subspace to obtain a first i-vector vector;

The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
The i-vector vector extraction method according to claim 1, wherein the extracting training speech features corresponding to the training speech data comprises:

Pre-processing the training voice data to obtain pre-processed voice data;

Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the training speech data, and obtaining a power spectrum of the training speech data according to the frequency spectrum;

Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;

Cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data.
The i-vector vector extraction method according to claim 1, wherein the training of an overall change subspace corresponding to a preset UBM model based on a preset UBM model comprises:

Acquiring high-dimensional sufficient statistics of the preset UBM model;

The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
The i-vector vector extraction method according to claim 1, wherein the projecting the training speech feature on the overall change subspace to obtain a first i-vector vector comprises:

Obtaining a GMM-UBM model based on the training speech feature and the preset UBM model using a mean MAP adaptive method;

The formula s 1 = m + Tw 1 is used to project the training speech features on the overall change subspace to obtain a first i-vector vector, where s 1 is the same as that in the C * F-dimensional GMM-UBM model. Describe the mean supervector corresponding to the training speech feature; m is a C * F-dimensional supervector that is independent of the speaker and channel-independent; T is the overall changing subspace with dimensions CF * N; w 1 is the first i- A vector with dimensions N.
The i-vector vector extraction method according to claim 1, wherein the first i-vector vector is projected onto the overall change subspace, and a registration i corresponding to the speaker is obtained. -vector vector, including:

Use the formula s 2 = m + Tw 2 to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector, where s 2 is D * G-dimensional and the registered i -vector vector corresponds to the mean supervector; m is a speaker-independent and channel-independent D * G-dimensional supervector; T is the overall changing subspace with a dimension of DG * M; w 2 is a registered i-vector vector , The dimension is M.
A speaker recognition method, comprising:

Obtaining test voice data, the test voice data carrying a speaker identification;

Further comprising using the i-vector vector extraction method according to any one of claims 1-5 to process the test speech data to obtain a corresponding test i-vector vector;

Querying a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;

Use a cosine similarity algorithm to obtain the similarity between the test i-vector vector and the registered i-vector vector, and detect whether the test i-vector vector and the registered i-vector correspond to the same speaker based on the similarity .
An i-vector vector extraction device, comprising:

A training data acquisition module, configured to acquire training speech data of a speaker, and extract training speech features corresponding to the training speech data;

A voice data acquisition module, configured to acquire training voice data of a speaker, and extract training voice features corresponding to the training voice data;

A training change space module for training an overall change subspace corresponding to a preset UBM model based on a preset UBM model;

A projection change space module, configured to project the training speech feature on the overall change subspace to obtain a first i-vector vector;

An i-vector vector acquisition module is configured to project the first i-vector vector onto the overall change subspace, and obtain a registered i-vector vector corresponding to the speaker.
A speaker recognition device, comprising:

A test data acquisition module, configured to acquire test voice data, where the test voice data carries a speaker identifier;

A test vector acquisition module, configured to process the test voice data by using the i-vector vector extraction method according to any one of claims 1-5 to acquire a corresponding test i-vector vector;

A registration vector obtaining module, configured to query a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;

Determining a corresponding speaker module for obtaining a similarity between the test i-vector vector and the registered i-vector vector using a cosine similarity algorithm, and detecting the test i-vector vector and the registration based on the similarity Whether the i-vector corresponds to the same speaker.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Acquiring training speech data of a speaker, and extracting training speech features corresponding to the training speech data;

Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;

Projecting the training speech feature on the overall change subspace to obtain a first i-vector vector;

The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
The computer device according to claim 9, wherein the extracting a training voice feature corresponding to the training voice data comprises:

Pre-processing the training voice data to obtain pre-processed voice data;

Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the training speech data, and obtaining a power spectrum of the training speech data according to the frequency spectrum;

Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;

Cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data.
The computer device according to claim 9, wherein the training of the overall change subspace corresponding to the preset UBM model based on the preset UBM model comprises:

Acquiring high-dimensional sufficient statistics of the preset UBM model;

The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
The computer device according to claim 9, wherein the projecting the training speech feature on the overall change subspace to obtain a first i-vector vector comprises:

Obtaining a GMM-UBM model based on the training speech feature and the preset UBM model using a mean MAP adaptive method;

The formula s 1 = m + Tw 1 is used to project the training speech features on the overall change subspace to obtain a first i-vector vector, where s 1 is the same as that in the C * F-dimensional GMM-UBM model. Describe the mean supervector corresponding to the training speech feature; m is a C * F-dimensional supervector that is independent of the speaker and channel-independent; T is the overall changing subspace with dimensions CF * N; w 1 is the first i- A vector with dimensions N.
The computer device according to claim 9, wherein the projecting the first i-vector vector onto the overall change subspace and obtaining a registered i-vector vector corresponding to the speaker includes: :

Use the formula s 2 = m + Tw 2 to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector, where s 2 is D * G-dimensional and the registered i -vector vector corresponds to the mean supervector; m is a speaker-independent and channel-independent D * G-dimensional supervector; T is the overall changing subspace with a dimension of DG * M; w 2 is a registered i-vector vector , The dimension is M.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Obtaining test voice data, the test voice data carrying a speaker identification;

Further comprising using the i-vector vector extraction method according to any one of claims 1-5 to process the test speech data to obtain a corresponding test i-vector vector;

Querying a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;

Use a cosine similarity algorithm to obtain the similarity between the test i-vector vector and the registered i-vector vector, and detect whether the test i-vector vector and the registered i-vector correspond to the same speaker based on the similarity .
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

Acquiring training speech data of a speaker, and extracting training speech features corresponding to the training speech data;

Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;

Projecting the training speech feature on the overall change subspace to obtain a first i-vector vector;

The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
The non-volatile readable storage medium according to claim 15, wherein the extracting a training voice feature corresponding to the training voice data comprises:

Pre-processing the training voice data to obtain pre-processed voice data;

Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the training speech data, and obtaining a power spectrum of the training speech data according to the frequency spectrum;

Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;

Cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data.
The non-volatile readable storage medium according to claim 15, wherein the training of the overall change subspace corresponding to the preset UBM model based on the preset UBM model comprises:

Acquiring high-dimensional sufficient statistics of the preset UBM model;

The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
The non-volatile readable storage medium according to claim 15, wherein the acquiring the first i-vector vector by projecting the training voice feature on the overall change subspace comprises:

Obtaining a GMM-UBM model based on the training speech feature and the preset UBM model using a mean MAP adaptive method;

The formula s 1 = m + Tw 1 is used to project the training speech features on the overall change subspace to obtain a first i-vector vector, where s 1 is the same as that in the C * F-dimensional GMM-UBM model. The mean supervector corresponding to the training speech feature; m is a C * F-dimensional supervector that is independent of the speaker and is channel-independent; T is the overall changing subspace with dimensions CF * N; w1 is the first i-vector Vector with dimensions N.
The non-volatile readable storage medium according to claim 15, wherein the projecting the first i-vector vector onto the overall change subspace, and obtaining a registration corresponding to the speaker i-vector vectors, including:

Use the formula s 2 = m + Tw 2 to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector, where s 2 is D * G-dimensional and the registered i -vector vector corresponds to the mean supervector; m is a speaker-independent and channel-independent D * G-dimensional supervector; T is the overall changing subspace with a dimension of DG * M; w 2 is a registered i-vector vector , The dimension is M.
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

Obtaining test voice data, the test voice data carrying a speaker identification;

Further comprising using the i-vector vector extraction method according to any one of claims 1-5 to process the test speech data to obtain a corresponding test i-vector vector;

Querying a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;

Use a cosine similarity algorithm to obtain the similarity between the test i-vector vector and the registered i-vector vector, and detect whether the test i-vector vector and the registered i-vector correspond to the same speaker based on the similarity .