CN109065022B - Method for extracting i-vector, method, device, equipment and medium for speaker recognition - Google Patents

Method for extracting i-vector, method, device, equipment and medium for speaker recognition Download PDF

Info

Publication number
CN109065022B
CN109065022B CN201810574010.4A CN201810574010A CN109065022B CN 109065022 B CN109065022 B CN 109065022B CN 201810574010 A CN201810574010 A CN 201810574010A CN 109065022 B CN109065022 B CN 109065022B
Authority
CN
China
Prior art keywords
vector
training
speaker
voice data
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810574010.4A
Other languages
Chinese (zh)
Other versions
CN109065022A (en
Inventor
涂宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810574010.4A priority Critical patent/CN109065022B/en
Priority to PCT/CN2018/092589 priority patent/WO2019232826A1/en
Publication of CN109065022A publication Critical patent/CN109065022A/en
Application granted granted Critical
Publication of CN109065022B publication Critical patent/CN109065022B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Abstract

The invention discloses an i-vector extraction method, a speaker identification device and a speaker identification medium, wherein the i-vector extraction method comprises the following steps: acquiring training voice data of a speaker, and extracting training voice characteristics corresponding to the training voice data; training a total variation subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training voice features on a total change subspace to obtain a first i-vector; and projecting the first i-vector on the overall change subspace to obtain a registration i-vector corresponding to the speaker. The method can remove more noise features after the training voice feature data is projected twice, namely the dimensionality is reduced, the purity of the voice feature of the speaker is improved, and meanwhile, the recognition efficiency of voice recognition is improved by reducing the calculation space after dimensionality reduction.

Description

Method for extracting i-vector, method, device, equipment and medium for speaker recognition
Technical Field
The invention relates to the field of voice recognition, in particular to an i-vector extraction method, a speaker recognition device and a speaker recognition medium.
Background
Speaker recognition, also known as voiceprint recognition, is a biometric authentication technique that uses specific speaker information contained in a speech signal to identify the identity of a speaker. In recent years, the performance of the speaker recognition system is obviously improved due to the introduction of an identity-vector modeling method based on vector analysis. In vector analysis of a speaker's voice, it is common for the channel subspace to contain the speaker's information. The i-vector space uses a low-dimensional total variable space to represent the speaker subspace and the channel subspace, and the speaker voice is projected to the space through dimension reduction, so that a vector representation (i.e. an i-vector) with a fixed length can be obtained. However, there are also more interference factors in the acquired i-vector vectors modeled by the existing i-vector, increasing the complexity when using the i-vector vectors for speaker recognition.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an i-vector extraction method, apparatus, computer device, and storage medium capable of removing a large number of interference factors.
An i-vector extraction method, comprising:
acquiring training voice data of a speaker, and extracting training voice characteristics corresponding to the training voice data;
training a total variation subspace corresponding to the preset UBM model based on the preset UBM model;
projecting the training voice features on a total change subspace to obtain a first i-vector;
and projecting the first i-vector on the overall change subspace to obtain a registration i-vector corresponding to the speaker.
An i-vector extraction apparatus comprising:
the voice data acquisition module is used for acquiring training voice data of a speaker and extracting training voice characteristics corresponding to the training voice data;
the training change space module is used for training out a total change subspace corresponding to a preset UBM model based on the preset UBM model;
the projection change space module is used for projecting the training voice features on the total change subspace to obtain a first i-vector;
and the i-vector obtaining module is used for projecting the first i-vector on the total change subspace and obtaining a registration i-vector corresponding to the speaker.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the i-vector extraction method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the i-vector extraction method.
The present embodiment also provides a speaker recognition method, including:
acquiring test voice data, wherein the test voice data carries a speaker identifier;
acquiring a corresponding test i-vector based on the test voice data;
inquiring a database based on the speaker identification to obtain a registered i-vector corresponding to the speaker identification;
and acquiring the similarity of the test i-vector and the registered i-vector by adopting a cosine similarity algorithm, and detecting whether the test i-vector and the registered i-vector correspond to the same speaker or not according to the similarity.
A speaker recognition device, comprising:
the test data acquisition module is used for acquiring test voice data, and the test voice data carries a speaker identifier;
the test vector acquisition module is used for processing the test voice data by adopting an i-vector extraction method to acquire a corresponding test i-vector;
the acquisition registration vector module is used for inquiring the database based on the speaker identifier and acquiring a registration i-vector corresponding to the speaker identifier;
and the speaker corresponding module is used for acquiring the similarity of the test i-vector and the registered i-vector by adopting a cosine similarity algorithm, and detecting whether the test i-vector and the registered i-vector correspond to the same speaker or not according to the similarity.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the speaker recognition method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the speaker recognition method.
According to the i-vector extraction method, the speaker recognition method, the device, the equipment and the medium provided by the embodiment of the invention, the training voice feature is projected on the total change subspace to obtain the first i-vector, and then the first i-vector is projected on the total change subspace for the second time to obtain the registration i-vector, so that more noise features can be removed after the training voice feature data is projected twice, namely, the dimensionality is reduced, the purity of the extracted speaker voice feature is improved, meanwhile, the calculation space is reduced after the dimensionality is reduced, the recognition efficiency of voice recognition is also improved, and the recognition complexity is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of an i-vector extraction method according to an embodiment of the present invention;
FIG. 2 is a flowchart of an i-vector extraction method according to an embodiment of the present invention;
FIG. 3 is another detailed flowchart of an i-vector extraction method according to an embodiment of the present invention;
FIG. 4 is another detailed flowchart of an i-vector extraction method according to an embodiment of the present invention;
FIG. 5 is another detailed flowchart of an i-vector extraction method according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating an exemplary method for speaker recognition according to an embodiment of the present invention;
FIG. 7 is a schematic block diagram of an i-vector extraction apparatus according to an embodiment of the present invention;
FIG. 8 is a functional block diagram of a speaker recognition device in accordance with an embodiment of the present invention;
FIG. 9 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The i-vector extraction method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1, wherein the computer equipment is communicated with the identification server through a network. Computer devices include, but are not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices, among others. The identification server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.
In an embodiment, as shown in fig. 2, an i-vector extraction method is provided, which is described by taking the example that the method is applied to the recognition server in fig. 1, and includes the following steps:
and S10, acquiring training voice data of the speaker, and extracting training voice characteristics corresponding to the training voice data.
Wherein the training speech data of the speaker is original speech data provided by the speaker. The training speech feature represents a speech feature that a speaker is distinguished from others, and may be a Mel-Frequency Cepstral Coefficients (hereinafter abbreviated as MFCC feature) used as the training speech feature.
The detection finds that the human ear resembles a filter bank, only certain frequency components are concerned (human hearing is non-linear to frequency), that is, the signal of the frequency of sound received by the human ear is limited. However, these filters are not uniformly distributed on the frequency axis, there are many filters in the low frequency region, and they are distributed more densely, but in the high frequency region, the number of filters becomes smaller, and the distribution is sparse. The resolution of the mel-scale filter bank in the low-frequency part is high, and the mel-scale filter bank is consistent with the auditory characteristics of human ears, which is the physical meaning of the mel scale.
S20, training a total variation subspace corresponding to the preset UBM model based on the preset UBM model.
The predetermined UBM (Universal Background Model) is a Gaussian Mixture Model (Gaussian Mixture Models) that represents the feature distribution of a large number of unspecified speakers. Training of the UBM model typically uses a large amount of speaker-specific, channel-independent speech data, and thus the UBM model can generally be considered to be a speaker-specific independent model that simply fits the speech feature distribution of a person and does not represent a particular speaker. The UBM model is preset in the recognition server because the voice data for training a specific speaker is usually very little in the voiceprint registration stage of the voiceprint recognition process, the GMM model is used for modeling the voice characteristics of the speaker, and the voice data for training the specific speaker cannot cover the characteristic space where the GMM is located. Therefore, the parameters of the UBM model can be adjusted according to the characteristics of the training voice to represent the personal information of a specific speaker, and the characteristics which cannot be covered by the training voice can be approximated by the similar characteristic distribution in the UBM model.
The Total variation subspace, also called T Space (Total variance Space), is a projection matrix directly set up with global variation to contain all possible information of the speaker in the speech data, and does not separate the speaker Space and the channel Space in the T Space. The T space can project high-dimensional sufficient statistics (supervectors) to an i-vector which can be used as a low-dimensional speaker characterization, and the dimensionality reduction effect is achieved. The training process of the T space comprises the following steps: and calculating the T space from convergence by utilizing vector analysis and EM (Expectation Maximization) Algorithm according to a preset UBM model.
In this step, the speaker space and the channel space are not distinguished based on the total variation subspace obtained by the preset UBM model, and the information of the channel space are converged into one space, so as to reduce the calculation complexity and facilitate the acquisition of the i-vector based on the total variation subspace.
And S30, projecting the training voice features on the overall change subspace to obtain a first i-vector.
The first i-vector is a vector characterized by a vector with a fixed length, namely an i-vector, obtained by projecting the training speech features to a low-dimensional overall change subspace.
Specifically, the formula s is adopted in the step 1 =m+Tw 1 The method can obtain the high-dimensional training voice feature projection to form a low-dimensional first i-vector after the overall change subspace, reduces the dimension of the training voice feature projection and removes more noise, and is convenient for recognizing the speaker based on the first i-vector.
And S40, projecting the first i-vector on the overall change subspace, and acquiring a registration i-vector corresponding to the speaker.
The total variation subspace is obtained through step S20, and the total variation subspace does not separate the speaker space and the channel space, and directly sets a globally varied t (total variance space) space to contain all possible information in the speech data.
The registration i-vector is a vector which is obtained by projecting the first i-vector to a low-dimensional overall change subspace, is recorded in a database of the identification server, and is used for being associated with the speaker ID as a vector characterization of a fixed length of an identity, namely the i-vector.
In one embodiment, in step S40, projecting the first i-vector onto the total variation subspace, and acquiring a registered i-vector corresponding to the speaker, the method specifically includes the following steps:
s41, adopting a formula s 2 =m+Tw 2 Projecting the first i-vector on the total change subspace to obtain a registration i-vector, wherein s 2 Is the mean supervector corresponding to the registered i-vector of dimension D x G; m is a speaker independent and channel independent D x G dimensional supervector; t is the overall change subspace, with dimension DG × M; w is a 2 Is the registration i-vector with dimension M.
In this example, s 2 A gaussian mean supervector of the first i-vector obtained in step S30 may be used; m is a D x G dimensional super vector irrelevant to a speaker and irrelevant to a channel, and is formed by splicing average value super vectors corresponding to a UBM model; w is a 2 Is a group of random vectors which are subject to standard normal distribution, namely registration i-vector vectors and registration i-vector vectorsIs M.
Further, the process of obtaining T (total variation subspace) in the formula is: and training the high-dimensional sufficient statistic of the UBM model, and then iteratively updating the high-dimensional sufficient statistic through an EM (effective vector) algorithm to generate a convergent T space. Substituting T space into formula s 2 =m+Tw 2 Reason for s 2 M and T are known, i.e. w is obtained 2 I.e. register an i-vector, where w 2 =(s 2 -m)/T。
In the i-vector extraction method provided by this embodiment, the training speech feature is projected on the total change subspace to obtain the first i-vector, and then the first i-vector is projected on the total change subspace for the second time to obtain the registered i-vector, so that the training speech feature data can remove more noise features after being projected twice, that is, the dimensionality is reduced, the purity of the extracted speaker speech feature is improved, and meanwhile, the recognition efficiency of speech recognition is improved by reducing the computation space after dimensionality reduction.
In an embodiment, as shown in fig. 3, the step S10, namely extracting the training speech feature corresponding to the training speech data, specifically includes the following steps:
s11: and preprocessing the training voice data to obtain preprocessed voice data.
In a specific embodiment, in step S11, preprocessing the training speech data to obtain preprocessed speech data includes the following steps:
s111: pre-emphasis processing is carried out on the training voice data, and the calculation formula of the pre-emphasis processing is s' n =s n -a*s n-1 Wherein s is n Is the amplitude of the signal in the time domain, s n-1 Is a sum of s n Corresponding signal amplitude at last moment of time, s' n Is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9<a<1.0。
The pre-emphasis is a signal processing method for compensating the high-frequency component of the input signal at the transmitting end. As the signal rate increases, the signal is greatly damaged during transmission, and the damaged signal needs to be compensated for in order to obtain a better signal waveform at the receiving end. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate the excessive attenuation of the high-frequency component in the transmission process, so that the receiving end can obtain a better signal waveform. The pre-emphasis has no influence on noise, so that the output signal-to-noise ratio can be effectively improved.
In this embodiment, the training speech data is pre-emphasized, and the formula of the pre-emphasis is s' n =s n -a*s n-1 Wherein s is n Is the amplitude of the signal in the time domain, i.e. the amplitude (amplitude) of the speech represented in the time domain by the speech data, s n-1 Is a sum of s n Relative signal amplitude at previous time, s' n Is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9<a<1.0, it is better to take 0.97 pre-emphasis here. By adopting the pre-emphasis processing, the interference caused by vocal cords, lips and the like in the sounding process can be eliminated, the suppressed high-frequency part of the training voice data can be effectively compensated, the formant of the high frequency of the training voice data can be highlighted, the signal amplitude of the training voice data is enhanced, and the extraction of the training voice characteristics is facilitated.
S112: and performing frame division processing on the pre-emphasized training voice data.
Specifically, after pre-emphasizing the training speech data, a framing process should be performed. Framing refers to a speech processing technique that divides an entire speech signal into several segments, where each frame is in the range of 10-30ms and approximately 1/2 frames are used as frame shifts. The frame shift refers to an overlapping area between two adjacent frames, so that the problem of overlarge change of the two adjacent frames can be avoided. The training voice data is subjected to framing processing, the training voice data can be divided into a plurality of sections of voice data, the training voice data can be subdivided, and the extraction of training voice features is facilitated.
S113: windowing the training voice data after framing to obtain preprocessed voice data, wherein the calculation formula of windowing is as follows
Figure GDA0003690511740000061
Wherein N is the window length, N is the time, s n Is the signal amplitude, s 'in the time domain' n Is the signal amplitude in the time domain after windowing.
Specifically, after the training speech data is framed, discontinuities occur at the beginning and end of each frame, so that the more framing the training speech data, the greater the error with the training speech data. The windowing can solve the problem, make the training speech data after framing continuous, and make each frame show the characteristics of periodic function. The windowing processing specifically refers to processing the training voice data by using a window function, wherein the window function can select a Hamming window, and the windowing formula is
Figure GDA0003690511740000071
N is the Hamming window length, N is the time, s n Is the signal amplitude, s 'in the time domain' n Is the signal amplitude in the time domain after windowing. The training voice data is subjected to windowing processing to obtain the preprocessed voice data, so that signals of the framed training voice data on a time domain become continuous, and the training voice feature of the training voice data is extracted.
The preprocessing operation of the training speech data in the above steps S211 to S213 provides a basis for extracting the training speech features of the training speech data, so that the extracted training speech features can better represent the training speech data, and a corresponding GMM-UBM model is trained according to the training speech features.
S12: and performing fast Fourier transform on the preprocessed voice data to obtain the frequency spectrum of the training voice data, and obtaining the power spectrum of the training voice data according to the frequency spectrum.
Among them, Fast Fourier Transform (FFT) is a generic name of an efficient and Fast calculation method for calculating discrete Fourier transform by using a computer. The multiplication times required by a computer for calculating the discrete Fourier transform can be greatly reduced by adopting the algorithm, and particularly, the more the number of the converted sampling points is, the more remarkable the calculation amount of the FFT algorithm is saved.
Specifically, the pre-processed voice data is subjected to a fast fourier transform to convert the pre-processed voice data from signal amplitude in the time domain to signal amplitude (frequency spectrum) in the frequency domain. The formula for calculating the frequency spectrum is
Figure GDA0003690511740000072
Figure GDA0003690511740000073
N is the size of the frame, s (k) is the signal amplitude in the frequency domain, s (N) is the signal amplitude in the time domain, N is time, and i is a complex unit. After the spectrum of the preprocessed voice data is obtained, the power spectrum of the preprocessed voice data can be directly obtained according to the spectrum, and the power spectrum of the preprocessed voice data is hereinafter referred to as the power spectrum of the training voice data. The formula for calculating the power spectrum of the training speech data is
Figure GDA0003690511740000074
N is the size of the frame and s (k) is the signal amplitude in the frequency domain. The preprocessing voice data are converted from the signal amplitude on the time domain to the signal amplitude on the frequency domain, and then the power spectrum of the training voice data is obtained according to the signal amplitude on the frequency domain, so that an important technical basis is provided for extracting the training voice features from the power spectrum of the training voice data.
S13: and processing the power spectrum of the training voice data by adopting a Mel scale filter bank to obtain the Mel power spectrum of the training voice data.
The power spectrum of the training voice data processed by the Mel scale filter bank is subjected to Mel frequency analysis, and the Mel frequency analysis is based on human auditory perception. It is found that the human ear, just like a filter bank, only focuses on certain specific frequency components (human hearing is non-linear with respect to frequency), i.e. the human ear receives signals at sound frequencies only limitedly. However, these filters are not uniformly distributed on the frequency axis, there are many filters in the low frequency region, and they are distributed more densely, but in the high frequency region, the number of filters becomes smaller, and the distribution is sparse. It is understood that the resolution of the mel-scale filter bank in the low frequency part is high, which is consistent with the auditory characteristics of human ears, and this is also the physical meaning of the mel scale.
In this embodiment, the mel scale filter bank is used to process the power spectrum of the training voice data to obtain the mel power spectrum of the training voice data, and the mel scale filter bank is used to segment the frequency domain signal, so that each frequency segment corresponds to a numerical value finally, and if the number of the filters is 22, 22 energy values corresponding to the mel power spectrum of the training voice data can be obtained. The Mel frequency analysis is carried out on the power spectrum of the training voice data, so that the Mel power spectrum obtained after the analysis keeps a frequency part closely related to the characteristics of human ears, and the frequency part can well reflect the characteristics of the training voice data.
S14: and performing cepstrum analysis on the Mel power spectrum to obtain MFCC characteristics of the training voice data.
Among them, cepstrum (cepstrum) is an inverse fourier transform performed after a fourier transform spectrum of a signal is subjected to a logarithmic operation, and is also called a complex cepstrum because a general fourier spectrum is a complex spectrum.
Specifically, cepstrum analysis is performed on the mel-power spectrum, and the MFCC features of the training speech data are analyzed and acquired according to the result of the cepstrum. By this cepstrum analysis, the features included in the mel-power spectrum of the training speech data, which have an excessively high feature dimension and are difficult to use directly, can be converted into features (MFCC feature vectors for training or recognition) that are easy to use by performing cepstrum analysis on the mel-power spectrum. The MFCC features can be used as coefficients for a training speech feature that can reflect the difference between speech and can be used to recognize and distinguish training speech data from speech.
In one embodiment, in step S14, performing cepstrum analysis on the mel-power spectrum to obtain MFCC features of the training speech data, includes the following steps:
s141: and taking the logarithm value of the Mel power spectrum to obtain the Mel power spectrum to be transformed.
Specifically, according to the definition of the cepstrum, taking a log of a logarithm value of the mel-power spectrum, and obtaining the mel-power spectrum m to be transformed.
S142: and performing discrete cosine transform on the Mel power spectrum to be transformed to obtain the MFCC characteristics of the training voice data.
Specifically, Discrete Cosine Transform (DCT) is performed on the mel power spectrum m to be transformed to obtain the MFCC feature of the corresponding training speech data, and generally, the 2 nd to 13 th coefficients are taken as the training speech features, which can reflect the difference between the speech data. The discrete cosine transform formula of the Mel power spectrum m to be transformed is
Figure GDA0003690511740000091
N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because the Mel filters are overlapped, the energy values obtained by adopting the Mel scale filters have correlation, discrete cosine transform can perform dimensionality reduction compression and abstraction on the Mel power spectrum m to be transformed, indirect training voice characteristics are obtained, compared with Fourier transform, the result of the discrete cosine transform has no imaginary part, and the method has obvious advantages in the aspect of calculation.
And S11-S14 are used for carrying out feature extraction processing on the training voice data based on the training technology, the finally obtained training voice features can well reflect the training voice data, the training voice features can train a corresponding GMM-UBM model, and then registration i-vector vectors are obtained, so that the result of the registration i-vector vectors obtained by training in the voice recognition process is more accurate.
It should be noted that the above extracted features are MFCC features, and the training speech features should not be limited to only one MFCC feature, but the speech features obtained by the training technique may be recognized as training speech features and model training may be performed as long as the speech data features are effectively reflected. In this embodiment, the training speech data is preprocessed, and the corresponding preprocessed speech data is obtained. Training voice characteristics of the training voice data can be better extracted by preprocessing the training voice data, so that the extracted training voice characteristics can represent the training voice data better, and voice recognition is performed by adopting the training voice characteristics.
In an embodiment, as shown in fig. 4, in step S20, training a total variation subspace corresponding to the preset UBM model based on the preset UBM model specifically includes the following steps:
and S21, acquiring high-dimensional sufficient statistics of a preset UBM model.
The UBM model is a high-order GMM model trained by adopting enough voices of multiple persons, channel equalization and male and female voice equalization so as to describe the characteristic distribution irrelevant to the speaker. The UBM model can adjust parameters of the UBM model according to training speech features to represent the personal information of a specific speaker, and features which cannot be covered by the training speech features are approximated by similar feature distribution in the UBM model, so that the performance problem caused by insufficient training speech is solved.
The statistics are a function of sample data, where t (x) is a sufficient statistic for the parameter θ of the unknown distribution P, and if and only if t (x) can provide all the information of θ, that is, no statistics can provide additional information about θ. The statistic is actually a compression of data distribution, and if there is no loss of information in the sample during processing of the sample into the statistic, the statistic is called a sufficient statistic. For example, for a gaussian distribution, the desired and covariance matrices are two sufficient statistics for it, because if these two parameters are known, a gaussian distribution can be uniquely determined.
Specifically, the process of obtaining the high-dimensional sufficient statistics of the preset UBM model includes: determining a speaker sample X ═ { X1, X2.., xn }, wherein the sample obeys a distribution f (X) corresponding to a preset UBM model, and the parameter is theta. The statistic for this set of samples is T, T ═ r (x1, x 2.., xn). If T is subject to distribution F (T) and the parameter theta of distribution F (X) of sample X can be found from F (T), i.e., all information about theta contained in F (X) is contained in F (T), T is a high-dimensional sufficient statistic of the predetermined UBM model.
In the step, the recognition server is used as a technical basis for training the total variation subspace by acquiring the zero-order sufficient statistic and the first-order sufficient statistic of the preset UBM model.
And S22, iterating the high-dimensional sufficient statistics by adopting a maximum expectation algorithm to obtain a corresponding overall change subspace.
Among them, the Expectation Maximization Algorithm (Expectation Maximization Algorithm) is an iterative Algorithm that is used statistically to find the maximum likelihood estimation of parameters in a probabilistic model that depends on unobservable hidden variables. For example, two parameters a and B are initialized, and the values of both parameters are unknown in the initial state, but obtaining information of a can obtain information of B, and similarly obtaining information of B can also obtain a. If A is given a certain initial value, the estimated value of B is obtained, and then the value of A is estimated again from the current value of B until convergence is achieved.
The algorithm flow of the EM is as follows: 1. initializing distribution parameters; 2. repeating the steps E and M until convergence: e, step E: estimating an expected value of an unknown parameter and giving a current parameter estimation; and M: the distribution parameters are re-estimated to maximize the likelihood of the data, giving the desired estimate of the unknown variable. And (3) gradually improving the parameters of the model by alternately using the step E and the step M, so that the likelihood probability of the parameters and the training samples is gradually increased, and finally, the training samples are terminated at a maximum point.
Specifically, iteratively obtaining the total variation subspace is achieved through the following steps:
the method comprises the following steps: according to the high-dimensional sufficient statistics, mean vectors of M Gaussian components (each vector has D dimensions) are connected in series to form a Gaussian mean super vector, namely an M x D dimension vector, and the M x D dimension vector is adopted to form F (x), wherein the F (x) is an MD dimension vector; and meanwhile, N is constructed by utilizing the zero-order sufficient statistics, wherein N is an MD x MD dimensional diagonal matrix, and the diagonal matrix is formed by splicing the elements by taking the posterior probability as a main diagonal element. Here, the posterior probability refers to a probability of being revised after obtaining information of a result. Such as: things have already happened and the reason for this is the size of the possibility caused by some factor, i.e. the posterior probability.
Step two: initializing a T space, and constructing an [ MD, V ] dimensional matrix, wherein the dimension of V is far smaller than the dimension of MD, and the dimension of V is the dimension of the first i-vector.
Step three: and fixing the T space, and repeatedly iterating the following formula by adopting a maximum expectation algorithm to estimate the zero-order sufficient statistic and the first-order sufficient statistic of the hidden variable w. When the iterative computation reaches the specified number of times (5-6 times), the T space is considered to be converged to fix the T space:
Figure GDA0003690511740000101
in the formula, w is an implicit variable, and I is an identity matrix; Σ is a covariance matrix of the UMM model in MD x MD dimensions, whose diagonal elements are Σ 1, · Σ m; f is a first order sufficient statistic in the high-dimensional sufficient statistics; n is the MD x MD diagonal matrix.
In the embodiment, a simple and stable iterative algorithm is provided to calculate the posterior density function through EM algorithm iteration to obtain the total change subspace; the high-dimensional sufficient statistics (super vectors) of the preset UBM model can be projected to the low-dimensional realization by obtaining the total change subspace, and the speech recognition of the vectors after the dimension reduction is facilitated.
In an embodiment, as shown in fig. 5, in step S30, projecting the training speech features onto the total variation subspace to obtain the first i-vector, the method specifically includes the following steps:
and S31, acquiring the GMM-UBM model by adopting a mean MAP self-adaption method based on the training voice characteristics and the preset UBM model.
The training speech feature represents a speech feature that a speaker is distinguished from others, and a Mel-Frequency Cepstral Coefficients (MFCC) feature may be used as the training speech feature.
Specifically, based on a preset UBM model, a GMM model of the speech features is adaptively trained by adopting a maximum posterior probability MAP, so as to update the mean vector of each Gaussian component. Then, a GMM model of M components is generated, namely, a GMM-UBM model is generated. And taking the mean vector of each Gaussian component of the GMM-UBM model (each vector has D dimension) as a concatenation unit to form a Gaussian mean supervector with M x D dimension.
S32, adopting a formula s 1 =m+Tw 1 Projecting the training voice characteristics on the total change subspace to obtain a first i-vector, wherein s 1 The mean value supervectors corresponding to the training voice features in the GMM-UBM model of C-F dimension; m is a speaker independent and channel independent C x F dimensional supervector; t is the overall change subspace, with dimensions CF × N; w is a 1 Is the first i-vector with dimension N.
In this example, s 1 The gaussian mean value super vector obtained in step S31 can be adopted; m is an M x D dimensional super vector irrelevant to a speaker and irrelevant to a channel, and is formed by splicing average value super vectors corresponding to a UBM model; w is a 1 The vector is a random vector which obeys standard normal distribution, namely a first i-vector, and the dimension of the first i-vector is N.
Further, the process of obtaining T (total variation subspace) in the formula is: and training the high-dimensional sufficient statistic of the UBM model, and then iteratively updating the high-dimensional sufficient statistic through an EM (effective vector) algorithm to generate a convergent T space. Substituting T space into formula s 1 =m+Tw 1 Reason for s 1 M and T are known, i.e. w is obtained 1 I.e. the first i-vector, where w 1 =(s 1 -m)/T。
In steps S31 to S32, by using the formula S 1 =m+Tw 1 The training voice features can be projected on the total change subspace to obtain the first i-vector, the training voice features can be subjected to initial dimensionality reduction to simplify the complexity of the training voice features, and the low-dimensional first i-vector can be further processed conveniently or used for voice recognition.
In an embodiment, as shown in fig. 6, a speaker recognition method is provided, which is described by taking the recognition server in fig. 1 as an example, and includes the following steps:
s50, test voice data are obtained, and the test voice data carry speaker identification.
Wherein, the test voice data is the voice data of the speaker corresponding to the speaker identification to be confirmed and allegedly sent by the carried speaker. The speaker ID is a unique ID used to represent the identity of the speaker, including but not limited to a user name, an identification number, a cell phone number, etc.
The process of completing speech recognition requires two basic elements: the voice and the identity are applied to the embodiment, the voice is the test voice data, and the identity is the speaker identification, so that the recognition server further judges whether the purported identity of the test voice data is the identity corresponding to the real identity.
And S60, processing the test voice data by adopting an i-vector extraction method to obtain a corresponding test i-vector.
The test i-vector is a fixed-length vector representation (i.e., i-vector) for identity verification obtained by projecting the test voice features to a low-dimensional overall change subspace.
In this step, a test i-vector corresponding to the test voice data may be obtained, and the obtaining process is the same as obtaining a corresponding registration i-vector based on the training voice feature, which is not described herein again.
S70, inquiring a database based on the speaker identification, and acquiring a registered i-vector corresponding to the speaker identification.
The database is used for carrying out association recording on the registered i-vector corresponding to the speaker and the speaker identification.
The registration i-vector is a fixed-length vector representation (i.e., i-vector) recorded in the database of the recognition server to associate with the speaker ID as an identity.
In this step, the recognition server can search the corresponding registered i-vector in the database based on the speaker identification carried by the test voice data, so as to further compare the registered i-vector with the test i-vector.
S80, obtaining the similarity of the test i-vector and the registered i-vector by adopting a cosine similarity algorithm, and detecting whether the test i-vector and the registered i-vector correspond to the same speaker or not according to the similarity.
Specifically, the similarity between the acquired test i-vector and the registered i-vector can be determined by the following formula:
Figure GDA0003690511740000121
wherein A is i And B i Representing the respective components of vector a and vector B, respectively. As can be seen from the above formula, the similarity ranges from-1 to 1, where-1 indicates that the two vectors are in opposite directions, and 1 indicates that the two vectors are pointing in the same direction; 0 means that the two vectors are independent. Between-1 and 1 denotes the similarity or dissimilarity between the two vectors, it being understood that a similarity closer to 1 denotes a closer proximity of the two vectors. The threshold value of cos θ can be preset according to practical experience. And if the similarity of the test i-vector and the registration i-vector is greater than the threshold value, the test i-vector and the registration i-vector are considered to be similar, namely the test voice data in the database can be judged to be corresponding to the speaker identification.
In the embodiment, the similarity between the tested i-vector and the registered i-vector can be judged by a cosine similarity calculation method, so that the method is simple and rapid, and is beneficial to rapidly confirming the identification result.
According to the i-vector extraction method provided by the embodiment of the invention, the training voice feature is projected on the total change subspace to obtain the first i-vector, and then the first i-vector is projected on the total change subspace for the second time to obtain the registered i-vector, so that more noise features can be removed after the training voice feature data is projected for two times, namely, the dimensionality is reduced, the purity of the extracted speaker voice feature is improved, meanwhile, the calculation space is reduced after dimensionality reduction, the recognition efficiency of voice recognition is improved, and the recognition complexity is reduced.
Furthermore, the registered i-vector is obtained by performing feature extraction on the training voice data based on the training technology, so that the training voice data can be well embodied, and the result of the registered i-vector obtained by training is more accurate when voice recognition is performed; through EM algorithm iteration, a simple and stable iteration algorithm is provided to calculate a posterior density function to obtain a total change subspace; the high-dimensional sufficient statistics of the preset UBM model can be projected to the low-dimensional realization by obtaining the total change subspace, and the speech recognition of the vectors after the dimension reduction is facilitated.
The speaker recognition method provided by the embodiment of the invention adopts an i-vector extraction method to process the test voice data to obtain the corresponding test i-vector, so that the complexity of obtaining the test i-vector can be reduced; meanwhile, the similarity of the tested i-vector and the registered i-vector can be judged through a cosine similarity calculation method, so that the method is simple and rapid, and is beneficial to rapidly confirming the identification result.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, an i-vector extraction device is provided, and the i-vector extraction device corresponds to the i-vector extraction method in the embodiment one to one. As shown in fig. 7, the i-vector extraction apparatus includes an acquiring speech data module 10, a training change space module 20, a projection change space module 30, and an acquiring i-vector module 40. The functional modules are explained in detail as follows:
and the voice data acquiring module 10 is used for acquiring training voice data of a speaker and extracting training voice features corresponding to the training voice data.
And a training change space module 20, configured to train a total change subspace corresponding to the preset UBM model based on the preset UBM model.
And the projection change space module 30 is configured to project the training speech features on the total change subspace to obtain a first i-vector.
And an i-vector obtaining module 40, configured to project the first i-vector onto the total change subspace, and obtain a registered i-vector corresponding to the speaker.
Preferably, the module for acquiring voice data 10 includes an acquire voice data unit 11, an acquire data power spectrum unit 12, an acquire mel power spectrum unit 13, and an acquire MFCC feature unit 14.
And the voice data acquiring unit 11 is used for preprocessing the training voice data to acquire preprocessed voice data.
And a data power spectrum acquiring unit 12, configured to perform fast fourier transform on the preprocessed voice data, acquire a frequency spectrum of the training voice data, and acquire a power spectrum of the training voice data according to the frequency spectrum.
And a mel power spectrum obtaining unit 13, configured to process the power spectrum of the training speech data by using a mel scale filter bank, and obtain a mel power spectrum of the training speech data.
And an obtaining MFCC feature unit 14, configured to perform cepstrum analysis on the mel-power spectrum to obtain MFCC features of the training speech data.
Training the change space module 20 includes an acquire high-dimensional statistics unit 21 and an acquire change subspace unit 22.
And the high-dimensional statistic obtaining unit 21 is used for obtaining high-dimensional sufficient statistics of the preset UBM model.
And a change subspace obtaining unit 22, configured to iterate the high-dimensional sufficient statistics by using a maximum expectation algorithm, and obtain a corresponding total change subspace.
The projection variation space module 30 comprises an acquisition GMM-UBM model unit 31 and an acquisition first vector unit 32.
And the acquiring GMM-UBM model unit 31 is used for acquiring the GMM-UBM model by adopting a mean MAP adaptive method based on the training speech characteristics and the preset UBM model.
A first vector unit 32 is obtained for applying the formula s 1 =m+Tw 1 Obtaining a first i-vector, wherein s 1 The mean value supervectors corresponding to the GMM-UBM model of the C-F dimension; m is speaker independent andchannel independent C x F dimensional supervectors; t is the overall change subspace, with dimensions CF × N; w is a 1 Is the first i-vector with dimension N.
Preferably, the obtain i-vector module 40 includes an obtain registration vector unit 41.
A get registration vector unit 41 for applying the formula s 2 =m+Tw 2 Projecting the first i-vector on the total change subspace to obtain a registration i-vector, wherein s 2 Is the mean supervector corresponding to the registered i-vector of dimension D x G; m is a speaker independent and channel independent D x G dimensional supervector; t is the overall change subspace, with dimension DG × M; w is a 2 Is the registration i-vector with dimension M.
For specific limitations of the i-vector extraction apparatus, reference may be made to the above limitations of the i-vector extraction method, which is not described herein again. All or part of each module in the i-vector extraction device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a speaker recognition device is provided, which corresponds to the speaker recognition method in the above embodiments one to one. As shown in FIG. 8, the speaker ID device includes a get test data module 50, a get test vector module 60, a get registration vector module 70, and a determine corresponding speaker module 80. The functional modules are explained in detail as follows:
the test data acquisition module 50 is used for acquiring test voice data, and the test voice data carries speaker identification;
an obtaining test vector module 60, configured to process the test speech data by using an i-vector extraction method, and obtain a corresponding test i-vector;
a register vector acquiring module 70, configured to query a database based on the speaker identifier, and acquire a register i-vector corresponding to the speaker identifier;
and the speaker corresponding module 80 is used for acquiring the similarity of the test i-vector and the registered i-vector by adopting a cosine similarity algorithm, and detecting whether the test i-vector and the registered i-vector correspond to the same speaker according to the similarity.
For the specific definition of the speaker recognition device, reference may be made to the above definition of the speaker recognition method, which is not described herein again. The modules in the speaker recognition device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data related to an i-vector extraction method or a speaker recognition method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an i-vector extraction method or a speaker recognition method.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring training voice data of a speaker, and extracting training voice characteristics corresponding to the training voice data; training a total variation subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training voice features on a total change subspace to obtain a first i-vector; and projecting the first i-vector on the overall change subspace to obtain a registration i-vector corresponding to the speaker.
In one embodiment, the training speech features corresponding to the training speech data are extracted, and the processor, when executing the computer program, implements the following steps: preprocessing training voice data to obtain preprocessed voice data; performing fast Fourier transform on the preprocessed voice data to obtain a frequency spectrum of the training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum; processing the power spectrum of the training voice data by adopting a Mel scale filter bank to obtain a Mel power spectrum of the training voice data; and performing cepstrum analysis on the Mel power spectrum to obtain MFCC characteristics of the training voice data.
In one embodiment, the total variation subspace corresponding to the preset UBM model is trained based on the preset UBM model, and the processor executes the computer program to implement the following steps: acquiring high-dimensional sufficient statistics of a preset UBM model; and iterating the high-dimensional sufficient statistics by adopting a maximum expectation algorithm to obtain a corresponding total change subspace.
In one embodiment, the training speech features are projected onto the total variation subspace to obtain a first i-vector, and the processor, when executing the computer program, implements the following steps: based on training voice characteristics and a preset UBM model, acquiring a GMM-UBM model by adopting a mean MAP self-adaption method; using the formula s 1 =m+Tw 1 Projecting the training voice features on the total change subspace to obtain a first i-vector, wherein s 1 The mean value supervectors corresponding to the training voice features in the GMM-UBM model of C-F dimension; m is a speaker independent and channel independent C x F dimensional supervector; t is the overall change subspace, with dimensions CF × N; w is a 1 Is the first i-vector with dimension N.
In one embodiment, the first i-vector is projected onto the total variation subspace to obtain a registered i-vector corresponding to the speaker, and the processor implements the following steps when executing the computer program:
using the formula s 2 =m+Tw 2 Projecting the first i-vector on the total change subspace to obtain a registration i-vector, wherein s 2 Is the mean supervector corresponding to the registered i-vector of dimension D x G; m is a speaker independent and channel independent D x G dimensional supervector; t is the overall change subspace, with dimension DG × M; w is a 2 Is a registered i-vector with dimension M.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring test voice data, wherein the test voice data carries a speaker identifier; acquiring a corresponding test i-vector based on the test voice data; inquiring a database based on the speaker identification to obtain a registered i-vector corresponding to the speaker identification; and acquiring the similarity of the test i-vector and the registered i-vector by adopting a cosine similarity algorithm, and detecting whether the test i-vector and the registered i-vector correspond to the same speaker or not according to the similarity.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of: acquiring training voice data of a speaker, and extracting training voice characteristics corresponding to the training voice data; training a total variation subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training voice features on a total change subspace to obtain a first i-vector; and projecting the first i-vector on the overall change subspace to obtain a registration i-vector corresponding to the speaker.
In one embodiment, the training speech features corresponding to the training speech data are extracted, and the computer program when executed by the processor performs the steps of: preprocessing training voice data to obtain preprocessed voice data; performing fast Fourier transform on the preprocessed voice data to obtain a frequency spectrum of the training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum; processing the power spectrum of the training voice data by adopting a Mel scale filter bank to obtain a Mel power spectrum of the training voice data; and performing cepstrum analysis on the Mel power spectrum to obtain MFCC characteristics of the training voice data.
In an embodiment, the total variation subspace corresponding to the preset UBM model is trained based on the preset UBM model, and the computer program when executed by the processor implements the following steps: acquiring high-dimensional sufficient statistics of a preset UBM model; and iterating the high-dimensional sufficient statistics by adopting a maximum expectation algorithm to obtain a corresponding total change subspace.
In an embodiment, the training speech features are projected onto the overall change subspace to obtain a first i-vector, and the computer program when executed by the processor performs the steps of: based on training voice characteristics and a preset UBM model, acquiring a GMM-UBM model by adopting a mean MAP self-adaption method; using the formula s 1 =m+Tw 1 Projecting the training voice features on the total change subspace to obtain a first i-vector, wherein s 1 The mean value supervectors corresponding to the training voice features in the GMM-UBM model of C-F dimension; m is a speaker independent and channel independent C x F dimensional supervector; t is the overall change subspace, with dimensions CF × N; w is a 1 Is the first i-vector with dimension N.
In one embodiment, projecting the first i-vector onto the overall variation subspace to obtain a registered i-vector corresponding to the speaker, the computer program when executed by the processor performs the steps of:
using the formula s 2 =m+Tw 2 Projecting the first i-vector on the total change subspace to obtain a registration i-vector, wherein s 2 Is the mean supervector corresponding to the registered i-vector of dimension D x G; m is a speaker independent and channel independent D x G dimensional supervector; t is the overall change subspace, with dimension DG × M; w is a 2 Is the registration i-vector with dimension M.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of: acquiring test voice data, wherein the test voice data carries a speaker identifier; acquiring a corresponding test i-vector based on the test voice data; inquiring a database based on the speaker identification to obtain a registered i-vector corresponding to the speaker identification; and acquiring the similarity of the test i-vector and the registered i-vector by adopting a cosine similarity algorithm, and detecting whether the test i-vector and the registered i-vector correspond to the same speaker or not according to the similarity.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. An i-vector extraction method is characterized by comprising the following steps:
acquiring training voice data of a speaker, and extracting training voice characteristics corresponding to the training voice data;
training a total variation subspace corresponding to the preset UBM model based on the preset UBM model;
projecting the training voice features on the overall change subspace to obtain a first i-vector;
and projecting the first i-vector on the total change subspace to obtain a registration i-vector corresponding to the speaker.
2. The i-vector extraction method of claim 1, wherein the extracting training speech features corresponding to the training speech data comprises:
preprocessing the training voice data to obtain preprocessed voice data;
performing fast Fourier transform on the preprocessed voice data to obtain a frequency spectrum of training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum;
processing the power spectrum of the training voice data by adopting a Mel scale filter bank to obtain a Mel power spectrum of the training voice data;
and performing cepstrum analysis on the Mel power spectrum to obtain MFCC characteristics of the training voice data.
3. The i-vector extraction method of claim 1, wherein training out an overall variation subspace corresponding to a preset UBM model based on the preset UBM model comprises:
acquiring high-dimensional sufficient statistics of the preset UBM model;
and iterating the high-dimensional sufficient statistics by adopting a maximum expectation algorithm to obtain a corresponding total change subspace.
4. The method of i-vector extraction according to claim 1, wherein the projecting the training speech features onto the overall variation subspace to obtain a first i-vector comprises:
based on the training voice characteristics and the preset UBM model, acquiring a GMM-UBM model by adopting a mean value MAP self-adaption method;
using the formula s 1 =m+Tw 1 Projecting the training speech features on the total variation subspace to obtain a first i-vector, wherein s 1 Is a mean supervector corresponding to the training speech feature in the GMM-UBM model of C x F dimension; m is a speaker independent and channel independent C x F dimensional supervector; t is the overall change subspace with dimensions CF × N; w is a 1 Is the first i-vector with dimension N.
5. The method of i-vector extraction according to claim 1, wherein the projecting the first i-vector onto the total variation subspace to obtain a registered i-vector corresponding to the speaker comprises:
using the formula s 2 =m+Tw 2 Projecting the first i-vector on the total change subspace to obtain a registration i-vector, wherein s 2 A mean supervector corresponding to the registered i-vector in dimension D x G; m is a speaker independent and channel independent D x G dimensional supervector; t is the overall change subspace, with dimension DG M; w is a 2 Is the registration i-vector with dimension M.
6. A speaker recognition method, comprising:
acquiring test voice data, wherein the test voice data carries a speaker identifier;
the method for extracting the i-vector further comprises the steps of processing the test voice data by adopting the method for extracting the i-vector according to any one of claims 1 to 5 to obtain a corresponding test i-vector;
querying a database based on the speaker identifier to obtain a registered i-vector corresponding to the speaker identifier;
and acquiring the similarity of the test i-vector and the registration i-vector by adopting a cosine similarity algorithm, and detecting whether the test i-vector and the registration i-vector correspond to the same speaker or not according to the similarity.
7. An i-vector extraction device, comprising:
the voice data acquisition module is used for acquiring training voice data of a speaker and extracting training voice characteristics corresponding to the training voice data;
the training change space module is used for training out a total change subspace corresponding to a preset UBM model based on the preset UBM model;
the projection change space module is used for projecting the training voice features on the total change subspace to obtain a first i-vector;
and the i-vector obtaining module is used for projecting the first i-vector on the total change subspace and obtaining a registration i-vector corresponding to the speaker.
8. A speaker recognition apparatus, comprising:
the test data acquisition module is used for acquiring test voice data, and the test voice data carries a speaker identifier;
a test vector acquisition module, configured to process the test speech data by using the i-vector extraction method according to any one of claims 1 to 5, and acquire a corresponding test i-vector;
a registration vector acquisition module, configured to query a database based on the speaker identifier, and acquire a registration i-vector corresponding to the speaker identifier;
and the module for determining the corresponding speaker is used for acquiring the similarity of the test i-vector and the registered i-vector by adopting a cosine similarity algorithm, and detecting whether the test i-vector and the registered i-vector correspond to the same speaker according to the similarity.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the i-vector extraction method according to any one of claims 1 to 5 or the speaker recognition method according to claim 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the i-vector extraction method according to any one of claims 1 to 5 or the speaker recognition method according to claim 6.
CN201810574010.4A 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition Active CN109065022B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810574010.4A CN109065022B (en) 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition
PCT/CN2018/092589 WO2019232826A1 (en) 2018-06-06 2018-06-25 I-vector extraction method, speaker recognition method and apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810574010.4A CN109065022B (en) 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition

Publications (2)

Publication Number Publication Date
CN109065022A CN109065022A (en) 2018-12-21
CN109065022B true CN109065022B (en) 2022-08-09

Family

ID=64820489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810574010.4A Active CN109065022B (en) 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition

Country Status (2)

Country Link
CN (1) CN109065022B (en)
WO (1) WO2019232826A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SG11202010803VA (en) * 2019-10-31 2020-11-27 Alipay Hangzhou Inf Tech Co Ltd System and method for determining voice characteristics
CN110827834B (en) * 2019-11-11 2022-07-12 广州国音智能科技有限公司 Voiceprint registration method, system and computer readable storage medium
CN111161713A (en) * 2019-12-20 2020-05-15 北京皮尔布莱尼软件有限公司 Voice gender identification method and device and computing equipment
CN111508505B (en) * 2020-04-28 2023-11-03 讯飞智元信息科技有限公司 Speaker recognition method, device, equipment and storage medium
CN111700718B (en) * 2020-07-13 2023-06-27 京东科技信息技术有限公司 Method and device for recognizing holding gesture, artificial limb and readable storage medium
CN114420142A (en) * 2022-03-28 2022-04-29 北京沃丰时代数据科技有限公司 Voice conversion method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737633A (en) * 2012-06-21 2012-10-17 北京华信恒达软件技术有限公司 Method and device for recognizing speaker based on tensor subspace analysis
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
CN105810199A (en) * 2014-12-30 2016-07-27 中国科学院深圳先进技术研究院 Identity verification method and device for speakers
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes
CN107146601A (en) * 2017-04-07 2017-09-08 南京邮电大学 A kind of rear end i vector Enhancement Methods for Speaker Recognition System
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN107633845A (en) * 2017-09-11 2018-01-26 清华大学 A kind of duscriminant local message distance keeps the method for identifying speaker of mapping
WO2018053531A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9858919B2 (en) * 2013-11-27 2018-01-02 International Business Machines Corporation Speaker adaptation of neural network acoustic models using I-vectors
CN104240706B (en) * 2014-09-12 2017-08-15 浙江大学 It is a kind of that the method for distinguishing speek person that similarity corrects score is matched based on GMM Token
CN105933323B (en) * 2016-06-01 2019-05-31 百度在线网络技术(北京)有限公司 Voiceprint registration, authentication method and device
DE102016115018B4 (en) * 2016-08-12 2018-10-11 Imra Europe S.A.S. Audio signature for voice command observation
CN107240397A (en) * 2017-08-14 2017-10-10 广东工业大学 A kind of smart lock and its audio recognition method and system based on Application on Voiceprint Recognition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737633A (en) * 2012-06-21 2012-10-17 北京华信恒达软件技术有限公司 Method and device for recognizing speaker based on tensor subspace analysis
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
CN105810199A (en) * 2014-12-30 2016-07-27 中国科学院深圳先进技术研究院 Identity verification method and device for speakers
WO2018053531A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes
CN107146601A (en) * 2017-04-07 2017-09-08 南京邮电大学 A kind of rear end i vector Enhancement Methods for Speaker Recognition System
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN107633845A (en) * 2017-09-11 2018-01-26 清华大学 A kind of duscriminant local message distance keeps the method for identifying speaker of mapping

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
改进i-向量说话人识别算法研究;邢玉娟 等;《科学技术与工程》;20141231;第14卷(第34期);正文第224-228页 *

Also Published As

Publication number Publication date
WO2019232826A1 (en) 2019-12-12
CN109065022A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
CN109065022B (en) Method for extracting i-vector, method, device, equipment and medium for speaker recognition
CN108922544B (en) Universal vector training method, voice clustering method, device, equipment and medium
CN109065028B (en) Speaker clustering method, speaker clustering device, computer equipment and storage medium
Michelsanti et al. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
Li et al. An overview of noise-robust automatic speech recognition
Krueger et al. Model-based feature enhancement for reverberant speech recognition
JP7008638B2 (en) voice recognition
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
US8346551B2 (en) Method for adapting a codebook for speech recognition
US8566093B2 (en) Intersession variability compensation for automatic extraction of information from voice
WO2019227586A1 (en) Voice model training method, speaker recognition method, apparatus, device and medium
CN109360572B (en) Call separation method and device, computer equipment and storage medium
JPH0850499A (en) Signal identification method
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
Karbasi et al. Twin-HMM-based non-intrusive speech intelligibility prediction
Seo et al. A maximum a posterior-based reconstruction approach to speech bandwidth expansion in noise
JP6748304B2 (en) Signal processing device using neural network, signal processing method using neural network, and signal processing program
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
Dash et al. Speech intelligibility based enhancement system using modified deep neural network and adaptive multi-band spectral subtraction
CN110176243B (en) Speech enhancement method, model training method, device and computer equipment
Poorjam et al. A parametric approach for classification of distortions in pathological voices
Krueger et al. A model-based approach to joint compensation of noise and reverberation for speech recognition
Herrera-Camacho et al. Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE
Razani et al. A reduced complexity MFCC-based deep neural network approach for speech enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant