WO2019232826A1 - I-vector extraction method, speaker recognition method and apparatus, device, and medium - Google Patents

I-vector extraction method, speaker recognition method and apparatus, device, and medium Download PDF

Info

Publication number
WO2019232826A1
WO2019232826A1 PCT/CN2018/092589 CN2018092589W WO2019232826A1 WO 2019232826 A1 WO2019232826 A1 WO 2019232826A1 CN 2018092589 W CN2018092589 W CN 2018092589W WO 2019232826 A1 WO2019232826 A1 WO 2019232826A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
training
speaker
registered
test
Prior art date
Application number
PCT/CN2018/092589
Other languages
French (fr)
Chinese (zh)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019232826A1 publication Critical patent/WO2019232826A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Definitions

  • the present application relates to the field of speech recognition, and in particular, to an i-vector vector extraction method, a speaker recognition method, a device, a device, and a medium.
  • Speaker recognition also called voiceprint recognition
  • voiceprint recognition is a kind of biometric authentication technology that uses specific speaker information contained in a voice signal to identify the identity of the speaker.
  • i-vector identity-vector
  • the introduction of i-vector (identity-vector) modeling methods based on vector analysis has significantly improved the performance of speaker recognition systems.
  • the channel subspace will contain speaker information.
  • i-vector space a low-dimensional total variable space is used to represent the speaker subspace and channel subspace.
  • the speaker's speech is projected into the space through dimensionality reduction to obtain a fixed-length vector representation (i-vector vector).
  • i-vector vector fixed-length vector representation
  • An i-vector vector extraction method includes:
  • the first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  • An i-vector vector extraction device includes:
  • a voice data acquisition module for acquiring training voice data of a speaker and extracting training voice features corresponding to the training voice data
  • a training change space module for training an overall change subspace corresponding to a preset UBM model based on a preset UBM model
  • a projection change space module configured to project training speech features on the overall change subspace to obtain a first i-vector vector
  • An i-vector vector module is used to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor implements the steps of the i-vector vector extraction method when the processor executes the computer-readable instructions.
  • a computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions are executed by a processor, the steps of the i-vector vector extraction method are implemented.
  • This implementation also provides a speaker recognition method, including:
  • test voice data Obtaining test voice data, and the test voice data carries a speaker identification
  • the cosine similarity algorithm is used to obtain the similarity between the test i-vector vector and the registered i-vector vector, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
  • a speaker recognition device includes:
  • test data acquisition module for acquiring test voice data, and the test voice data carries a speaker identifier
  • a test vector obtaining module configured to process test voice data by using an i-vector vector extraction method to obtain a corresponding test i-vector vector
  • a registration vector module for querying a database based on a speaker identifier to obtain a registered i-vector vector corresponding to the speaker identifier
  • the corresponding speaker module is determined and used to obtain the similarity between the test i-vector vector and the registered i-vector vector using a cosine similarity algorithm, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • the first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • test voice data Obtaining test voice data, and the test voice data carries a speaker identification
  • the cosine similarity algorithm is used to obtain the similarity between the test i-vector vector and the registered i-vector vector, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
  • One or more non-volatile readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • the first i-vector vector is projected onto the overall changing subspace, and a registered i-vector vector corresponding to the speaker is obtained.
  • One or more non-volatile readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • test voice data Obtaining test voice data, and the test voice data carries a speaker identification
  • the cosine similarity algorithm is used to obtain the similarity between the test i-vector vector and the registered i-vector vector, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
  • FIG. 1 is a schematic diagram of an application environment of an i-vector vector extraction method according to an embodiment of the present application
  • FIG. 2 is a flowchart of an i-vector vector extraction method according to an embodiment of the present application
  • FIG. 3 is another specific flowchart of an i-vector vector extraction method according to an embodiment of the present application.
  • FIG. 5 is another specific flowchart of an i-vector vector extraction method according to an embodiment of the present application.
  • FIG. 6 is a specific flowchart of a speaker recognition method according to an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of an i-vector vector extraction device according to an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a speaker recognition device according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.
  • the i-vector vector extraction method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 1, where a computer device communicates with an identification server through a network.
  • computer equipment includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the identification server can be implemented by an independent server or a server cluster composed of multiple servers.
  • an i-vector vector extraction method is provided.
  • the method is applied to the identification server in FIG. 1 as an example, and includes the following steps:
  • the speaker's training speech data is the original speech data provided by the speaker.
  • the training speech feature is a speech feature that represents a speaker different from others, and is applied in this embodiment.
  • Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) can be used as the training speech feature.
  • these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse.
  • the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the auditory characteristics of the human ear, which is also the physical meaning of the Mel scale.
  • the preset UBM is a Gaussian Mixture Models (Gaussian Mixture Models) that represent a large number of non-specific speaker speech feature distributions.
  • UBM model training usually uses a large amount of speech data that is independent of specific speakers and channels. Therefore, UBM models can usually be considered as models that are independent of specific speakers. It only fits the distribution of human speech features and does not represent a certain A specific speaker.
  • the UBM model is preset in the recognition server because the voice data for training a specific speaker is usually very small during the voiceprint registration phase of the voiceprint recognition process.
  • the GMM model is used to model the speaker's voice characteristics and train the specific speaker The voice data cannot usually cover the feature space where the GMM is located.
  • the parameters of the UBM model can be adjusted according to the characteristics of the training speech to characterize the personality information of a specific speaker.
  • Features that are not covered by the training speech can be approximated by similar feature distributions in the UBM model. This method can better solve the training System performance problems caused by insufficient voice.
  • the total change subspace also called T space (Total Space) is a direct setting of a globally changing projection matrix to contain all possible information of the speaker in the voice data.
  • the speaker space and channel space are not separated in the T space.
  • T-space can project high-dimensional full statistics (supervectors) onto i-vectors that can be used as low-dimensional speaker representations, and play a role in reducing dimensions.
  • the training process of the T space includes: according to a preset UBM model, using a vector analysis and an EM (Expectation Maximum Algorithm) algorithm to calculate the T space from the convergence.
  • the overall change subspace obtained based on the preset UBM model does not distinguish between speaker space and channel space, and converges the information of the channel space and the information of the channel space into one space to reduce the computational complexity and facilitate further based on the overall Change the subspace to get the i-vector vector.
  • the first i-vector vector is a vector representing a fixed-length vector obtained by projecting training speech features onto a low-dimensional global change subspace, that is, an i-vector vector.
  • the overall change subspace is obtained through step S20.
  • the overall change subspace does not separate the speaker space and the channel space, and directly sets a globally changing T (Total Variable Space) space to contain all possible possibilities in the voice data. Information.
  • the registered i-vector vector is a fixed-length projected first i-vector vector into a low-dimensional global change subspace, which is obtained for recording in the database of the recognition server and used to associate with the speaker ID as an identity.
  • the vector of vector representation, i-vector is a fixed-length projected first i-vector vector into a low-dimensional global change subspace, which is obtained for recording in the database of the recognition server and used to associate with the speaker ID as an identity.
  • step S40 the training voice feature is projected on the overall change subspace to obtain a first i-vector vector, which specifically includes the following steps:
  • s 2 may use the Gaussian mean supervector of the first i-vector vector obtained in step S30; m is a D * G-dimensional supervector that is independent of the speaker and independent of the channel, and the average value corresponding to the UBM model is super. Vectors are spliced together; w 2 is a set of random vectors obeying the standard normal distribution, which is the registered i-vector vector, and the dimension of the registered i-vector vector is M.
  • T (total change subspace) in the formula is: training the high-dimensional sufficient statistics of the UBM model, and then iteratively updating the high-dimensional sufficient statistics through the EM algorithm to generate a convergent T-space.
  • the i-vector vector extraction method obtained in this embodiment obtains a first i-vector vector by projecting training speech features on the overall change subspace, and then projects the first i-vector vector on the overall change subspace a second time. Obtaining the registered i-vector vector, so that after training the feature data of the speech after two projections, that is, reducing the dimensionality, more noise features can be removed, which improves the purity of the extracted speaker's speech features, while reducing the computation space and reducing speech The recognition efficiency of recognition.
  • the speaker recognition method provided by this implementation uses the i-vector vector extraction method for recognition and reduces the complexity of recognition.
  • step S10 the training voice feature corresponding to the training voice data is extracted, and specifically includes the following steps:
  • S11 Preprocess the training voice data to obtain preprocessed voice data.
  • step S11 the training voice data is pre-processed to obtain the pre-processed voice data, which specifically includes the following steps:
  • pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end.
  • the idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform.
  • Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
  • the value of a ranges from 0.9 ⁇ a ⁇ 1.0.
  • the effect of pre-emphasis of 0.97 is better.
  • the use of the pre-emphasis processing can eliminate interference caused by vocal cords and lips during vocalization, can effectively compensate the suppressed high-frequency part of the training voice data, and can highlight the high-frequency formants of the training voice data, and strengthen the signal amplitude of the training voice data To help extract training speech features.
  • S112 Perform frame processing on the pre-emphasized training voice data.
  • Framing refers to the speech processing technology that cuts the entire voice signal into several segments.
  • the size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length.
  • Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames.
  • the framed processing of the training voice data can divide the training voice data into several pieces of voice data, and the training voice data can be subdivided to facilitate the extraction of training voice features.
  • S113 Perform window processing on the framed training speech data to obtain pre-processed speech data.
  • the calculation formula for windowing is Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  • the windowing process specifically refers to the processing of training speech data by using a window function.
  • the window function can select the Hamming window. N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed. Windowing the training voice data and obtaining pre-processed voice data can make the signal of the training voice data in the time domain after the frame become continuous, which is helpful for extracting the training voice features of the training voice data.
  • the pre-processing operations of the training voice data in steps S211-S213 provide a basis for extracting the training voice features of the training voice data, which can make the extracted training voice features more representative of the training voice data, and train out according to the training voice features.
  • S12 Perform a fast Fourier transform on the preprocessed voice data to obtain the frequency spectrum of the training voice data, and obtain the power spectrum of the training voice data according to the frequency spectrum.
  • FFT Fast Fourier Transform
  • a fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from a signal amplitude in a time domain to a signal amplitude (spectrum) in a frequency domain.
  • the formula for calculating the spectrum is N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit.
  • the power spectrum of the pre-processed voice data can be directly obtained based on the frequency spectrum.
  • the power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the training voice data.
  • the formula for calculating the power spectrum of the training speech data is N is the frame size, and s (k) is the signal amplitude in the frequency domain.
  • S13 Use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data.
  • the power spectrum of the training speech data using the Mel scale filter bank is a Mel frequency analysis of the power spectrum
  • the Mel frequency analysis is an analysis based on human auditory perception.
  • human hearing is non-linear to frequency
  • these filters are not uniformly distributed on the frequency axis.
  • the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
  • the power spectrum of the training speech data is processed by using a Mel scale filter bank, and the Mel power spectrum of the training speech data is obtained.
  • the frequency domain signals are segmented by using the Mel scale filter bank, so that the final The frequency segment corresponds to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the training speech data can be obtained.
  • the Mel power spectrum obtained after the analysis retains a frequency portion that is closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the training speech data. .
  • cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
  • cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum results, the MFCC features of the training speech data are analyzed and acquired.
  • the features contained in the Mel power spectrum of the training speech data that were originally too high in dimension and difficult to use directly can be converted into easy-to-use features by performing cepstrum analysis on the Mel power spectrum ( MFCC feature feature vector used for training or recognition).
  • the MFCC feature can be used as a coefficient for distinguishing different voices from the training voice feature.
  • the training voice feature can reflect the difference between the voices and can be used to identify and distinguish the training voice data.
  • step S14 cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data, including the following steps:
  • a log value log of the Mel power spectrum is taken to obtain a Mel power spectrum m to be transformed.
  • S142 Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain MFCC features of training speech data.
  • a discrete cosine transform is performed on the transformed Mel power spectrum m to obtain corresponding MFCC features of training speech data.
  • the second to thirteenth coefficients are taken as the training speech features.
  • the training Speech features can reflect the differences between speech data.
  • the formula for discrete cosine transform of the transformed Mel power spectrum m is N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters.
  • Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and Compared with the Fourier transform, the result of the indirect training speech feature has no imaginary part, and has obvious advantages in calculation.
  • Steps S11-S14 perform feature extraction processing on the training voice data based on the training technology.
  • the finally obtained training voice feature can well reflect the training voice data, and the training voice feature can train the corresponding GMM-UBM model, and then obtain the registration i -vector vector, so that the registered i-vector vector obtained during training is more accurate when performing speech recognition.
  • the features extracted above are MFCC features.
  • the training voice features should not be limited to only MFCC features here. Instead, it should be considered that the voice features obtained by training techniques can effectively reflect the features of voice data. Can be used as training speech features for recognition and model training.
  • the training voice data is pre-processed, and corresponding pre-processed voice data is obtained. Preprocessing the training voice data can better extract the training voice features of the training voice data, so that the extracted training voice features can better represent the training voice data, and use the training voice features for voice recognition.
  • step S20 that is, the overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model, and specifically includes the following steps:
  • the UBM model is a multi-person, channel-equalized, and male-female-sound-equal enough voice to train a high-order GMM model to describe speaker-independent feature distributions.
  • the UBM model can adjust the parameters of the UBM model according to the characteristics of the training speech to characterize the personality information of a specific speaker. Features that are not covered by the training speech features are approximated by similar feature distributions in the UBM model to solve the performance problem caused by insufficient training speech. .
  • T (x) is a sufficient statistic of the parameter ⁇ of the unknown distribution P, if and only if T (x) can provide all the information of ⁇ , that is, there is no statistics
  • the quantity can provide additional information about ⁇ .
  • Statistics are actually a kind of compression of the data distribution. During the process of processing samples into statistics, the information contained in the samples may be lost. If the samples are processed into statistics, there is no loss of information. This statistic is called a full statistic. For example, for a Gaussian distribution, the expectation and covariance matrix are its two sufficient statistics, because if these two parameters are known, a Gaussian distribution can be uniquely determined.
  • the parameter is theta.
  • the recognition server obtains a zero-order sufficient statistic and a first-order sufficient statistic of the preset UBM model, which is used as a technical basis for training the overall change subspace.
  • the maximum expectation algorithm (Expectation Maximization Algorithm) is an iterative algorithm used in statistics to find the maximum likelihood estimation of parameters in a probability model that depends on unobservable hidden variables. For example, two parameters A and B are initialized. In the initial state, the values of both parameters are unknown, but the information of B can be obtained by obtaining the information of A, and the information of B can be obtained by the same way. If you first give A some initial value to get the estimated value of B, and then start from the current value of B, re-estimate the value of A until it continues to converge.
  • EM's algorithm flow is as follows: 1. Initialize the distribution parameters; 2. Repeat steps E and M until convergence: Step E: Estimate the expected value of the unknown parameter and give the current parameter estimate; Step M: Re-estimate the distribution parameters so that the data The likelihood is the largest and gives the expected estimate of the unknown variable.
  • Step E Estimate the expected value of the unknown parameter and give the current parameter estimate
  • Step M Re-estimate the distribution parameters so that the data The likelihood is the largest and gives the expected estimate of the unknown variable.
  • Step 1 According to the high-dimensional sufficient statistics, the average vectors of M Gaussian components (each vector has D dimensions) are concatenated to form a Gaussian mean supervector, that is, an M * D-dimensional vector, and an M * D-dimensional vector is used.
  • Form F (x), F (x) is the MD dimension vector; at the same time, use the zero-order sufficient statistics to construct N, N is the MD diagonal matrix, and the posterior probability is concatenated as the main diagonal element.
  • the posterior probability refers to the probability of re-correction after obtaining the result information. For example: something has happened, the reason why this thing is required to happen is the magnitude of the possibility caused by a certain factor, which is the posterior probability.
  • Step 2 Initialize the T space and construct a [MD, V] -dimensional matrix, where the dimension of V is much smaller than the MD dimension, and the dimension of V is the dimension of the first i-vector vector.
  • Step 3 Fix the T space and use the maximum expectation algorithm to iterate the following formula repeatedly to estimate the zero-order sufficient statistics and the first-order sufficient statistics of the hidden variable w.
  • the T space can be considered to converge to fix the T space:
  • an iterative EM algorithm is provided to provide a simple and stable iterative algorithm to calculate the posterior density function to obtain the overall change subspace; obtaining the overall change subspace can project the high-dimensional sufficient statistics (supervector) of the preset UBM model To low-dimensional implementation, the vector after dimensionality reduction is further used for speech recognition.
  • step S30 the training voice feature is projected on the overall change subspace to obtain a first i-vector vector, which specifically includes the following steps:
  • a mean MAP adaptive method is used to obtain a GMM-UBM model.
  • the training speech feature is a speech feature that represents the speaker different from others, and is applied in this embodiment, and Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) may be used as the training speech feature.
  • MFCC features Mel-Frequency Cepstral Coefficients
  • a maximum posterior probability MAP is used to adaptively train a GMM model of speech features to update the mean vector of each Gaussian component. Then, a GMM model with M components is generated, that is, a GMM-UBM model is generated.
  • the average vector of each Gaussian component of the GMM-UBM model (each vector has D dimension) is used as a concatenation unit to form a Gaussian mean supervector of M * D dimension.
  • s 1 m + Tw 1 to project the training speech feature on the overall change subspace to obtain the first i-vector vector, where s 1 is the C * F dimension GMM-UBM model and the training speech feature. corresponding to the mean super vector; m is independent of the speaker and independent channel C * F dimensional supervector; T is the total variation subspace of dimension CF * N; w 1 is the first i-vector vector of dimension N .
  • s 1 may use the Gaussian mean supervector obtained in step S31; m is a speaker-independent and channel-independent M * D-dimensional supervector, which is a concatenation of the mean supervectors corresponding to the UBM model; w 1 Is a set of random vectors obeying the standard normal distribution, that is, the first i-vector vector, and the dimension of the first i-vector vector is N.
  • T (total change subspace) in the formula is: training the high-dimensional sufficient statistics of the UBM model, and then iteratively updating the high-dimensional sufficient statistics through the EM algorithm to generate a convergent T-space.
  • the training voice feature can be projected on the overall change subspace to obtain the first i-vector vector, and the training voice feature can be reduced for the first time to simplify the training voice.
  • the complexity of the features is also convenient for further processing of low-dimensional first i-vector vectors or for speech recognition.
  • a speaker recognition method is provided.
  • the method is applied to the recognition server in FIG. 1 as an example, and includes the following steps:
  • test voice data carries a speaker identifier
  • the test voice data is the voice data of the speaker corresponding to the speaker ID that is claimed to be carried.
  • the speaker ID is a unique identifier used to indicate the identity of the speaker, including, but not limited to, a user name, an ID number, a mobile phone number, and the like.
  • Speech is the test voice data and identity is the speaker identification, so that the recognition server further determines whether the identity claimed by the test voice data is the true corresponding identity. .
  • test i-vector vector is a fixed-length vector representation (ie, i-vector) obtained by projecting test voice features onto a low-dimensional overall change subspace, which is used to verify identity.
  • a test i-vector vector corresponding to the test voice data can be obtained, and the acquisition process is the same as the corresponding registered i-vector vector based on the training voice feature, which is not repeated here.
  • the database is a database that records the registered i-vector vector corresponding to the speaker and the speaker identification.
  • the registered i-vector vector is a fixed-length vector representation (ie, i-vector) recorded in the database of the identification server and used to associate with the speaker ID as an identity.
  • the recognition server can find the corresponding registered i-vector vector in the database based on the speaker identification carried in the test voice data, so as to further compare the registered i-vector vector with the test i-vector vector.
  • the similarity between the obtained test i-vector vector and the registered i-vector vector can be determined by the following formula:
  • a i and Bi represent the components of the vector A and the vector B, respectively.
  • the similarity ranges from -1 to 1, where -1 indicates that the two vectors are in opposite directions, 1 indicates that the two vectors point in the same direction, and 0 indicates that the two vectors are independent.
  • -1 it can be understood that the closer the similarity is to 1, the closer the two vectors are, and the closer the similarity or dissimilarity between the two vectors is.
  • the threshold of cos ⁇ can be set in advance according to actual experience.
  • test i-vector vector and the registered i-vector vector are considered similar, that is, it can be determined that the test voice data corresponds to the speaker identification in the database. .
  • the cosine similarity algorithm can be used to determine the similarity between the test i-vector vector and the registered i-vector vector, which is simple and fast, which is helpful for quickly confirming the recognition result.
  • the i-vector vector extraction method obtaineds the first i-vector vector by projecting the training speech features on the overall change subspace, and then projects the first i-vector vector on the overall change subspace a second time.
  • the registered i-vector vector is obtained on the system, so that after training the feature data of the speech after two projections, that is, after reducing the dimension, more noise features can be removed, which improves the purity of the speaker's feature extraction.
  • the recognition efficiency of speech recognition reduces the recognition complexity.
  • the processing of feature extraction of the training voice data based on the training technology to obtain the registered i-vector vector can well reflect the training voice data, making the registered i-vector vector obtained by training more accurate when performing speech recognition. ; Provide a simple and stable iterative algorithm to calculate the posterior density function to obtain the overall change subspace through the EM algorithm iteration. Obtaining the overall change subspace can project the high-dimensional sufficient statistics of the preset UBM model to the low-dimensional implementation, which is conducive to dimension reduction The subsequent vectors are further subjected to speech recognition.
  • the speaker recognition method provided in the embodiment of the present application uses the i-vector vector extraction method to process the test voice data to obtain the corresponding test i-vector vector, which can reduce the complexity of obtaining the test i-vector vector; at the same time, through the cosine
  • the similarity algorithm can determine the similarity between the test i-vector vector and the registered i-vector vector, which is simple and fast, which is helpful for quickly confirming the recognition result.
  • an i-vector vector extraction device corresponds to the i-vector vector extraction method in the above embodiment in a one-to-one correspondence.
  • the i-vector vector extraction device includes a speech data acquisition module 10, a training variation space module 20, a projection variation space module 30, and an i-vector vector acquisition module 40.
  • the detailed description of each function module is as follows:
  • the voice data acquisition module 10 is configured to acquire training voice data of a speaker and extract training voice features corresponding to the training voice data.
  • the training change space module 20 is configured to train an overall change subspace corresponding to the preset UBM model based on the preset UBM model.
  • the projection change space module 30 is configured to project training speech features on the overall change subspace to obtain a first i-vector vector.
  • An i-vector vector module 40 is configured to project the first i-vector vector on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  • the acquired voice data module 10 includes an acquired voice data unit 11, an acquired data power spectrum unit 12, an acquired Mel power spectrum unit 13, and an acquired MFCC feature unit 14.
  • the voice data acquisition unit 11 is configured to preprocess the training voice data and obtain preprocessed voice data.
  • a data power spectrum obtaining unit 12 is configured to perform a fast Fourier transform on the pre-processed voice data, obtain a frequency spectrum of the training voice data, and obtain a power spectrum of the training voice data according to the frequency spectrum.
  • a Mel power spectrum obtaining unit 13 is configured to process a power spectrum of the training voice data by using a Mel scale filter bank, and obtain a Mel power spectrum of the training voice data.
  • An MFCC feature unit 14 is used to perform cepstrum analysis on the Mel power spectrum to obtain MFCC features of training speech data.
  • the training variation space module 20 includes an acquisition high-dimensional statistics unit 21 and an acquisition change subspace unit 22.
  • a high-dimensional statistics obtaining unit 21 is configured to obtain high-dimensional sufficient statistics of a preset UBM model.
  • a change subspace obtaining unit 22 is configured to iterate the high-dimensional sufficient statistics by using a maximum expectation algorithm to obtain a corresponding overall change subspace.
  • the projection change space module 30 includes an acquisition GMM-UBM model unit 31 and an acquisition first vector unit 32.
  • a GMM-UBM model obtaining unit 31 is configured to obtain a GMM-UBM model based on a training speech feature and a preset UBM model, and adopt a mean MAP adaptive method.
  • the obtaining i-vector vector module 40 includes obtaining a registration vector unit 41.
  • Each module in the i-vector vector extraction device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a speaker recognition device is provided, and the speaker recognition device corresponds to the speaker recognition method in the embodiment described above.
  • the speaker recognition device includes a test data acquisition module 50, a test vector acquisition module 60, a registration vector acquisition module 70, and a corresponding speaker determination module 80.
  • the detailed description of each function module is as follows:
  • the acquisition test data module 50 is configured to acquire test voice data, and the test voice data carries a speaker identifier;
  • the test vector obtaining module 60 is configured to process the test voice data by using an i-vector vector extraction method to obtain a corresponding test i-vector vector;
  • a registration vector obtaining module 70 configured to query a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;
  • the corresponding speaker module 80 is determined, and is used to obtain the similarity between the test i-vector vector and the registered i-vector vector by using a cosine similarity algorithm, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
  • Each module in the speaker recognition device may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and the internal structure diagram may be as shown in FIG. 9.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the database of the computer equipment is used to store data related to the i-vector vector extraction method or speaker recognition method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement an i-vector vector extraction method or a speaker recognition method.
  • a computer device which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented: acquiring a speaker Training speech data and extract training speech features corresponding to the training speech data; training the overall change subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training speech feature on the overall change subspace to obtain the first i-vector vector; the first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  • the training voice features corresponding to the training voice data are extracted, and the processor implements the following steps when the processor executes computer-readable instructions: pre-processing the training voice data to obtain pre-processed voice data;
  • the Fourier transform obtains the frequency spectrum of the training speech data, and obtains the power spectrum of the training speech data according to the frequency spectrum; uses the Mel scale filter bank to process the power spectrum of the training speech data, and obtains the Mel power spectrum of the training speech data; Cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.
  • the overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model.
  • the processor executes computer-readable instructions, the following steps are implemented: obtaining high-dimensional sufficient statistics of the preset UBM model; using The maximum expectation algorithm iterates high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
  • the training voice feature is projected on the overall change subspace to obtain the first i-vector vector.
  • the processor executes computer-readable instructions, the following steps are implemented: based on the training voice feature and a preset UBM model, using the mean
  • the first i-vector vector is projected on the overall change subspace to obtain the registered i-vector vector corresponding to the speaker.
  • s 2 m + Tw 2 to project the first i-vector vector onto the overall change subspace to obtain the registered i-vector vector, where s 2 is a D * G dimension corresponding to the registered i-vector vector ultra mean vector; m is speaker-independent and channel-independent ultra-dimensional vector D * G; T is the total variation subspace dimension DG * m; w 2 i-vector is a registered vector of dimension M.
  • a computer device which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the processor implements the following steps: obtaining a test voice Data, the test voice data carries the speaker ID; based on the test voice data, the corresponding test i-vector vector is obtained; query the database based on the speaker ID, and obtain the registered i-vector vector corresponding to the speaker ID; using the cosine similarity algorithm to obtain Test the similarity between the i-vector vector and the registered i-vector vector, and test whether the i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
  • a computer-readable storage medium on which computer-readable instructions are stored.
  • the following steps are performed: obtaining training voice data of a speaker, and extracting training voice data Corresponding training speech features; training the overall change subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training speech features on the overall change subspace to obtain the first i-vector vector; The vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  • the training voice features corresponding to the training voice data are extracted, and the computer-readable instructions are executed by the processor to implement the following steps: pre-processing the training voice data to obtain pre-processed voice data; quickly performing pre-processing voice data Fourier transform to obtain the frequency spectrum of the training speech data, and obtain the power spectrum of the training speech data according to the frequency spectrum; use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data; The cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.
  • an overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model, and when the computer-readable instructions are executed by the processor, the following steps are performed: obtaining high-dimensional sufficient statistics of the preset UBM model; The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
  • the training voice feature is projected on the overall change subspace to obtain a first i-vector vector.
  • the first i-vector vector is projected on the overall change subspace to obtain the registered i-vector vector corresponding to the speaker.
  • s 2 m + Tw 2 to project the first i-vector vector on the overall change subspace to obtain the registered i-vector vector, where s 2 is a D * G dimension corresponding to the registered i-vector vector ultra mean vector; m is speaker-independent and channel-independent ultra-dimensional vector D * G; T is the total variation subspace dimension DG * m; w 2 i-vector is a registered vector of dimension M.
  • a computer-readable storage medium on which computer-readable instructions are stored.
  • the following steps are performed: obtaining test voice data, and the test voice data carrying a speaker identification; Obtain the corresponding test i-vector vector based on the test speech data; query the database based on the speaker ID to obtain the registered i-vector vector corresponding to the speaker ID; use the cosine similarity algorithm to obtain the test i-vector vector and the registered i-vector The similarity of the vectors, based on the similarity detection test whether the i-vector vector and the registered i-vector correspond to the same speaker.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Disclosed are an i-vector extraction method, a speaker recognition method and apparatus, a device, and a medium. The i-vector extraction method comprises: obtaining training voice data of a speaker, and extracting a training voice feature corresponding to the training voice data; training, on the basis of a preset UBM, a total variability subspace corresponding to the preset UBM; projecting the training voice feature on the total variability subspace, and obtaining a first i-vector; and projecting the first i-vector on the total variability subspace, and obtaining a registration i-vector corresponding to the speaker. According to the method, training voice feature data is projected twice, i.e., the dimension is reduced, so that more noise features can be removed, thereby improving the purity of the extracted voice feature of a speaker; moreover, after dimension reduction, the computation space is reduced and the recognition efficiency of voice recognition is also improved.

Description

i-vector向量提取方法、说话人识别方法、装置、设备及介质I-vector vector extraction method, speaker recognition method, device, device and medium
本申请以2018年06月06日提交的申请号为201810574010.4,名称为“i-vector向量提取方法、说话人识别方法、装置、设备及介质”的中国发明申请为基础,并要求其优先权。This application is based on a Chinese invention application filed on June 06, 2018 with application number 201810574010.4, entitled "i-vector vector extraction method, speaker recognition method, device, device, and medium", and claims its priority.
技术领域Technical field
本申请涉及语音识别领域,尤其涉及一种i-vector向量提取方法、说话人识别方法、装置、设备及介质。The present application relates to the field of speech recognition, and in particular, to an i-vector vector extraction method, a speaker recognition method, a device, a device, and a medium.
背景技术Background technique
说话人识别又称声纹识别,是利用语音信号中含有的特定说话人信息来识别说话者身份的一种生物认证技术。近年来,基于向量分析的i-vector(identity-vector,身份认证向量)建模方法的引入使得说话人识别系统的性能有了明显的提升。在对说话人语音的向量分析中,通常信道子空间中会包含说话人的信息。i-vector空间用一个低维的总变量空间来表示说话人子空间和信道子空间,将说话人语音通过降维投影到该空间,可得到一个固定长度的矢量表征(即i-vector向量)。然而,现有i-vector建模的所获取的i-vector向量还存在较多干扰因素,增加将i-vector向量用于说话人识别时的复杂性。Speaker recognition, also called voiceprint recognition, is a kind of biometric authentication technology that uses specific speaker information contained in a voice signal to identify the identity of the speaker. In recent years, the introduction of i-vector (identity-vector) modeling methods based on vector analysis has significantly improved the performance of speaker recognition systems. In vector analysis of speaker speech, usually the channel subspace will contain speaker information. In i-vector space, a low-dimensional total variable space is used to represent the speaker subspace and channel subspace. The speaker's speech is projected into the space through dimensionality reduction to obtain a fixed-length vector representation (i-vector vector). . However, there are still many interference factors in the i-vector vectors obtained by the existing i-vector modeling, which increases the complexity when using the i-vector vectors for speaker recognition.
发明内容Summary of the Invention
基于此,有必要针对上述技术问题,提供一种可以去除较多干扰因素的i-vector向量提取方法、装置、计算机设备及存储介质。Based on this, it is necessary to provide an i-vector vector extraction method, device, computer equipment, and storage medium that can remove more interference factors in response to the above technical problems.
一种i-vector向量提取方法,包括:An i-vector vector extraction method includes:
获取说话人的训练语音数据,并提取训练语音数据对应的训练语音特征;Obtain the training voice data of the speaker, and extract the training voice features corresponding to the training voice data;
基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;
将训练语音特征投影在总体变化子空间上,获取第一i-vector向量;Project training speech features on the overall change subspace to obtain the first i-vector vector;
将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量。The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
一种i-vector向量提取装置,包括:An i-vector vector extraction device includes:
获取语音数据模块,用于获取说话人的训练语音数据,并提取训练语音数据对应的训练语音特征;A voice data acquisition module for acquiring training voice data of a speaker and extracting training voice features corresponding to the training voice data;
训练变化空间模块,用于基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;A training change space module for training an overall change subspace corresponding to a preset UBM model based on a preset UBM model;
投影变化空间模块,用于将训练语音特征投影在总体变化子空间上,获取第一i-vector向量;A projection change space module, configured to project training speech features on the overall change subspace to obtain a first i-vector vector;
获取i-vector向量模块,用于将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量。An i-vector vector module is used to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算 机可读指令,处理器执行计算机可读指令时实现i-vector向量提取方法的步骤。A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the steps of the i-vector vector extraction method when the processor executes the computer-readable instructions.
一种计算机可读存储介质,计算机可读存储介质存储有计算机可读指令,计算机可读指令被处理器执行时实现i-vector向量提取方法的步骤。A computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions are executed by a processor, the steps of the i-vector vector extraction method are implemented.
本实施还提供一种说话人识别方法,包括:This implementation also provides a speaker recognition method, including:
获取测试语音数据,测试语音数据携带说话人标识;Obtaining test voice data, and the test voice data carries a speaker identification;
基于测试语音数据,获取对应的测试i-vector向量;Obtain corresponding test i-vector vectors based on test speech data;
基于说话人标识查询数据库,获取与说话人标识对应的注册i-vector向量;Query the database based on the speaker identity to obtain a registered i-vector vector corresponding to the speaker identity;
采用余弦相似度算法获取测试i-vector向量和注册i-vector向量的相似度,根据相似度检测测试i-vector向量和注册i-vector是否对应同一说话人。The cosine similarity algorithm is used to obtain the similarity between the test i-vector vector and the registered i-vector vector, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
一种说话人识别装置,包括:A speaker recognition device includes:
获取测试数据模块,用于获取测试语音数据,测试语音数据携带说话人标识;A test data acquisition module for acquiring test voice data, and the test voice data carries a speaker identifier;
获取测试向量模块,用于采用i-vector向量提取方法对测试语音数据进行处理,获取对应的测试i-vector向量;A test vector obtaining module, configured to process test voice data by using an i-vector vector extraction method to obtain a corresponding test i-vector vector;
获取注册向量模块,用于基于说话人标识查询数据库,获取与说话人标识对应的注册i-vector向量;A registration vector module for querying a database based on a speaker identifier to obtain a registered i-vector vector corresponding to the speaker identifier;
确定对应说话人模块,用于采用余弦相似度算法获取测试i-vector向量和注册i-vector向量的相似度,根据相似度检测测试i-vector向量和注册i-vector是否对应同一说话人。The corresponding speaker module is determined and used to obtain the similarity between the test i-vector vector and the registered i-vector vector using a cosine similarity algorithm, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:
获取说话人的训练语音数据,并提取训练语音数据对应的训练语音特征;Obtain the training voice data of the speaker, and extract the training voice features corresponding to the training voice data;
基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;
将训练语音特征投影在总体变化子空间上,获取第一i-vector向量;Project training speech features on the overall change subspace to obtain the first i-vector vector;
将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量。The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:
获取测试语音数据,测试语音数据携带说话人标识;Obtaining test voice data, and the test voice data carries a speaker identification;
还包括采用i-vector向量提取方法对测试语音数据进行处理,获取对应的测试i-vector向量;It also includes using the i-vector vector extraction method to process the test voice data to obtain the corresponding test i-vector vector;
基于说话人标识查询数据库,获取与说话人标识对应的注册i-vector向量;Query the database based on the speaker identity to obtain a registered i-vector vector corresponding to the speaker identity;
采用余弦相似度算法获取测试i-vector向量和注册i-vector向量的相似度,根据相似度检测测试i-vector向量和注册i-vector是否对应同一说话人。The cosine similarity algorithm is used to obtain the similarity between the test i-vector vector and the registered i-vector vector, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
一个或多个存储有计算机可读指令的非易失性可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取说话人的训练语音数据,并提取训练语音数据对应的训练语音特征;Obtain the training voice data of the speaker, and extract the training voice features corresponding to the training voice data;
基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;
将训练语音特征投影在总体变化子空间上,获取第一i-vector向量;Project training speech features on the overall change subspace to obtain the first i-vector vector;
将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向 量。The first i-vector vector is projected onto the overall changing subspace, and a registered i-vector vector corresponding to the speaker is obtained.
一个或多个存储有计算机可读指令的非易失性可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取测试语音数据,测试语音数据携带说话人标识;Obtaining test voice data, and the test voice data carries a speaker identification;
还包括采用i-vector向量提取方法对测试语音数据进行处理,获取对应的测试i-vector向量;It also includes using the i-vector vector extraction method to process the test voice data to obtain the corresponding test i-vector vector;
基于说话人标识查询数据库,获取与说话人标识对应的注册i-vector向量;Query the database based on the speaker identity to obtain a registered i-vector vector corresponding to the speaker identity;
采用余弦相似度算法获取测试i-vector向量和注册i-vector向量的相似度,根据相似度检测测试i-vector向量和注册i-vector是否对应同一说话人。The cosine similarity algorithm is used to obtain the similarity between the test i-vector vector and the registered i-vector vector, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请一实施例中i-vector向量提取方法的应用环境示意图;1 is a schematic diagram of an application environment of an i-vector vector extraction method according to an embodiment of the present application;
图2是本申请一实施例中i-vector向量提取方法的流程图;2 is a flowchart of an i-vector vector extraction method according to an embodiment of the present application;
图3是本申请一实施例中i-vector向量提取方法的另一具体流程图;3 is another specific flowchart of an i-vector vector extraction method according to an embodiment of the present application;
图4是本申请一实施例中i-vector向量提取方法的另一具体流程图;4 is another specific flowchart of an i-vector vector extraction method according to an embodiment of the present application;
图5是本申请一实施例中i-vector向量提取方法的另一具体流程图;5 is another specific flowchart of an i-vector vector extraction method according to an embodiment of the present application;
图6是本申请一实施例中说话人识别方法的一具体流程图;6 is a specific flowchart of a speaker recognition method according to an embodiment of the present application;
图7是本申请一实施例中i-vector向量提取装置的一原理框图;7 is a schematic block diagram of an i-vector vector extraction device according to an embodiment of the present application;
图8是本申请一实施例中说话人识别装置的一原理框图;8 is a schematic block diagram of a speaker recognition device according to an embodiment of the present application;
图9是本申请一实施例中计算机设备的一示意图。FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
本申请实施例提供的i-vector向量提取方法,可应用在如图1的应用环境中,其中,计算机设备通过网络与识别服务器进行通信。其中,计算机设备包括但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。识别服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The i-vector vector extraction method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 1, where a computer device communicates with an identification server through a network. Among them, computer equipment includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The identification server can be implemented by an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种i-vector向量提取方法,以该方法应用在图1中的识别服务器为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 2, an i-vector vector extraction method is provided. The method is applied to the identification server in FIG. 1 as an example, and includes the following steps:
S10.获取说话人的训练语音数据,并提取训练语音数据对应的训练语音特征。S10. Acquire the training voice data of the speaker, and extract the training voice features corresponding to the training voice data.
其中,说话人的训练语音数据是说话人提供的原始语音数据。训练语音特征是代表说话人区别于他人的语音特征,应用于本实施例,可采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,以下简称MFCC特征)作为训练语音特征。The speaker's training speech data is the original speech data provided by the speaker. The training speech feature is a speech feature that represents a speaker different from others, and is applied in this embodiment. Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) can be used as the training speech feature.
检测发现人耳像一个滤波器组,只关注某些特定的频率分量(人的听觉对频率是非线性的),也就是说人耳接收声音频率的信号是有限的。然而这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,他们分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。梅尔刻度滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,这也是梅尔刻度的物理意义所在。The test found that the human ear is like a filter bank and only pays attention to certain specific frequency components (human hearing is non-linear to frequency), which means that the human ear receives a limited number of signals at a sound frequency. However, these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. The resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the auditory characteristics of the human ear, which is also the physical meaning of the Mel scale.
S20.基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间。S20. Train the overall change subspace corresponding to the preset UBM model based on the preset UBM model.
其中,预设UBM(Universal Background Model,通用背景模型)是一个表征大量非特定说话人语音特征分布的高斯混合模型(Gaussian Mixture Models,高斯混合模型)。UBM模型的训练通常采用大量的与特定说话人无关、信道无关的语音数据,因此通常可以认为UBM模型是与特定说话人无关的模型,它只是拟合人的语音特征分布,而并不代表某个具体的说话人。在识别服务器中预设UBM模型,是因为在声纹识别过程的声纹注册阶段中,训练特定说话人的语音数据通常非常少,使用GMM模型对说话人语音特征进行建模,训练特定说话人的语音数据通常无法覆盖到GMM所在的特征空间。因此,可根据训练语音的特征调整UBM模型的参数来表征特定说话人的个性信息,训练语音覆盖不到的特征可以用UBM模型中相似的特征分布来近似,这种方法可以较好地解决训练语音不足带来的系统性能的问题。Among them, the preset UBM (Universal Background Model) is a Gaussian Mixture Models (Gaussian Mixture Models) that represent a large number of non-specific speaker speech feature distributions. UBM model training usually uses a large amount of speech data that is independent of specific speakers and channels. Therefore, UBM models can usually be considered as models that are independent of specific speakers. It only fits the distribution of human speech features and does not represent a certain A specific speaker. The UBM model is preset in the recognition server because the voice data for training a specific speaker is usually very small during the voiceprint registration phase of the voiceprint recognition process. The GMM model is used to model the speaker's voice characteristics and train the specific speaker The voice data cannot usually cover the feature space where the GMM is located. Therefore, the parameters of the UBM model can be adjusted according to the characteristics of the training speech to characterize the personality information of a specific speaker. Features that are not covered by the training speech can be approximated by similar feature distributions in the UBM model. This method can better solve the training System performance problems caused by insufficient voice.
总体变化子空间,也称T空间(Total Variability Space),是直接设置一个全局变化的投影矩阵,用以包含语音数据中说话人所有可能的信息,在T空间内不分开说话人空间和信道空间。T空间能把高维充分统计量(超矢量)投影到可以作为低维说话人表征的i-vector,起到降维作用。T空间的训练过程包括:根据预设UBM模型,利用向量分析和EM(Expectation Maximization Algorithm,最大期望)算法,从其中收敛计算出T空间。The total change subspace, also called T space (Total Space), is a direct setting of a globally changing projection matrix to contain all possible information of the speaker in the voice data. The speaker space and channel space are not separated in the T space. . T-space can project high-dimensional full statistics (supervectors) onto i-vectors that can be used as low-dimensional speaker representations, and play a role in reducing dimensions. The training process of the T space includes: according to a preset UBM model, using a vector analysis and an EM (Expectation Maximum Algorithm) algorithm to calculate the T space from the convergence.
本步骤中,基于预设UBM模型得到的总体变化子空间不区分说话人空间和信道空间,将声道空间的信息和信道空间的信息收敛于一个空间,以降低计算复杂度,便于进一步基于总体变化子空间获取i-vector向量。In this step, the overall change subspace obtained based on the preset UBM model does not distinguish between speaker space and channel space, and converges the information of the channel space and the information of the channel space into one space to reduce the computational complexity and facilitate further based on the overall Change the subspace to get the i-vector vector.
S30.将训练语音特征投影在总体变化子空间上,获取第一i-vector向量。S30. Project the training speech features on the overall change subspace to obtain a first i-vector vector.
其中,第一i-vector向量是将训练语音特征投影到低维的总体变化子空间,得到的一个固定长度的矢量表征的向量,即i-vector向量。The first i-vector vector is a vector representing a fixed-length vector obtained by projecting training speech features onto a low-dimensional global change subspace, that is, an i-vector vector.
具体地,本步骤中采用公式s 1=m+Tw 1,可获取高维的训练语音特征投影在总体变化子空间后形成低维的第一i-vector向量,降低训练语音特征投影的维度和去除更多的噪声,便于基于第一i-vector向量对说话人进行识别。 Specifically, in this step, the formula s 1 = m + Tw 1 is adopted, and a high-dimensional training speech feature projection can be obtained to form a low-dimensional first i-vector vector after the overall change subspace, which reduces the dimensional sum of the training speech feature projection and Remove more noise to facilitate speaker identification based on the first i-vector vector.
S40.将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量。S40. Project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
其中,总体变化子空间就是通过步骤S20得到的,该总体变化子空间不分开说话人空间和信道空间,且直接设置一个全局变化的T(Total Variability Space)空间,用以包含语音数据中所有可能的信息。Among them, the overall change subspace is obtained through step S20. The overall change subspace does not separate the speaker space and the channel space, and directly sets a globally changing T (Total Variable Space) space to contain all possible possibilities in the voice data. Information.
注册i-vector向量是将第一i-vector向量投影到低维的总体变化子空间,得到的一个用于记录在识别服务器的数据库中、用以与说话人ID关联作为身份标识的固定长度的矢量表征的向量,即i-vector。The registered i-vector vector is a fixed-length projected first i-vector vector into a low-dimensional global change subspace, which is obtained for recording in the database of the recognition server and used to associate with the speaker ID as an identity. The vector of vector representation, i-vector.
在一具体实施方式中,在步骤S40中,即将训练语音特征投影在总体变化子空间上,获取第一i-vector向量,具体包括如下步骤:In a specific implementation, in step S40, the training voice feature is projected on the overall change subspace to obtain a first i-vector vector, which specifically includes the following steps:
S41.采用公式s 2=m+Tw 2将第一i-vector向量投影在总体变化子空间上,获取注册i-vector向量,其中,s 2是D*G维的与注册i-vector向量相对应的均值超矢量;m是与说话人无关且信道无关的D*G维超向量;T是总体变化子空间,维度为DG*M;w 2是注册i-vector向量,维度为M。 S41. Use the formula s 2 = m + Tw 2 to project the first i-vector vector on the overall change subspace to obtain the registered i-vector vector, where s 2 is a D * G dimension that is in phase with the registered i-vector vector. ultra corresponding mean vector; m is speaker-independent and channel-independent ultra-dimensional vector D * G; T is the total variation subspace dimension DG * m; w 2 i-vector is a registered vector of dimension M.
本实施例中,s 2可采用步骤S30获取的第一i-vector向量的高斯均值超矢量;m是与说话人无关且与信道无关的D*G维超向量,由UBM模型对应的均值超矢量拼接而成;w 2是一组服从标准正态分布的随机向量,就是注册i-vector向量,注册i-vector向量的维度为M。 In this embodiment, s 2 may use the Gaussian mean supervector of the first i-vector vector obtained in step S30; m is a D * G-dimensional supervector that is independent of the speaker and independent of the channel, and the average value corresponding to the UBM model is super. Vectors are spliced together; w 2 is a set of random vectors obeying the standard normal distribution, which is the registered i-vector vector, and the dimension of the registered i-vector vector is M.
进一步地,公式中T(总体变化子空间)的获取过程为:训练UBM模型的高维充分统计量,然后通过EM算法迭代更新上述高维充分统计量即可生成收敛的T空间。将T空间带入公式s 2=m+Tw 2,因s 2、m和T都是已知的,即可获取w 2,也即注册i-vector向量,其中,w 2=(s 2-m)/T。 Further, the acquisition process of T (total change subspace) in the formula is: training the high-dimensional sufficient statistics of the UBM model, and then iteratively updating the high-dimensional sufficient statistics through the EM algorithm to generate a convergent T-space. Bring T space into the formula s 2 = m + Tw 2 , because s 2 , m, and T are all known, w 2 can be obtained, that is, the registered i-vector vector, where w 2 = (s 2- m) / T.
本实施例提供的i-vector向量提取方法通过将训练语音特征投影在总体变化子空间上获取第一i-vector向量后,再将第一i-vector向量第二次投影在总体变化子空间上获取注册i-vector向量,使得训练语音特征数据经过两次投影也即降低维度后可去除更多的噪音特征,提高了提取说话人语音特征的纯净度,同时降维后减少计算空间也提高语音识别的识别效率,本实施了提供的说话人识别方法采用i-vector向量提取方法来进行识别,降低识别复杂度。The i-vector vector extraction method provided in this embodiment obtains a first i-vector vector by projecting training speech features on the overall change subspace, and then projects the first i-vector vector on the overall change subspace a second time. Obtaining the registered i-vector vector, so that after training the feature data of the speech after two projections, that is, reducing the dimensionality, more noise features can be removed, which improves the purity of the extracted speaker's speech features, while reducing the computation space and reducing speech The recognition efficiency of recognition. The speaker recognition method provided by this implementation uses the i-vector vector extraction method for recognition and reduces the complexity of recognition.
在一实施例中,如图3所示,步骤S10中,即提取训练语音数据对应的训练语音特征,具体包括如下步骤:In an embodiment, as shown in FIG. 3, in step S10, the training voice feature corresponding to the training voice data is extracted, and specifically includes the following steps:
S11:对训练语音数据进行预处理,获取预处理语音数据。S11: Preprocess the training voice data to obtain preprocessed voice data.
在一具体实施方式中,步骤S11中,对训练语音数据进行预处理,获取预处理语音数据,具体包括如下步骤:In a specific embodiment, in step S11, the training voice data is pre-processed to obtain the pre-processed voice data, which specifically includes the following steps:
S111:对训练语音数据作预加重处理,预加重处理的计算公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0。 S111: the training speech data for pre-emphasis, pre-emphasis process is calculated as s' n = s n -a * s n-1, wherein the amplitude of the signal on the time domain s n, s n-1 with the The signal amplitude corresponding to s n at the previous moment, s' n is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9 <a <1.0.
其中,预加重是一种在发送端对输入信号高频分量进行补偿的信号处理方式。随着信号速率的增加,信号在传输过程中受损很大,为了使接收端能得到比较好的信号波形,就需要对受损的信号进行补偿。预加重技术的思想就是在传输线的发送端增强信号的高频成分,以补偿高频分量在传输过程中的过大衰减,使得接收端能够得到较好的信号波形。预加重对噪声并没有影响,因此能够有效提高输出信噪比。Among them, pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process. In order to obtain a better signal waveform at the receiving end, the damaged signal needs to be compensated. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform. Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
本实施例中,对训练语音数据作预加重处理,该预加重处理的公式为s' n=s n-a*s n-1,其 中,s n为时域上的信号幅度,即语音数据在时域上表达的语音的幅值(幅度),s n-1为与s n相对的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0,这里取0.97预加重的效果比较好。采用该预加重处理能够消除发声过程中声带和嘴唇等造成的干扰,可以有效补偿训练语音数据被压抑的高频部分,并且能够突显训练语音数据高频的共振峰,加强训练语音数据的信号幅度,有助于提取训练语音特征。 Formula of the present embodiment, the training speech data for pre-emphasis, the pre-emphasis is s' n = s n -a * s n-1, where, s n on the time-domain signal amplitude, i.e. the voice data expressed in the time domain speech amplitude (amplitude), s n-1 s n is the opposite of the signal amplitude of a time, s' n for the signal amplitude in the time domain after the pre-emphasis, a is the pre-emphasis coefficient The value of a ranges from 0.9 <a <1.0. Here, the effect of pre-emphasis of 0.97 is better. The use of the pre-emphasis processing can eliminate interference caused by vocal cords and lips during vocalization, can effectively compensate the suppressed high-frequency part of the training voice data, and can highlight the high-frequency formants of the training voice data, and strengthen the signal amplitude of the training voice data To help extract training speech features.
S112:将预加重后的训练语音数据进行分帧处理。S112: Perform frame processing on the pre-emphasized training voice data.
具体地,在预加重训练语音数据后,还应进行分帧处理。分帧是指将整段的语音信号切分成若干段的语音处理技术,每帧的大小在10-30ms的范围内,以大概1/2帧长作为帧移。帧移是指相邻两帧间的重叠区域,能够避免相邻两帧变化过大的问题。对训练语音数据进行分帧处理,能够将训练语音数据分成若干段的语音数据,可以细分训练语音数据,便于训练语音特征的提取。Specifically, after pre-emphasizing the training voice data, frame processing should also be performed. Framing refers to the speech processing technology that cuts the entire voice signal into several segments. The size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length. Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames. The framed processing of the training voice data can divide the training voice data into several pieces of voice data, and the training voice data can be subdivided to facilitate the extraction of training voice features.
S113:将分帧后的训练语音数据进行加窗处理,获取预处理语音数据,加窗的计算公式为
Figure PCTCN2018092589-appb-000001
其中,N为窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。
S113: Perform window processing on the framed training speech data to obtain pre-processed speech data. The calculation formula for windowing is
Figure PCTCN2018092589-appb-000001
Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
具体地,在对训练语音数据进行分帧处理后,每一帧的起始段和末尾端都会出现不连续的地方,所以分帧越多与训练语音数据的误差也就越大。采用加窗能够解决这个问题,可以使分帧后的训练语音数据变得连续,并且使得每一帧能够表现出周期函数的特征。加窗处理具体是指采用窗函数对训练语音数据进行处理,窗函数可以选择汉明窗,则该加窗的公式为
Figure PCTCN2018092589-appb-000002
N为汉明窗窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。对训练语音数据进行加窗处理,获取预处理语音数据,能够使得分帧后的训练语音数据在时域上的信号变得连续,有助于提取训练语音数据的训练语音特征。
Specifically, after frame processing is performed on the training speech data, discontinuities will appear at the beginning and end of each frame, so the more frames there are, the greater the error between the training speech data and the training speech data. The use of windowing can solve this problem, making the framed training speech data continuous, and enabling each frame to exhibit the characteristics of a periodic function. The windowing process specifically refers to the processing of training speech data by using a window function. The window function can select the Hamming window.
Figure PCTCN2018092589-appb-000002
N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed. Windowing the training voice data and obtaining pre-processed voice data can make the signal of the training voice data in the time domain after the frame become continuous, which is helpful for extracting the training voice features of the training voice data.
上述步骤S211-S213对训练语音数据的预处理操作,为提取训练语音数据的训练语音特征提供了基础,能够使得提取的训练语音特征更能代表该训练语音数据,并根据该训练语音特征训练出对应的GMM-UBM模型。The pre-processing operations of the training voice data in steps S211-S213 provide a basis for extracting the training voice features of the training voice data, which can make the extracted training voice features more representative of the training voice data, and train out according to the training voice features. Corresponding GMM-UBM model.
S12:对预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据频谱获取训练语音数据的功率谱。S12: Perform a fast Fourier transform on the preprocessed voice data to obtain the frequency spectrum of the training voice data, and obtain the power spectrum of the training voice data according to the frequency spectrum.
其中,快速傅里叶变换(Fast Fourier Transformation,简称FFT),指利用计算机计算离散傅里叶变换的高效、快速计算方法的统称。采用这种算法能使计算机计算离散傅里叶变换所需要的乘法次数大为减少,特别是被变换的抽样点数越多,FFT算法计算量的节省就越显著。Among them, Fast Fourier Transform (FFT) refers to a collective term for an efficient and fast method for computing discrete Fourier transforms using a computer. The use of this algorithm can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the FFT algorithm's computational savings will be.
具体地,对预处理语音数据进行快速傅里叶变换,以将预处理语音数据从时域上的信号 幅度转换为在频域上的信号幅度(频谱)。该计算频谱的公式为
Figure PCTCN2018092589-appb-000003
Figure PCTCN2018092589-appb-000004
N为帧的大小,s(k)为频域上的信号幅度,s(n)为时域上的信号幅度,n为时间,i为复数单位。在获取预处理语音数据的频谱后,可以根据该频谱直接求得预处理语音数据的功率谱,以下将预处理语音数据的功率谱称为训练语音数据的功率谱。该计算训练语音数据的功率谱的公式为
Figure PCTCN2018092589-appb-000005
N为帧的大小,s(k)为频域上的信号幅度。通过将预处理语音数据从时域上的信号幅度转换为频域上的信号幅度,再根据该频域上的信号幅度获取训练语音数据的功率谱,为从训练语音数据的功率谱中提取训练语音特征提供重要的技术基础。
Specifically, a fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from a signal amplitude in a time domain to a signal amplitude (spectrum) in a frequency domain. The formula for calculating the spectrum is
Figure PCTCN2018092589-appb-000003
Figure PCTCN2018092589-appb-000004
N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit. After obtaining the frequency spectrum of the pre-processed voice data, the power spectrum of the pre-processed voice data can be directly obtained based on the frequency spectrum. The power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the training voice data. The formula for calculating the power spectrum of the training speech data is
Figure PCTCN2018092589-appb-000005
N is the frame size, and s (k) is the signal amplitude in the frequency domain. By converting the pre-processed speech data from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then obtaining the power spectrum of the training speech data according to the signal amplitude in the frequency domain, the training is extracted from the power spectrum of the training speech data. Speech features provide an important technical basis.
S13:采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱。S13: Use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data.
其中,采用梅尔刻度滤波器组处理训练语音数据的功率谱是对功率谱进行的梅尔频率分析,梅尔频率分析是基于人类听觉感知的分析。检测发现,人耳就像一个滤波器组一样,只关注某些特定的频率分量(人的听觉对频率是非线性的),也就是说人耳接收声音频率的信号是有限的。然而这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,他们分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。可以理解地,梅尔刻度滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,这也是梅尔刻度的物理意义所在。Among them, the power spectrum of the training speech data using the Mel scale filter bank is a Mel frequency analysis of the power spectrum, and the Mel frequency analysis is an analysis based on human auditory perception. The test found that the human ear is like a filter bank and only pays attention to certain specific frequency components (human hearing is non-linear to frequency), which means that the human ear receives a limited number of sound frequencies. However, these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
本实施例中,采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱,通过采用梅尔刻度滤波器组对频域信号进行切分,使得最后每个频率段对应一个数值,若滤波器的个数为22,则可以得到训练语音数据的梅尔功率谱对应的22个能量值。通过对训练语音数据的功率谱进行梅尔频率分析,使得其分析后获取的梅尔功率谱保留着与人耳特性密切相关的频率部分,该频率部分能够很好地反映出训练语音数据的特征。In this embodiment, the power spectrum of the training speech data is processed by using a Mel scale filter bank, and the Mel power spectrum of the training speech data is obtained. The frequency domain signals are segmented by using the Mel scale filter bank, so that the final The frequency segment corresponds to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the training speech data can be obtained. By performing Mel frequency analysis on the power spectrum of the training speech data, the Mel power spectrum obtained after the analysis retains a frequency portion that is closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the training speech data. .
S14:在梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。S14: Perform cepstrum analysis on the Mel power spectrum to obtain MFCC features of the training speech data.
其中,倒谱(cepstrum)是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶反变换,由于一般傅里叶谱是复数谱,因而倒谱又称复倒谱。Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
具体地,对梅尔功率谱进行倒谱分析,根据倒谱的结果,分析并获取训练语音数据的MFCC特征。通过该倒谱分析,可以将原本特征维度过高,难以直接使用的训练语音数据的梅尔功率谱中包含的特征,通过在梅尔功率谱上进行倒谱分析,转换成易于使用的特征(用来进行训练或识别的MFCC特征特征向量)。该MFCC特征能够作为训练语音特征对不同语音进行区分的系数,该训练语音特征可以反映语音之间的区别,可以用来识别和区分训练语音数据。Specifically, cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum results, the MFCC features of the training speech data are analyzed and acquired. Through this cepstrum analysis, the features contained in the Mel power spectrum of the training speech data that were originally too high in dimension and difficult to use directly can be converted into easy-to-use features by performing cepstrum analysis on the Mel power spectrum ( MFCC feature feature vector used for training or recognition). The MFCC feature can be used as a coefficient for distinguishing different voices from the training voice feature. The training voice feature can reflect the difference between the voices and can be used to identify and distinguish the training voice data.
在一具体实施方式中,步骤S14中,在梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征,包括如下步骤:In a specific embodiment, in step S14, cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data, including the following steps:
S141:取梅尔功率谱的对数值,获取待变换梅尔功率谱。S141: Take the log value of the Mel power spectrum, and obtain the Mel power spectrum to be transformed.
具体地,根据倒谱的定义,对梅尔功率谱取对数值log,获取待变换梅尔功率谱m。Specifically, according to the definition of the cepstrum, a log value log of the Mel power spectrum is taken to obtain a Mel power spectrum m to be transformed.
S142:对待变换梅尔功率谱作离散余弦变换,获取训练语音数据的MFCC特征。S142: Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain MFCC features of training speech data.
具体地,对待变换梅尔功率谱m作离散余弦变换(Discrete Cosine Transform,DCT),获取相对应的训练语音数据的MFCC特征,一般取第2个到第13个系数作为训练语音特征,该训练语音特征能够反映语音数据间的区别。对待变换梅尔功率谱m作离散余弦变换的公式为
Figure PCTCN2018092589-appb-000006
N为帧长,m为待变换梅尔功率谱,j为待变换梅尔功率谱的自变量。由于梅尔滤波器之间是有重叠的,所以采用梅尔刻度滤波器获取的能量值之间是具有相关性的,离散余弦变换可以对待变换梅尔功率谱m进行降维压缩和抽象,并获取间接的训练语音特征,相比于傅里叶变换,离散余弦变换的结果没有虚部,在计算方面有明显的优势。
Specifically, a discrete cosine transform (DCT) is performed on the transformed Mel power spectrum m to obtain corresponding MFCC features of training speech data. Generally, the second to thirteenth coefficients are taken as the training speech features. The training Speech features can reflect the differences between speech data. The formula for discrete cosine transform of the transformed Mel power spectrum m is
Figure PCTCN2018092589-appb-000006
N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters. Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and Compared with the Fourier transform, the result of the indirect training speech feature has no imaginary part, and has obvious advantages in calculation.
步骤S11-S14基于训练技术对训练语音数据进行特征提取的处理,最终获取的训练语音特征能够很好地体现训练语音数据,该训练语音特征能够训练出对应的GMM-UBM模型,进而获取注册i-vector向量,以使训练获取的注册i-vector向量在进行语音识别时的结果更为精确。Steps S11-S14 perform feature extraction processing on the training voice data based on the training technology. The finally obtained training voice feature can well reflect the training voice data, and the training voice feature can train the corresponding GMM-UBM model, and then obtain the registration i -vector vector, so that the registered i-vector vector obtained during training is more accurate when performing speech recognition.
需要说明的是,以上提取的特征为MFCC特征,在这里不应将训练语音特征限定为只有MFCC特征一种,而应当认为采用训练技术获取的语音特征,只要能够有效反映语音数据特征,都是可以作为训练语音特征进行识别和模型训练的。本实施例中,对训练语音数据进行预处理,并获取相对应的预处理语音数据。对训练语音数据进行预处理能够更好地提取训练语音数据的训练语音特征,使得提取出的训练语音特征更能代表该训练语音数据,以采用该训练语音特征进行语音识别。It should be noted that the features extracted above are MFCC features. The training voice features should not be limited to only MFCC features here. Instead, it should be considered that the voice features obtained by training techniques can effectively reflect the features of voice data. Can be used as training speech features for recognition and model training. In this embodiment, the training voice data is pre-processed, and corresponding pre-processed voice data is obtained. Preprocessing the training voice data can better extract the training voice features of the training voice data, so that the extracted training voice features can better represent the training voice data, and use the training voice features for voice recognition.
在一实施例中,如图4所示,步骤S20中,即基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间,具体包括如下步骤:In an embodiment, as shown in FIG. 4, in step S20, that is, the overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model, and specifically includes the following steps:
S21.获取预设UBM模型的高维充分统计量。S21. Obtain high-dimensional sufficient statistics of the preset UBM model.
其中,UBM模型是采用多人、信道均衡且男女声均衡的足够多的语音训练出一个高阶的GMM模型,以描述与说话人无关的特征分布。UBM模型可根据训练语音特征调整UBM模型的参数来表征特定说话人的个性信息,训练语音特征覆盖不到的特征用UBM模型中相似的特征分布来近似,以解决训练语音不足带来的性能问题。Among them, the UBM model is a multi-person, channel-equalized, and male-female-sound-equal enough voice to train a high-order GMM model to describe speaker-independent feature distributions. The UBM model can adjust the parameters of the UBM model according to the characteristics of the training speech to characterize the personality information of a specific speaker. Features that are not covered by the training speech features are approximated by similar feature distributions in the UBM model to solve the performance problem caused by insufficient training speech. .
统计量是样本数据的函数,在统计学中,T(x)为未知分布P的参数θ的充分统计量,当且仅当T(x)可以提供θ的全部信息,也就是说,没有统计量可以提供关于θ的额外信息。统计量是实际上是一种对数据分布的压缩,在样本加工为统计量的过程中,样本中所含的信息可能有所损失,若在将样本加工为统计量时,信息毫无损失,则称此统计量为充分统计量。比如,对于高斯分布,期望和协方差矩阵就是它的两个充分统计量,因为如果这两个参数已知,就可以唯一确定一个高斯分布。Statistics is a function of sample data. In statistics, T (x) is a sufficient statistic of the parameter θ of the unknown distribution P, if and only if T (x) can provide all the information of θ, that is, there is no statistics The quantity can provide additional information about θ. Statistics are actually a kind of compression of the data distribution. During the process of processing samples into statistics, the information contained in the samples may be lost. If the samples are processed into statistics, there is no loss of information. This statistic is called a full statistic. For example, for a Gaussian distribution, the expectation and covariance matrix are its two sufficient statistics, because if these two parameters are known, a Gaussian distribution can be uniquely determined.
具体地,获取预设UBM模型的高维充分统计量的过程为:确定说话人样本X={x1,x2,...,xn},该样本服从预设UBM模型对应的分布F(x),参数为theta。对于该组样本的统计量为T,T=r(x1,x2,...,xn)。若T服从分布F(T),且样本X的分布F(x)的参数theta可以由F(T)求出来,即F(x)中包含的所有关于theta的信息都包含在了F(T)中,则T就是预设UBM模型的高维充分统计量。Specifically, the process of obtaining high-dimensional sufficient statistics of the preset UBM model is: determining a speaker sample X = {x1, x2, ..., xn}, and the sample obeys the distribution F (x) corresponding to the preset UBM model. , The parameter is theta. For this group of samples, the statistic is T, T = r (x1, x2, ..., xn). If T follows the distribution F (T), and the parameter theta of the distribution F (x) of the sample X can be obtained from F (T), that is, all the information about theta contained in F (x) is included in F (T) ), Then T is the high-dimensional sufficient statistic of the preset UBM model.
本步骤中,识别服务器通过获取预设UBM模型的零阶充分统计量和一阶充分统计量,用以作为训练总体变化子空间的技术基础。In this step, the recognition server obtains a zero-order sufficient statistic and a first-order sufficient statistic of the preset UBM model, which is used as a technical basis for training the overall change subspace.
S22.采用最大期望算法对高维充分统计量进行迭代,获取对应的总体变化子空间。S22. Iterate the high-dimensional sufficient statistics by using the maximum expectation algorithm to obtain the corresponding overall change subspace.
其中,最大期望算法(Expectation Maximization Algorithm,最大期望em算法)是一种迭代算法,在统计学中被用于寻找,依赖于不可观察的隐性变量的概率模型中参数的最大似然估计。比如,初始化A和B两个参数,在初始状态下二者的数值都是未知的,但得到A的信息即可得到B的信息,同理获得B的信息也可得到A。若首先赋予A某种初值,以此得到B的估计值,然后从B的当前值出发,重新估计A的取值,直至持续到收敛为止。Among them, the maximum expectation algorithm (Expectation Maximization Algorithm) is an iterative algorithm used in statistics to find the maximum likelihood estimation of parameters in a probability model that depends on unobservable hidden variables. For example, two parameters A and B are initialized. In the initial state, the values of both parameters are unknown, but the information of B can be obtained by obtaining the information of A, and the information of B can be obtained by the same way. If you first give A some initial value to get the estimated value of B, and then start from the current value of B, re-estimate the value of A until it continues to converge.
EM的算法流程如下:1.初始化分布参数;2.重复E步骤和M步骤直到收敛:E步骤:估计未知参数的期望值,给出当前的参数估计;M步骤:重新估计分布参数,以使得数据的似然性最大,给出未知变量的期望估计。通过交替使用E步骤和M步骤,逐步改进模型的参数,使参数和训练样本的似然概率逐渐增大,最后终止于一个极大点。EM's algorithm flow is as follows: 1. Initialize the distribution parameters; 2. Repeat steps E and M until convergence: Step E: Estimate the expected value of the unknown parameter and give the current parameter estimate; Step M: Re-estimate the distribution parameters so that the data The likelihood is the largest and gives the expected estimate of the unknown variable. By alternately using the E and M steps, the parameters of the model are gradually improved so that the likelihood of the parameters and the training samples gradually increases, and finally ends at a maximum point.
具体地,迭代获取总体变化子空间是通过下述步骤实现的:Specifically, iteratively obtaining the overall change subspace is achieved by the following steps:
步骤一:根据高维充分统计量将M个高斯分量的均值矢量(每个矢量有D维),串接在一起形成一个高斯均值超矢量,即M*D维矢量,采用M*D维矢量构成F(x),F(x)是MD维矢量;同时利用零阶充分统计量构造N,N是MD x MD维对角矩阵,以后验概率作为主对角线元素拼接而成。其中,后验概率是指在得到结果的信息后重新修正的概率。比如:事情已经发生,要求这件事情发生的原因是由某个因素引起的可能性的大小,即为后验概率。Step 1: According to the high-dimensional sufficient statistics, the average vectors of M Gaussian components (each vector has D dimensions) are concatenated to form a Gaussian mean supervector, that is, an M * D-dimensional vector, and an M * D-dimensional vector is used. Form F (x), F (x) is the MD dimension vector; at the same time, use the zero-order sufficient statistics to construct N, N is the MD diagonal matrix, and the posterior probability is concatenated as the main diagonal element. Among them, the posterior probability refers to the probability of re-correction after obtaining the result information. For example: something has happened, the reason why this thing is required to happen is the magnitude of the possibility caused by a certain factor, which is the posterior probability.
步骤二:初始化T空间,构造一个[MD,V]维矩阵,其中,V的维度远小于MD维度,V的维度就是第一i-vector向量的维度。Step 2: Initialize the T space and construct a [MD, V] -dimensional matrix, where the dimension of V is much smaller than the MD dimension, and the dimension of V is the dimension of the first i-vector vector.
步骤三:固定T空间,采用最大期望算法对下述公式进行反复迭代,以估算隐变量w的零阶充分统计量和一阶充分统计量。当迭代计算达到指定次数(5-6次)后,即可认为T空间收敛,以固定T空间:Step 3: Fix the T space and use the maximum expectation algorithm to iterate the following formula repeatedly to estimate the zero-order sufficient statistics and the first-order sufficient statistics of the hidden variable w. When the iterative calculation reaches a specified number of times (5-6 times), the T space can be considered to converge to fix the T space:
Figure PCTCN2018092589-appb-000007
Figure PCTCN2018092589-appb-000007
该公式中,w是隐变量,I是单位矩阵;∑是MD x MD维的UMM模型的协方差矩阵,其对角元素是∑ 1,...∑ m;F是高维充分统计量中的一阶充分统计量;N是MD x MD维对角矩阵。 In this formula, w is a hidden variable and I is an identity matrix; Σ is a covariance matrix of the UMM model in MD x MD dimensions, and its diagonal elements are Σ 1 , ... Σ m ; F is a high-dimensional full statistic First-order sufficient statistics of N; N is a MD x MD diagonal matrix.
本实施例中,通过EM算法迭代,提供一个简单稳定的迭代算法计算后验密度函数获取总体变化子空间;获取总体变化子空间可将预设UBM模型的高维充分统计量(超矢量)投影到低维实现,利于降维后的矢量进一步进行语音识别。In this embodiment, an iterative EM algorithm is provided to provide a simple and stable iterative algorithm to calculate the posterior density function to obtain the overall change subspace; obtaining the overall change subspace can project the high-dimensional sufficient statistics (supervector) of the preset UBM model To low-dimensional implementation, the vector after dimensionality reduction is further used for speech recognition.
在一实施例中,如图5所示,步骤S30中,即将训练语音特征投影在总体变化子空间上,获取第一i-vector向量,具体包括如下步骤:In an embodiment, as shown in FIG. 5, in step S30, the training voice feature is projected on the overall change subspace to obtain a first i-vector vector, which specifically includes the following steps:
S31.基于训练语音特征和预设UBM模型,采用均值MAP自适应方法获取GMM-UBM模型。S31. Based on the training speech features and a preset UBM model, a mean MAP adaptive method is used to obtain a GMM-UBM model.
其中,训练语音特征是代表说话人区别于他人的语音特征,应用于本实施例,可采用梅尔频率倒谱系数MFCC特征(Mel-Frequency Cepstral Coefficients,以下简称MFCC特征)作为训练语音特征。Among them, the training speech feature is a speech feature that represents the speaker different from others, and is applied in this embodiment, and Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) may be used as the training speech feature.
具体地,基于预设UBM模型,采用最大后验概率MAP来自适应训练语音特征的GMM模型, 以更新每个高斯分量的均值矢量。然后生成M个分量的GMM模型,也即生成GMM-UBM模型。以GMM-UBM模型的每个高斯分量的均值矢量(每个矢量有D维)作为串接单元,形成M*D维的高斯均值超矢量。Specifically, based on a preset UBM model, a maximum posterior probability MAP is used to adaptively train a GMM model of speech features to update the mean vector of each Gaussian component. Then, a GMM model with M components is generated, that is, a GMM-UBM model is generated. The average vector of each Gaussian component of the GMM-UBM model (each vector has D dimension) is used as a concatenation unit to form a Gaussian mean supervector of M * D dimension.
S32.采用公式s 1=m+Tw 1将训练语音特征投影在总体变化子空间上,获取第一i-vector向量,其中,s 1是C*F维的GMM-UBM模型中与训练语音特征相对应的均值超矢量;m是与说话人无关且信道无关的C*F维超向量;T是总体变化子空间,维度为CF*N;w 1是第一i-vector向量,维度为N。 S32. Use the formula s 1 = m + Tw 1 to project the training speech feature on the overall change subspace to obtain the first i-vector vector, where s 1 is the C * F dimension GMM-UBM model and the training speech feature. corresponding to the mean super vector; m is independent of the speaker and independent channel C * F dimensional supervector; T is the total variation subspace of dimension CF * N; w 1 is the first i-vector vector of dimension N .
本实施例中,s 1可采用步骤S31获取的高斯均值超矢量;m是与说话人无关且与信道无关的M*D维超向量,由UBM模型对应的均值超矢量拼接而成;w 1是一组服从标准正态分布的随机向量,就是第一i-vector向量,第一i-vector向量的维度为N。 In this embodiment, s 1 may use the Gaussian mean supervector obtained in step S31; m is a speaker-independent and channel-independent M * D-dimensional supervector, which is a concatenation of the mean supervectors corresponding to the UBM model; w 1 Is a set of random vectors obeying the standard normal distribution, that is, the first i-vector vector, and the dimension of the first i-vector vector is N.
进一步地,公式中T(总体变化子空间)的获取过程为:训练UBM模型的高维充分统计量,然后通过EM算法迭代更新上述高维充分统计量即可生成收敛的T空间。将T空间带入公式s 1=m+Tw 1,因s 1、m和T都是已知的,即可获取w 1,也即第一i-vector向量,其中,w 1=(s 1-m)/T。 Further, the acquisition process of T (total change subspace) in the formula is: training the high-dimensional sufficient statistics of the UBM model, and then iteratively updating the high-dimensional sufficient statistics through the EM algorithm to generate a convergent T-space. Bring T space into the formula s 1 = m + Tw 1 , because s 1 , m and T are all known, we can obtain w 1 , that is, the first i-vector vector, where w 1 = (s 1 -m) / T.
步骤S31至步骤S32中,通过采用公式s 1=m+Tw 1可将训练语音特征投影在总体变化子空间上,获取第一i-vector向量,可将训练语音特征进行初次降维简化训练语音特征的复杂度,也便于对低维的第一i-vector向量进行进一步处理或者用来进行语音识别。 In steps S31 to S32, by using the formula s 1 = m + Tw 1, the training voice feature can be projected on the overall change subspace to obtain the first i-vector vector, and the training voice feature can be reduced for the first time to simplify the training voice. The complexity of the features is also convenient for further processing of low-dimensional first i-vector vectors or for speech recognition.
在一实施例中,如图6所示,提供一种说话人识别方法,以该方法应用在图1中的识别服务器为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 6, a speaker recognition method is provided. The method is applied to the recognition server in FIG. 1 as an example, and includes the following steps:
S50.获取测试语音数据,测试语音数据携带说话人标识。S50. Acquire test voice data, and the test voice data carries a speaker identifier.
其中,测试语音数据是待确认的、声称是发自携带的说话人标识对应的说话人的声音数据。说话人标识是用以表示说话人身份的唯一标识,包括但不限于用户名、身份证号码、手机号码等。The test voice data is the voice data of the speaker corresponding to the speaker ID that is claimed to be carried. The speaker ID is a unique identifier used to indicate the identity of the speaker, including, but not limited to, a user name, an ID number, a mobile phone number, and the like.
完成语音识别的过程需要两个基本要素:语音和身份,应用于本实施例,语音就是测试语音数据,身份就是说话人标识,以便识别服务器进一步判定测试语音数据声称的身份是否为真正对应的身份。Two basic elements are required to complete the speech recognition process: speech and identity, which are applied in this embodiment. Speech is the test voice data and identity is the speaker identification, so that the recognition server further determines whether the identity claimed by the test voice data is the true corresponding identity. .
S60.采用i-vector向量提取方法对测试语音数据进行处理,获取对应的测试i-vector向量。S60. Use the i-vector vector extraction method to process the test voice data to obtain the corresponding test i-vector vector.
其中,测试i-vector向量是将测试语音特征投影到低维的总体变化子空间后,得到的一个用于验证身份的固定长度的矢量表征(即i-vector)。Among them, the test i-vector vector is a fixed-length vector representation (ie, i-vector) obtained by projecting test voice features onto a low-dimensional overall change subspace, which is used to verify identity.
本步骤中,可获取测试语音数据对应的测试i-vector向量,获取过程与基于训练语音特征获取对应的注册i-vector向量相同,此处不再赘述。In this step, a test i-vector vector corresponding to the test voice data can be obtained, and the acquisition process is the same as the corresponding registered i-vector vector based on the training voice feature, which is not repeated here.
S70.基于说话人标识查询数据库,获取与说话人标识对应的注册i-vector向量。S70. Query the database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification.
其中,数据库是将说话人对应的注册i-vector向量和说话人标识进行关联记录的数据库。The database is a database that records the registered i-vector vector corresponding to the speaker and the speaker identification.
注册i-vector向量是记录在识别服务器的数据库中、用以与说话人ID关联作为身份标识的固定长度的矢量表征(即i-vector)。The registered i-vector vector is a fixed-length vector representation (ie, i-vector) recorded in the database of the identification server and used to associate with the speaker ID as an identity.
本步骤中,识别服务器可基于测试语音数据携带的说话人标识在数据库查找到对应的注册i-vector向量,以便进一步对注册i-vector向量和测试i-vector向量进行对比。In this step, the recognition server can find the corresponding registered i-vector vector in the database based on the speaker identification carried in the test voice data, so as to further compare the registered i-vector vector with the test i-vector vector.
S80.采用余弦相似度算法获取测试i-vector向量和注册i-vector向量的相似度,根据相似度检测测试i-vector向量和注册i-vector是否对应同一说话人。S80. Use the cosine similarity algorithm to obtain the similarity between the test i-vector vector and the registered i-vector vector, and check whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
具体地,获取测试i-vector向量和注册i-vector向量的相似度可由以下公式进行判定:Specifically, the similarity between the obtained test i-vector vector and the registered i-vector vector can be determined by the following formula:
Figure PCTCN2018092589-appb-000008
Figure PCTCN2018092589-appb-000008
其中,A i和B i分别代表向量A和向量B的各个分量。由上式可知,相似度范围从-1到1,其中-1表示两个向量方向相反,1表示两个向量指向相同;0表示两个向量是独立的。在-1性,可以理解地,相似度越接近1表示两个向量和1之间表示两个向量之间的相似性或相异越接近。应用于本实施例,可根据实际经验预先设定cosθ的阈值。若测试i-vector向量和注册i-vector向量的相似度大于阈值,则认为测试i-vector向量和注册i-vector向量相似,也即可判定测试语音数据在数据库中与说话人标识是对应的。 Among them, A i and Bi represent the components of the vector A and the vector B, respectively. It can be known from the above formula that the similarity ranges from -1 to 1, where -1 indicates that the two vectors are in opposite directions, 1 indicates that the two vectors point in the same direction, and 0 indicates that the two vectors are independent. In -1, it can be understood that the closer the similarity is to 1, the closer the two vectors are, and the closer the similarity or dissimilarity between the two vectors is. Applied to this embodiment, the threshold of cosθ can be set in advance according to actual experience. If the similarity between the test i-vector vector and the registered i-vector vector is greater than the threshold, the test i-vector vector and the registered i-vector vector are considered similar, that is, it can be determined that the test voice data corresponds to the speaker identification in the database. .
本实施例中,通过余弦相似度算法即可判定测试i-vector向量和注册i-vector向量的相似度,简单快捷,利于快速确认识别结果。In this embodiment, the cosine similarity algorithm can be used to determine the similarity between the test i-vector vector and the registered i-vector vector, which is simple and fast, which is helpful for quickly confirming the recognition result.
本申请实施例提供的i-vector向量提取方法通过将训练语音特征投影在总体变化子空间上获取第一i-vector向量后,再将第一i-vector向量第二次投影在总体变化子空间上获取注册i-vector向量,使得训练语音特征数据经过两次投影也即降低维度后可去除更多的噪音特征,提高了提取说话人语音特征的纯净度,同时降维后减少计算空间也提高语音识别的识别效率,降低识别复杂度。The i-vector vector extraction method provided in the embodiment of the present application obtains the first i-vector vector by projecting the training speech features on the overall change subspace, and then projects the first i-vector vector on the overall change subspace a second time. The registered i-vector vector is obtained on the system, so that after training the feature data of the speech after two projections, that is, after reducing the dimension, more noise features can be removed, which improves the purity of the speaker's feature extraction. The recognition efficiency of speech recognition reduces the recognition complexity.
进一步地,基于训练技术对训练语音数据进行特征提取的处理获取注册i-vector向量,能够很好地体现训练语音数据,使得训练获取的注册i-vector向量在进行语音识别时的结果更为精确;通过EM算法迭代,提供一个简单稳定的迭代算法计算后验密度函数获取总体变化子空间;获取总体变化子空间可将预设UBM模型的高维充分统计量投影到低维实现,利于降维后的矢量进一步进行语音识别。Further, the processing of feature extraction of the training voice data based on the training technology to obtain the registered i-vector vector can well reflect the training voice data, making the registered i-vector vector obtained by training more accurate when performing speech recognition. ; Provide a simple and stable iterative algorithm to calculate the posterior density function to obtain the overall change subspace through the EM algorithm iteration. Obtaining the overall change subspace can project the high-dimensional sufficient statistics of the preset UBM model to the low-dimensional implementation, which is conducive to dimension reduction The subsequent vectors are further subjected to speech recognition.
本申请实施例提供的说话人识别方法通过采用i-vector向量提取方法对测试语音数据进行处理,获取对应的测试i-vector向量,可降低获取测试i-vector向量的复杂度;同时,通过余弦相似度算法即可判定测试i-vector向量和注册i-vector向量的相似度,简单快捷,利于快速确认识别结果。The speaker recognition method provided in the embodiment of the present application uses the i-vector vector extraction method to process the test voice data to obtain the corresponding test i-vector vector, which can reduce the complexity of obtaining the test i-vector vector; at the same time, through the cosine The similarity algorithm can determine the similarity between the test i-vector vector and the registered i-vector vector, which is simple and fast, which is helpful for quickly confirming the recognition result.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
在一实施例中,提供一种i-vector向量提取装置,该i-vector向量提取装置与上述实施例中i-vector向量提取方法一一对应。如图7所示,该i-vector向量提取装置包括获取语音数据模块10、训练变化空间模块20、投影变化空间模块30和获取i-vector向量模块40。各功能模块详细说明如下:In one embodiment, an i-vector vector extraction device is provided. The i-vector vector extraction device corresponds to the i-vector vector extraction method in the above embodiment in a one-to-one correspondence. As shown in FIG. 7, the i-vector vector extraction device includes a speech data acquisition module 10, a training variation space module 20, a projection variation space module 30, and an i-vector vector acquisition module 40. The detailed description of each function module is as follows:
获取语音数据模块10,用于获取说话人的训练语音数据,并提取训练语音数据对应的训练语音特征。The voice data acquisition module 10 is configured to acquire training voice data of a speaker and extract training voice features corresponding to the training voice data.
训练变化空间模块20,用于基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间。The training change space module 20 is configured to train an overall change subspace corresponding to the preset UBM model based on the preset UBM model.
投影变化空间模块30,用于将训练语音特征投影在总体变化子空间上,获取第一i-vector向量。The projection change space module 30 is configured to project training speech features on the overall change subspace to obtain a first i-vector vector.
获取i-vector向量模块40,用于将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量。An i-vector vector module 40 is configured to project the first i-vector vector on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
优选地,获取语音数据模块10包括获取语音数据单元11、获取数据功率谱单元12、获取梅尔功率谱单元13和获取MFCC特征单元14。Preferably, the acquired voice data module 10 includes an acquired voice data unit 11, an acquired data power spectrum unit 12, an acquired Mel power spectrum unit 13, and an acquired MFCC feature unit 14.
获取语音数据单元11,用于对训练语音数据进行预处理,获取预处理语音数据。The voice data acquisition unit 11 is configured to preprocess the training voice data and obtain preprocessed voice data.
获取数据功率谱单元12,用于对预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据频谱获取训练语音数据的功率谱。A data power spectrum obtaining unit 12 is configured to perform a fast Fourier transform on the pre-processed voice data, obtain a frequency spectrum of the training voice data, and obtain a power spectrum of the training voice data according to the frequency spectrum.
获取梅尔功率谱单元13,用于采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱。A Mel power spectrum obtaining unit 13 is configured to process a power spectrum of the training voice data by using a Mel scale filter bank, and obtain a Mel power spectrum of the training voice data.
获取MFCC特征单元14,用于在梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。An MFCC feature unit 14 is used to perform cepstrum analysis on the Mel power spectrum to obtain MFCC features of training speech data.
训练变化空间模块20包括获取高维统计量单元21和获取变化子空间单元22。The training variation space module 20 includes an acquisition high-dimensional statistics unit 21 and an acquisition change subspace unit 22.
获取高维统计量单元21,用于获取预设UBM模型的高维充分统计量。A high-dimensional statistics obtaining unit 21 is configured to obtain high-dimensional sufficient statistics of a preset UBM model.
获取变化子空间单元22,用于采用最大期望算法对高维充分统计量进行迭代,获取对应的总体变化子空间。A change subspace obtaining unit 22 is configured to iterate the high-dimensional sufficient statistics by using a maximum expectation algorithm to obtain a corresponding overall change subspace.
投影变化空间模块30包括获取GMM-UBM模型单元31和获取第一向量单元32。The projection change space module 30 includes an acquisition GMM-UBM model unit 31 and an acquisition first vector unit 32.
获取GMM-UBM模型单元31,用于基于训练语音特征和预设UBM模型,采用均值MAP自适应方法获取GMM-UBM模型。A GMM-UBM model obtaining unit 31 is configured to obtain a GMM-UBM model based on a training speech feature and a preset UBM model, and adopt a mean MAP adaptive method.
获取第一向量单元32,用于采用公式s 1=m+Tw 1,获取第一i-vector向量,其中,s 1是C*F维的GMM-UBM模型对应的均值超矢量;m是与说话人无关且信道无关的C*F维超向量;T是总体变化子空间,维度为CF*N;w 1是第一i-vector向量,维度为N。 A first vector unit 32 is obtained, and is used to obtain a first i-vector vector by using a formula s 1 = m + Tw 1 , where s 1 is a mean supervector corresponding to the GMM-UBM model of the C * F dimension; m is the same as independent and speaker-independent channel ultra-dimensional vector C * F; T is the total variation subspace of dimension CF * N; w 1 i-vector is the first vector of dimension N.
优选地,获取i-vector向量模块40包括获取注册向量单元41。Preferably, the obtaining i-vector vector module 40 includes obtaining a registration vector unit 41.
获取注册向量单元41,用于采用公式s 2=m+Tw 2将第一i-vector向量投影在总体变化子空间上,获取注册i-vector向量,其中,s 2是D*G维的与注册i-vector向量相对应的均值超矢量;m是与说话人无关且信道无关的D*G维超向量;T是总体变化子空间,维度为DG*M;w 2是注册i-vector向量,维度为M。 A registration vector unit 41 is used for projecting the first i-vector vector on the overall change subspace by using the formula s 2 = m + Tw 2 to obtain a registration i-vector vector, where s 2 is a D * G-dimensional AND The mean supervector corresponding to the registered i-vector vector; m is a D * G-dimensional supervector that is independent of the speaker and channel independent; T is the overall changing subspace with a dimension of DG * M; w 2 is the registered i-vector vector , The dimension is M.
关于i-vector向量提取装置的具体限定可以参见上文中对于i-vector向量提取方法的限定,在此不再赘述。上述i-vector向量提取装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the i-vector vector extraction device, refer to the foregoing limitation on the i-vector vector extraction method, which is not repeated here. Each module in the i-vector vector extraction device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一实施例中,提供一种说话人识别装置,该说话人识别装置与上述实施例中说话人识 别方法一一对应。如图8所示,该说话人识别装置包括获取测试数据模块50、获取测试向量模块60、获取注册向量模块70和确定对应说话人模块80。各功能模块详细说明如下:In one embodiment, a speaker recognition device is provided, and the speaker recognition device corresponds to the speaker recognition method in the embodiment described above. As shown in FIG. 8, the speaker recognition device includes a test data acquisition module 50, a test vector acquisition module 60, a registration vector acquisition module 70, and a corresponding speaker determination module 80. The detailed description of each function module is as follows:
获取测试数据模块50,用于获取测试语音数据,测试语音数据携带说话人标识;The acquisition test data module 50 is configured to acquire test voice data, and the test voice data carries a speaker identifier;
获取测试向量模块60,用于采用i-vector向量提取方法对测试语音数据进行处理,获取对应的测试i-vector向量;The test vector obtaining module 60 is configured to process the test voice data by using an i-vector vector extraction method to obtain a corresponding test i-vector vector;
获取注册向量模块70,用于基于说话人标识查询数据库,获取与说话人标识对应的注册i-vector向量;A registration vector obtaining module 70, configured to query a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;
确定对应说话人模块80,用于采用余弦相似度算法获取测试i-vector向量和注册i-vector向量的相似度,根据相似度检测测试i-vector向量和注册i-vector是否对应同一说话人。The corresponding speaker module 80 is determined, and is used to obtain the similarity between the test i-vector vector and the registered i-vector vector by using a cosine similarity algorithm, and whether the test i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
关于说话人识别装置的具体限定可以参见上文中对于说话人识别方法的限定,在此不再赘述。上述说话人识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For specific limitations on the speaker recognition device, reference may be made to the foregoing limitation on the speaker recognition method, and details are not described herein again. Each module in the speaker recognition device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一实施例中,提供一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储与i-vector向量提取方法或说话人识别方法相关的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现i-vector向量提取方法或说话人识别方法。In one embodiment, a computer device is provided. The computer device may be a server, and the internal structure diagram may be as shown in FIG. 9. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data related to the i-vector vector extraction method or speaker recognition method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement an i-vector vector extraction method or a speaker recognition method.
在一实施例中,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:获取说话人的训练语音数据,并提取训练语音数据对应的训练语音特征;基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;将训练语音特征投影在总体变化子空间上,获取第一i-vector向量;将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量。In an embodiment, a computer device is provided, which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented: acquiring a speaker Training speech data and extract training speech features corresponding to the training speech data; training the overall change subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training speech feature on the overall change subspace to obtain the first i-vector vector; the first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
在一实施例中,提取训练语音数据对应的训练语音特征,处理器执行计算机可读指令时实现以下步骤:对训练语音数据进行预处理,获取预处理语音数据;对预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据频谱获取训练语音数据的功率谱;采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;在梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。In one embodiment, the training voice features corresponding to the training voice data are extracted, and the processor implements the following steps when the processor executes computer-readable instructions: pre-processing the training voice data to obtain pre-processed voice data; The Fourier transform obtains the frequency spectrum of the training speech data, and obtains the power spectrum of the training speech data according to the frequency spectrum; uses the Mel scale filter bank to process the power spectrum of the training speech data, and obtains the Mel power spectrum of the training speech data; Cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.
在一实施例中,基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间,处理器执行计算机可读指令时实现以下步骤:获取预设UBM模型的高维充分统计量;采用最大期望算法对高维充分统计量进行迭代,获取对应的总体变化子空间。In one embodiment, the overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model. When the processor executes computer-readable instructions, the following steps are implemented: obtaining high-dimensional sufficient statistics of the preset UBM model; using The maximum expectation algorithm iterates high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
在一实施例中,将训练语音特征投影在总体变化子空间上,获取第一i-vector向量,处理器执行计算机可读指令时实现以下步骤:基于训练语音特征和预设UBM模型,采用均值MAP 自适应方法获取GMM-UBM模型;采用公式s 1=m+Tw 1将训练语音特征投影在总体变化子空间上,获取第一i-vector向量,其中,s 1是C*F维的GMM-UBM模型中与训练语音特征相对应的均值超矢量;m是与说话人无关且信道无关的C*F维超向量;T是总体变化子空间,维度为CF*N;w 1是第一i-vector向量,维度为N。 In one embodiment, the training voice feature is projected on the overall change subspace to obtain the first i-vector vector. When the processor executes computer-readable instructions, the following steps are implemented: based on the training voice feature and a preset UBM model, using the mean The MAP adaptive method is used to obtain the GMM-UBM model; the formula s 1 = m + Tw 1 is used to project the training speech features on the overall changing subspace to obtain the first i-vector vector, where s 1 is a C * F-dimensional GMM -The average supervector corresponding to the training speech features in the UBM model; m is a C * F-dimensional supervector that is independent of the speaker and channel-independent; T is the overall change subspace with a dimension of CF * N; w 1 is the first i-vector vector with dimension N.
在一实施例中,将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量,处理器执行计算机可读指令时实现以下步骤:In one embodiment, the first i-vector vector is projected on the overall change subspace to obtain the registered i-vector vector corresponding to the speaker. When the processor executes the computer-readable instructions, the following steps are implemented:
采用公式s 2=m+Tw 2将第一i-vector向量投影在总体变化子空间上,获取注册i-vector向量,其中,s 2是D*G维的与注册i-vector向量相对应的均值超矢量;m是与说话人无关且信道无关的D*G维超向量;T是总体变化子空间,维度为DG*M;w 2是注册i-vector向量,维度为M。 Use the formula s 2 = m + Tw 2 to project the first i-vector vector onto the overall change subspace to obtain the registered i-vector vector, where s 2 is a D * G dimension corresponding to the registered i-vector vector ultra mean vector; m is speaker-independent and channel-independent ultra-dimensional vector D * G; T is the total variation subspace dimension DG * m; w 2 i-vector is a registered vector of dimension M.
在一实施例中,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:获取测试语音数据,测试语音数据携带说话人标识;基于测试语音数据,获取对应的测试i-vector向量;基于说话人标识查询数据库,获取与说话人标识对应的注册i-vector向量;采用余弦相似度算法获取测试i-vector向量和注册i-vector向量的相似度,根据相似度检测测试i-vector向量和注册i-vector是否对应同一说话人。In one embodiment, a computer device is provided, which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the processor implements the following steps: obtaining a test voice Data, the test voice data carries the speaker ID; based on the test voice data, the corresponding test i-vector vector is obtained; query the database based on the speaker ID, and obtain the registered i-vector vector corresponding to the speaker ID; using the cosine similarity algorithm to obtain Test the similarity between the i-vector vector and the registered i-vector vector, and test whether the i-vector vector and the registered i-vector correspond to the same speaker according to the similarity detection.
在一实施例中,提供一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现以下步骤:获取说话人的训练语音数据,并提取训练语音数据对应的训练语音特征;基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;将训练语音特征投影在总体变化子空间上,获取第一i-vector向量;将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量。In one embodiment, a computer-readable storage medium is provided on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the following steps are performed: obtaining training voice data of a speaker, and extracting training voice data Corresponding training speech features; training the overall change subspace corresponding to the preset UBM model based on the preset UBM model; projecting the training speech features on the overall change subspace to obtain the first i-vector vector; The vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
在一实施例中,提取训练语音数据对应的训练语音特征,计算机可读指令被处理器执行时实现以下步骤:对训练语音数据进行预处理,获取预处理语音数据;对预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据频谱获取训练语音数据的功率谱;采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;在梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。In one embodiment, the training voice features corresponding to the training voice data are extracted, and the computer-readable instructions are executed by the processor to implement the following steps: pre-processing the training voice data to obtain pre-processed voice data; quickly performing pre-processing voice data Fourier transform to obtain the frequency spectrum of the training speech data, and obtain the power spectrum of the training speech data according to the frequency spectrum; use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data; The cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.
在一实施例中,基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间,计算机可读指令被处理器执行时实现以下步骤:获取预设UBM模型的高维充分统计量;采用最大期望算法对高维充分统计量进行迭代,获取对应的总体变化子空间。In one embodiment, an overall change subspace corresponding to the preset UBM model is trained based on the preset UBM model, and when the computer-readable instructions are executed by the processor, the following steps are performed: obtaining high-dimensional sufficient statistics of the preset UBM model; The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
在一实施例中,将训练语音特征投影在总体变化子空间上,获取第一i-vector向量,计算机可读指令被处理器执行时实现以下步骤:基于训练语音特征和预设UBM模型,采用均值MAP自适应方法获取GMM-UBM模型;采用公式s 1=m+Tw 1将训练语音特征投影在总体变化子空间上,获取第一i-vector向量,其中,s 1是C*F维的GMM-UBM模型中与训练语音特征相对应的均值超矢量;m是与说话人无关且信道无关的C*F维超向量;T是总体变化子空间,维度为CF*N;w 1是第一i-vector向量,维度为N。 In one embodiment, the training voice feature is projected on the overall change subspace to obtain a first i-vector vector. When the computer-readable instructions are executed by the processor, the following steps are implemented: based on the training voice feature and a preset UBM model, using Mean MAP adaptive method to obtain the GMM-UBM model; use the formula s 1 = m + Tw 1 to project the training speech features on the overall changing subspace to obtain the first i-vector vector, where s 1 is C * F dimension In the GMM-UBM model, the mean supervector corresponding to the features of the training speech; m is a C * F-dimensional supervector that is independent of the speaker and channel-independent; T is the overall change subspace with a dimension of CF * N; w 1 is the first An i-vector vector with dimension N.
在一实施例中,将第一i-vector向量投影在总体变化子空间上,获取与说话人对应的注册i-vector向量,计算机可读指令被处理器执行时实现以下步骤:In one embodiment, the first i-vector vector is projected on the overall change subspace to obtain the registered i-vector vector corresponding to the speaker. When the computer-readable instructions are executed by the processor, the following steps are implemented:
采用公式s 2=m+Tw 2将第一i-vector向量投影在总体变化子空间上,获取注册i-vector 向量,其中,s 2是D*G维的与注册i-vector向量相对应的均值超矢量;m是与说话人无关且信道无关的D*G维超向量;T是总体变化子空间,维度为DG*M;w 2是注册i-vector向量,维度为M。 Use the formula s 2 = m + Tw 2 to project the first i-vector vector on the overall change subspace to obtain the registered i-vector vector, where s 2 is a D * G dimension corresponding to the registered i-vector vector ultra mean vector; m is speaker-independent and channel-independent ultra-dimensional vector D * G; T is the total variation subspace dimension DG * m; w 2 i-vector is a registered vector of dimension M.
在一实施例中,提供一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现以下步骤:获取测试语音数据,测试语音数据携带说话人标识;基于测试语音数据,获取对应的测试i-vector向量;基于说话人标识查询数据库,获取与说话人标识对应的注册i-vector向量;采用余弦相似度算法获取测试i-vector向量和注册i-vector向量的相似度,根据相似度检测测试i-vector向量和注册i-vector是否对应同一说话人。In one embodiment, a computer-readable storage medium is provided on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the following steps are performed: obtaining test voice data, and the test voice data carrying a speaker identification; Obtain the corresponding test i-vector vector based on the test speech data; query the database based on the speaker ID to obtain the registered i-vector vector corresponding to the speaker ID; use the cosine similarity algorithm to obtain the test i-vector vector and the registered i-vector The similarity of the vectors, based on the similarity detection test whether the i-vector vector and the registered i-vector correspond to the same speaker.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims (20)

  1. 一种i-vector向量提取方法,其特征在于,包括:An i-vector vector extraction method is characterized in that it includes:
    获取说话人的训练语音数据,并提取所述训练语音数据对应的训练语音特征;Acquiring training speech data of a speaker, and extracting training speech features corresponding to the training speech data;
    基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;
    将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量;Projecting the training speech feature on the overall change subspace to obtain a first i-vector vector;
    将所述第一i-vector向量投影在所述总体变化子空间上,获取与所述说话人对应的注册i-vector向量。The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  2. 如权利要求1所述的i-vector向量提取方法,其特征在于,所述提取所述训练语音数据对应的训练语音特征,包括:The i-vector vector extraction method according to claim 1, wherein the extracting training speech features corresponding to the training speech data comprises:
    对所述训练语音数据进行预处理,获取预处理语音数据;Pre-processing the training voice data to obtain pre-processed voice data;
    对所述预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据所述频谱获取训练语音数据的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the training speech data, and obtaining a power spectrum of the training speech data according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;
    在所述梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。Cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data.
  3. 如权利要求1所述的的i-vector向量提取方法,其特征在于,所述基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间,包括:The i-vector vector extraction method according to claim 1, wherein the training of an overall change subspace corresponding to a preset UBM model based on a preset UBM model comprises:
    获取所述预设UBM模型的高维充分统计量;Acquiring high-dimensional sufficient statistics of the preset UBM model;
    采用最大期望算法对所述高维充分统计量进行迭代,获取对应的总体变化子空间。The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
  4. 如权利要求1所述的的i-vector向量提取方法,其特征在于,所述将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量,包括:The i-vector vector extraction method according to claim 1, wherein the projecting the training speech feature on the overall change subspace to obtain a first i-vector vector comprises:
    基于所述训练语音特征和所述预设UBM模型,采用均值MAP自适应方法获取GMM-UBM模型;Obtaining a GMM-UBM model based on the training speech feature and the preset UBM model using a mean MAP adaptive method;
    采用公式s 1=m+Tw 1将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量,其中,s 1是C*F维的GMM-UBM模型中与所述训练语音特征相对应的均值超矢量;m是与说话人无关且信道无关的C*F维超向量;T是所述总体变化子空间,维度为CF*N;w 1是第一i-vector向量,维度为N。 The formula s 1 = m + Tw 1 is used to project the training speech features on the overall change subspace to obtain a first i-vector vector, where s 1 is the same as that in the C * F-dimensional GMM-UBM model. Describe the mean supervector corresponding to the training speech feature; m is a C * F-dimensional supervector that is independent of the speaker and channel-independent; T is the overall changing subspace with dimensions CF * N; w 1 is the first i- A vector with dimensions N.
  5. 如权利要求1所述的的i-vector向量提取方法,其特征在于,所述将所述第一i-vector向量投影在所述总体变化子空间上,获取与所述说话人对应的注册i-vector向量,包括:The i-vector vector extraction method according to claim 1, wherein the first i-vector vector is projected onto the overall change subspace, and a registration i corresponding to the speaker is obtained. -vector vector, including:
    采用公式s 2=m+Tw 2将所述第一i-vector向量投影在所述总体变化子空间上,获取注册i-vector向量,其中,s 2是D*G维的与所述注册i-vector向量相对应的均值超矢量;m是与说话人无关且信道无关的D*G维超向量;T是所述总体变化子空间,维度为DG*M;w 2是注册i-vector向量,维度为M。 Use the formula s 2 = m + Tw 2 to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector, where s 2 is D * G-dimensional and the registered i -vector vector corresponds to the mean supervector; m is a speaker-independent and channel-independent D * G-dimensional supervector; T is the overall changing subspace with a dimension of DG * M; w 2 is a registered i-vector vector , The dimension is M.
  6. 一种说话人识别方法,其特征在于,包括:A speaker recognition method, comprising:
    获取测试语音数据,所述测试语音数据携带说话人标识;Obtaining test voice data, the test voice data carrying a speaker identification;
    还包括采用权利要求1-5任一项所述i-vector向量提取方法对所述测试语音数据进行处 理,获取对应的测试i-vector向量;Further comprising using the i-vector vector extraction method according to any one of claims 1-5 to process the test speech data to obtain a corresponding test i-vector vector;
    基于所述说话人标识查询数据库,获取与所述说话人标识对应的注册i-vector向量;Querying a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;
    采用余弦相似度算法获取所述测试i-vector向量和所述注册i-vector向量的相似度,根据所述相似度检测所述测试i-vector向量和所述注册i-vector是否对应同一说话人。Use a cosine similarity algorithm to obtain the similarity between the test i-vector vector and the registered i-vector vector, and detect whether the test i-vector vector and the registered i-vector correspond to the same speaker based on the similarity .
  7. 一种i-vector向量提取装置,其特征在于,包括:An i-vector vector extraction device, comprising:
    获取训练数据模块,用于获取说话人的训练语音数据,并提取所述训练语音数据对应的训练语音特征;A training data acquisition module, configured to acquire training speech data of a speaker, and extract training speech features corresponding to the training speech data;
    获取语音数据模块,用于获取说话人的训练语音数据,并提取所述训练语音数据对应的训练语音特征;A voice data acquisition module, configured to acquire training voice data of a speaker, and extract training voice features corresponding to the training voice data;
    训练变化空间模块,用于基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;A training change space module for training an overall change subspace corresponding to a preset UBM model based on a preset UBM model;
    投影变化空间模块,用于将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量;A projection change space module, configured to project the training speech feature on the overall change subspace to obtain a first i-vector vector;
    获取i-vector向量模块,用于将所述第一i-vector向量投影在所述总体变化子空间上,获取与所述说话人对应的注册i-vector向量。An i-vector vector acquisition module is configured to project the first i-vector vector onto the overall change subspace, and obtain a registered i-vector vector corresponding to the speaker.
  8. 一种说话人识别装置,其特征在于,包括:A speaker recognition device, comprising:
    获取测试数据模块,用于获取测试语音数据,所述测试语音数据携带说话人标识;A test data acquisition module, configured to acquire test voice data, where the test voice data carries a speaker identifier;
    获取测试向量模块,用于采用权利要求1-5任一项所述i-vector向量提取方法对所述测试语音数据进行处理,获取对应的测试i-vector向量;A test vector acquisition module, configured to process the test voice data by using the i-vector vector extraction method according to any one of claims 1-5 to acquire a corresponding test i-vector vector;
    获取注册向量模块,用于基于所述说话人标识查询数据库,获取与所述说话人标识对应的注册i-vector向量;A registration vector obtaining module, configured to query a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;
    确定对应说话人模块,用于采用余弦相似度算法获取所述测试i-vector向量和所述注册i-vector向量的相似度,根据所述相似度检测所述测试i-vector向量和所述注册i-vector是否对应同一说话人。Determining a corresponding speaker module for obtaining a similarity between the test i-vector vector and the registered i-vector vector using a cosine similarity algorithm, and detecting the test i-vector vector and the registration based on the similarity Whether the i-vector corresponds to the same speaker.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    获取说话人的训练语音数据,并提取所述训练语音数据对应的训练语音特征;Acquiring training speech data of a speaker, and extracting training speech features corresponding to the training speech data;
    基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;
    将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量;Projecting the training speech feature on the overall change subspace to obtain a first i-vector vector;
    将所述第一i-vector向量投影在所述总体变化子空间上,获取与所述说话人对应的注册i-vector向量。The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  10. 如权利要求9所述的计算机设备,其特征在于,所述提取所述训练语音数据对应的训练语音特征,包括:The computer device according to claim 9, wherein the extracting a training voice feature corresponding to the training voice data comprises:
    对所述训练语音数据进行预处理,获取预处理语音数据;Pre-processing the training voice data to obtain pre-processed voice data;
    对所述预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据所述频谱获取训练语音数据的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the training speech data, and obtaining a power spectrum of the training speech data according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;
    在所述梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。Cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data.
  11. 如权利要求9所述的计算机设备,其特征在于,所述基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间,包括:The computer device according to claim 9, wherein the training of the overall change subspace corresponding to the preset UBM model based on the preset UBM model comprises:
    获取所述预设UBM模型的高维充分统计量;Acquiring high-dimensional sufficient statistics of the preset UBM model;
    采用最大期望算法对所述高维充分统计量进行迭代,获取对应的总体变化子空间。The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
  12. 如权利要求9所述的计算机设备,其特征在于,所述将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量,包括:The computer device according to claim 9, wherein the projecting the training speech feature on the overall change subspace to obtain a first i-vector vector comprises:
    基于所述训练语音特征和所述预设UBM模型,采用均值MAP自适应方法获取GMM-UBM模型;Obtaining a GMM-UBM model based on the training speech feature and the preset UBM model using a mean MAP adaptive method;
    采用公式s 1=m+Tw 1将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量,其中,s 1是C*F维的GMM-UBM模型中与所述训练语音特征相对应的均值超矢量;m是与说话人无关且信道无关的C*F维超向量;T是所述总体变化子空间,维度为CF*N;w 1是第一i-vector向量,维度为N。 The formula s 1 = m + Tw 1 is used to project the training speech features on the overall change subspace to obtain a first i-vector vector, where s 1 is the same as that in the C * F-dimensional GMM-UBM model. Describe the mean supervector corresponding to the training speech feature; m is a C * F-dimensional supervector that is independent of the speaker and channel-independent; T is the overall changing subspace with dimensions CF * N; w 1 is the first i- A vector with dimensions N.
  13. 如权利要求9所述的计算机设备,其特征在于,所述将所述第一i-vector向量投影在所述总体变化子空间上,获取与所述说话人对应的注册i-vector向量,包括:The computer device according to claim 9, wherein the projecting the first i-vector vector onto the overall change subspace and obtaining a registered i-vector vector corresponding to the speaker includes: :
    采用公式s 2=m+Tw 2将所述第一i-vector向量投影在所述总体变化子空间上,获取注册i-vector向量,其中,s 2是D*G维的与所述注册i-vector向量相对应的均值超矢量;m是与说话人无关且信道无关的D*G维超向量;T是所述总体变化子空间,维度为DG*M;w 2是注册i-vector向量,维度为M。 Use the formula s 2 = m + Tw 2 to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector, where s 2 is D * G-dimensional and the registered i -vector vector corresponds to the mean supervector; m is a speaker-independent and channel-independent D * G-dimensional supervector; T is the overall changing subspace with a dimension of DG * M; w 2 is a registered i-vector vector , The dimension is M.
  14. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    获取测试语音数据,所述测试语音数据携带说话人标识;Obtaining test voice data, the test voice data carrying a speaker identification;
    还包括采用权利要求1-5任一项所述i-vector向量提取方法对所述测试语音数据进行处理,获取对应的测试i-vector向量;Further comprising using the i-vector vector extraction method according to any one of claims 1-5 to process the test speech data to obtain a corresponding test i-vector vector;
    基于所述说话人标识查询数据库,获取与所述说话人标识对应的注册i-vector向量;Querying a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;
    采用余弦相似度算法获取所述测试i-vector向量和所述注册i-vector向量的相似度,根据所述相似度检测所述测试i-vector向量和所述注册i-vector是否对应同一说话人。Use a cosine similarity algorithm to obtain the similarity between the test i-vector vector and the registered i-vector vector, and detect whether the test i-vector vector and the registered i-vector correspond to the same speaker based on the similarity .
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:
    获取说话人的训练语音数据,并提取所述训练语音数据对应的训练语音特征;Acquiring training speech data of a speaker, and extracting training speech features corresponding to the training speech data;
    基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间;Training the overall change subspace corresponding to the preset UBM model based on the preset UBM model;
    将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量;Projecting the training speech feature on the overall change subspace to obtain a first i-vector vector;
    将所述第一i-vector向量投影在所述总体变化子空间上,获取与所述说话人对应的注册i-vector向量。The first i-vector vector is projected on the overall change subspace to obtain a registered i-vector vector corresponding to the speaker.
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述提取所述训练语音数据对应的训练语音特征,包括:The non-volatile readable storage medium according to claim 15, wherein the extracting a training voice feature corresponding to the training voice data comprises:
    对所述训练语音数据进行预处理,获取预处理语音数据;Pre-processing the training voice data to obtain pre-processed voice data;
    对所述预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据所述频谱 获取训练语音数据的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the training speech data, and obtaining a power spectrum of the training speech data according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;
    在所述梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。Cepstrum analysis is performed on the Mel power spectrum to obtain MFCC features of training speech data.
  17. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述基于预设UBM模型训练出与预设UBM模型对应的总体变化子空间,包括:The non-volatile readable storage medium according to claim 15, wherein the training of the overall change subspace corresponding to the preset UBM model based on the preset UBM model comprises:
    获取所述预设UBM模型的高维充分统计量;Acquiring high-dimensional sufficient statistics of the preset UBM model;
    采用最大期望算法对所述高维充分统计量进行迭代,获取对应的总体变化子空间。The maximum expectation algorithm is used to iterate the high-dimensional sufficient statistics to obtain the corresponding overall change subspace.
  18. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量,包括:The non-volatile readable storage medium according to claim 15, wherein the acquiring the first i-vector vector by projecting the training voice feature on the overall change subspace comprises:
    基于所述训练语音特征和所述预设UBM模型,采用均值MAP自适应方法获取GMM-UBM模型;Obtaining a GMM-UBM model based on the training speech feature and the preset UBM model using a mean MAP adaptive method;
    采用公式s 1=m+Tw 1将所述训练语音特征投影在所述总体变化子空间上,获取第一i-vector向量,其中,s 1是C*F维的GMM-UBM模型中与所述训练语音特征相对应的均值超矢量;m是与说话人无关且信道无关的C*F维超向量;T是所述总体变化子空间,维度为CF*N;w1是第一i-vector向量,维度为N。 The formula s 1 = m + Tw 1 is used to project the training speech features on the overall change subspace to obtain a first i-vector vector, where s 1 is the same as that in the C * F-dimensional GMM-UBM model. The mean supervector corresponding to the training speech feature; m is a C * F-dimensional supervector that is independent of the speaker and is channel-independent; T is the overall changing subspace with dimensions CF * N; w1 is the first i-vector Vector with dimensions N.
  19. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述将所述第一i-vector向量投影在所述总体变化子空间上,获取与所述说话人对应的注册i-vector向量,包括:The non-volatile readable storage medium according to claim 15, wherein the projecting the first i-vector vector onto the overall change subspace, and obtaining a registration corresponding to the speaker i-vector vectors, including:
    采用公式s 2=m+Tw 2将所述第一i-vector向量投影在所述总体变化子空间上,获取注册i-vector向量,其中,s 2是D*G维的与所述注册i-vector向量相对应的均值超矢量;m是与说话人无关且信道无关的D*G维超向量;T是所述总体变化子空间,维度为DG*M;w 2是注册i-vector向量,维度为M。 Use the formula s 2 = m + Tw 2 to project the first i-vector vector onto the overall change subspace to obtain a registered i-vector vector, where s 2 is D * G-dimensional and the registered i -vector vector corresponds to the mean supervector; m is a speaker-independent and channel-independent D * G-dimensional supervector; T is the overall changing subspace with a dimension of DG * M; w 2 is a registered i-vector vector , The dimension is M.
  20. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:
    获取测试语音数据,所述测试语音数据携带说话人标识;Obtaining test voice data, the test voice data carrying a speaker identification;
    还包括采用权利要求1-5任一项所述i-vector向量提取方法对所述测试语音数据进行处理,获取对应的测试i-vector向量;Further comprising using the i-vector vector extraction method according to any one of claims 1-5 to process the test speech data to obtain a corresponding test i-vector vector;
    基于所述说话人标识查询数据库,获取与所述说话人标识对应的注册i-vector向量;Querying a database based on the speaker identification to obtain a registered i-vector vector corresponding to the speaker identification;
    采用余弦相似度算法获取所述测试i-vector向量和所述注册i-vector向量的相似度,根据所述相似度检测所述测试i-vector向量和所述注册i-vector是否对应同一说话人。Use a cosine similarity algorithm to obtain the similarity between the test i-vector vector and the registered i-vector vector, and detect whether the test i-vector vector and the registered i-vector correspond to the same speaker based on the similarity .
PCT/CN2018/092589 2018-06-06 2018-06-25 I-vector extraction method, speaker recognition method and apparatus, device, and medium WO2019232826A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810574010.4A CN109065022B (en) 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition
CN201810574010.4 2018-06-06

Publications (1)

Publication Number Publication Date
WO2019232826A1 true WO2019232826A1 (en) 2019-12-12

Family

ID=64820489

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/092589 WO2019232826A1 (en) 2018-06-06 2018-06-25 I-vector extraction method, speaker recognition method and apparatus, device, and medium

Country Status (2)

Country Link
CN (1) CN109065022B (en)
WO (1) WO2019232826A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111700718A (en) * 2020-07-13 2020-09-25 北京海益同展信息科技有限公司 Holding posture identifying method, holding posture identifying device, artificial limb and readable storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020154883A1 (en) * 2019-01-29 2020-08-06 深圳市欢太科技有限公司 Speech information processing method and apparatus, and storage medium and electronic device
CN111712874B (en) 2019-10-31 2023-07-14 支付宝(杭州)信息技术有限公司 Method, system, device and storage medium for determining sound characteristics
CN110827834B (en) * 2019-11-11 2022-07-12 广州国音智能科技有限公司 Voiceprint registration method, system and computer readable storage medium
CN111161713A (en) * 2019-12-20 2020-05-15 北京皮尔布莱尼软件有限公司 Voice gender identification method and device and computing equipment
CN111508505B (en) * 2020-04-28 2023-11-03 讯飞智元信息科技有限公司 Speaker recognition method, device, equipment and storage medium
CN114420142A (en) * 2022-03-28 2022-04-29 北京沃丰时代数据科技有限公司 Voice conversion method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104240706A (en) * 2014-09-12 2014-12-24 浙江大学 Speaker recognition method based on GMM Token matching similarity correction scores
CN105933323A (en) * 2016-06-01 2016-09-07 百度在线网络技术(北京)有限公司 Voiceprint register and authentication method and device
CN107240397A (en) * 2017-08-14 2017-10-10 广东工业大学 A kind of smart lock and its audio recognition method and system based on Application on Voiceprint Recognition
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
WO2018029071A1 (en) * 2016-08-12 2018-02-15 Imra Europe S.A.S Audio signature for speech command spotting

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737633B (en) * 2012-06-21 2013-12-25 北京华信恒达软件技术有限公司 Method and device for recognizing speaker based on tensor subspace analysis
US9858919B2 (en) * 2013-11-27 2018-01-02 International Business Machines Corporation Speaker adaptation of neural network acoustic models using I-vectors
CN104167208B (en) * 2014-08-08 2017-09-15 中国科学院深圳先进技术研究院 A kind of method for distinguishing speek person and device
CN105810199A (en) * 2014-12-30 2016-07-27 中国科学院深圳先进技术研究院 Identity verification method and device for speakers
US10553218B2 (en) * 2016-09-19 2020-02-04 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition
CN106971713B (en) * 2017-01-18 2020-01-07 北京华控智加科技有限公司 Speaker marking method and system based on density peak value clustering and variational Bayes
CN107146601B (en) * 2017-04-07 2020-07-24 南京邮电大学 Rear-end i-vector enhancement method for speaker recognition system
CN107633845A (en) * 2017-09-11 2018-01-26 清华大学 A kind of duscriminant local message distance keeps the method for identifying speaker of mapping

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104240706A (en) * 2014-09-12 2014-12-24 浙江大学 Speaker recognition method based on GMM Token matching similarity correction scores
CN105933323A (en) * 2016-06-01 2016-09-07 百度在线网络技术(北京)有限公司 Voiceprint register and authentication method and device
WO2018029071A1 (en) * 2016-08-12 2018-02-15 Imra Europe S.A.S Audio signature for speech command spotting
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN107240397A (en) * 2017-08-14 2017-10-10 广东工业大学 A kind of smart lock and its audio recognition method and system based on Application on Voiceprint Recognition

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111700718A (en) * 2020-07-13 2020-09-25 北京海益同展信息科技有限公司 Holding posture identifying method, holding posture identifying device, artificial limb and readable storage medium

Also Published As

Publication number Publication date
CN109065022B (en) 2022-08-09
CN109065022A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
WO2019237519A1 (en) General vector training method, voice clustering method, apparatus, device and medium
JP7008638B2 (en) voice recognition
CN109065028B (en) Speaker clustering method, speaker clustering device, computer equipment and storage medium
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
US9940935B2 (en) Method and device for voiceprint recognition
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
Li et al. An overview of noise-robust automatic speech recognition
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
WO2018149077A1 (en) Voiceprint recognition method, device, storage medium, and background server
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
CN109360572B (en) Call separation method and device, computer equipment and storage medium
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
WO2019227586A1 (en) Voice model training method, speaker recognition method, apparatus, device and medium
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN105096955B (en) A kind of speaker&#39;s method for quickly identifying and system based on model growth cluster
WO2019200744A1 (en) Self-updated anti-fraud method and apparatus, computer device and storage medium
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
WO2019227574A1 (en) Voice model training method, voice recognition method, device and equipment, and medium
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
WO2023001128A1 (en) Audio data processing method, apparatus and device
CN116490920A (en) Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system
Savchenko Method for reduction of speech signal autoregression model for speech transmission systems on low-speed communication channels

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921590

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18921590

Country of ref document: EP

Kind code of ref document: A1