CN109065022A - I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium - Google Patents

I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium Download PDF

Info

Publication number
CN109065022A
CN109065022A CN201810574010.4A CN201810574010A CN109065022A CN 109065022 A CN109065022 A CN 109065022A CN 201810574010 A CN201810574010 A CN 201810574010A CN 109065022 A CN109065022 A CN 109065022A
Authority
CN
China
Prior art keywords
vector
speaker
voice data
training
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810574010.4A
Other languages
Chinese (zh)
Other versions
CN109065022B (en
Inventor
涂宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810574010.4A priority Critical patent/CN109065022B/en
Priority to PCT/CN2018/092589 priority patent/WO2019232826A1/en
Publication of CN109065022A publication Critical patent/CN109065022A/en
Application granted granted Critical
Publication of CN109065022B publication Critical patent/CN109065022B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a kind of i-vector vector extracting method, method for distinguishing speek person, device, equipment and media, wherein, the i-vector vector extracting method includes: the training voice data for obtaining speaker, and extracts the corresponding trained phonetic feature of trained voice data;Entire change corresponding with default UBM model subspace is trained based on default UBM model;Training phonetic feature is projected on entire change subspace, the first i-vector vector is obtained;By the first i-vector vector projection on entire change subspace, registration i-vector vector corresponding with speaker is obtained.This method makes that voice feature data is trained to can remove more feature of noise after projecting twice namely reducing dimension, improves the degree of purity for extracting speaker's phonetic feature, while reducing after dimensionality reduction and calculating the recognition efficiency that speech recognition is also improved in space.

Description

I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium
Technical field
The present invention relates to field of speech recognition more particularly to a kind of i-vector vector extracting methods, Speaker Identification side Method, device, equipment and medium.
Background technique
Speaker Identification is also known as Application on Voiceprint Recognition, is to be spoken using the speaker dependent's information contained in voice signal to identify A kind of biometrics of person's identity.In recent years, the i-vector based on vector analysis (recognize by identity-vector, identity Syndrome vector) modeling method introducing so that the performance of Speaker Recognition System has is obviously improved.To speaker's voice It can include the information of speaker in vector analysis, in usual channel subspace.Total variable of one low-dimensional in the space i-vector Space indicates speaker subspace and channel subspace, and speaker's voice projected to the space by dimensionality reduction, can be obtained one The characterization vector (i.e. i-vector vector) of a regular length.However, the acquired i-vector of existing i-vector modeling There is also more disturbing factors for vector, increase complexity when i-vector vector to be used for Speaker Identification.
Summary of the invention
Based on this, it is necessary in view of the above technical problems, provide a kind of i-vector that can remove more disturbing factor Vector extracting method, device, computer equipment and storage medium.
A kind of i-vector vector extracting method, comprising:
The training voice data of speaker is obtained, and extracts the corresponding trained phonetic feature of trained voice data;
Entire change corresponding with default UBM model subspace is trained based on default UBM model;
Training phonetic feature is projected on entire change subspace, the first i-vector vector is obtained;
By the first i-vector vector projection on entire change subspace, registration i- corresponding with speaker is obtained Vector vector.
A kind of i-vector vector extraction device, comprising:
Voice data module is obtained, for obtaining the training voice data of speaker, and it is corresponding to extract trained voice data Training phonetic feature;
Training variation space module, for training entire change corresponding with default UBM model based on default UBM model Subspace;
Projection variation space module obtains the first i- for phonetic feature will to be trained to be projected in entire change subspace Vector vector;
I-vector vector module is obtained, in entire change subspace, obtaining the first i-vector vector projection Take registration i-vector vector corresponding with speaker.
A kind of computer equipment, including memory, processor and storage can be run in memory and on a processor Computer program, processor execute computer program when realize i-vector vector extracting method the step of.
A kind of computer readable storage medium, computer-readable recording medium storage have computer program, computer program The step of i-vector vector extracting method is realized when being executed by processor.
This implementation also provides a kind of method for distinguishing speek person, comprising:
Tested speech data are obtained, tested speech data carry speaker's mark;
Based on tested speech data, corresponding test i-vector vector is obtained;
Inquiry database is identified based on speaker, obtains registration i-vector vector corresponding with speaker's mark;
The similarity that test i-vector vector sum registration i-vector vector is obtained using cosine similarity algorithm, according to Whether similarity detection test i-vector vector sum registration i-vector corresponds to same speaker.
A kind of Speaker Identification device, comprising:
Test data module is obtained, for obtaining tested speech data, tested speech data carry speaker's mark;
Test vector module is obtained, for being handled using i-vector vector extracting method tested speech data, Obtain corresponding test i-vector vector;
Log-in vector module is obtained, for identifying inquiry database based on speaker, is obtained corresponding with speaker's mark Register i-vector vector;
It determines corresponding speaker's module, registers i- for obtaining test i-vector vector sum using cosine similarity algorithm The similarity of vector vector detects whether test i-vector vector sum registration i-vector corresponds to same theory according to similarity Talk about people.
A kind of computer equipment, including memory, processor and storage can be run in memory and on a processor Computer program, processor execute computer program when realize method for distinguishing speek person the step of.
A kind of computer readable storage medium, computer-readable recording medium storage have computer program, computer program The step of method for distinguishing speek person is realized when being executed by processor.
I-vector vector extracting method, method for distinguishing speek person, device, equipment and Jie provided in an embodiment of the present invention Matter, after being projected in and obtain the first i-vector vector on entire change subspace by the way that phonetic feature being trained, then by the first i- Vector vector, which is projected in for the second time on entire change subspace, obtains registration i-vector vector, so that training phonetic feature number According to can remove more feature of noise after projecting twice namely reducing dimension, improving and extracting the pure of speaker's phonetic feature Cleanliness, while being reduced after dimensionality reduction and calculating the recognition efficiency that speech recognition is also improved in space, reduce identification complexity.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is the application environment schematic diagram of i-vector vector extracting method in one embodiment of the invention;
Fig. 2 is the flow chart of i-vector vector extracting method in one embodiment of the invention;
Fig. 3 is another specific flow chart of i-vector vector extracting method in one embodiment of the invention;
Fig. 4 is another specific flow chart of i-vector vector extracting method in one embodiment of the invention;
Fig. 5 is another specific flow chart of i-vector vector extracting method in one embodiment of the invention;
Fig. 6 is a specific flow chart of method for distinguishing speek person in one embodiment of the invention;
Fig. 7 is a functional block diagram of i-vector vector extraction device in one embodiment of the invention;
Fig. 8 is a functional block diagram of Speaker Identification device in one embodiment of the invention;
Fig. 9 is a schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
I-vector vector extracting method provided in an embodiment of the present invention, can be applicable in the application environment such as Fig. 1, In, computer equipment is communicated by network with identification server.Wherein, computer equipment includes but is not limited to various individuals Computer, laptop, smart phone, tablet computer and portable wearable device.Identify that server can be with independent The server cluster of server either multiple servers composition is realized.
In one embodiment, it as shown in Fig. 2, providing a kind of i-vector vector extracting method, is applied in this way in Fig. 1 In identification server for be illustrated, include the following steps:
S10. the training voice data of speaker is obtained, and extracts the corresponding trained phonetic feature of trained voice data.
Wherein, the training voice data of speaker is the primary voice data that speaker provides.Training phonetic feature is generation Table speaker is different from other people phonetic feature, is applied to the present embodiment, mel-frequency cepstrum coefficient (Mel- can be used Frequency Cepstral Coefficients, hereinafter referred to as MFCC feature) as training phonetic feature.
One filter group of finder's earcon is detected, only focusing on certain specific frequency components, (sense of hearing of people is to frequency It is nonlinear), that is to say, that the signal that human ear receives sound frequency is limited.However these filters are on frequency coordinate axis But it is not univesral distribution, there are many filters in low frequency region, they is distributed than comparatively dense, but in high-frequency region, filtering The number of device just becomes fewer, is distributed very sparse.Melscale filter group low frequency part high resolution, with human ear Auditory properties are consistent, this is also the physical significance place of melscale.
S20. entire change corresponding with default UBM model subspace is trained based on default UBM model.
Wherein, presetting UBM (Universal Background Model, universal background model) is that a characterization is a large amount of non- The gauss hybrid models (Gaussian Mixture Models, gauss hybrid models) of speaker dependent's phonetic feature distribution. The training of UBM model generallys use largely, voice data that channel unrelated unrelated with speaker dependent, therefore can usually recognize It is the model unrelated with speaker dependent for UBM model, it is only fitted the phonetic feature distribution of people, and does not represent some tool The speaker of body.UBM model is preset in identification server, is because being instructed in the voiceprint registration stage of Application on Voiceprint Recognition process The voice data for practicing speaker dependent is usually considerably less, is modeled using GMM model to speaker's phonetic feature, and training is specific The voice data of speaker can not usually cover the feature space where GMM.It therefore, can be according to the Character adjustment of training voice The parameter of UBM model characterizes the individual information of speaker dependent, and the feature that training voice does not cover can be in UBM model Similar feature distribution is next approximate, and this method can preferably solve the problems, such as to train voice deficiency bring system performance.
Entire change subspace, the also referred to as space T (Total Variability Space) are one overall situations of direct setting The projection matrix of variation, to comprising all possible information of speaker in voice data, not separated speaker is empty in the space T Between and channel space.The space T can project to higher-dimension sufficient statistic (super vector) i- that can be used as low-dimensional speaker characterization Vector plays the role of dimensionality reduction.The training process in the space T includes: to utilize vector analysis and EM according to UBM model is preset (Expectation Maximization Algorithm, greatest hope) algorithm calculates the space T from wherein convergence.
In this step, the entire change subspace obtained based on default UBM model does not distinguish speaker space and channel sky Between, the information of the information in sound channel space and channel space is converged on into a space, to reduce computation complexity, convenient for further I-vector vector is obtained based on entire change subspace.
S30. training phonetic feature is projected on entire change subspace, obtains the first i-vector vector.
Wherein, the first i-vector vector is the entire change subspace that phonetic feature will be trained to project to low-dimensional, is obtained A regular length characterization vector vector, i.e. i-vector vector.
Specifically, formula s is used in this step1=m+Tw1, the training phonetic feature that can obtain higher-dimension is projected in overall change The first i-vector vector that low-dimensional is formed behind beggar space, the dimension and removal for reducing training phonetic feature projection are more made an uproar Sound, convenient for being identified based on the first i-vector vector to speaker.
S40. by the first i-vector vector projection on entire change subspace, registration i- corresponding with speaker is obtained Vector vector.
Wherein, entire change subspace is obtained by step S20, the not separated speaker in the entire change subspace Space and channel space, and space T (Total Variability Space) of a global change is directly set, to wrap Containing information all possible in voice data.
Registering i-vector vector is to obtain the entire change subspace of the first i-vector vector projection to low-dimensional In one database for being recorded in identification server, to be associated with the regular length as identity with speaker ID The vector of characterization vector, i.e. i-vector.
In a specific embodiment, in step s 40, i.e., training phonetic feature is projected in entire change subspace On, the first i-vector vector is obtained, is specifically comprised the following steps:
S41. formula s is used2=m+Tw2By the first i-vector vector projection on entire change subspace, registration is obtained I-vector vector, wherein s2It is the mean value super vector corresponding with registration i-vector vector of D*G dimension;M is and speaker Unrelated and unrelated channel D*G ties up super vector;T is entire change subspace, dimension DG*M;w2Be registration i-vector to Amount, dimension M.
In the present embodiment, s2The Gaussian mean super vector of the first i-vector vector of step S30 acquisition can be used;M is D*G unrelated with speaker and unrelated with channel ties up super vector, is spliced by the corresponding mean value super vector of UBM model;w2It is The random vector of one group of obedience standardized normal distribution exactly registers i-vector vector, register the dimension of i-vector vector as M。
Further, in formula T (entire change subspace) acquisition process are as follows: training UBM model higher-dimension sufficiently unite Then metering updates above-mentioned higher-dimension sufficient statistic by EM algorithm iteration and produces the convergent space T.The space T is brought into Formula s2=m+Tw2, because of s2, m and T be all it is known, w can be obtained2, namely registration i-vector vector, wherein w2=(s2- m)/T。
I-vector vector extracting method provided in this embodiment will be by that will train phonetic feature to be projected in entire change After spatially obtaining the first i-vector vector, then the first i-vector vector is projected in for the second time on entire change subspace Registration i-vector vector is obtained, so that training voice feature data can remove more after projecting twice namely reducing dimension Feature of noise, improve extract speaker's phonetic feature degree of purity, while after dimensionality reduction reduce calculate space also improve voice The recognition efficiency of identification, the method for distinguishing speek person for originally implementing offer are known using i-vector vector extracting method Not, identification complexity is reduced.
In one embodiment, as shown in figure 3, in step S10, that is, it is special to extract the corresponding trained voice of training voice data Sign, specifically comprises the following steps:
S11: pre-processing training voice data, obtains pretreatment voice data.
In a specific embodiment, in step S11, training voice data is pre-processed, obtains pretreatment voice Data specifically comprise the following steps:
S111: preemphasis processing is made to training voice data, the calculation formula of preemphasis processing is s'n=sn-a*sn-1, In, snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nFor time domain after preemphasis On signal amplitude, a is pre emphasis factor, and the value range of a is 0.9 < a < 1.0.
Wherein, preemphasis is a kind of signal processing mode compensated in transmitting terminal to input signal high fdrequency component.With The increase of signal rate, signal be damaged in transmission process it is very big, in order to enable receiving end to obtain relatively good signal waveform, With regard to needing to compensate impaired signal.The thought of pre-emphasis technique is exactly the high frequency in the transmitting terminal enhancing signal of transmission line Ingredient enables receiving end to obtain preferable signal waveform to compensate excessive decaying of the high fdrequency component in transmission process.In advance Exacerbation does not have an impact to noise, therefore can effectively improve output signal-to-noise ratio.
In the present embodiment, preemphasis processing is made to training voice data, the formula of preemphasis processing is s'n=sn-a* sn-1, wherein snFor the signal amplitude in time domain, i.e. the amplitude (amplitude) of voice expressed in the time domain of voice data, sn-1For With snThe signal amplitude of opposite last moment, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, and a's takes Value range is 0.9 < a < 1.0, takes the effect of 0.97 preemphasis relatively good here.Sounding mistake can be eliminated by being handled using the preemphasis It is interfered caused by vocal cords and lip etc. in journey, the pent-up high frequency section of voice data can be trained with effective compensation, and can The formant of trained voice data high frequency is highlighted, the signal amplitude of training voice data is reinforced, helps to extract trained voice spy Sign.
S112: the training voice data after preemphasis is subjected to sub-frame processing.
Specifically, after pre-add retraining voice data, sub-frame processing should also be carried out.Framing refers to whole section of voice letter It number is cut into the voice processing technology of several segments, the size of every frame is in the range of 10-30ms, using general 1/2 frame length as frame It moves.Frame moves the overlapping region for referring to adjacent two interframe, can be avoided adjacent two frame and changes excessive problem.To training voice data Sub-frame processing is carried out, training voice data can be divided into the voice data of several segments, trained voice data can be segmented, be convenient for The extraction of training phonetic feature.
S113: carrying out windowing process for the training voice data after framing, obtains pretreatment voice data, the calculating of adding window Formula isWherein, N is that window is long, and n is time, snFor the signal width in time domain Degree, s'nFor the signal amplitude in time domain after adding window.
Specifically, after carrying out sub-frame processing to training voice data, the initial segment of each frame and end end can all occur Discontinuous place, so framing is mostly also bigger with the error of training voice data.This is able to solve using adding window to ask Topic, the training voice data after can making framing becomes continuously, and each frame is enabled to show the feature of periodic function. Windowing process specifically refers to handle training voice data using window function, and window function can choose Hamming window, then should add The formula of window isN is that Hamming window window is long, and n is time, snFor the letter in time domain Number amplitude, s'nFor the signal amplitude in time domain after adding window.Windowing process is carried out to training voice data, obtains pretreatment voice Data, the signal of training voice data in the time domain after enabling to framing become continuously, to help to extract trained voice number According to training phonetic feature.
Above-mentioned steps S211-S213 trains the training of voice data for extraction to the pretreatment operation of training voice data Phonetic feature provides the foundation, and enables to the training phonetic feature extracted more representative of the training voice data, and according to this Training phonetic feature trains corresponding GMM-UBM model.
S12: Fast Fourier Transform (FFT) is made to pretreatment voice data, obtains the frequency spectrum of training voice data, and according to frequency Spectrum obtains the power spectrum of training voice data.
Wherein, Fast Fourier Transform (FFT) (Fast Fourier Transformation, abbreviation FFT), refers to and utilizes computer Calculate efficient, quick calculation method the general designation of discrete Fourier transform.Computer can be made to calculate discrete Fu using this algorithm In multiplication number required for leaf transformation be greatly reduced, the number of sampling points being especially transformed is more, the section of fft algorithm calculation amount It saves more significant.
Specifically, Fast Fourier Transform (FFT) is carried out to pretreatment voice data, voice data will be pre-processed from time domain Signal amplitude be converted to the signal amplitude (frequency spectrum) on frequency domain.The formula of the calculating frequency spectrum is 1≤k≤N, N are the size of frame, and s (k) is the signal amplitude on frequency domain, and s (n) is the signal amplitude in time domain, and n is the time, and i is Complex unit.After the frequency spectrum for obtaining pretreatment voice data, pretreatment voice data can be directly acquired according to the frequency spectrum The power spectrum for pre-processing voice data is known as the power spectrum of training voice data by power spectrum below.Voice number is trained in the calculating According to the formula of power spectrum be1≤k≤N, N are the size of frame, and s (k) is the signal amplitude on frequency domain. By the way that pretreatment voice data is converted to the signal amplitude on frequency domain from the signal amplitude in time domain, further according on the frequency domain Signal amplitude obtains the power spectrum of training voice data, mentions to extract training phonetic feature from the power spectrum of training voice data For important technical foundation.
S13: handling the power spectrum of training voice data using melscale filter group, obtains the plum of training voice data That power spectrum.
It wherein, is the Meier carried out to power spectrum using the power spectrum that melscale filter group handles training voice data Frequency analysis, mel-frequency analysis are the analyses based on human auditory's perception.Detection discovery, human ear is just as a filter group one Sample only focuses on certain specific frequency components (sense of hearing of people is nonlinear to frequency), that is to say, that human ear receives sound audio The signal of rate is limited.However these filters are not but univesral distributions on frequency coordinate axis, are had very in low frequency region More filters, they are distributed than comparatively dense, but in high-frequency region, the number of filter just becomes fewer, are distributed very sparse. It is to be appreciated that high resolution of the melscale filter group in low frequency part, the auditory properties with human ear are consistent, this It is the physical significance place of melscale.
In the present embodiment, the power spectrum of training voice data is handled using melscale filter group, obtains training voice The Meier power spectrum of data carries out cutting to frequency-region signal by using melscale filter group, so that last each frequency The corresponding numerical value of section, if the number of filter is 22, the Meier power spectrum corresponding 22 of available trained voice data A energy value.Mel-frequency analysis is carried out by the power spectrum to training voice data, so that the Meier function obtained after its analysis Rate spectrum maintains the frequency-portions closely related with human ear characteristic, which, which can be well reflected out, trains voice data Feature.
S14: carrying out cepstral analysis on Meier power spectrum, obtains the MFCC feature of training voice data.
Wherein, cepstrum (cepstrum) refers in Fu that a kind of Fourier transform spectrum of signal carries out again after logarithm operation Leaf inverse transformation, since general Fourier spectrum is complex number spectrum, thus cepstrum is also known as cepstrum.
Specifically, cepstral analysis is carried out to Meier power spectrum, according to cepstrum as a result, analyzing and obtaining trained voice data MFCC feature.It, can be excessively high by script characteristic dimension by the cepstral analysis, it is difficult to the training voice data directly used The feature for including in Meier power spectrum is converted into wieldy feature and (uses by carrying out cepstral analysis on Meier power spectrum Come the MFCC character vector for being trained or identifying).The MFCC feature can be as training phonetic feature to different phonetic The coefficient distinguished, the training phonetic feature can reflect the difference between voice, can be used to identify and distinguish between trained language Sound data.
In a specific embodiment, in step S14, cepstral analysis is carried out on Meier power spectrum, obtains training voice The MFCC feature of data, includes the following steps:
S141: taking the logarithm of Meier power spectrum, obtains Meier power spectrum to be transformed.
Specifically, according to the definition of cepstrum, logarithm log is taken to Meier power spectrum, obtains Meier power spectrum m to be transformed.
S142: discrete cosine transform is made to Meier power spectrum to be transformed, obtains the MFCC feature of training voice data.
Specifically, to Meier power spectrum m to be transformed make discrete cosine transform (Discrete Cosine Transform, DCT), the MFCC feature for obtaining corresponding trained voice data generally takes the 2nd to the 13rd coefficient special as training voice Sign, the training phonetic feature are able to reflect the difference between voice data.Discrete cosine transform is made to Meier power spectrum m to be transformed Formula isI=0,1,2 ..., N-1, N are frame length, and m is Meier power spectrum to be transformed, J is the independent variable of Meier power spectrum to be transformed.Due to having overlapping between Meier filter, so being filtered using melscale There is correlation, discrete cosine transform can carry out dimensionality reduction to Meier power spectrum m to be transformed between the energy value that device obtains It compresses and abstract, and obtains indirectly training phonetic feature, compared to Fourier transformation, the no void of the result of discrete cosine transform There is apparent advantage in portion in terms of calculating.
Step S11-S14 carries out the processing of feature extraction, the instruction finally obtained based on training technique to training voice data Trained voice data can be embodied well by practicing phonetic feature, which can train corresponding GMM-UBM mould Type, and then registration i-vector vector is obtained, so that the registration i-vector vector that training obtains is when carrying out speech recognition As a result more accurate.
It should be noted that the feature extracted above is MFCC feature, training phonetic feature should not be limited to herein Only MFCC feature is a kind of, and will be understood that the phonetic feature obtained using training technique, as long as can effectively reflect voice number According to feature, it all can be used as trained phonetic feature and carry out identification and model training.In the present embodiment, to training voice data It is pre-processed, and obtains corresponding pretreatment voice data.Training voice data is pre-processed and can preferably be mentioned The training phonetic feature of trained voice data is taken, so that the training phonetic feature extracted is more representative of the training voice data, To carry out speech recognition using the training phonetic feature.
In one embodiment, it as shown in figure 4, in step S20, i.e., is trained and default UBM model based on default UBM model Corresponding entire change subspace, specifically comprises the following steps:
S21. the higher-dimension sufficient statistic of default UBM model is obtained.
Wherein, UBM model is to go out a height using enough voice trainings of more people, channel equalization and men and women's sound equilibrium The GMM model of rank, to describe the feature distribution unrelated with speaker.UBM model can adjust UBM model according to training phonetic feature Parameter characterize the individual information of speaker dependent, it is in the feature UBM model that training phonetic feature does not cover similar Feature distribution is next approximate, to solve training voice deficiency bring performance issue.
Statistic is the function of sample data, and in statistics, T (x) is the sufficient statistic of the parameter θ of unknown distribution P, The all information of θ can be provided and if only if T (x), that is to say, that no statistic can provide the additional information about θ.System Metering is the compression of actually a kind of pair data distribution, during sample is processed as statistic, is believed contained in sample Breath may lose, if information has no to lose when sample is processed as statistic, then this statistic is referred to as abundant statistics Amount.For example, for Gaussian Profile, it is expected that being exactly its two sufficient statistics with covariance matrix, because if the two are joined Number is it is known that a Gaussian Profile can be uniquely determined.
Specifically, obtain the process of the higher-dimension sufficient statistic of default UBM model are as follows: determine speaker's sample X=x1, X2 ..., xn }, which obeys the default corresponding distribution F (x) of UBM model, parameter theta.For the statistics of this group of sample Amount is T, T=r (x1, x2 ..., xn).If T obeys distribution F (T), and the parameter theta of the distribution F (x) of sample X can be by F (T) it finds out and, is i.e. the information for all about theta for including in F (x) has been included in F (T), then T is exactly default UBM mould The higher-dimension sufficient statistic of type.
In this step, identification server is sufficiently counted by the zeroth order sufficient statistic and single order for obtaining default UBM model Amount, to the technical foundation as training entire change subspace.
S22. higher-dimension sufficient statistic is iterated using EM algorithm, it is empty obtains corresponding entire change Between.
Wherein, EM algorithm (Expectation Maximization Algorithm, greatest hope em algorithm) It is a kind of iterative algorithm, is used to find in statistics, dependent on parameter in the probabilistic model of the recessive variable of not observable Maximal possibility estimation.For example, initialization two parameters of A and B, the numerical value of the two is all unknown in the initial state, but The information of B can be obtained to the information of A, A also can be obtained in the information for similarly obtaining B.If assigning certain initial value of A first, obtained with this To the estimated value of B, then from the current value of B, the value of A is reevaluated, until continueing to convergence.
The algorithm flow of EM is as follows: 1. initialization distribution parameters;2. repeating E step and M step until convergence: E step: estimating The desired value for counting unknown parameter, provides current parameter Estimation;M step: reevaluating distribution parameter, so that the likelihood of data Property it is maximum, provide the expectation estimation of known variables.By the way that E step and M step is used alternatingly, gradually the parameter of improved model, makes Parameter and the likelihood probability of training sample are gradually increased, and eventually terminate at a maximal point.
Specifically, iteration, which obtains entire change subspace, is realized by following step:
Step 1: it according to higher-dimension sufficient statistic by the mean value vector (each vector has D dimension) of M Gaussian component, concatenates A Gaussian mean super vector, i.e. M*D n dimensional vector n are formed together, F (x) is constituted using M*D n dimensional vector n, and F (x) is MD dimension arrow Amount;N is constructed using zeroth order sufficient statistic simultaneously, N is MD x MD dimension diagonal matrix, first using posterior probability as leading diagonal Element is spliced.Wherein, posterior probability refers to after the information for obtaining result modified probability again.Such as: thing has been sent out It is raw, it is desirable that the reason of this occurs is size a possibility that being caused by some factor, as posterior probability.
Step 2: the initialization space T constructs one [MD, V] and ties up matrix, wherein the dimension of V is much smaller than MD dimension, V's Dimension is exactly the dimension of the first i-vector vector.
Step 3: the fixed space T iterates to following formula using EM algorithm, to estimate hidden variable w Zeroth order sufficient statistic and single order sufficient statistic.After iterative calculation reaches predetermined number of times (5-6 times), i.e., it is believed that T is empty Between restrain, to fix the space T:
In the formula, w is hidden variable, and I is unit matrix;∑ is the covariance matrix of the UMM model of MD x MD dimension, Diagonal element is ∑ 1 ... ∑ m;F is the single order sufficient statistic in higher-dimension sufficient statistic;N is MD x MD dimension to angular moment Battle array.
In the present embodiment, by EM algorithm iteration, the iterative algorithm for providing a simple and stable calculates posterior density function Obtain entire change subspace;Obtaining entire change subspace can be by the higher-dimension sufficient statistic (super vector) of default UBM model Low-dimensional realization is projected to, conducive to the vector further progress speech recognition after dimensionality reduction.
In one embodiment, as shown in figure 5, in step S30, i.e., training phonetic feature is projected in entire change subspace On, the first i-vector vector is obtained, is specifically comprised the following steps:
S31. based on training phonetic feature and default UBM model, GMM-UBM mould is obtained using mean value MAP adaptive approach Type.
Wherein, training phonetic feature is the phonetic feature for representing speaker and being different from other people, is applied to the present embodiment, can adopt With mel-frequency cepstrum coefficient MFCC feature, (Mel-Frequency Cepstral Coefficients, hereinafter referred to as MFCC are special Sign) as training phonetic feature.
Specifically, based on default UBM model, using Maximize come the GMM mould of adaptive training phonetic feature Type, to update the mean value vector of each Gaussian component.Then the GMM model of M component is generated, namely generates GMM-UBM model. Using the mean value vector (each vector has D dimension) of each Gaussian component of GMM-UBM model as concatenation unit, M*D dimension is formed Gaussian mean super vector.
S32. formula s is used1=m+Tw1Training phonetic feature is projected on entire change subspace, the first i- is obtained Vector vector, wherein s1It is mean value super vector corresponding with training phonetic feature in the GMM-UBM model of C*F dimension;M is The C*F unrelated and unrelated channel with speaker ties up super vector;T is entire change subspace, dimension CF*N;w1It is the first i- Vector vector, dimension N.
In the present embodiment, s1The Gaussian mean super vector of step S31 acquisition can be used;M be it is unrelated with speaker and with letter The unrelated M*D in road ties up super vector, is spliced by the corresponding mean value super vector of UBM model;w1It is one group of obedience standard normal point The random vector of cloth, is exactly the first i-vector vector, and the dimension of the first i-vector vector is N.
Further, in formula T (entire change subspace) acquisition process are as follows: training UBM model higher-dimension sufficiently unite Then metering updates above-mentioned higher-dimension sufficient statistic by EM algorithm iteration and produces the convergent space T.The space T is brought into Formula s1=m+Tw1, because of s1, m and T be all it is known, w can be obtained1Namely the first i-vector vector, wherein w1=(s1- m)/T。
Step S31 is into step S32, by using formula s1=m+Tw1Training phonetic feature can be projected in entire change On subspace, the first i-vector vector is obtained, training phonetic feature can be carried out to first dimensionality reduction and simplify training phonetic feature Complexity is also convenient for the first i-vector vector of low-dimensional being further processed or for carrying out speech recognition.
In one embodiment, as shown in fig. 6, providing a kind of method for distinguishing speek person, the knowledge in Fig. 1 is applied in this way It is illustrated, includes the following steps: for other server
S50. tested speech data are obtained, tested speech data carry speaker's mark.
Wherein, tested speech data be it is to be confirmed, claim to be that the speaker from carrying identifies corresponding speaker Voice data.Speaker's mark is the unique identification to indicate speaker's identity, including but not limited to user name, identification card number Code, phone number etc..
The process for completing speech recognition needs two fundamentals: voice and identity, is applied to the present embodiment, and voice is exactly Tested speech data, identity is exactly speaker's mark, to identify body that server further determines that tested speech data are claimed Whether part is real corresponding identity.
S60. tested speech data are handled using i-vector vector extracting method, obtains corresponding test i- Vector vector.
Wherein, it after test i-vector vector is the entire change subspace by tested speech Projection Character to low-dimensional, obtains To one for verify identity regular length characterization vector (i.e. i-vector).
In this step, the corresponding test i-vector vector of tested speech data can be obtained, acquisition process with based on training The corresponding registration i-vector vector of phonetic feature acquisition is identical, and details are not described herein again.
S70. inquiry database is identified based on speaker, obtains registration i-vector vector corresponding with speaker's mark.
Wherein, database is that the corresponding registration i-vector vector sum speaker mark of speaker is associated record Database.
Registration i-vector vector is recorded in the database of identification server, is used as body to be associated with speaker ID The characterization vector (i.e. i-vector) of the regular length of part mark.
In this step, identification server can based on tested speech data carry speaker mark database lookup to pair The registration i-vector vector answered, further to be compared to registration i-vector vector sum test i-vector vector.
S80. the phase that i-vector vector is registered described in test i-vector vector sum is obtained using cosine similarity algorithm Like degree, detect whether test i-vector vector sum registration i-vector corresponds to same speaker according to similarity.
Specifically, the similarity for obtaining test i-vector vector sum registration i-vector vector can be carried out by following formula Determine:
Wherein, AiAnd BiRespectively represent each component of vector A and vector B.From the above equation, we can see that similarity dimensions from -1 to 1, wherein -1 indicates two vector directions on the contrary, 1 indicates that two vectors directions are identical;0 indicates that two vectors are independent.- 1 And the similitude or diversity between two vectors are indicated between 1, it is possible to understand that ground, similarity indicate two vectors closer to 1 It is closer.Applied to the present embodiment, the threshold value of cos θ can be preset based on practical experience.If testing i-vector vector sum note The similarity of volume i-vector vector is greater than threshold value, then it is assumed that and test i-vector vector sum registration i-vector vector is similar, Namely it can determine that tested speech data are in the database corresponding with speaker's mark.
In the present embodiment, by cosine similarity algorithm can discriminating test i-vector vector sum register i-vector to The similarity of amount, it is simple and fast, it is conducive to quickly confirmation recognition result.
I-vector vector extracting method provided in an embodiment of the present invention will be by that will train phonetic feature to be projected in overall change After beggar spatially obtains the first i-vector vector, then that the first i-vector vector is projected in entire change for the second time is empty Between it is upper obtain registration i-vector vector so that training voice feature data is removable after projecting twice namely reducing dimension More feature of noise improve the degree of purity for extracting speaker's phonetic feature, while reducing calculating space after dimensionality reduction and also improving The recognition efficiency of speech recognition reduces identification complexity.
Further, the processing for carrying out feature extraction to training voice data based on training technique obtains registration i-vector Vector can embody trained voice data well, so that the registration i-vector vector that training obtains is carrying out speech recognition When result it is more accurate;By EM algorithm iteration, the iterative algorithm for providing a simple and stable calculates posterior density function and obtains Take entire change subspace;Low-dimensional can be projected to for the higher-dimension sufficient statistic of default UBM model by obtaining entire change subspace It realizes, conducive to the vector further progress speech recognition after dimensionality reduction.
Method for distinguishing speek person provided in an embodiment of the present invention is by using i-vector vector extracting method to test language Sound data are handled, and corresponding test i-vector vector is obtained, and can reduce the complexity for obtaining test i-vector vector; Meanwhile by cosine similarity algorithm can discriminating test i-vector vector sum register i-vector vector similarity, letter It is single quick, it is conducive to quickly confirmation recognition result.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
In one embodiment, a kind of i-vector vector extraction device is provided, the i-vector vector extraction device with it is upper I-vector vector extracting method in embodiment is stated to correspond.As shown in fig. 7, the i-vector vector extraction device includes obtaining It takes voice data module 10, training variation space module 20, projection variation space module 30 and obtains i-vector vector module 40.Detailed description are as follows for each functional module:
Voice data module 10 is obtained, for obtaining the training voice data of speaker, and extracts trained voice data pair The training phonetic feature answered.
Variation space module 20 is trained, for training overall change corresponding with default UBM model based on default UBM model Beggar space.
Projection variation space module 30 obtains first for phonetic feature will to be trained to be projected in entire change subspace I-vector vector.
Obtain i-vector vector module 40, for by the first i-vector vector projection in entire change subspace, Obtain registration i-vector vector corresponding with speaker.
Preferably, obtain voice data module 10 include obtain units of speech data 11, obtain data power spectrum unit 12, It obtains Meier power spectrum unit 13 and obtains MFCC feature unit 14.
Units of speech data 11 is obtained, for pre-processing to training voice data, obtains pretreatment voice data.
It obtains data power and composes unit 12, for making Fast Fourier Transform (FFT) to pretreatment voice data, obtain training language The frequency spectrum of sound data, and the power spectrum for training voice data is obtained according to frequency spectrum.
Meier power spectrum unit 13 is obtained, for handling the power of training voice data using melscale filter group Spectrum obtains the Meier power spectrum of training voice data.
MFCC feature unit 14 is obtained, for carrying out cepstral analysis on Meier power spectrum, obtains training voice data MFCC feature.
Training variation space module 20 includes obtaining higher-dimension statistic unit 21 and obtaining variation subspace unit 22.
Higher-dimension statistic unit 21 is obtained, for obtaining the higher-dimension sufficient statistic of default UBM model.
Variation subspace unit 22 is obtained, for being iterated using EM algorithm to higher-dimension sufficient statistic, is obtained Take corresponding entire change subspace.
Projection variation space module 30 includes obtaining GMM-UBM model unit 31 and obtaining primary vector unit 32.
GMM-UBM model unit 31 is obtained, for being based on training phonetic feature and default UBM model, certainly using mean value MAP Adaptive method obtains GMM-UBM model.
Primary vector unit 32 is obtained, for using formula s1=m+Tw1, obtain the first i-vector vector, wherein s1 It is the corresponding mean value super vector of GMM-UBM model of C*F dimension;M is C*F dimension super vector unrelated with speaker and unrelated channel;T It is entire change subspace, dimension CF*N;w1It is the first i-vector vector, dimension N.
Preferably, obtaining i-vector vector module 40 includes obtaining log-in vector unit 41.
Log-in vector unit 41 is obtained, for using formula s2=m+Tw2By the first i-vector vector projection in overall change Beggar spatially, obtains registration i-vector vector, wherein s2It is the corresponding with registration i-vector vector equal of D*G dimension It is worth super vector;M is D*G dimension super vector unrelated with speaker and unrelated channel;T is entire change subspace, dimension DG*M; w2It is registration i-vector vector, dimension M.
Specific restriction about i-vector vector extraction device may refer to extract above for i-vector vector The restriction of method, details are not described herein.Modules in above-mentioned i-vector vector extraction device can be fully or partially through Software, hardware and combinations thereof are realized.Above-mentioned each module can be embedded in the form of hardware or independently of the place in computer equipment It manages in device, can also be stored in a software form in the memory in computer equipment, in order to which processor calls execution or more The corresponding operation of modules.
In one embodiment, a kind of Speaker Identification device is provided, is said in the Speaker Identification device and above-described embodiment People's recognition methods is talked about to correspond.As shown in figure 8, the Speaker Identification device includes obtaining test data module 50, obtaining and survey Vector module 60 is tried, log-in vector module 70 is obtained and determines corresponding speaker's module 80.Detailed description are as follows for each functional module:
Test data module 50 is obtained, for obtaining tested speech data, tested speech data carry speaker's mark;
Obtain test vector module 60, for using i-vector vector extracting method to tested speech data at Reason obtains corresponding test i-vector vector;
Log-in vector module 70 is obtained, for identifying inquiry database based on speaker, is obtained corresponding with speaker's mark Registration i-vector vector;
Corresponding speaker's module 80 is determined, for obtaining test i-vector vector sum registration using cosine similarity algorithm The similarity of i-vector vector, it is same according to whether similarity detection test i-vector vector sum registration i-vector corresponds to Speaker.
Specific about Speaker Identification device limits the restriction that may refer to above for method for distinguishing speek person, This is repeated no more.Modules in above-mentioned Speaker Identification device can come fully or partially through software, hardware and combinations thereof It realizes.Above-mentioned each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also be with software Form is stored in the memory in computer equipment, executes the corresponding operation of the above modules in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be server, internal structure chart It can be as shown in Figure 9.The computer equipment includes processor, memory, network interface and the data connected by system bus Library.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory of the computer equipment includes non- Volatile storage medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and database. The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The computer is set Standby database is for storing data relevant to i-vector vector extracting method or method for distinguishing speek person.The computer is set Standby network interface is used to communicate with external terminal by network connection.To realize when the computer program is executed by processor I-vector vector extracting method or method for distinguishing speek person.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory and can The computer program run on a processor, processor perform the steps of the instruction for obtaining speaker when executing computer program Practice voice data, and extracts the corresponding trained phonetic feature of trained voice data;It trains and presets based on default UBM model The corresponding entire change subspace of UBM model;Training phonetic feature is projected on entire change subspace, the first i- is obtained Vector vector;By the first i-vector vector projection on entire change subspace, registration i- corresponding with speaker is obtained Vector vector.
In one embodiment, the corresponding trained phonetic feature of training voice data is extracted, processor executes computer program When perform the steps of to training voice data pre-process, obtain pretreatment voice data;Pretreatment voice data is made Fast Fourier Transform (FFT) obtains the frequency spectrum of training voice data, and the power spectrum of training voice data is obtained according to frequency spectrum;Using The power spectrum of melscale filter group processing training voice data, obtains the Meier power spectrum of training voice data;In Meier Cepstral analysis is carried out on power spectrum, obtains the MFCC feature of training voice data.
In one embodiment, entire change corresponding with default UBM model subspace is trained based on default UBM model, Processor performs the steps of the higher-dimension sufficient statistic for obtaining default UBM model when executing computer program;Using the maximum phase It hopes algorithm to be iterated higher-dimension sufficient statistic, obtains corresponding entire change subspace.
In one embodiment, by training phonetic feature be projected on entire change subspace, obtain the first i-vector to Amount, processor are performed the steps of when executing computer program based on training phonetic feature and default UBM model, using mean value MAP adaptive approach obtains GMM-UBM model;Using formula s1=m+Tw1Training phonetic feature is projected in the entire change On subspace, the first i-vector vector is obtained, wherein s1It is opposite with training phonetic feature in the GMM-UBM model of C*F dimension The mean value super vector answered;M is C*F dimension super vector unrelated with speaker and unrelated channel;T is entire change subspace, dimension For CF*N;w1It is the first i-vector vector, dimension N.
In one embodiment, it by the first i-vector vector projection on entire change subspace, obtains and speaker couple The registration i-vector vector answered, processor perform the steps of when executing computer program
Using formula s2=m+Tw2By the first i-vector vector projection on entire change subspace, registration i- is obtained Vector vector, wherein s2It is the mean value super vector corresponding with registration i-vector vector of D*G dimension;M be with speaker without It closes and the unrelated D*G of channel ties up super vector;T is entire change subspace, dimension DG*M;w2It is registration i-vector vector, Dimension is M.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory and can The computer program run on a processor, processor performs the steps of when executing computer program obtains tested speech number According to tested speech data carry speaker's mark;Based on tested speech data, corresponding test i-vector vector is obtained;Base Inquiry database is identified in speaker, obtains registration i-vector vector corresponding with speaker's mark;Using cosine similarity Algorithm obtains the similarity of test i-vector vector sum registration i-vector vector, detects test i-vector according to similarity Whether vector sum registration i-vector corresponds to same speaker.
In one embodiment, a kind of computer readable storage medium is provided, computer program, computer journey are stored thereon with The training voice data for obtaining speaker is performed the steps of when sequence is executed by processor, and it is corresponding to extract trained voice data Training phonetic feature;Entire change corresponding with default UBM model subspace is trained based on default UBM model;It will train Phonetic feature is projected on entire change subspace, obtains the first i-vector vector;First i-vector vector projection is existed On entire change subspace, registration i-vector vector corresponding with speaker is obtained.
In one embodiment, the corresponding trained phonetic feature of training voice data is extracted, computer program is held by processor It is performed the steps of when row and training voice data is pre-processed, obtain pretreatment voice data;To pretreatment voice data Make Fast Fourier Transform (FFT), obtain the frequency spectrum of training voice data, and obtains the power spectrum of training voice data according to frequency spectrum;It adopts The power spectrum of training voice data is handled with melscale filter group, obtains the Meier power spectrum of training voice data;In plum Cepstral analysis is carried out on your power spectrum, obtains the MFCC feature of training voice data.
In one embodiment, entire change corresponding with default UBM model subspace is trained based on default UBM model, The higher-dimension sufficient statistic for obtaining default UBM model is performed the steps of when computer program is executed by processor;Using maximum Expectation Algorithm is iterated higher-dimension sufficient statistic, obtains corresponding entire change subspace.
In one embodiment, by training phonetic feature be projected on entire change subspace, obtain the first i-vector to Amount is performed the steps of when computer program is executed by processor based on training phonetic feature and default UBM model, using equal Value MAP adaptive approach obtains GMM-UBM model;Using formula s1=m+Tw1Training phonetic feature is projected in the overall change Beggar spatially, obtains the first i-vector vector, wherein s1Be C*F dimension GMM-UBM model in training phonetic feature phase Corresponding mean value super vector;M is C*F dimension super vector unrelated with speaker and unrelated channel;T is entire change subspace, dimension Degree is CF*N;w1It is the first i-vector vector, dimension N.
In one embodiment, it by the first i-vector vector projection on entire change subspace, obtains and speaker couple The registration i-vector vector answered, performs the steps of when computer program is executed by processor
Using formula s2=m+Tw2By the first i-vector vector projection on entire change subspace, registration i- is obtained Vector vector, wherein s2It is the mean value super vector corresponding with registration i-vector vector of D*G dimension;M be with speaker without It closes and the unrelated D*G of channel ties up super vector;T is entire change subspace, dimension DG*M;w2It is registration i-vector vector, Dimension is M.
In one embodiment, a kind of computer readable storage medium is provided, computer program, computer journey are stored thereon with It is performed the steps of when sequence is executed by processor and obtains tested speech data, tested speech data carry speaker's mark;It is based on Tested speech data obtain corresponding test i-vector vector;Inquiry database, acquisition and speaker are identified based on speaker Identify corresponding registration i-vector vector;Test i-vector vector sum is obtained using cosine similarity algorithm and registers i- The similarity of vector vector detects whether test i-vector vector sum registration i-vector corresponds to same theory according to similarity Talk about people.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (10)

1. a kind of i-vector vector extracting method characterized by comprising
The training voice data of speaker is obtained, and extracts the corresponding trained phonetic feature of the trained voice data;
Entire change corresponding with default UBM model subspace is trained based on default UBM model;
The trained phonetic feature is projected on the entire change subspace, the first i-vector vector is obtained;
By the first i-vector vector projection on the entire change subspace, note corresponding with the speaker is obtained Volume i-vector vector.
2. i-vector vector extracting method as described in claim 1, which is characterized in that described to extract the trained voice number According to corresponding trained phonetic feature, comprising:
The trained voice data is pre-processed, pretreatment voice data is obtained;
Fast Fourier Transform (FFT) is made to the pretreatment voice data, obtains the frequency spectrum of training voice data, and according to the frequency Spectrum obtains the power spectrum of training voice data;
The power spectrum of the trained voice data is handled using melscale filter group, obtains the Meier function of training voice data Rate spectrum;
Cepstral analysis is carried out on the Meier power spectrum, obtains the MFCC feature of training voice data.
3. i-vector vector extracting method as described in claim 1, which is characterized in that described based on default UBM model Train entire change corresponding with default UBM model subspace, comprising:
Obtain the higher-dimension sufficient statistic of the default UBM model;
The higher-dimension sufficient statistic is iterated using EM algorithm, obtains corresponding entire change subspace.
4. i-vector vector extracting method as described in claim 1, which is characterized in that described that the trained voice is special Sign is projected on the entire change subspace, obtains the first i-vector vector, comprising:
Based on the trained phonetic feature and the default UBM model, GMM-UBM mould is obtained using mean value MAP adaptive approach Type;
Using formula s1=m+Tw1The trained phonetic feature is projected on the entire change subspace, the first i- is obtained Vector vector, wherein s1It is mean value super vector corresponding with the trained phonetic feature in the GMM-UBM model of C*F dimension; M is C*F dimension super vector unrelated with speaker and unrelated channel;T is the entire change subspace, dimension CF*N;w1It is First i-vector vector, dimension N.
5. i-vector vector extracting method as described in claim 1, which is characterized in that described by the first i- Vector vector projection obtains registration i-vector vector corresponding with the speaker on the entire change subspace, Include:
Using formula s2=m+Tw2By the first i-vector vector projection on the entire change subspace, registration is obtained I-vector vector, wherein s2It is the mean value super vector corresponding with the registration i-vector vector of D*G dimension;M is and says The words D*G that people is unrelated and channel is unrelated ties up super vector;T is the entire change subspace, dimension DG*M;w2It is registration i- Vector vector, dimension M.
6. a kind of method for distinguishing speek person characterized by comprising
Tested speech data are obtained, the tested speech data carry speaker's mark;
It further include that the tested speech data are carried out using any one of the claim 1-5 i-vector vector extracting method Processing obtains corresponding test i-vector vector;
Inquiry database is identified based on the speaker, obtains registration i-vector vector corresponding with speaker mark;
The similarity that i-vector vector is registered described in the test i-vector vector sum is obtained using cosine similarity algorithm, Same speaker whether is corresponded to according to registration i-vector described in the similarity detection test i-vector vector sum.
7. a kind of i-vector vector extraction device characterized by comprising
Training data module is obtained, for obtaining the training voice data of speaker, and it is corresponding to extract the trained voice data Training phonetic feature;
Voice data module is obtained, for obtaining the training voice data of speaker, and it is corresponding to extract the trained voice data Training phonetic feature;
Training variation space module, it is empty for training entire change corresponding with default UBM model based on default UBM model Between;
Projection variation space module obtains for the trained phonetic feature to be projected in the entire change subspace One i-vector vector;
I-vector vector module is obtained, is used for the first i-vector vector projection in the entire change subspace On, obtain registration i-vector vector corresponding with the speaker.
8. a kind of Speaker Identification device characterized by comprising
Test data module is obtained, for obtaining tested speech data, the tested speech data carry speaker's mark;
Test vector module is obtained, for using any one of the claim 1-5 i-vector vector extracting method to described Tested speech data are handled, and corresponding test i-vector vector is obtained;
Log-in vector module is obtained, for identifying inquiry database based on the speaker, is obtained and speaker mark pair The registration i-vector vector answered;
It determines corresponding speaker's module, is infused described in the test i-vector vector sum for being obtained using cosine similarity algorithm The similarity of volume i-vector vector registers i- according to the similarity detection test i-vector vector sum Whether vector corresponds to same speaker.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to The step of method for distinguishing speek person described in any one of 5 i-vector vector extracting methods or claim 6.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization i-vector vector extraction side as described in any one of claim 1 to 5 when the computer program is executed by processor Described in method or claim 6 the step of method for distinguishing speek person.
CN201810574010.4A 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition Active CN109065022B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810574010.4A CN109065022B (en) 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition
PCT/CN2018/092589 WO2019232826A1 (en) 2018-06-06 2018-06-25 I-vector extraction method, speaker recognition method and apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810574010.4A CN109065022B (en) 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition

Publications (2)

Publication Number Publication Date
CN109065022A true CN109065022A (en) 2018-12-21
CN109065022B CN109065022B (en) 2022-08-09

Family

ID=64820489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810574010.4A Active CN109065022B (en) 2018-06-06 2018-06-06 Method for extracting i-vector, method, device, equipment and medium for speaker recognition

Country Status (2)

Country Link
CN (1) CN109065022B (en)
WO (1) WO2019232826A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827834A (en) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 Voiceprint registration method, system and computer readable storage medium
CN111161713A (en) * 2019-12-20 2020-05-15 北京皮尔布莱尼软件有限公司 Voice gender identification method and device and computing equipment
CN111508505A (en) * 2020-04-28 2020-08-07 讯飞智元信息科技有限公司 Speaker identification method, device, equipment and storage medium
WO2020098828A3 (en) * 2019-10-31 2020-09-03 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for personalized speaker verification
CN113056784A (en) * 2019-01-29 2021-06-29 深圳市欢太科技有限公司 Voice information processing method and device, storage medium and electronic equipment
CN114420142A (en) * 2022-03-28 2022-04-29 北京沃丰时代数据科技有限公司 Voice conversion method, device, equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111700718B (en) * 2020-07-13 2023-06-27 京东科技信息技术有限公司 Method and device for recognizing holding gesture, artificial limb and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737633A (en) * 2012-06-21 2012-10-17 北京华信恒达软件技术有限公司 Method and device for recognizing speaker based on tensor subspace analysis
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
US20150149165A1 (en) * 2013-11-27 2015-05-28 International Business Machines Corporation Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors
CN105810199A (en) * 2014-12-30 2016-07-27 中国科学院深圳先进技术研究院 Identity verification method and device for speakers
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes
CN107146601A (en) * 2017-04-07 2017-09-08 南京邮电大学 A kind of rear end i vector Enhancement Methods for Speaker Recognition System
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN107633845A (en) * 2017-09-11 2018-01-26 清华大学 A kind of duscriminant local message distance keeps the method for identifying speaker of mapping
WO2018053531A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104240706B (en) * 2014-09-12 2017-08-15 浙江大学 It is a kind of that the method for distinguishing speek person that similarity corrects score is matched based on GMM Token
CN105933323B (en) * 2016-06-01 2019-05-31 百度在线网络技术(北京)有限公司 Voiceprint registration, authentication method and device
DE102016115018B4 (en) * 2016-08-12 2018-10-11 Imra Europe S.A.S. Audio signature for voice command observation
CN107240397A (en) * 2017-08-14 2017-10-10 广东工业大学 A kind of smart lock and its audio recognition method and system based on Application on Voiceprint Recognition

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737633A (en) * 2012-06-21 2012-10-17 北京华信恒达软件技术有限公司 Method and device for recognizing speaker based on tensor subspace analysis
US20150149165A1 (en) * 2013-11-27 2015-05-28 International Business Machines Corporation Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
CN105810199A (en) * 2014-12-30 2016-07-27 中国科学院深圳先进技术研究院 Identity verification method and device for speakers
WO2018053531A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes
CN107146601A (en) * 2017-04-07 2017-09-08 南京邮电大学 A kind of rear end i vector Enhancement Methods for Speaker Recognition System
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN107633845A (en) * 2017-09-11 2018-01-26 清华大学 A kind of duscriminant local message distance keeps the method for identifying speaker of mapping

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邢玉娟 等: "改进i-向量说话人识别算法研究", 《科学技术与工程》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113056784A (en) * 2019-01-29 2021-06-29 深圳市欢太科技有限公司 Voice information processing method and device, storage medium and electronic equipment
WO2020098828A3 (en) * 2019-10-31 2020-09-03 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for personalized speaker verification
US10997980B2 (en) 2019-10-31 2021-05-04 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics
US11031018B2 (en) 2019-10-31 2021-06-08 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for personalized speaker verification
US11244689B2 (en) 2019-10-31 2022-02-08 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics
CN110827834A (en) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 Voiceprint registration method, system and computer readable storage medium
CN110827834B (en) * 2019-11-11 2022-07-12 广州国音智能科技有限公司 Voiceprint registration method, system and computer readable storage medium
CN111161713A (en) * 2019-12-20 2020-05-15 北京皮尔布莱尼软件有限公司 Voice gender identification method and device and computing equipment
CN111508505A (en) * 2020-04-28 2020-08-07 讯飞智元信息科技有限公司 Speaker identification method, device, equipment and storage medium
CN111508505B (en) * 2020-04-28 2023-11-03 讯飞智元信息科技有限公司 Speaker recognition method, device, equipment and storage medium
CN114420142A (en) * 2022-03-28 2022-04-29 北京沃丰时代数据科技有限公司 Voice conversion method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2019232826A1 (en) 2019-12-12
CN109065022B (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN109065022A (en) I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium
CN108922544A (en) General vector training method, voice clustering method, device, equipment and medium
US9940935B2 (en) Method and device for voiceprint recognition
CN107610707B (en) A kind of method for recognizing sound-groove and device
Li et al. An overview of noise-robust automatic speech recognition
Krueger et al. Model-based feature enhancement for reverberant speech recognition
CN110232932B (en) Speaker confirmation method, device, equipment and medium based on residual delay network
CN109065028A (en) Speaker clustering method, device, computer equipment and storage medium
CN107886943A (en) Voiceprint recognition method and device
CN105096955B (en) A kind of speaker&#39;s method for quickly identifying and system based on model growth cluster
WO2019200744A1 (en) Self-updated anti-fraud method and apparatus, computer device and storage medium
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
WO2014114116A1 (en) Method and system for voiceprint recognition
CN110047504B (en) Speaker identification method under identity vector x-vector linear transformation
CN103794207A (en) Dual-mode voice identity recognition method
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN108154371A (en) Electronic device, the method for authentication and storage medium
CN104538035A (en) Speaker recognition method and system based on Fisher supervectors
Abdelaziz et al. Twin-HMM-based audio-visual speech enhancement
Nidhyananthan et al. Language and text-independent speaker identification system using GMM
Kudashev et al. A Speaker Recognition System for the SITW Challenge.
CN112992155A (en) Far-field voice speaker recognition method and device based on residual error neural network
Herrera-Camacho et al. Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE
Zi et al. Joint filter combination-based central difference feature extraction and attention-enhanced Dense-Res2Block network for short-utterance speaker recognition
Sehr et al. A novel approach for matched reverberant training of HMMs using data pairs.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant