CN109065022A - I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium - Google Patents
I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium Download PDFInfo
- Publication number
- CN109065022A CN109065022A CN201810574010.4A CN201810574010A CN109065022A CN 109065022 A CN109065022 A CN 109065022A CN 201810574010 A CN201810574010 A CN 201810574010A CN 109065022 A CN109065022 A CN 109065022A
- Authority
- CN
- China
- Prior art keywords
- vector
- speaker
- voice data
- training
- trained
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 506
- 238000000034 method Methods 0.000 title claims abstract description 82
- 238000012549 training Methods 0.000 claims abstract description 151
- 230000008859 change Effects 0.000 claims abstract description 88
- 239000000284 extract Substances 0.000 claims abstract description 7
- 238000001228 spectrum Methods 0.000 claims description 67
- 238000012360 testing method Methods 0.000 claims description 42
- 238000004590 computer program Methods 0.000 claims description 32
- 238000004422 calculation algorithm Methods 0.000 claims description 31
- 238000003860 storage Methods 0.000 claims description 18
- 238000004458 analytical method Methods 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 12
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 5
- 238000013459 approach Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 abstract description 10
- 238000009826 distribution Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 15
- 230000001419 dependent effect Effects 0.000 description 8
- 238000012512 characterization method Methods 0.000 description 7
- 239000011159 matrix material Substances 0.000 description 6
- 238000009432 framing Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000012790 confirmation Methods 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000003749 cleanliness Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005713 exacerbation Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
- 239000011800 void material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a kind of i-vector vector extracting method, method for distinguishing speek person, device, equipment and media, wherein, the i-vector vector extracting method includes: the training voice data for obtaining speaker, and extracts the corresponding trained phonetic feature of trained voice data;Entire change corresponding with default UBM model subspace is trained based on default UBM model;Training phonetic feature is projected on entire change subspace, the first i-vector vector is obtained;By the first i-vector vector projection on entire change subspace, registration i-vector vector corresponding with speaker is obtained.This method makes that voice feature data is trained to can remove more feature of noise after projecting twice namely reducing dimension, improves the degree of purity for extracting speaker's phonetic feature, while reducing after dimensionality reduction and calculating the recognition efficiency that speech recognition is also improved in space.
Description
Technical field
The present invention relates to field of speech recognition more particularly to a kind of i-vector vector extracting methods, Speaker Identification side
Method, device, equipment and medium.
Background technique
Speaker Identification is also known as Application on Voiceprint Recognition, is to be spoken using the speaker dependent's information contained in voice signal to identify
A kind of biometrics of person's identity.In recent years, the i-vector based on vector analysis (recognize by identity-vector, identity
Syndrome vector) modeling method introducing so that the performance of Speaker Recognition System has is obviously improved.To speaker's voice
It can include the information of speaker in vector analysis, in usual channel subspace.Total variable of one low-dimensional in the space i-vector
Space indicates speaker subspace and channel subspace, and speaker's voice projected to the space by dimensionality reduction, can be obtained one
The characterization vector (i.e. i-vector vector) of a regular length.However, the acquired i-vector of existing i-vector modeling
There is also more disturbing factors for vector, increase complexity when i-vector vector to be used for Speaker Identification.
Summary of the invention
Based on this, it is necessary in view of the above technical problems, provide a kind of i-vector that can remove more disturbing factor
Vector extracting method, device, computer equipment and storage medium.
A kind of i-vector vector extracting method, comprising:
The training voice data of speaker is obtained, and extracts the corresponding trained phonetic feature of trained voice data;
Entire change corresponding with default UBM model subspace is trained based on default UBM model;
Training phonetic feature is projected on entire change subspace, the first i-vector vector is obtained;
By the first i-vector vector projection on entire change subspace, registration i- corresponding with speaker is obtained
Vector vector.
A kind of i-vector vector extraction device, comprising:
Voice data module is obtained, for obtaining the training voice data of speaker, and it is corresponding to extract trained voice data
Training phonetic feature;
Training variation space module, for training entire change corresponding with default UBM model based on default UBM model
Subspace;
Projection variation space module obtains the first i- for phonetic feature will to be trained to be projected in entire change subspace
Vector vector;
I-vector vector module is obtained, in entire change subspace, obtaining the first i-vector vector projection
Take registration i-vector vector corresponding with speaker.
A kind of computer equipment, including memory, processor and storage can be run in memory and on a processor
Computer program, processor execute computer program when realize i-vector vector extracting method the step of.
A kind of computer readable storage medium, computer-readable recording medium storage have computer program, computer program
The step of i-vector vector extracting method is realized when being executed by processor.
This implementation also provides a kind of method for distinguishing speek person, comprising:
Tested speech data are obtained, tested speech data carry speaker's mark;
Based on tested speech data, corresponding test i-vector vector is obtained;
Inquiry database is identified based on speaker, obtains registration i-vector vector corresponding with speaker's mark;
The similarity that test i-vector vector sum registration i-vector vector is obtained using cosine similarity algorithm, according to
Whether similarity detection test i-vector vector sum registration i-vector corresponds to same speaker.
A kind of Speaker Identification device, comprising:
Test data module is obtained, for obtaining tested speech data, tested speech data carry speaker's mark;
Test vector module is obtained, for being handled using i-vector vector extracting method tested speech data,
Obtain corresponding test i-vector vector;
Log-in vector module is obtained, for identifying inquiry database based on speaker, is obtained corresponding with speaker's mark
Register i-vector vector;
It determines corresponding speaker's module, registers i- for obtaining test i-vector vector sum using cosine similarity algorithm
The similarity of vector vector detects whether test i-vector vector sum registration i-vector corresponds to same theory according to similarity
Talk about people.
A kind of computer equipment, including memory, processor and storage can be run in memory and on a processor
Computer program, processor execute computer program when realize method for distinguishing speek person the step of.
A kind of computer readable storage medium, computer-readable recording medium storage have computer program, computer program
The step of method for distinguishing speek person is realized when being executed by processor.
I-vector vector extracting method, method for distinguishing speek person, device, equipment and Jie provided in an embodiment of the present invention
Matter, after being projected in and obtain the first i-vector vector on entire change subspace by the way that phonetic feature being trained, then by the first i-
Vector vector, which is projected in for the second time on entire change subspace, obtains registration i-vector vector, so that training phonetic feature number
According to can remove more feature of noise after projecting twice namely reducing dimension, improving and extracting the pure of speaker's phonetic feature
Cleanliness, while being reduced after dimensionality reduction and calculating the recognition efficiency that speech recognition is also improved in space, reduce identification complexity.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is the application environment schematic diagram of i-vector vector extracting method in one embodiment of the invention;
Fig. 2 is the flow chart of i-vector vector extracting method in one embodiment of the invention;
Fig. 3 is another specific flow chart of i-vector vector extracting method in one embodiment of the invention;
Fig. 4 is another specific flow chart of i-vector vector extracting method in one embodiment of the invention;
Fig. 5 is another specific flow chart of i-vector vector extracting method in one embodiment of the invention;
Fig. 6 is a specific flow chart of method for distinguishing speek person in one embodiment of the invention;
Fig. 7 is a functional block diagram of i-vector vector extraction device in one embodiment of the invention;
Fig. 8 is a functional block diagram of Speaker Identification device in one embodiment of the invention;
Fig. 9 is a schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
I-vector vector extracting method provided in an embodiment of the present invention, can be applicable in the application environment such as Fig. 1,
In, computer equipment is communicated by network with identification server.Wherein, computer equipment includes but is not limited to various individuals
Computer, laptop, smart phone, tablet computer and portable wearable device.Identify that server can be with independent
The server cluster of server either multiple servers composition is realized.
In one embodiment, it as shown in Fig. 2, providing a kind of i-vector vector extracting method, is applied in this way in Fig. 1
In identification server for be illustrated, include the following steps:
S10. the training voice data of speaker is obtained, and extracts the corresponding trained phonetic feature of trained voice data.
Wherein, the training voice data of speaker is the primary voice data that speaker provides.Training phonetic feature is generation
Table speaker is different from other people phonetic feature, is applied to the present embodiment, mel-frequency cepstrum coefficient (Mel- can be used
Frequency Cepstral Coefficients, hereinafter referred to as MFCC feature) as training phonetic feature.
One filter group of finder's earcon is detected, only focusing on certain specific frequency components, (sense of hearing of people is to frequency
It is nonlinear), that is to say, that the signal that human ear receives sound frequency is limited.However these filters are on frequency coordinate axis
But it is not univesral distribution, there are many filters in low frequency region, they is distributed than comparatively dense, but in high-frequency region, filtering
The number of device just becomes fewer, is distributed very sparse.Melscale filter group low frequency part high resolution, with human ear
Auditory properties are consistent, this is also the physical significance place of melscale.
S20. entire change corresponding with default UBM model subspace is trained based on default UBM model.
Wherein, presetting UBM (Universal Background Model, universal background model) is that a characterization is a large amount of non-
The gauss hybrid models (Gaussian Mixture Models, gauss hybrid models) of speaker dependent's phonetic feature distribution.
The training of UBM model generallys use largely, voice data that channel unrelated unrelated with speaker dependent, therefore can usually recognize
It is the model unrelated with speaker dependent for UBM model, it is only fitted the phonetic feature distribution of people, and does not represent some tool
The speaker of body.UBM model is preset in identification server, is because being instructed in the voiceprint registration stage of Application on Voiceprint Recognition process
The voice data for practicing speaker dependent is usually considerably less, is modeled using GMM model to speaker's phonetic feature, and training is specific
The voice data of speaker can not usually cover the feature space where GMM.It therefore, can be according to the Character adjustment of training voice
The parameter of UBM model characterizes the individual information of speaker dependent, and the feature that training voice does not cover can be in UBM model
Similar feature distribution is next approximate, and this method can preferably solve the problems, such as to train voice deficiency bring system performance.
Entire change subspace, the also referred to as space T (Total Variability Space) are one overall situations of direct setting
The projection matrix of variation, to comprising all possible information of speaker in voice data, not separated speaker is empty in the space T
Between and channel space.The space T can project to higher-dimension sufficient statistic (super vector) i- that can be used as low-dimensional speaker characterization
Vector plays the role of dimensionality reduction.The training process in the space T includes: to utilize vector analysis and EM according to UBM model is preset
(Expectation Maximization Algorithm, greatest hope) algorithm calculates the space T from wherein convergence.
In this step, the entire change subspace obtained based on default UBM model does not distinguish speaker space and channel sky
Between, the information of the information in sound channel space and channel space is converged on into a space, to reduce computation complexity, convenient for further
I-vector vector is obtained based on entire change subspace.
S30. training phonetic feature is projected on entire change subspace, obtains the first i-vector vector.
Wherein, the first i-vector vector is the entire change subspace that phonetic feature will be trained to project to low-dimensional, is obtained
A regular length characterization vector vector, i.e. i-vector vector.
Specifically, formula s is used in this step1=m+Tw1, the training phonetic feature that can obtain higher-dimension is projected in overall change
The first i-vector vector that low-dimensional is formed behind beggar space, the dimension and removal for reducing training phonetic feature projection are more made an uproar
Sound, convenient for being identified based on the first i-vector vector to speaker.
S40. by the first i-vector vector projection on entire change subspace, registration i- corresponding with speaker is obtained
Vector vector.
Wherein, entire change subspace is obtained by step S20, the not separated speaker in the entire change subspace
Space and channel space, and space T (Total Variability Space) of a global change is directly set, to wrap
Containing information all possible in voice data.
Registering i-vector vector is to obtain the entire change subspace of the first i-vector vector projection to low-dimensional
In one database for being recorded in identification server, to be associated with the regular length as identity with speaker ID
The vector of characterization vector, i.e. i-vector.
In a specific embodiment, in step s 40, i.e., training phonetic feature is projected in entire change subspace
On, the first i-vector vector is obtained, is specifically comprised the following steps:
S41. formula s is used2=m+Tw2By the first i-vector vector projection on entire change subspace, registration is obtained
I-vector vector, wherein s2It is the mean value super vector corresponding with registration i-vector vector of D*G dimension;M is and speaker
Unrelated and unrelated channel D*G ties up super vector;T is entire change subspace, dimension DG*M;w2Be registration i-vector to
Amount, dimension M.
In the present embodiment, s2The Gaussian mean super vector of the first i-vector vector of step S30 acquisition can be used;M is
D*G unrelated with speaker and unrelated with channel ties up super vector, is spliced by the corresponding mean value super vector of UBM model;w2It is
The random vector of one group of obedience standardized normal distribution exactly registers i-vector vector, register the dimension of i-vector vector as
M。
Further, in formula T (entire change subspace) acquisition process are as follows: training UBM model higher-dimension sufficiently unite
Then metering updates above-mentioned higher-dimension sufficient statistic by EM algorithm iteration and produces the convergent space T.The space T is brought into
Formula s2=m+Tw2, because of s2, m and T be all it is known, w can be obtained2, namely registration i-vector vector, wherein w2=(s2-
m)/T。
I-vector vector extracting method provided in this embodiment will be by that will train phonetic feature to be projected in entire change
After spatially obtaining the first i-vector vector, then the first i-vector vector is projected in for the second time on entire change subspace
Registration i-vector vector is obtained, so that training voice feature data can remove more after projecting twice namely reducing dimension
Feature of noise, improve extract speaker's phonetic feature degree of purity, while after dimensionality reduction reduce calculate space also improve voice
The recognition efficiency of identification, the method for distinguishing speek person for originally implementing offer are known using i-vector vector extracting method
Not, identification complexity is reduced.
In one embodiment, as shown in figure 3, in step S10, that is, it is special to extract the corresponding trained voice of training voice data
Sign, specifically comprises the following steps:
S11: pre-processing training voice data, obtains pretreatment voice data.
In a specific embodiment, in step S11, training voice data is pre-processed, obtains pretreatment voice
Data specifically comprise the following steps:
S111: preemphasis processing is made to training voice data, the calculation formula of preemphasis processing is s'n=sn-a*sn-1,
In, snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nFor time domain after preemphasis
On signal amplitude, a is pre emphasis factor, and the value range of a is 0.9 < a < 1.0.
Wherein, preemphasis is a kind of signal processing mode compensated in transmitting terminal to input signal high fdrequency component.With
The increase of signal rate, signal be damaged in transmission process it is very big, in order to enable receiving end to obtain relatively good signal waveform,
With regard to needing to compensate impaired signal.The thought of pre-emphasis technique is exactly the high frequency in the transmitting terminal enhancing signal of transmission line
Ingredient enables receiving end to obtain preferable signal waveform to compensate excessive decaying of the high fdrequency component in transmission process.In advance
Exacerbation does not have an impact to noise, therefore can effectively improve output signal-to-noise ratio.
In the present embodiment, preemphasis processing is made to training voice data, the formula of preemphasis processing is s'n=sn-a*
sn-1, wherein snFor the signal amplitude in time domain, i.e. the amplitude (amplitude) of voice expressed in the time domain of voice data, sn-1For
With snThe signal amplitude of opposite last moment, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, and a's takes
Value range is 0.9 < a < 1.0, takes the effect of 0.97 preemphasis relatively good here.Sounding mistake can be eliminated by being handled using the preemphasis
It is interfered caused by vocal cords and lip etc. in journey, the pent-up high frequency section of voice data can be trained with effective compensation, and can
The formant of trained voice data high frequency is highlighted, the signal amplitude of training voice data is reinforced, helps to extract trained voice spy
Sign.
S112: the training voice data after preemphasis is subjected to sub-frame processing.
Specifically, after pre-add retraining voice data, sub-frame processing should also be carried out.Framing refers to whole section of voice letter
It number is cut into the voice processing technology of several segments, the size of every frame is in the range of 10-30ms, using general 1/2 frame length as frame
It moves.Frame moves the overlapping region for referring to adjacent two interframe, can be avoided adjacent two frame and changes excessive problem.To training voice data
Sub-frame processing is carried out, training voice data can be divided into the voice data of several segments, trained voice data can be segmented, be convenient for
The extraction of training phonetic feature.
S113: carrying out windowing process for the training voice data after framing, obtains pretreatment voice data, the calculating of adding window
Formula isWherein, N is that window is long, and n is time, snFor the signal width in time domain
Degree, s'nFor the signal amplitude in time domain after adding window.
Specifically, after carrying out sub-frame processing to training voice data, the initial segment of each frame and end end can all occur
Discontinuous place, so framing is mostly also bigger with the error of training voice data.This is able to solve using adding window to ask
Topic, the training voice data after can making framing becomes continuously, and each frame is enabled to show the feature of periodic function.
Windowing process specifically refers to handle training voice data using window function, and window function can choose Hamming window, then should add
The formula of window isN is that Hamming window window is long, and n is time, snFor the letter in time domain
Number amplitude, s'nFor the signal amplitude in time domain after adding window.Windowing process is carried out to training voice data, obtains pretreatment voice
Data, the signal of training voice data in the time domain after enabling to framing become continuously, to help to extract trained voice number
According to training phonetic feature.
Above-mentioned steps S211-S213 trains the training of voice data for extraction to the pretreatment operation of training voice data
Phonetic feature provides the foundation, and enables to the training phonetic feature extracted more representative of the training voice data, and according to this
Training phonetic feature trains corresponding GMM-UBM model.
S12: Fast Fourier Transform (FFT) is made to pretreatment voice data, obtains the frequency spectrum of training voice data, and according to frequency
Spectrum obtains the power spectrum of training voice data.
Wherein, Fast Fourier Transform (FFT) (Fast Fourier Transformation, abbreviation FFT), refers to and utilizes computer
Calculate efficient, quick calculation method the general designation of discrete Fourier transform.Computer can be made to calculate discrete Fu using this algorithm
In multiplication number required for leaf transformation be greatly reduced, the number of sampling points being especially transformed is more, the section of fft algorithm calculation amount
It saves more significant.
Specifically, Fast Fourier Transform (FFT) is carried out to pretreatment voice data, voice data will be pre-processed from time domain
Signal amplitude be converted to the signal amplitude (frequency spectrum) on frequency domain.The formula of the calculating frequency spectrum is
1≤k≤N, N are the size of frame, and s (k) is the signal amplitude on frequency domain, and s (n) is the signal amplitude in time domain, and n is the time, and i is
Complex unit.After the frequency spectrum for obtaining pretreatment voice data, pretreatment voice data can be directly acquired according to the frequency spectrum
The power spectrum for pre-processing voice data is known as the power spectrum of training voice data by power spectrum below.Voice number is trained in the calculating
According to the formula of power spectrum be1≤k≤N, N are the size of frame, and s (k) is the signal amplitude on frequency domain.
By the way that pretreatment voice data is converted to the signal amplitude on frequency domain from the signal amplitude in time domain, further according on the frequency domain
Signal amplitude obtains the power spectrum of training voice data, mentions to extract training phonetic feature from the power spectrum of training voice data
For important technical foundation.
S13: handling the power spectrum of training voice data using melscale filter group, obtains the plum of training voice data
That power spectrum.
It wherein, is the Meier carried out to power spectrum using the power spectrum that melscale filter group handles training voice data
Frequency analysis, mel-frequency analysis are the analyses based on human auditory's perception.Detection discovery, human ear is just as a filter group one
Sample only focuses on certain specific frequency components (sense of hearing of people is nonlinear to frequency), that is to say, that human ear receives sound audio
The signal of rate is limited.However these filters are not but univesral distributions on frequency coordinate axis, are had very in low frequency region
More filters, they are distributed than comparatively dense, but in high-frequency region, the number of filter just becomes fewer, are distributed very sparse.
It is to be appreciated that high resolution of the melscale filter group in low frequency part, the auditory properties with human ear are consistent, this
It is the physical significance place of melscale.
In the present embodiment, the power spectrum of training voice data is handled using melscale filter group, obtains training voice
The Meier power spectrum of data carries out cutting to frequency-region signal by using melscale filter group, so that last each frequency
The corresponding numerical value of section, if the number of filter is 22, the Meier power spectrum corresponding 22 of available trained voice data
A energy value.Mel-frequency analysis is carried out by the power spectrum to training voice data, so that the Meier function obtained after its analysis
Rate spectrum maintains the frequency-portions closely related with human ear characteristic, which, which can be well reflected out, trains voice data
Feature.
S14: carrying out cepstral analysis on Meier power spectrum, obtains the MFCC feature of training voice data.
Wherein, cepstrum (cepstrum) refers in Fu that a kind of Fourier transform spectrum of signal carries out again after logarithm operation
Leaf inverse transformation, since general Fourier spectrum is complex number spectrum, thus cepstrum is also known as cepstrum.
Specifically, cepstral analysis is carried out to Meier power spectrum, according to cepstrum as a result, analyzing and obtaining trained voice data
MFCC feature.It, can be excessively high by script characteristic dimension by the cepstral analysis, it is difficult to the training voice data directly used
The feature for including in Meier power spectrum is converted into wieldy feature and (uses by carrying out cepstral analysis on Meier power spectrum
Come the MFCC character vector for being trained or identifying).The MFCC feature can be as training phonetic feature to different phonetic
The coefficient distinguished, the training phonetic feature can reflect the difference between voice, can be used to identify and distinguish between trained language
Sound data.
In a specific embodiment, in step S14, cepstral analysis is carried out on Meier power spectrum, obtains training voice
The MFCC feature of data, includes the following steps:
S141: taking the logarithm of Meier power spectrum, obtains Meier power spectrum to be transformed.
Specifically, according to the definition of cepstrum, logarithm log is taken to Meier power spectrum, obtains Meier power spectrum m to be transformed.
S142: discrete cosine transform is made to Meier power spectrum to be transformed, obtains the MFCC feature of training voice data.
Specifically, to Meier power spectrum m to be transformed make discrete cosine transform (Discrete Cosine Transform,
DCT), the MFCC feature for obtaining corresponding trained voice data generally takes the 2nd to the 13rd coefficient special as training voice
Sign, the training phonetic feature are able to reflect the difference between voice data.Discrete cosine transform is made to Meier power spectrum m to be transformed
Formula isI=0,1,2 ..., N-1, N are frame length, and m is Meier power spectrum to be transformed,
J is the independent variable of Meier power spectrum to be transformed.Due to having overlapping between Meier filter, so being filtered using melscale
There is correlation, discrete cosine transform can carry out dimensionality reduction to Meier power spectrum m to be transformed between the energy value that device obtains
It compresses and abstract, and obtains indirectly training phonetic feature, compared to Fourier transformation, the no void of the result of discrete cosine transform
There is apparent advantage in portion in terms of calculating.
Step S11-S14 carries out the processing of feature extraction, the instruction finally obtained based on training technique to training voice data
Trained voice data can be embodied well by practicing phonetic feature, which can train corresponding GMM-UBM mould
Type, and then registration i-vector vector is obtained, so that the registration i-vector vector that training obtains is when carrying out speech recognition
As a result more accurate.
It should be noted that the feature extracted above is MFCC feature, training phonetic feature should not be limited to herein
Only MFCC feature is a kind of, and will be understood that the phonetic feature obtained using training technique, as long as can effectively reflect voice number
According to feature, it all can be used as trained phonetic feature and carry out identification and model training.In the present embodiment, to training voice data
It is pre-processed, and obtains corresponding pretreatment voice data.Training voice data is pre-processed and can preferably be mentioned
The training phonetic feature of trained voice data is taken, so that the training phonetic feature extracted is more representative of the training voice data,
To carry out speech recognition using the training phonetic feature.
In one embodiment, it as shown in figure 4, in step S20, i.e., is trained and default UBM model based on default UBM model
Corresponding entire change subspace, specifically comprises the following steps:
S21. the higher-dimension sufficient statistic of default UBM model is obtained.
Wherein, UBM model is to go out a height using enough voice trainings of more people, channel equalization and men and women's sound equilibrium
The GMM model of rank, to describe the feature distribution unrelated with speaker.UBM model can adjust UBM model according to training phonetic feature
Parameter characterize the individual information of speaker dependent, it is in the feature UBM model that training phonetic feature does not cover similar
Feature distribution is next approximate, to solve training voice deficiency bring performance issue.
Statistic is the function of sample data, and in statistics, T (x) is the sufficient statistic of the parameter θ of unknown distribution P,
The all information of θ can be provided and if only if T (x), that is to say, that no statistic can provide the additional information about θ.System
Metering is the compression of actually a kind of pair data distribution, during sample is processed as statistic, is believed contained in sample
Breath may lose, if information has no to lose when sample is processed as statistic, then this statistic is referred to as abundant statistics
Amount.For example, for Gaussian Profile, it is expected that being exactly its two sufficient statistics with covariance matrix, because if the two are joined
Number is it is known that a Gaussian Profile can be uniquely determined.
Specifically, obtain the process of the higher-dimension sufficient statistic of default UBM model are as follows: determine speaker's sample X=x1,
X2 ..., xn }, which obeys the default corresponding distribution F (x) of UBM model, parameter theta.For the statistics of this group of sample
Amount is T, T=r (x1, x2 ..., xn).If T obeys distribution F (T), and the parameter theta of the distribution F (x) of sample X can be by F
(T) it finds out and, is i.e. the information for all about theta for including in F (x) has been included in F (T), then T is exactly default UBM mould
The higher-dimension sufficient statistic of type.
In this step, identification server is sufficiently counted by the zeroth order sufficient statistic and single order for obtaining default UBM model
Amount, to the technical foundation as training entire change subspace.
S22. higher-dimension sufficient statistic is iterated using EM algorithm, it is empty obtains corresponding entire change
Between.
Wherein, EM algorithm (Expectation Maximization Algorithm, greatest hope em algorithm)
It is a kind of iterative algorithm, is used to find in statistics, dependent on parameter in the probabilistic model of the recessive variable of not observable
Maximal possibility estimation.For example, initialization two parameters of A and B, the numerical value of the two is all unknown in the initial state, but
The information of B can be obtained to the information of A, A also can be obtained in the information for similarly obtaining B.If assigning certain initial value of A first, obtained with this
To the estimated value of B, then from the current value of B, the value of A is reevaluated, until continueing to convergence.
The algorithm flow of EM is as follows: 1. initialization distribution parameters;2. repeating E step and M step until convergence: E step: estimating
The desired value for counting unknown parameter, provides current parameter Estimation;M step: reevaluating distribution parameter, so that the likelihood of data
Property it is maximum, provide the expectation estimation of known variables.By the way that E step and M step is used alternatingly, gradually the parameter of improved model, makes
Parameter and the likelihood probability of training sample are gradually increased, and eventually terminate at a maximal point.
Specifically, iteration, which obtains entire change subspace, is realized by following step:
Step 1: it according to higher-dimension sufficient statistic by the mean value vector (each vector has D dimension) of M Gaussian component, concatenates
A Gaussian mean super vector, i.e. M*D n dimensional vector n are formed together, F (x) is constituted using M*D n dimensional vector n, and F (x) is MD dimension arrow
Amount;N is constructed using zeroth order sufficient statistic simultaneously, N is MD x MD dimension diagonal matrix, first using posterior probability as leading diagonal
Element is spliced.Wherein, posterior probability refers to after the information for obtaining result modified probability again.Such as: thing has been sent out
It is raw, it is desirable that the reason of this occurs is size a possibility that being caused by some factor, as posterior probability.
Step 2: the initialization space T constructs one [MD, V] and ties up matrix, wherein the dimension of V is much smaller than MD dimension, V's
Dimension is exactly the dimension of the first i-vector vector.
Step 3: the fixed space T iterates to following formula using EM algorithm, to estimate hidden variable w
Zeroth order sufficient statistic and single order sufficient statistic.After iterative calculation reaches predetermined number of times (5-6 times), i.e., it is believed that T is empty
Between restrain, to fix the space T:
In the formula, w is hidden variable, and I is unit matrix;∑ is the covariance matrix of the UMM model of MD x MD dimension,
Diagonal element is ∑ 1 ... ∑ m;F is the single order sufficient statistic in higher-dimension sufficient statistic;N is MD x MD dimension to angular moment
Battle array.
In the present embodiment, by EM algorithm iteration, the iterative algorithm for providing a simple and stable calculates posterior density function
Obtain entire change subspace;Obtaining entire change subspace can be by the higher-dimension sufficient statistic (super vector) of default UBM model
Low-dimensional realization is projected to, conducive to the vector further progress speech recognition after dimensionality reduction.
In one embodiment, as shown in figure 5, in step S30, i.e., training phonetic feature is projected in entire change subspace
On, the first i-vector vector is obtained, is specifically comprised the following steps:
S31. based on training phonetic feature and default UBM model, GMM-UBM mould is obtained using mean value MAP adaptive approach
Type.
Wherein, training phonetic feature is the phonetic feature for representing speaker and being different from other people, is applied to the present embodiment, can adopt
With mel-frequency cepstrum coefficient MFCC feature, (Mel-Frequency Cepstral Coefficients, hereinafter referred to as MFCC are special
Sign) as training phonetic feature.
Specifically, based on default UBM model, using Maximize come the GMM mould of adaptive training phonetic feature
Type, to update the mean value vector of each Gaussian component.Then the GMM model of M component is generated, namely generates GMM-UBM model.
Using the mean value vector (each vector has D dimension) of each Gaussian component of GMM-UBM model as concatenation unit, M*D dimension is formed
Gaussian mean super vector.
S32. formula s is used1=m+Tw1Training phonetic feature is projected on entire change subspace, the first i- is obtained
Vector vector, wherein s1It is mean value super vector corresponding with training phonetic feature in the GMM-UBM model of C*F dimension;M is
The C*F unrelated and unrelated channel with speaker ties up super vector;T is entire change subspace, dimension CF*N;w1It is the first i-
Vector vector, dimension N.
In the present embodiment, s1The Gaussian mean super vector of step S31 acquisition can be used;M be it is unrelated with speaker and with letter
The unrelated M*D in road ties up super vector, is spliced by the corresponding mean value super vector of UBM model;w1It is one group of obedience standard normal point
The random vector of cloth, is exactly the first i-vector vector, and the dimension of the first i-vector vector is N.
Further, in formula T (entire change subspace) acquisition process are as follows: training UBM model higher-dimension sufficiently unite
Then metering updates above-mentioned higher-dimension sufficient statistic by EM algorithm iteration and produces the convergent space T.The space T is brought into
Formula s1=m+Tw1, because of s1, m and T be all it is known, w can be obtained1Namely the first i-vector vector, wherein w1=(s1-
m)/T。
Step S31 is into step S32, by using formula s1=m+Tw1Training phonetic feature can be projected in entire change
On subspace, the first i-vector vector is obtained, training phonetic feature can be carried out to first dimensionality reduction and simplify training phonetic feature
Complexity is also convenient for the first i-vector vector of low-dimensional being further processed or for carrying out speech recognition.
In one embodiment, as shown in fig. 6, providing a kind of method for distinguishing speek person, the knowledge in Fig. 1 is applied in this way
It is illustrated, includes the following steps: for other server
S50. tested speech data are obtained, tested speech data carry speaker's mark.
Wherein, tested speech data be it is to be confirmed, claim to be that the speaker from carrying identifies corresponding speaker
Voice data.Speaker's mark is the unique identification to indicate speaker's identity, including but not limited to user name, identification card number
Code, phone number etc..
The process for completing speech recognition needs two fundamentals: voice and identity, is applied to the present embodiment, and voice is exactly
Tested speech data, identity is exactly speaker's mark, to identify body that server further determines that tested speech data are claimed
Whether part is real corresponding identity.
S60. tested speech data are handled using i-vector vector extracting method, obtains corresponding test i-
Vector vector.
Wherein, it after test i-vector vector is the entire change subspace by tested speech Projection Character to low-dimensional, obtains
To one for verify identity regular length characterization vector (i.e. i-vector).
In this step, the corresponding test i-vector vector of tested speech data can be obtained, acquisition process with based on training
The corresponding registration i-vector vector of phonetic feature acquisition is identical, and details are not described herein again.
S70. inquiry database is identified based on speaker, obtains registration i-vector vector corresponding with speaker's mark.
Wherein, database is that the corresponding registration i-vector vector sum speaker mark of speaker is associated record
Database.
Registration i-vector vector is recorded in the database of identification server, is used as body to be associated with speaker ID
The characterization vector (i.e. i-vector) of the regular length of part mark.
In this step, identification server can based on tested speech data carry speaker mark database lookup to pair
The registration i-vector vector answered, further to be compared to registration i-vector vector sum test i-vector vector.
S80. the phase that i-vector vector is registered described in test i-vector vector sum is obtained using cosine similarity algorithm
Like degree, detect whether test i-vector vector sum registration i-vector corresponds to same speaker according to similarity.
Specifically, the similarity for obtaining test i-vector vector sum registration i-vector vector can be carried out by following formula
Determine:
Wherein, AiAnd BiRespectively represent each component of vector A and vector B.From the above equation, we can see that similarity dimensions from -1 to
1, wherein -1 indicates two vector directions on the contrary, 1 indicates that two vectors directions are identical;0 indicates that two vectors are independent.- 1
And the similitude or diversity between two vectors are indicated between 1, it is possible to understand that ground, similarity indicate two vectors closer to 1
It is closer.Applied to the present embodiment, the threshold value of cos θ can be preset based on practical experience.If testing i-vector vector sum note
The similarity of volume i-vector vector is greater than threshold value, then it is assumed that and test i-vector vector sum registration i-vector vector is similar,
Namely it can determine that tested speech data are in the database corresponding with speaker's mark.
In the present embodiment, by cosine similarity algorithm can discriminating test i-vector vector sum register i-vector to
The similarity of amount, it is simple and fast, it is conducive to quickly confirmation recognition result.
I-vector vector extracting method provided in an embodiment of the present invention will be by that will train phonetic feature to be projected in overall change
After beggar spatially obtains the first i-vector vector, then that the first i-vector vector is projected in entire change for the second time is empty
Between it is upper obtain registration i-vector vector so that training voice feature data is removable after projecting twice namely reducing dimension
More feature of noise improve the degree of purity for extracting speaker's phonetic feature, while reducing calculating space after dimensionality reduction and also improving
The recognition efficiency of speech recognition reduces identification complexity.
Further, the processing for carrying out feature extraction to training voice data based on training technique obtains registration i-vector
Vector can embody trained voice data well, so that the registration i-vector vector that training obtains is carrying out speech recognition
When result it is more accurate;By EM algorithm iteration, the iterative algorithm for providing a simple and stable calculates posterior density function and obtains
Take entire change subspace;Low-dimensional can be projected to for the higher-dimension sufficient statistic of default UBM model by obtaining entire change subspace
It realizes, conducive to the vector further progress speech recognition after dimensionality reduction.
Method for distinguishing speek person provided in an embodiment of the present invention is by using i-vector vector extracting method to test language
Sound data are handled, and corresponding test i-vector vector is obtained, and can reduce the complexity for obtaining test i-vector vector;
Meanwhile by cosine similarity algorithm can discriminating test i-vector vector sum register i-vector vector similarity, letter
It is single quick, it is conducive to quickly confirmation recognition result.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
In one embodiment, a kind of i-vector vector extraction device is provided, the i-vector vector extraction device with it is upper
I-vector vector extracting method in embodiment is stated to correspond.As shown in fig. 7, the i-vector vector extraction device includes obtaining
It takes voice data module 10, training variation space module 20, projection variation space module 30 and obtains i-vector vector module
40.Detailed description are as follows for each functional module:
Voice data module 10 is obtained, for obtaining the training voice data of speaker, and extracts trained voice data pair
The training phonetic feature answered.
Variation space module 20 is trained, for training overall change corresponding with default UBM model based on default UBM model
Beggar space.
Projection variation space module 30 obtains first for phonetic feature will to be trained to be projected in entire change subspace
I-vector vector.
Obtain i-vector vector module 40, for by the first i-vector vector projection in entire change subspace,
Obtain registration i-vector vector corresponding with speaker.
Preferably, obtain voice data module 10 include obtain units of speech data 11, obtain data power spectrum unit 12,
It obtains Meier power spectrum unit 13 and obtains MFCC feature unit 14.
Units of speech data 11 is obtained, for pre-processing to training voice data, obtains pretreatment voice data.
It obtains data power and composes unit 12, for making Fast Fourier Transform (FFT) to pretreatment voice data, obtain training language
The frequency spectrum of sound data, and the power spectrum for training voice data is obtained according to frequency spectrum.
Meier power spectrum unit 13 is obtained, for handling the power of training voice data using melscale filter group
Spectrum obtains the Meier power spectrum of training voice data.
MFCC feature unit 14 is obtained, for carrying out cepstral analysis on Meier power spectrum, obtains training voice data
MFCC feature.
Training variation space module 20 includes obtaining higher-dimension statistic unit 21 and obtaining variation subspace unit 22.
Higher-dimension statistic unit 21 is obtained, for obtaining the higher-dimension sufficient statistic of default UBM model.
Variation subspace unit 22 is obtained, for being iterated using EM algorithm to higher-dimension sufficient statistic, is obtained
Take corresponding entire change subspace.
Projection variation space module 30 includes obtaining GMM-UBM model unit 31 and obtaining primary vector unit 32.
GMM-UBM model unit 31 is obtained, for being based on training phonetic feature and default UBM model, certainly using mean value MAP
Adaptive method obtains GMM-UBM model.
Primary vector unit 32 is obtained, for using formula s1=m+Tw1, obtain the first i-vector vector, wherein s1
It is the corresponding mean value super vector of GMM-UBM model of C*F dimension;M is C*F dimension super vector unrelated with speaker and unrelated channel;T
It is entire change subspace, dimension CF*N;w1It is the first i-vector vector, dimension N.
Preferably, obtaining i-vector vector module 40 includes obtaining log-in vector unit 41.
Log-in vector unit 41 is obtained, for using formula s2=m+Tw2By the first i-vector vector projection in overall change
Beggar spatially, obtains registration i-vector vector, wherein s2It is the corresponding with registration i-vector vector equal of D*G dimension
It is worth super vector;M is D*G dimension super vector unrelated with speaker and unrelated channel;T is entire change subspace, dimension DG*M;
w2It is registration i-vector vector, dimension M.
Specific restriction about i-vector vector extraction device may refer to extract above for i-vector vector
The restriction of method, details are not described herein.Modules in above-mentioned i-vector vector extraction device can be fully or partially through
Software, hardware and combinations thereof are realized.Above-mentioned each module can be embedded in the form of hardware or independently of the place in computer equipment
It manages in device, can also be stored in a software form in the memory in computer equipment, in order to which processor calls execution or more
The corresponding operation of modules.
In one embodiment, a kind of Speaker Identification device is provided, is said in the Speaker Identification device and above-described embodiment
People's recognition methods is talked about to correspond.As shown in figure 8, the Speaker Identification device includes obtaining test data module 50, obtaining and survey
Vector module 60 is tried, log-in vector module 70 is obtained and determines corresponding speaker's module 80.Detailed description are as follows for each functional module:
Test data module 50 is obtained, for obtaining tested speech data, tested speech data carry speaker's mark;
Obtain test vector module 60, for using i-vector vector extracting method to tested speech data at
Reason obtains corresponding test i-vector vector;
Log-in vector module 70 is obtained, for identifying inquiry database based on speaker, is obtained corresponding with speaker's mark
Registration i-vector vector;
Corresponding speaker's module 80 is determined, for obtaining test i-vector vector sum registration using cosine similarity algorithm
The similarity of i-vector vector, it is same according to whether similarity detection test i-vector vector sum registration i-vector corresponds to
Speaker.
Specific about Speaker Identification device limits the restriction that may refer to above for method for distinguishing speek person,
This is repeated no more.Modules in above-mentioned Speaker Identification device can come fully or partially through software, hardware and combinations thereof
It realizes.Above-mentioned each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also be with software
Form is stored in the memory in computer equipment, executes the corresponding operation of the above modules in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be server, internal structure chart
It can be as shown in Figure 9.The computer equipment includes processor, memory, network interface and the data connected by system bus
Library.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory of the computer equipment includes non-
Volatile storage medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and database.
The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The computer is set
Standby database is for storing data relevant to i-vector vector extracting method or method for distinguishing speek person.The computer is set
Standby network interface is used to communicate with external terminal by network connection.To realize when the computer program is executed by processor
I-vector vector extracting method or method for distinguishing speek person.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory and can
The computer program run on a processor, processor perform the steps of the instruction for obtaining speaker when executing computer program
Practice voice data, and extracts the corresponding trained phonetic feature of trained voice data;It trains and presets based on default UBM model
The corresponding entire change subspace of UBM model;Training phonetic feature is projected on entire change subspace, the first i- is obtained
Vector vector;By the first i-vector vector projection on entire change subspace, registration i- corresponding with speaker is obtained
Vector vector.
In one embodiment, the corresponding trained phonetic feature of training voice data is extracted, processor executes computer program
When perform the steps of to training voice data pre-process, obtain pretreatment voice data;Pretreatment voice data is made
Fast Fourier Transform (FFT) obtains the frequency spectrum of training voice data, and the power spectrum of training voice data is obtained according to frequency spectrum;Using
The power spectrum of melscale filter group processing training voice data, obtains the Meier power spectrum of training voice data;In Meier
Cepstral analysis is carried out on power spectrum, obtains the MFCC feature of training voice data.
In one embodiment, entire change corresponding with default UBM model subspace is trained based on default UBM model,
Processor performs the steps of the higher-dimension sufficient statistic for obtaining default UBM model when executing computer program;Using the maximum phase
It hopes algorithm to be iterated higher-dimension sufficient statistic, obtains corresponding entire change subspace.
In one embodiment, by training phonetic feature be projected on entire change subspace, obtain the first i-vector to
Amount, processor are performed the steps of when executing computer program based on training phonetic feature and default UBM model, using mean value
MAP adaptive approach obtains GMM-UBM model;Using formula s1=m+Tw1Training phonetic feature is projected in the entire change
On subspace, the first i-vector vector is obtained, wherein s1It is opposite with training phonetic feature in the GMM-UBM model of C*F dimension
The mean value super vector answered;M is C*F dimension super vector unrelated with speaker and unrelated channel;T is entire change subspace, dimension
For CF*N;w1It is the first i-vector vector, dimension N.
In one embodiment, it by the first i-vector vector projection on entire change subspace, obtains and speaker couple
The registration i-vector vector answered, processor perform the steps of when executing computer program
Using formula s2=m+Tw2By the first i-vector vector projection on entire change subspace, registration i- is obtained
Vector vector, wherein s2It is the mean value super vector corresponding with registration i-vector vector of D*G dimension;M be with speaker without
It closes and the unrelated D*G of channel ties up super vector;T is entire change subspace, dimension DG*M;w2It is registration i-vector vector,
Dimension is M.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory and can
The computer program run on a processor, processor performs the steps of when executing computer program obtains tested speech number
According to tested speech data carry speaker's mark;Based on tested speech data, corresponding test i-vector vector is obtained;Base
Inquiry database is identified in speaker, obtains registration i-vector vector corresponding with speaker's mark;Using cosine similarity
Algorithm obtains the similarity of test i-vector vector sum registration i-vector vector, detects test i-vector according to similarity
Whether vector sum registration i-vector corresponds to same speaker.
In one embodiment, a kind of computer readable storage medium is provided, computer program, computer journey are stored thereon with
The training voice data for obtaining speaker is performed the steps of when sequence is executed by processor, and it is corresponding to extract trained voice data
Training phonetic feature;Entire change corresponding with default UBM model subspace is trained based on default UBM model;It will train
Phonetic feature is projected on entire change subspace, obtains the first i-vector vector;First i-vector vector projection is existed
On entire change subspace, registration i-vector vector corresponding with speaker is obtained.
In one embodiment, the corresponding trained phonetic feature of training voice data is extracted, computer program is held by processor
It is performed the steps of when row and training voice data is pre-processed, obtain pretreatment voice data;To pretreatment voice data
Make Fast Fourier Transform (FFT), obtain the frequency spectrum of training voice data, and obtains the power spectrum of training voice data according to frequency spectrum;It adopts
The power spectrum of training voice data is handled with melscale filter group, obtains the Meier power spectrum of training voice data;In plum
Cepstral analysis is carried out on your power spectrum, obtains the MFCC feature of training voice data.
In one embodiment, entire change corresponding with default UBM model subspace is trained based on default UBM model,
The higher-dimension sufficient statistic for obtaining default UBM model is performed the steps of when computer program is executed by processor;Using maximum
Expectation Algorithm is iterated higher-dimension sufficient statistic, obtains corresponding entire change subspace.
In one embodiment, by training phonetic feature be projected on entire change subspace, obtain the first i-vector to
Amount is performed the steps of when computer program is executed by processor based on training phonetic feature and default UBM model, using equal
Value MAP adaptive approach obtains GMM-UBM model;Using formula s1=m+Tw1Training phonetic feature is projected in the overall change
Beggar spatially, obtains the first i-vector vector, wherein s1Be C*F dimension GMM-UBM model in training phonetic feature phase
Corresponding mean value super vector;M is C*F dimension super vector unrelated with speaker and unrelated channel;T is entire change subspace, dimension
Degree is CF*N;w1It is the first i-vector vector, dimension N.
In one embodiment, it by the first i-vector vector projection on entire change subspace, obtains and speaker couple
The registration i-vector vector answered, performs the steps of when computer program is executed by processor
Using formula s2=m+Tw2By the first i-vector vector projection on entire change subspace, registration i- is obtained
Vector vector, wherein s2It is the mean value super vector corresponding with registration i-vector vector of D*G dimension;M be with speaker without
It closes and the unrelated D*G of channel ties up super vector;T is entire change subspace, dimension DG*M;w2It is registration i-vector vector,
Dimension is M.
In one embodiment, a kind of computer readable storage medium is provided, computer program, computer journey are stored thereon with
It is performed the steps of when sequence is executed by processor and obtains tested speech data, tested speech data carry speaker's mark;It is based on
Tested speech data obtain corresponding test i-vector vector;Inquiry database, acquisition and speaker are identified based on speaker
Identify corresponding registration i-vector vector;Test i-vector vector sum is obtained using cosine similarity algorithm and registers i-
The similarity of vector vector detects whether test i-vector vector sum registration i-vector corresponds to same theory according to similarity
Talk about people.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
To any reference of memory, storage, database or other media used in each embodiment provided herein,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of i-vector vector extracting method characterized by comprising
The training voice data of speaker is obtained, and extracts the corresponding trained phonetic feature of the trained voice data;
Entire change corresponding with default UBM model subspace is trained based on default UBM model;
The trained phonetic feature is projected on the entire change subspace, the first i-vector vector is obtained;
By the first i-vector vector projection on the entire change subspace, note corresponding with the speaker is obtained
Volume i-vector vector.
2. i-vector vector extracting method as described in claim 1, which is characterized in that described to extract the trained voice number
According to corresponding trained phonetic feature, comprising:
The trained voice data is pre-processed, pretreatment voice data is obtained;
Fast Fourier Transform (FFT) is made to the pretreatment voice data, obtains the frequency spectrum of training voice data, and according to the frequency
Spectrum obtains the power spectrum of training voice data;
The power spectrum of the trained voice data is handled using melscale filter group, obtains the Meier function of training voice data
Rate spectrum;
Cepstral analysis is carried out on the Meier power spectrum, obtains the MFCC feature of training voice data.
3. i-vector vector extracting method as described in claim 1, which is characterized in that described based on default UBM model
Train entire change corresponding with default UBM model subspace, comprising:
Obtain the higher-dimension sufficient statistic of the default UBM model;
The higher-dimension sufficient statistic is iterated using EM algorithm, obtains corresponding entire change subspace.
4. i-vector vector extracting method as described in claim 1, which is characterized in that described that the trained voice is special
Sign is projected on the entire change subspace, obtains the first i-vector vector, comprising:
Based on the trained phonetic feature and the default UBM model, GMM-UBM mould is obtained using mean value MAP adaptive approach
Type;
Using formula s1=m+Tw1The trained phonetic feature is projected on the entire change subspace, the first i- is obtained
Vector vector, wherein s1It is mean value super vector corresponding with the trained phonetic feature in the GMM-UBM model of C*F dimension;
M is C*F dimension super vector unrelated with speaker and unrelated channel;T is the entire change subspace, dimension CF*N;w1It is
First i-vector vector, dimension N.
5. i-vector vector extracting method as described in claim 1, which is characterized in that described by the first i-
Vector vector projection obtains registration i-vector vector corresponding with the speaker on the entire change subspace,
Include:
Using formula s2=m+Tw2By the first i-vector vector projection on the entire change subspace, registration is obtained
I-vector vector, wherein s2It is the mean value super vector corresponding with the registration i-vector vector of D*G dimension;M is and says
The words D*G that people is unrelated and channel is unrelated ties up super vector;T is the entire change subspace, dimension DG*M;w2It is registration i-
Vector vector, dimension M.
6. a kind of method for distinguishing speek person characterized by comprising
Tested speech data are obtained, the tested speech data carry speaker's mark;
It further include that the tested speech data are carried out using any one of the claim 1-5 i-vector vector extracting method
Processing obtains corresponding test i-vector vector;
Inquiry database is identified based on the speaker, obtains registration i-vector vector corresponding with speaker mark;
The similarity that i-vector vector is registered described in the test i-vector vector sum is obtained using cosine similarity algorithm,
Same speaker whether is corresponded to according to registration i-vector described in the similarity detection test i-vector vector sum.
7. a kind of i-vector vector extraction device characterized by comprising
Training data module is obtained, for obtaining the training voice data of speaker, and it is corresponding to extract the trained voice data
Training phonetic feature;
Voice data module is obtained, for obtaining the training voice data of speaker, and it is corresponding to extract the trained voice data
Training phonetic feature;
Training variation space module, it is empty for training entire change corresponding with default UBM model based on default UBM model
Between;
Projection variation space module obtains for the trained phonetic feature to be projected in the entire change subspace
One i-vector vector;
I-vector vector module is obtained, is used for the first i-vector vector projection in the entire change subspace
On, obtain registration i-vector vector corresponding with the speaker.
8. a kind of Speaker Identification device characterized by comprising
Test data module is obtained, for obtaining tested speech data, the tested speech data carry speaker's mark;
Test vector module is obtained, for using any one of the claim 1-5 i-vector vector extracting method to described
Tested speech data are handled, and corresponding test i-vector vector is obtained;
Log-in vector module is obtained, for identifying inquiry database based on the speaker, is obtained and speaker mark pair
The registration i-vector vector answered;
It determines corresponding speaker's module, is infused described in the test i-vector vector sum for being obtained using cosine similarity algorithm
The similarity of volume i-vector vector registers i- according to the similarity detection test i-vector vector sum
Whether vector corresponds to same speaker.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
The step of method for distinguishing speek person described in any one of 5 i-vector vector extracting methods or claim 6.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In realization i-vector vector extraction side as described in any one of claim 1 to 5 when the computer program is executed by processor
Described in method or claim 6 the step of method for distinguishing speek person.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810574010.4A CN109065022B (en) | 2018-06-06 | 2018-06-06 | Method for extracting i-vector, method, device, equipment and medium for speaker recognition |
PCT/CN2018/092589 WO2019232826A1 (en) | 2018-06-06 | 2018-06-25 | I-vector extraction method, speaker recognition method and apparatus, device, and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810574010.4A CN109065022B (en) | 2018-06-06 | 2018-06-06 | Method for extracting i-vector, method, device, equipment and medium for speaker recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109065022A true CN109065022A (en) | 2018-12-21 |
CN109065022B CN109065022B (en) | 2022-08-09 |
Family
ID=64820489
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810574010.4A Active CN109065022B (en) | 2018-06-06 | 2018-06-06 | Method for extracting i-vector, method, device, equipment and medium for speaker recognition |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109065022B (en) |
WO (1) | WO2019232826A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110827834A (en) * | 2019-11-11 | 2020-02-21 | 广州国音智能科技有限公司 | Voiceprint registration method, system and computer readable storage medium |
CN111161713A (en) * | 2019-12-20 | 2020-05-15 | 北京皮尔布莱尼软件有限公司 | Voice gender identification method and device and computing equipment |
CN111508505A (en) * | 2020-04-28 | 2020-08-07 | 讯飞智元信息科技有限公司 | Speaker identification method, device, equipment and storage medium |
WO2020098828A3 (en) * | 2019-10-31 | 2020-09-03 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for personalized speaker verification |
CN113056784A (en) * | 2019-01-29 | 2021-06-29 | 深圳市欢太科技有限公司 | Voice information processing method and device, storage medium and electronic equipment |
CN114420142A (en) * | 2022-03-28 | 2022-04-29 | 北京沃丰时代数据科技有限公司 | Voice conversion method, device, equipment and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111700718B (en) * | 2020-07-13 | 2023-06-27 | 京东科技信息技术有限公司 | Method and device for recognizing holding gesture, artificial limb and readable storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737633A (en) * | 2012-06-21 | 2012-10-17 | 北京华信恒达软件技术有限公司 | Method and device for recognizing speaker based on tensor subspace analysis |
CN104167208A (en) * | 2014-08-08 | 2014-11-26 | 中国科学院深圳先进技术研究院 | Speaker recognition method and device |
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
CN105810199A (en) * | 2014-12-30 | 2016-07-27 | 中国科学院深圳先进技术研究院 | Identity verification method and device for speakers |
CN106971713A (en) * | 2017-01-18 | 2017-07-21 | 清华大学 | Speaker's labeling method and system based on density peaks cluster and variation Bayes |
CN107146601A (en) * | 2017-04-07 | 2017-09-08 | 南京邮电大学 | A kind of rear end i vector Enhancement Methods for Speaker Recognition System |
CN107369440A (en) * | 2017-08-02 | 2017-11-21 | 北京灵伴未来科技有限公司 | The training method and device of a kind of Speaker Identification model for phrase sound |
CN107633845A (en) * | 2017-09-11 | 2018-01-26 | 清华大学 | A kind of duscriminant local message distance keeps the method for identifying speaker of mapping |
WO2018053531A1 (en) * | 2016-09-19 | 2018-03-22 | Pindrop Security, Inc. | Dimensionality reduction of baum-welch statistics for speaker recognition |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104240706B (en) * | 2014-09-12 | 2017-08-15 | 浙江大学 | It is a kind of that the method for distinguishing speek person that similarity corrects score is matched based on GMM Token |
CN105933323B (en) * | 2016-06-01 | 2019-05-31 | 百度在线网络技术(北京)有限公司 | Voiceprint registration, authentication method and device |
DE102016115018B4 (en) * | 2016-08-12 | 2018-10-11 | Imra Europe S.A.S. | Audio signature for voice command observation |
CN107240397A (en) * | 2017-08-14 | 2017-10-10 | 广东工业大学 | A kind of smart lock and its audio recognition method and system based on Application on Voiceprint Recognition |
-
2018
- 2018-06-06 CN CN201810574010.4A patent/CN109065022B/en active Active
- 2018-06-25 WO PCT/CN2018/092589 patent/WO2019232826A1/en active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737633A (en) * | 2012-06-21 | 2012-10-17 | 北京华信恒达软件技术有限公司 | Method and device for recognizing speaker based on tensor subspace analysis |
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
CN104167208A (en) * | 2014-08-08 | 2014-11-26 | 中国科学院深圳先进技术研究院 | Speaker recognition method and device |
CN105810199A (en) * | 2014-12-30 | 2016-07-27 | 中国科学院深圳先进技术研究院 | Identity verification method and device for speakers |
WO2018053531A1 (en) * | 2016-09-19 | 2018-03-22 | Pindrop Security, Inc. | Dimensionality reduction of baum-welch statistics for speaker recognition |
CN106971713A (en) * | 2017-01-18 | 2017-07-21 | 清华大学 | Speaker's labeling method and system based on density peaks cluster and variation Bayes |
CN107146601A (en) * | 2017-04-07 | 2017-09-08 | 南京邮电大学 | A kind of rear end i vector Enhancement Methods for Speaker Recognition System |
CN107369440A (en) * | 2017-08-02 | 2017-11-21 | 北京灵伴未来科技有限公司 | The training method and device of a kind of Speaker Identification model for phrase sound |
CN107633845A (en) * | 2017-09-11 | 2018-01-26 | 清华大学 | A kind of duscriminant local message distance keeps the method for identifying speaker of mapping |
Non-Patent Citations (1)
Title |
---|
邢玉娟 等: "改进i-向量说话人识别算法研究", 《科学技术与工程》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113056784A (en) * | 2019-01-29 | 2021-06-29 | 深圳市欢太科技有限公司 | Voice information processing method and device, storage medium and electronic equipment |
WO2020098828A3 (en) * | 2019-10-31 | 2020-09-03 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for personalized speaker verification |
US10997980B2 (en) | 2019-10-31 | 2021-05-04 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for determining voice characteristics |
US11031018B2 (en) | 2019-10-31 | 2021-06-08 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for personalized speaker verification |
US11244689B2 (en) | 2019-10-31 | 2022-02-08 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for determining voice characteristics |
CN110827834A (en) * | 2019-11-11 | 2020-02-21 | 广州国音智能科技有限公司 | Voiceprint registration method, system and computer readable storage medium |
CN110827834B (en) * | 2019-11-11 | 2022-07-12 | 广州国音智能科技有限公司 | Voiceprint registration method, system and computer readable storage medium |
CN111161713A (en) * | 2019-12-20 | 2020-05-15 | 北京皮尔布莱尼软件有限公司 | Voice gender identification method and device and computing equipment |
CN111508505A (en) * | 2020-04-28 | 2020-08-07 | 讯飞智元信息科技有限公司 | Speaker identification method, device, equipment and storage medium |
CN111508505B (en) * | 2020-04-28 | 2023-11-03 | 讯飞智元信息科技有限公司 | Speaker recognition method, device, equipment and storage medium |
CN114420142A (en) * | 2022-03-28 | 2022-04-29 | 北京沃丰时代数据科技有限公司 | Voice conversion method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2019232826A1 (en) | 2019-12-12 |
CN109065022B (en) | 2022-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109065022A (en) | I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium | |
CN108922544A (en) | General vector training method, voice clustering method, device, equipment and medium | |
US9940935B2 (en) | Method and device for voiceprint recognition | |
CN107610707B (en) | A kind of method for recognizing sound-groove and device | |
Li et al. | An overview of noise-robust automatic speech recognition | |
Krueger et al. | Model-based feature enhancement for reverberant speech recognition | |
CN110232932B (en) | Speaker confirmation method, device, equipment and medium based on residual delay network | |
CN109065028A (en) | Speaker clustering method, device, computer equipment and storage medium | |
CN107886943A (en) | Voiceprint recognition method and device | |
CN105096955B (en) | A kind of speaker's method for quickly identifying and system based on model growth cluster | |
WO2019200744A1 (en) | Self-updated anti-fraud method and apparatus, computer device and storage medium | |
CN108922543A (en) | Model library method for building up, audio recognition method, device, equipment and medium | |
WO2014114116A1 (en) | Method and system for voiceprint recognition | |
CN110047504B (en) | Speaker identification method under identity vector x-vector linear transformation | |
CN103794207A (en) | Dual-mode voice identity recognition method | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN108154371A (en) | Electronic device, the method for authentication and storage medium | |
CN104538035A (en) | Speaker recognition method and system based on Fisher supervectors | |
Abdelaziz et al. | Twin-HMM-based audio-visual speech enhancement | |
Nidhyananthan et al. | Language and text-independent speaker identification system using GMM | |
Kudashev et al. | A Speaker Recognition System for the SITW Challenge. | |
CN112992155A (en) | Far-field voice speaker recognition method and device based on residual error neural network | |
Herrera-Camacho et al. | Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE | |
Zi et al. | Joint filter combination-based central difference feature extraction and attention-enhanced Dense-Res2Block network for short-utterance speaker recognition | |
Sehr et al. | A novel approach for matched reverberant training of HMMs using data pairs. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |