CN106782565A

CN106782565A - A kind of vocal print feature recognition methods and system

Info

Publication number: CN106782565A
Application number: CN201611075677.7A
Authority: CN
Inventors: 徐晓东; 张程; 张毅
Original assignee: Chongqing Heavy Chi Robot Research Institute Co Ltd
Current assignee: Chongqing Heavy Chi Robot Research Institute Co Ltd
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2017-05-31

Abstract

The embodiment of the present invention provides a kind of vocal print feature recognition methods and system, wherein, the process that implements of the method is, after the speech Separation treatment based on auditory properties is carried out to pretreated noisy mixed noise, extract the frequency cepstral coefficient of signal and perceive linear predictor coefficient, and utilize noise background discrimination, completion Fusion Features are analyzed to frequency cepstral coefficient and perception linear predictor coefficient under different noise circumstances, finally in the vocal print feature ATL for pre-building, pattern match is carried out to the feature for having completed fusion using gauss hybrid models universal background model, realize that vocal print feature is recognized.With traditional method for recognizing sound-groove be combined human auditory system properties by this kind of vocal print feature recognition methods, from bionics angle solve the problems, such as noise under the reduction of Application on Voiceprint Recognition rate, effectively improve the robustness of the accuracy rate of vocal print feature identification and system under noise circumstance.

Description

A kind of vocal print feature recognition methods and system

Technical field

The present invention relates to voice recognition technology field, in particular to a kind of vocal print feature recognition methods and system.

Background technology

Early in the thirties in 20th century, the research of Application on Voiceprint Recognition has just been expanded in information researcher.In early stage In research, human ear listens the emphasis for distinguishing that (Aural) experiment and audition identification feasibility checking are Application on Voiceprint Recognition field.With computer Breakthrough of the science and technology in hardware and algorithm, the research of Application on Voiceprint Recognition is no longer limited only to single human ear to be listened and distinguishes.U.S. Bell Laboratory is chronically at leading position in field of speech recognition, and laboratory member L G.Kesta divide by voice spectrum figure Analysis completes identification, and first proposed " Application on Voiceprint Recognition " this concept by him.With researcher in Application on Voiceprint Recognition field not Disconnected exploration and innovation, automatically analyzes machine and recognizes that human speech signal becomes possibility.But, existing vocal print is special at present Levy recognition methods recognition accuracy in a noisy environment all universal relatively low, system robustness is poor, and application effect is not good.

The content of the invention

It is an object of the invention to provide a kind of vocal print feature recognition methods and system, to improve above mentioned problem.

Present pre-ferred embodiments provide a kind of vocal print feature recognition methods, and the method includes：

Primary speech signal to being input into is pre-processed, and the pretreatment includes preemphasis, framing adding window and end points Detection；

Noisy mixed signal to being obtained after pretreatment carries out the speech Separation based on auditory properties and processes；

Extract the frequency cepstral coefficient of the signal after being processed through speech Separation and perceive linear predictor coefficient；

Using noise background discrimination, frequency cepstral coefficient and perception linear predictor coefficient are entered under different noise circumstances Row is analyzed to complete Fusion Features；And

In the vocal print feature ATL for pre-building, using gauss hybrid models-universal background model to having completed to melt The feature of conjunction carries out pattern match, realizes that vocal print feature is recognized.

Another embodiment of the present invention provides a kind of vocal print feature identifying system, and the system includes:

Pretreatment module, for being pre-processed to the primary speech signal being input into, the pretreatment includes preemphasis, divides Frame adding window and end-point detection；

Speech Separation module, for carrying out the voice point based on auditory properties to the noisy mixed signal obtained after pretreatment From treatment；

Characteristic extracting module, for extracting the frequency cepstral coefficient of the signal after being processed through speech Separation and perceiving linear pre- Survey coefficient；

Fusion Features module, for utilizing noise background discrimination, under different noise circumstances to frequency cepstral coefficient and Linear predictor coefficient is perceived to be analyzed to complete Fusion Features；And

Feature recognition module, in the vocal print feature ATL for pre-building, using gauss hybrid models-general back of the body Scape model carries out pattern match to the feature for having completed fusion, realizes that vocal print feature is recognized.

Vocal print feature recognition methods provided in an embodiment of the present invention and system, by human auditory system properties and traditional vocal print Recognition methods is combined, from bionics angle solve the problems, such as noise under Application on Voiceprint Recognition rate reduction, effectively improve noise circumstance The accuracy rate of lower Application on Voiceprint Recognition and the robustness of system.

Brief description of the drawings

Technical scheme in order to illustrate more clearly the embodiments of the present invention, below will be attached to what is used needed for embodiment Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, thus be not construed as it is right The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to this A little accompanying drawings obtain other related accompanying drawings.

Fig. 1 is a kind of block diagram of speech recognition apparatus provided in an embodiment of the present invention；

Fig. 2 is a kind of flow chart of vocal print feature recognition methods provided in an embodiment of the present invention；

Fig. 3 is the geometrical principle figure of ears time difference provided in an embodiment of the present invention；

Fig. 4 is a kind of functional block diagram of vocal print feature identifying system provided in an embodiment of the present invention.

Icon：100- speech recognition apparatus；110- vocal print feature identifying systems；120- memories；130- processors； 1102- pretreatment modules；1104- speech Separation modules；1106- characteristic extracting modules；1108- Fusion Features modules；1110- is special Levy identification module.

Specific embodiment

To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is A part of embodiment of the present invention, rather than whole embodiments.Present invention implementation generally described and illustrated in accompanying drawing herein The component of example can be arranged and designed with a variety of configurations.Therefore, reality of the invention below to providing in the accompanying drawings The detailed description for applying example is not intended to limit the scope of claimed invention, but is merely representative of selected implementation of the invention Example.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained under the premise of creative work is not made Every other embodiment, belongs to the scope of protection of the invention.

As shown in figure 1, being a kind of block diagram of speech recognition apparatus 100 provided in an embodiment of the present invention.Institute's predicate Sound identification equipment 100 includes vocal print feature identifying system 110, memory 120 and processor 130.Wherein, the memory Directly or indirectly it is electrically connected between 120 and processor 130, to carry out data transmission or interact.The vocal print feature identification System 110 can be stored in the memory 120 in the form of software or firmware including at least one or be solidificated in the voice Software function module in the operating system of identification equipment 100.The processor 130 accesses institute under the control of storage control Memory 120 is stated, for performing the executable module stored in the memory 120, such as described vocal print feature identification system Software function module and computer program included by system 110 etc..

As shown in Fig. 2 in being a kind of speech recognition apparatus 100 being applied to shown in Fig. 1 provided in an embodiment of the present invention The schematic flow sheet of vocal print feature recognition methods.It should be noted that, the method that the present invention is provided is not with Fig. 2 and as described below Particular order is limitation.Each step shown in Fig. 2 will be described in detail below.

Step S101, the primary speech signal to being input into is pre-processed, and the pretreatment includes preemphasis, framing adding window And end-point detection.

In the present embodiment, the primary speech signal of the speech recognition apparatus 100 is input into, single order FIR high pass numbers are crossed first Word wave filter realizes preemphasis, and its transmission function is：

H (Z)=1- μ Z^-1

Wherein, coefficient μ values are that between 0 to 1, its value can determine according to priori rule, generally desirable 0.94.

Then, the voice signal that will be obtained after preemphasis carries out framing, and is multiplied by the Moving Window w (n-m) that amplitude is k.K can By certain function value, will have certain addition to each sampling value of framing.After through framing windowing process, the voice for obtaining Signal is represented by：

Wherein, T [] represents a kind of functional transformation, and x (m) represents voice signal sequence, and Q (n) represents each section by treatment The time series for obtaining afterwards.

Finally, the end points of voice signal is detected.In the present embodiment, language is mainly realized by short-time energy and short-time zero-crossing rate The end-point detection of message number.

Specifically, short-time energy is expressed as：

Wherein, N represents analysis window width, and S (n) represents the signal sampling value of n-th point in t frame voice signals.

Short-time zero-crossing rate is expressed as：

Wherein, Sgn [] represents zero-crossing rate function.

Step S103, the noisy mixed signal to being obtained after pretreatment carries out the speech Separation based on auditory properties and processes.

In the present embodiment, the process that the bionical separating treatment based on auditory properties is carried out to voice signal is, based on periphery Auditory model carries out after resolution process obtains time frequency unit noisy mixed signal, poly- to time frequency unit according to speech Separation clue Class, the voice after being separated eventually through the output of speech reconstruction model.Speech reconstruction model completes the cluster and voice of time frequency unit Stream synthesis, mainly includes two-value mask cluster and recombination model two parts.

Masking model for the i-th frequency channel and jth time frame may be defined as following formula：

Wherein, f_c=1500Hz represents the critical frequency of high frequency and middle low frequency, f_iRepresent the frequency of the i-th frequency channel, τ (i, J) represent that the i-th frequency channel separates clue with one of jth time frame, L (i, j) represents the i-th frequency channel with jth time frame Another separates clue, T^τ(i, j) and T^l(i, j) represents that above-mentioned two separates the threshold value of clue respectively.

In order to improve the reduction degree of reconstructed voice, first have to carry out prosody adjustment to signal to be synthesized.The rhythm is adjusted The adjustment of information etc. whole amplitude including to voice, length, fundamental tone.Wherein, the amplitude adjustment to voice signal can be by weighting Mode realize that weights formula is expressed as：

τ in formula is signal frame length, and n is moved for frame.

Reconstruction formula is：

In formulaIt is the recombination signal for obtaining, t_jIt is the synchronous mark of recombination, h_jN () is peripheral auditory model In window function,It is Short Time Speech signal, the adjustment of amplitude is then realized by the weights g in above-mentioned weights formula.

In addition, in the present embodiment, the speech Separation clue can be interaural time difference (Interaural Time Difference, ITD) or binaural sound it is differential (Interaural Level Difference, ILD).Sound is listened to distinguish position from human ear Angle is set out, and simulation human ear differentiates the process of sound, will reflect speech Separation the clue ITD and ILD of acoustic space azimuth information Speech Separation efficiency will be effectively lifted for speech Separation.Below, the realization principle to ITD and ILD is briefly described.

During human auditory system carries out speech Separation, ITD is mainly used in the treatment of centering low frequency voice signal. For simplicity, this section will illustrate the generation principle of ITD by taking single sound source as an example.It is assumed that a certain sound source then may be used closer to left ear Represent that voice signal reaches the process of left ear with α sin2 π ft.And distant auris dextra is then (α-Δ α) sin2 π f (t+ Δs T), wherein f is frequency, and Δ t is time difference information, and representative voice propagates to the time difference of ears, i.e. ITD, and Δ α believes for intensity difference Breath, the sound pressure that representative voice travels to ears is poor, i.e. ILD.According to both information, can be by the difference of sound source position PMD EDM is separated.

As shown in figure 3, being the geometrical principle figure of ears time difference.In Fig. 3, S is sound source position, and A and B is left and right ear, and D is Distance between them, angle C represents the angle of sound source and brain center, and d is that the distance between sound two ears of arrival are poor, is expressed as d =Dsin α.

The voice signal of input, is carried out windowing process, generally by window function by the computation structure figure of ITD values first See the unit impulse response of wave filter as.Hamming window is selected in the present embodiment, to ensure that voice signal is in short-time analysis Smoothly.The expression formula of Hamming window is：

In formula, N represents that window is long.Signal by adding window is transformed into frequency domain by Fourier transformation, such as following two formulas institutes Show：

The cross-correlation of the voice signal of left and right ear is reached, can be expressed as：

Normally, each transfer function h_l(t) and h_rT () all can come near by the decay factor of an amplitude and a time delay Can be expressed as like expression, therefore the formula of cross-correlation：

In formula, α represents decay factor, and D represents the value of ITD.According to above-mentioned analysis, the ITD voice signal to low frequency Separation is worked, auto-correlation function R_ssMaximum is reached at τ=0, therefore the value D of ITD can be expressed as：

Crosspower spectrum is defined as two Fourier of the cross-correlation of signal and calculates, such as following formula：

The specific calculating of the formula is with formula：

Represent X_rThe complex conjugate of (ω), Fourier conversion is done to the formula, and the power spectrum that can receive signal is：

As can be seen from the above equation, phase of the D values of ITD only with crosspower spectrum is relevant, and cross-correlation is carried out into standards change can ：

Thus, the D values of ITD can be accurately calculated as：

ILD represents the acoustic pressure difference that sound-source signal reaches two ears.When the distance that sound is delivered to left and right ear produces difference, Acoustic pressure difference will be caused, and this information provides another clue-ILD for speech Separation.Research shows, in high-frequency region, ILD will play more effects.After voice signal frequency is more than 1500Hz, due to the screening of human auditory peripheral's such as auricle Effect is covered, voice signal will produce stronger sound shadow effect and hinder voice signal to be delivered to inner ear.Produce this result Principal element is that the voice signal wavelength of low frequency is shorter, it is difficult to which diffraction through auricle occurs, and the sound of high frequency can then be bypassed Auricle, therefore in order to separate the voice signal of high frequency, ears level difference need to be extracted.

Calculating ILD needs spectral line rope, and in the case where echo is ignored, the energy spectrum of the signal that left and right ear is received can Expressed with by following two formulas：

P_l(ω)=S (ω) | H_l(ω)|²

P_r(ω)=S (ω) | H_r(ω)|²

In formula, S (ω) is the power spectrum of sound source, and H_l(ω) and H_r(ω) represents the transmission letter of left and right ear respectively Number.Therefore, the intensity difference of left and right ear can be expressed as：

I_l(ω)=10log₁₀P_l(ω)=10log₁₀S(ω)+20log₁₀|H_l(ω)|

And

I_r(ω)=10log₁₀P_r(ω)=10log₁₀S(ω)+20log₁₀|H_r(ω)|

Normally, ears level difference can be used for extracting the separation information of High frequency speech signal, and extract binaural sound During differential information, sound source is changed into simple addition from being multiplied with passage relation.Simple addition relation helps subsequently to calculate ILD Extract channel information.

After intensity is calculated, voice signal will be by COCHLEAR FILTER.ILD information only is extracted in HFS, not only The size of feature space is reduced, and the resonance of cochlea frequency selection in human auditory's cental system can be simulated.

Because ILD only works to the voice signal higher than 1500Hz, so there is the interruption of ears level difference extraction frequently Rate f_cut, its computing formula：

In formula, C represents the aerial spread speed of voice signal, d_αThe aperture of physical size is represented, only in subband Reach interruption frequency f_cutILD clues could be calculated later.

Therefore the subband i of interruption frequency is reached for each, there is following formula to set up：

In formula, Ω_iIt is the frequency range of subband i, W_i(ω) is the weight of COCHLEAR FILTER.

Therefore the ILD of each subband i is defined as：

Step S105, extracts the frequency cepstral coefficient of the signal after being processed through speech Separation and perceives linear predictor coefficient.

It is well known that having adopted used characteristic parameter predominantly cepstrum coefficient in Application on Voiceprint Recognition research.Cepstrum coefficient reflects Human vocal tract's principle of sound, extraction process median filter group reflection human hearing characteristic.In the present embodiment, to mel-frequency cepstrum Coefficient (MFCC) is improved, and frequency cepstral coefficient is extracted based on Gammatone wave filter groups.

Function is similar with human auditory system periphery in speech signal processing for Gammatone wave filter groups, can preferable simulation Basilar membrane characteristic, to voice signal scaling down processing；Meddis models can well complete the mould of internal tragus cell characteristics Intend, can accurately describe the granting speed of auditory nerve, both constitute complete sense of hearing periphery model.

When voice signal enters human ear, basilar membrane frequency dividing is first passed around, simulated by Gammatone wave filter groups, filtered Ripple device group time-domain expression is as follows：

In formula, N is the number of wave filter, and i is ordinal number, and n is filter order, takes n=4, φ_iIt is the initial phase of wave filter Position, f_iIt is the centre frequency of each wave filter, b_iIt is decay factor.

Single filter bandwidth is related to human auditory system critical band in Gammatone wave filter groups, auditory critical band Measured with equivalent rectangular bandwidth and be：

EBR (f)=24.7* (4.37f/1000+1)

For centre frequency f_i, decay factor b can be corresponded to_i：

b_i=1.019EBR (f_i)

To formulaLaplace transformation is carried out to obtain：

And transform is transformed into, inverse transformation is finally carried out again can obtain the discrete shock response of Gammatone wave filter groups：

Step S107, it is linear to frequency cepstral coefficient and perception under different noise circumstances using noise background discrimination Predictive coefficient is analyzed to complete Fusion Features.

D_RThe ratio between inter _ class relationship and within-cluster variance for being characterized, reflect in vocal print feature ATL area between each feature Indexing, this discrimination can Efficient Characterization vocal print feature whether adapt to noise circumstance.Vocal print feature is obtained in different signal to noise ratio rings D under border_RValue, further analyzes feature robustness in a noisy environment.D_RExpression formula it is as follows：

μ is the mean eigenvalue of all speakers in vocal print feature ATL, μ in formula_iIt is i-th average spy of speaker Value indicative, M is speaker's number in vocal print feature ATL, and N is single speaker's voice signal frame number.

Phonetic feature is generally stored with a matrix type after extraction, can be represented with multidimensional characteristic vectors, to each dimension Discrimination research understands every one-dimensional characteristic parameter robustness in a noisy environment between characteristic vector carries out class, on this basis It is capable of achieving the data fusion to different vocal print features.It is assumed that feature A and feature B is represented by X peacekeeping Y dimensional feature vectors respectively：

A={ α₁,α₂,......α_X}'

B={ β₁,β₂,.......β_Y}'

Discrimination analysis carrying out class to two kinds of vocal print features, the D of feature A and feature B_RMatrix is as follows：

To study two kinds of every one-dimensional performances of vocal print features in noise circumstance, under various signal to noise ratio environment, to vocal print Speaker extracts feature A and B in feature templates storehouse, and counts D_RMaximum D_RMax is in eigenmatrix per one-dimensional number of times P：

To ensure that each vectorial weight of fusion feature matrix is appropriate, threshold value P is set according to statistical conditions_th, P_thBy concrete outcome It is selected, to P_x, P_yIt is regular after and ask：

ε=max { P_x,P_y,P_th}

Fusion feature characteristic parameter C is obtained, expression formula is as follows：

C={ γ₁,γ₂,.......γ_Z}'

Step S109, in the vocal print feature ATL for pre-building, using gauss hybrid models-universal background model pair The feature for having completed fusion carries out pattern match, realizes that vocal print feature is recognized.

In the present embodiment, the model of pattern match is gauss hybrid models-universal background model (GMM-OUM models).It is high The essence of this mixed model (GMM model) is the probability density function of various dimensions, is tieed up for d and degree of mixing is the GMM model of M, Can be expressed as by the weighted sum of Gaussian function：

In formula, ∑_iIt is covariance matrix, p_iIt is i-th Gaussian function of component d dimensions of GMM model, x is the observation arrow of d dimensions Amount, w_iIt is mixed weight-value, and meetsμ_iIt is mean value vector.

As shown in figure 4, being a kind of functional block diagram of vocal print feature identifying system 110 provided in an embodiment of the present invention. The vocal print feature identifying system 110 includes pretreatment module 1102, speech Separation module 1104, characteristic extracting module 1106, spy Levy Fusion Module 1108 and feature recognition module 1110.

The pretreatment module 1102, for being pre-processed to the primary speech signal being input into, the pretreatment includes Preemphasis, framing adding window and end-point detection；

The speech Separation module 1104, for carrying out the noisy mixed signal obtained after pretreatment based on auditory properties Speech Separation treatment；

The characteristic extracting module 1106, frequency cepstral coefficient and sense for extracting the signal after being processed through speech Separation Know linear predictor coefficient；

The Fusion Features module 1108, for utilizing noise background discrimination, falls under different noise circumstances to frequency Spectral coefficient and perception linear predictor coefficient are analyzed to complete Fusion Features；

The feature recognition module 1110, in the vocal print feature ATL for pre-building, using Gaussian Mixture mould Type-universal background model carries out pattern match to the feature for having completed fusion, realizes that vocal print feature is recognized.

The concrete operation method of each functional module described in the present embodiment can refer to the detailed of the corresponding steps shown in Fig. 2 Thin to illustrate, it is no longer repeated herein.

In sum, vocal print feature recognition methods provided in an embodiment of the present invention and system, solve to make an uproar from bionics angle The problem of Application on Voiceprint Recognition rate reduction, effectively improves the accuracy rate and the robustness of system of Application on Voiceprint Recognition under noise circumstance under sound.

In several embodiments provided herein, it should be understood that disclosed apparatus and method, it is also possible to pass through Other modes are realized.Device embodiment described above is only schematical, for example, flow chart and block diagram in accompanying drawing Show the device of multiple embodiments of the invention, the architectural framework in the cards of method and computer program product, Function and operation.At this point, each square frame in flow chart or block diagram can represent one the one of module, program segment or code Part a, part for the module, program segment or code is used to realize holding for the logic function for specifying comprising one or more Row instruction.It should also be noted that at some as in the implementation replaced, the function of being marked in square frame can also be being different from The order marked in accompanying drawing occurs.For example, two continuous square frames can essentially be performed substantially in parallel, they are sometimes Can perform in the opposite order, this is depending on involved function.It is also noted that every in block diagram and/or flow chart The combination of the square frame in individual square frame and block diagram and/or flow chart, can use the function or the special base of action for performing regulation Realized in the system of hardware, or can be realized with the combination of computer instruction with specialized hardware.

If the function is to realize in the form of software function module and as independent production marketing or when using, can be with Storage is in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are used to so that a computer equipment (can be individual People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the invention.

The above, specific embodiment only of the invention, but protection scope of the present invention is not limited thereto, and it is any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. a kind of vocal print feature recognition methods, it is characterised in that the method includes：

Primary speech signal to being input into is pre-processed, and the pretreatment includes preemphasis, framing adding window and end-point detection；

Using noise background discrimination, frequency cepstral coefficient and perception linear predictor coefficient are divided under different noise circumstances Analyse to complete Fusion Features；

In the vocal print feature ATL for pre-building, using gauss hybrid models-universal background model to having completed fusion Feature carries out pattern match, realizes that vocal print feature is recognized.

2. vocal print feature recognition methods according to claim 1, it is characterised in that the noisy mixing to being obtained after pretreatment Signal carries out the step of speech Separation based on auditory properties is processed to be included：

The noisy mixed signal is decomposed, multiple time frequency units are obtained；

The multiple time frequency unit that decomposition is obtained is clustered according to speech Separation clue；

Signal to be synthesized after to cluster carries out speech reconstruction, the voice after output separation.

3. vocal print feature recognition methods according to claim 2, it is characterised in that the speech Separation clue includes two ears The time difference and binaural sound are differential.

4. the vocal print feature recognition methods according to Claims 2 or 3, it is characterised in that according to speech Separation clue to dividing The step of the multiple time frequency unit that solution is obtained is clustered includes：

According to masking modelTwo-value is carried out to the multiple time frequency unit to cover Code cluster, wherein, f_iRepresent the frequency of the i-th frequency channel, f_cThe critical frequency between high frequency and middle low frequency is represented, τ (i, j) is represented I-th frequency channel separates another of clue, L (i, j) the i-th frequency channels of expression and jth time frame with one of jth time frame Separate clue, T^τ(i, j) and T^l(i, j) represents two threshold values of separation clue respectively.

5. vocal print feature recognition methods according to claim 2, it is characterised in that the signal to be synthesized after to cluster is carried out The step of speech reconstruction, includes：

Prosody adjustment is carried out to the signal to be synthesized, the rhythm includes amplitude, length and fundamental tone；

According to reconstruction formula：To carrying out speech reconstruction through the signal after prosody adjustment, its In, t_jRepresent the synchronous mark of reconstruct, h_jN () represents window function,Represent Short Time Speech signal, g_jThe adjustment of expression amplitude Weights.

6. vocal print feature recognition methods according to claim 1, it is characterised in that extract the letter after being processed through speech Separation Number frequency cepstral coefficient and perceive linear predictor coefficient the step of include：

The frequency cepstral coefficient of the signal after extracting the treatment through speech Separation based on Gammatone wave filter groups.

7. a kind of vocal print feature identifying system, it is characterised in that the system includes:

Pretreatment module, for being pre-processed to the primary speech signal being input into, the pretreatment includes that preemphasis, framing add Window and end-point detection；

Speech Separation module, for being carried out at the speech Separation based on auditory properties to the noisy mixed signal obtained after pretreatment Reason；

Characteristic extracting module, for extracting the frequency cepstral coefficient of the signal after being processed through speech Separation and perceiving linear prediction system Number；

Fusion Features module, for utilizing noise background discrimination, to frequency cepstral coefficient and perception under different noise circumstances Linear predictor coefficient is analyzed to complete Fusion Features；

Feature recognition module, in the vocal print feature ATL for pre-building, using gauss hybrid models-common background mould Type carries out pattern match to the feature for having completed fusion, realizes that vocal print feature is recognized.

8. vocal print feature identifying system according to claim 7, it is characterised in that the speech Separation module is to pretreatment The mode that the noisy mixed signal for obtaining afterwards carries out the speech Separation treatment based on auditory properties includes：

9. vocal print feature identifying system according to claim 8, it is characterised in that the speech Separation module is according to voice Separate clue includes to the mode that the multiple time frequency unit that decomposition is obtained is clustered：

According to masking modelTwo-value is carried out to the multiple time frequency unit Mask is clustered, wherein, f_iRepresent the frequency of the i-th frequency channel, f_cRepresent the critical frequency between high frequency and middle low frequency, τ (i, j) table Show that the i-th frequency channel separates clue with one of jth time frame, L (i, j) represents that the i-th frequency channel is another with jth time frame Individual separation clue, T^τ(i, j) and T^l(i, j) represents two threshold values of separation clue respectively.

10. vocal print feature identifying system according to claim 7, it is characterised in that the characteristic extracting module extracts warp The frequency cepstral coefficient of the signal after speech Separation treatment and the mode of perception linear predictor coefficient include：