CN109545229A

CN109545229A - A kind of method for distinguishing speek person based on speech samples Feature space trace

Info

Publication number: CN109545229A
Application number: CN201910027145.3A
Authority: CN
Inventors: 贺前华; 吴克乾; 谢伟; 庞文丰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2019-03-29
Anticipated expiration: 2039-01-11
Also published as: CN109545229B; WO2020143263A1; SG11202103091XA

Abstract

The invention discloses a kind of method for distinguishing speek person based on speech samples Feature space trace, the method includes indicating to being clustered without mark voice data feature, obtain speech feature space: mark subclass；Speaker's registration is carried out using mark speech samples, obtains distributed intelligence and motion track information of the speaker in speech feature space；Speech samples to be identified are identified using the motion track information of the distributed intelligence of speaker's speech feature space and speech samples.The present invention uses the thinking of speaker's speech feature space positioning, and Speaker Identification computation complexity is low, solves the problems, such as that GMM-UBM computation complexity is high；And speaker's speech feature space of a languages can be used as the speech feature space of the Speaker Identification of another languages, realize the shared of data.

Description

A kind of method for distinguishing speek person based on speech samples Feature space trace

Technical field

The present invention relates to living things feature recognition fields, and in particular to a kind of speaking based on speech samples Feature space trace People's recognition methods.

Background technique

With the development of artificial intelligence technology, audio perception has become the hot spot of audio signal processing technique research, middle pitch Frequency division class or audio identification are the key problems of audio perception, and in engineer application, audio classification shows as Speaker Identification, sound Frequency event recognition, audio event detection etc..Speaker Recognition Technology is identity validation technology --- the one of biometrics identification technology Kind.Biometrics identification technology is known using the technology of biological characteristic automatic identification individual identity, including fingerprint recognition, iris Not, gene identification, recognition of face etc..Compared with other identity validation technologies, Speaker Identification is more convenient, naturally, and having The relatively low user property invaded.Speaker Identification carries out identification using voice signal, has human-computer interaction nature, voice letter Number it is easy to extract, the advantages such as long-range identification can be achieved.

Existing Speaker Recognition System includes two stages: training stage and cognitive phase.In the training stage, system makes It is that speaker establishes model with speaker's voice of collection；In cognitive phase, system will input voice and speaker model carries out Matching is to enter a judgement.Speaker Recognition System needs to extract the feature that can reflect speaker's individual character from voice signal, and establishes Accurate model distinguishes the difference between the speaker and other speakers.Currently used audio classification techniques mainly have two big Class, one kind are to generate statistical model, and such as mixed Gauss model GMM and hidden Markov model HMM, another kind of is based on depth mind Method through network, such as DNN, RNN or LSTM.Whether any technology requires largely to mark training sample, in order to Reach preferable recognition performance, deep neural network method requires sample size higher.Method based on GMM or HMM is not to It is not subject to special consideration with the distinction information between audio class, does not account for the shared of inhomogeneity sample data yet, Such as: paper " the Speaker Verification Using Adapted Gaussian of MIT professor Reynold etc. Mixture Models " (Digital Signal Processing 10 (2000), 19-41.) method for mentioning, have higher Computation complexity；Under large sample support, deep neural network method shows good performance, such as Google Paper " End-to-End Text-Dependent Speaker Verification " (2016IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2016,Pages:5115- 5119) feature and training are extracted to voice using neural network in, but the training of neural network needs largely to have mark language Sound, and the procurement cost of great amount of samples is very high, and deep neural network method shortage is explanatory, suitable one is black Case.

Often computation complexity is higher for existing speaker Recognition Technology, and the speaker's voice data for needing largely to mark comes Training pattern, and acquiring largely has mark voice data to need huge workload.Therefore needing to find one kind can be more Method for distinguishing speek person and system easily and effectively.

Summary of the invention

In view of the deficiencies of the prior art, it is an object of the present invention to provide a kind of based on speech samples Feature space trace Method for distinguishing speek person, wherein speech feature space does not depend on speaker, text and language, therefore speech feature space Building can use the voice data of any qualification, realize the shared of voice data；And speaker's voice track, even if one Sample can also construct, therefore not need largely have mark voice data, and overcoming the prior art and needing to acquire largely has mark The defect of voice data.

The purpose of the present invention can be achieved through the following technical solutions:

A kind of method for distinguishing speek person based on speech samples Feature space trace, one of speech samples can be considered as The primary movement of speech feature space the described method comprises the following steps with the rail track feature in activity space and space:

Step 1), building speech feature space Ω: it will be carried out without mark speech samples in feature space using clustering method Cluster, generated by clustering obtained subclass data expression Ω of certain expression as speech feature space of the subclass data= {g_k, k=1,2 ..., K }；

Step 2), building speaker's knowledge: using the clean speech sample for having speaker's attribute labeling, it is obtained in voice Distributed intelligence and motion track information on feature space Ω；

Step 3), Speaker Identification: for speech samples to be identified, the speech feature space distribution of the sample is obtained first Expression and track, then calculated using the distributed intelligence of speaker's speech feature space the difference of sample distribution and prior distribution with And the accumulative local distribution difference along track, as Speaker Identification foundation and judged.

Further, during step 1) building speech feature space Ω, it is able to use any clean speech sample, it is right Speaker, languages factor do not have any constraint.

Further, speech samples are gathered in feature space using K-means or other clustering methods in step 1) Class, the speech feature space express Ω={ g_k, k=1,2 ..., K can be class data distribution function (such as Gauss point Cloth function), cluster centre vector (mass center) or generate model (such as hidden Markov model or neural network) these have The mark of stationkeeping ability, referred to as feature space mark are sub, and the mark of class used in speech feature space cuckoo mould K determines voice Feature space expresses granularity, and K is bigger, and speech feature space expression is finer.On the other hand, the accuracy and data of space expression Scale is related, and data are abundanter, and space expression is more complete；Meanwhile the data for constructing speech feature space are more targeted, for For particular problem, space expression can be more accurate.

Further, in step 2), using have the clean speech sample of speaker's attribute labeling to speech feature space into Rower note is using Gaussian Profile g_k(m_k,U_k) it is used as the space identification period of the day from 11 p.m. to 1 a.m, speaker characteristic space distribution information is pressed with lower section Formula obtains:

One, each feature f of speech samples is calculated_tWith the sub- g of space identification_k(m_k,U_k) the position degree of association, is defined as:

In formula, space identification is indicated with Multi-dimensional Gaussian distribution, m_kIndicate the mean value vector of k-th of Gaussian Profile, U_kTable Show the variance matrix of k-th of Multi-dimensional Gaussian distribution；

Two, speaker's sample set and the sub- g of space identification are calculated_k(m_k,U_k) the position degree of association desired value:

In formula,Indicate the t frame feature and the sub- g of space identification of n-th of sample_k(m_k,U_k) the degree of association；

Three, speaker characteristic spatial distribution is calculated are as follows:

Further, in step 2), motion profile timing information of speaker's speech samples on speech feature space Ω It is expressed as the sequence of neighborhoods Ψ of speech samples feature₁Ψ₂…Ψ_T, and speech samples feature f_tδ neighborhood be Ψ_t={ g_k|d_tk< δ }, wherein d_tkFor the Mahalanobis generalised distance (Mahalanobis of speech samples feature and speech samples distribution Distance), i.e.,

Further, the speech samples feature f_tδ neighborhood Ψ_t={ g_k|d_tk< δ } decision threshold refer to normal distribution Characteristic, choose 2 < δ < 3.

Further, in step 3), speech samples f={ f₁,f₂,…,f_TSpeaker Identification process include following step It is rapid:

One, speech samples f={ f is calculated₁,f₂,…,f_TIn the distribution P=(p of speech feature space Ω₁,p2,…,p_K), Wherein

Two, speech samples f={ f is determined₁,f₂,…,f_TMotion profile Ψ in speech feature space Ω₁Ψ₂…Ψ_T, Ψ_t={ g_k|d_tk<δ}；

Three, sample distribution P=(p is calculated₁,p₂,…,p_K) with the priori features spatial distribution of speaker s DistanceThen screening includes the possible disaggregation S really solved_p:

Four, motion profile Ψ is calculated₁Ψ₂…Ψ_TDistance metricFrom S_pIn select possibility SolutionComplete Speaker Identification.

Specifically, in step 3), the space distribution information P=(p of speech samples is only used₁,p₂,…,p_K) or motion profile DistanceGood Speaker Identification performance can be obtained.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1, a kind of method for distinguishing speek person based on speech samples Feature space trace provided by the invention, wherein voice is special The foundation for levying space is clustered to a large amount of phonetic feature, is not needed the data of mark, is established speech feature space Data sample derives from different speakers, does not have exact requirement for speech content, speaker's age, languages, overcomes It in neural network method, needs largely to have the problem of mark voice, the data acquisition that speech space is established facilitates realization.

2, a kind of method for distinguishing speek person based on speech samples Feature space trace provided by the invention is based on speaker Positioning and trace information of the phonetic feature in speech feature space are different from signal source and generate Modeling Approaches, such as hidden Ma Er Can husband's model (HMM) etc., positioning is opposite, and it is absolute for generating model；Compared with the method for deep neural network, have Interpretation, each knowledge data have certain physics semanteme, such as the degree of association of the sample characteristics on the Ω of space point Cloth information P=(p₁,p₂,…,p_K) express activity space range (the mark subclass corresponding to nonzero element of the sample Representative space), and express the distribution in the space.

3, a kind of method for distinguishing speek person based on speech samples Feature space trace provided by the invention, essence is voice Feature positions in space, for the phonetic feature of different speakers, is positioned, is made on the speech feature space of foundation The phonetic feature location information that different speakers are indicated with the degree of association has just given expression to difference by less calculation amount and has spoken Distinction between people has compared to the method for the GMM or HMM for needing to model each speaker with generation model Lower computation complexity.

4, a kind of method for distinguishing speek person based on speech samples Feature space trace provided by the invention, wherein voice is special Sign space identification subclass is the reference system for positioning speaker's phonetic feature, is relativeness, does not have with sample to be identified There is stringent concerns mandate, therefore feature space has sharing, the speech feature space of foundation can move to other theorys It is identified on words personal data collection, such as: speaker's speech feature space of a languages can be used as the speaker of another languages The speech feature space of identification realizes the shared of data.

Detailed description of the invention

Fig. 1 is the general flowchart of method for distinguishing speek person in the embodiment of the present invention 1.

Fig. 2 is the step flow chart that speech feature space is established in the embodiment of the present invention 1.

Fig. 3 is that the step of distributed intelligence of speaker's speech feature space is generated with trace information in the embodiment of the present invention 1 is flowed Cheng Tu.

Fig. 4 is the step flow chart identified in the embodiment of the present invention 1 to speech samples to be identified.

Specific embodiment

Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.

Embodiment 1:

Present embodiments provide a kind of method for distinguishing speek person based on speech samples Feature space trace, general flowchart As shown in Figure 1, comprising the following three steps:

1) speech feature space Ω is established, any clean speech sample can be used, do not have to factors such as speaker, languages Then speech samples are clustered using clustering method in feature space, cluster obtained subclass tables of data by any constraint Up to the expression { g for speech feature space_k, k=1,2 ..., K }；

2) speaker's knowledge, distributed intelligence and motion track information two including speaker in speech feature space are constructed Part；

3) for speech samples to be identified, the distributed intelligence of speaker's speech feature space and the movement of speech samples are utilized Trace information is identified.

It is the step flow chart that speech feature space is established in the present embodiment with reference to Fig. 2.Use aishell Chinese corpus Speech samples collection of speaker's voice data as no mark in library includes 400 speakers, selection in aishell altogether Everyone 60 wav files are used to train speech feature space, extract without mark speech samples collection X={ x₁,x₂,....., x_N12 dimension MFCC features, to obtain feature setWhereinIt is short time frame feature, t_NFor institute There is the sum of the frame number of sample；

Then it usesThe GMM that characteristic sequence one degree of mixing of training is K, gives up the power of GMM Weight information, retains mark subclass of each Gaussian component as speech feature space.Wherein, K is audio feature space mark Know the quantity of son, mark quantity K selection 4096, to give precision higher description audio feature space；

Speech feature space mark sublist is shown as Ω={ g_k, k=1,2 ..., K }, wherein g=N (m, U) is multidimensional Gauss Distribution function；

It is the step flow chart that speaker characteristic space distribution information generates in the present embodiment with reference to Fig. 3.For Everyone, is used to be labeled speech feature space using 20 wav files in aishell.Target speaker's voice sample This collection Y={ (y₁,s₁),(y₂,s₂),.....,(y_M,s_M), s_i∈ S={ S_l, l=1,2 ..., L } and (speaker's collection), speaker S_lSample be Y_l={ y_m|s_m=S_l, m=1,2 ..., M }, extracting its audio frequency characteristics sequence is F_l={ f₁,f₂,…,f_tl}；Meter Calculate all feature f of speech samples_tWith the sub- g of space identification_k(m_k,U_k) the position degree of association:

Calculate speaker's sample set and the sub- g of space identification_k(m_k,U_k) the position degree of association desired value:

WhereinFor the t frame feature and the sub- g of mark of n-th of sample_k(m_k,U_k) the position degree of association；

Calculate speaker characteristic spatial distribution are as follows:

It concentrates the registration voice of each speaker to handle target speaker, obtains the phonetic feature of each speaker Distributed intelligence.

The motion profile timing information of speech samples is expressed as the sequence of neighborhoods Ψ of speech samples feature₁Ψ₂…Ψ_T, and sample Eigen f_tδ neighborhood be Ψ_t={ g_k|d_tk< δ }, wherein d_tkIt is characterized the Mahalanobis distance with distribution, i.e.,

It is the step flow chart that speech samples are identified in the present embodiment with reference to Fig. 4.The spy of speech samples to be identified Sign is f={ f₁,f₂,…,f_T, identification process is as follows:

Calculate voice f={ f₁,f₂,…,f_TIn the distribution P=(p of feature space Ω₁,p₂,…,p_K)；

Wherein, feature f_tWith the sub- g of space identification_k(m_k,U_k) the position degree of association are as follows:

Speaker's sample to be identified and the sub- g of space identification_k(m_k,U_k) the position degree of association are as follows:

Determine phonetic feature f={ f₁,f₂,…,f_TTrack Ψ in feature space Ω₁Ψ₂…Ψ_T, wherein Ψ_t={ g_k |d_tk<δ}；

Calculate sample distribution P=(p₁,p₂,…,p_K) with the priori features spatial distribution of speaker s DistanceWherein α takes 2, and then screening includes the possible disaggregation S really solved_p:It selects apart from the smallest 10 speakers as candidate recognition result；

Calculate track Ψ₁Ψ₂…Ψ_TDistance metricWherein α takes 2, from candidate 10 Select the smallest speaker of trajectory distance as recognition result in a speaker, i.e.,

Embodiment 2:

Present embodiments provide a kind of method for distinguishing speek person based on speech samples Feature space trace, including following step It is rapid:

Step 1, speech feature space mark subclass is established using the voice data of English corpus timit；

Step 2, target speaker collection is registered using the voice data in aishell corpus, with embodiment 1；

Step 3, speech samples to be identified are identified, with embodiment 1.

Obtained recognition effect has a small amount of gap compared in embodiment 1, can prove the speaker of another languages Speech feature space can be used as the speech feature space of the Speaker Identification of another languages, realize the shared of data.

The above, only the invention patent preferred embodiment, but the scope of protection of the patent of the present invention is not limited to This, anyone skilled in the art is in the range disclosed in the invention patent, according to the present invention the skill of patent Art scheme and its patent of invention design are subject to equivalent substitution or change, belong to the scope of protection of the patent of the present invention.

Claims

1. a kind of method for distinguishing speek person based on speech samples Feature space trace, one of speech samples can be considered as language The primary movement of sound feature space, with the rail track feature in activity space and space, which is characterized in that the method includes with Lower step:

Step 1), building speech feature space Ω: will be clustered without mark speech samples in feature space using clustering method, Expression Ω={ g of certain expression as speech feature space of the subclass data is generated by clustering obtained subclass data_k,k =1,2 ..., K }；

Step 2), building speaker's knowledge: using the clean speech sample for having speaker's attribute labeling, it is obtained in phonetic feature Distributed intelligence and motion track information on the Ω of space；

Step 3), Speaker Identification: for speech samples to be identified, the speech feature space distribution and expression of the sample is obtained first And track, difference and the edge of sample distribution and prior distribution are then calculated using the distributed intelligence of speaker's speech feature space The accumulative local distribution difference of track, as Speaker Identification foundation and judged.

2. a kind of method for distinguishing speek person based on speech samples Feature space trace according to claim 1, feature It is: during step 1) constructs speech feature space Ω, any clean speech sample is able to use, to speaker, languages Factor does not have any constraint.

3. a kind of method for distinguishing speek person based on speech samples Feature space trace according to claim 1, feature Be: the speech feature space expresses Ω={ g_k, k=1,2 ..., K can be class data distribution function, cluster centre Vector generates these marks with stationkeeping ability of model, and referred to as feature space mark, speech feature space are made Class identifies cuckoo mould K and determines that speech feature space expresses granularity, and K is bigger, and speech feature space expression is finer.

4. a kind of method for distinguishing speek person based on speech samples Feature space trace according to claim 1, feature It is: in step 2), using there is the clean speech sample of speaker's attribute labeling to be labeled speech feature space, is using Gaussian Profile g_k(m_k,U_k) it is used as the space identification period of the day from 11 p.m. to 1 a.m, speaker characteristic space distribution information obtains in the following manner:

In formula, space identification is indicated with Multi-dimensional Gaussian distribution, m_kIndicate the mean value vector of k-th of Gaussian Profile, U_kIndicate the The variance matrix of k Multi-dimensional Gaussian distribution；

5. a kind of method for distinguishing speek person based on speech samples Feature space trace according to claim 4, special Sign is: in step 2), motion profile timing information of speaker's speech samples on speech feature space Ω is expressed as language The sequence of neighborhoods Ψ of sound sample characteristics₁Ψ₂…Ψ_T, and speech samples feature f_tδ neighborhood be Ψ_t={ g_k|d_tk< δ }, wherein d_tk For speech samples feature and speech samples distribution Mahalanobis generalised distance (Mahalanobis distance), i.e.,

6. a kind of method for distinguishing speek person based on speech samples Feature space trace according to claim 5, feature It is: the speech samples feature f_tδ neighborhood Ψ_t={ g_k|d_tk< δ } decision threshold refer to normal distribution characteristic, choose 2 <δ<3。

7. a kind of method for distinguishing speek person based on speech samples Feature space trace according to claim 5, feature It is, in step 3), speech samples f={ f₁,f₂,…,f_TSpeaker Identification process the following steps are included:

One, speech samples f={ f is calculated₁,f₂,…,f_TIn the distribution P=(p of speech feature space Ω₁,p₂,…,p_K), wherein

Two, speech samples f={ f is determined₁,f₂,…,f_TMotion profile Ψ in speech feature space Ω₁Ψ₂…Ψ_T, Ψ_t= {g_k|d_tk<δ}；

Three, sample distribution P=(p is calculated₁,p₂,…,p_K) with the priori features spatial distribution of speaker s's DistanceThen screening includes the possible disaggregation S really solved_p:

Four, motion profile Ψ is calculated₁Ψ₂…Ψ_TDistance metricFrom S_pIn select possible solutionComplete Speaker Identification.