CN109065059A

CN109065059A - The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established

Info

Publication number: CN109065059A
Application number: CN201811118265.6A
Authority: CN
Inventors: 陈永清; 陈东风; 王贵珊; 李瑞娟
Original assignee: New Bart (anhui) Intelligent Technology Co Ltd
Current assignee: New Bart (anhui) Intelligent Technology Co Ltd
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2018-12-21

Abstract

The invention discloses a kind of methods for identifying speaker with the voice cluster of audio frequency characteristics principal component foundation, this method is to combine the hierarchical clustering of principal component analysis and the Euclidean distance based on audio frequency characteristics in principle components space, specifically: collect different training audio sample collection；Calculate the time domain and frequency domain audio feature of each sample；Calculate the average value and standard deviation of time domain and frequency domain audio feature；Principal component analysis is carried out to training sample by calculated data；Each audio is represented by audio characteristic data along the coordinate of above-mentioned N number of principal component projection；Using UPGMA cluster algorithm, speaker is clustered based on the distance in n-dimensional space.Method of the invention has speed fast, and the new convenient feature of human speech sound of speaking of addition is used for intelligent language tutoring system, realizes Speaker Identification, speaker is differentiated in time from unknown multiple spokesman's sessions, conducive to targetedly imparting knowledge to students.

Description

The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established

Technical field:

The invention belongs to speaker Recognition Technology field, in particular to a kind of voice group established with audio frequency characteristics principal component The method for collecting to identify speaker.

Background technique:

Speaker Identification is one mode identification problem.Various technologies for handling and storing vocal print include that frequency is estimated Meter, hidden Markov model, gauss hybrid models, pattern matching algorithm, matrix expression, vector quantization, support vector machines and certainly Plan tree, some systems also use " anti-speaker " technology, such as queuing model and world model.Neural network in recent years, especially Deep neural network and convolutional neural networks are widely used in speech recognition and obtain immense success.Similar technology is also used for Speaker Identification.However existing session identification technology not only needs a large amount of voice data, but also the training time is also longer, to having A bit using not very convenient.

Currently, service robot is either in the world or domestic all not counting especially mature, session robotic is not only wanted It can understand what you are saying, also to understand more people while talk with, this is difficult for robot.Because speech intonation is different Mingle, robot can not accept dialogue that cannot be smooth.For this purpose, being difficult to meet reality for session identification technology in the prior art Border application demand, the application provide a kind of voice cluster established with audio frequency characteristics principal component to break this technical barrier and know The method of other speaker.

Summary of the invention:

The purpose of the present invention is intended to provide a kind of voice cluster established with audio frequency characteristics principal component and identifies speaker's Method differentiates speaker from unknown multiple spokesman's sessions to realize intelligent language tutoring system Speaker Identification in time.

In order to achieve the above objectives, the present invention takes following technical scheme:

The voice cluster that the present invention is established with audio frequency characteristics principal component is come the method that identifies speaker, mainly by principal component The hierarchical clustering of analysis (PCA) and the Euclidean distance based on audio frequency characteristics in principle components space combines, and specifically includes Following steps:

1) different training audio sample collection is collected；

2) algorithm according to described in Librosa calculates the time domain and frequency domain audio feature of each sample；The frequency domain sound Frequency feature mainly include zero-crossing rate, root mean square energy, spectral centroid and bandwidth, Mel-Frequency cepstrum coefficient (MFCC) and Fundamental tone grade or coloration.

3) average value and standard deviation of above-mentioned time domain and frequency domain audio feature are calculated separately out；

4) principal component analysis is carried out to training sample by calculated above-mentioned data, 95% variance can be explained by selecting Top n component；

5) each audio is represented by audio characteristic data along the coordinate of above-mentioned N number of principal component projection；

6) UPGMA cluster algorithm is used, speaker is clustered based on the distance in n-dimensional space.

The above-mentioned distance based in n-dimensional space clusters specifically first by the speaker clustering of minimum distance speaker At cluster or branch, coordinate is the speaker for including or the average value of leaf, is continued until that all speakers are added with this To cluster, one tree is formed.

Further, identify speaker in new audio with the following method:

Reading or typing new speech, first calculate new audio characteristic data, and are converted into the projection of N-dimensional principle components space Coordinate；

Branches and leaves in above-mentioned existing cluster tree are compared with new audio, find out immediate speaker, that is, are calculated new The similarity of audio and immediate speaker, specifically:

Distance d is first calculated, matching score s is then calculated by following equation:

As d≤r_ave,

Wherein, r in above formula_aveAnd r_sdIt is the flat of the distance from immediate speaker's audio frequency characteristics coordinate sample to center Equal and standard deviation,Cdf is normal cumulative distribution function.

If score s is higher than specified cutoff value d, new audio and immediate speaker are same speakers；Otherwise, newly Audio is from new speaker.

The new audio data coordinate of above-mentioned acquisition is added in the above cluster tree as new entry, is used to further identification Thus voice from this new speaker constitutes new voice cluster tree.

The beneficial effects of the present invention are:

(1) compared with prior art, the method that the present invention identifies speaker only needs one group of different phonetic file to train With establish a starting cluster tree, the audio to be identified can be entirely different with these training voices, after starting cluster tree foundation No longer need to be trained can Direct Recognition new speech, addition newly speaks human speech sound.

(2) special algorithm is utilized in the method for present invention identification speaker, listens dialogue by succinct, fast and accurate Must be clear, then this method has speed fast, the new convenient feature of human speech sound of speaking of addition.

(3) method of the invention is used for intelligent language tutoring system, realizes Speaker Identification, from unknown multiple hairs Speaker is differentiated in time in speaker's session, conducive to targetedly imparting knowledge to students.

Detailed description of the invention:

Fig. 1 is the flow chart that speaker's voice cluster is established in the specific embodiment of the invention；

Fig. 2 is identification speaker's voice flow figure in the specific embodiment of the invention.

Specific embodiment:

In conjunction with the embodiments below by attached drawing, further specific be described in detail is made to technical solution of the present invention.

Referring to Fig. 1, the present invention first passes through principal component analysis (PCA) and based on audio spy on the basis of Speaker Identification The hierarchical clustering for levying the Euclidean distance in principle components space, which combines, establishes speaker's voice cluster, and specific steps are such as Under:

(1) reading training voice document；

(2) phonetic feature is calculated, i.e., the time domain and frequency domain audio feature of each trained voice document mainly include zero passage Rate, root mean square energy, spectral centroid and bandwidth, Mel-Frequency cepstrum coefficient (MFCC) and fundamental tone grade or coloration；

(3) principal component in phonetic feature is found, that is, the average value and standard deviation for calculating the above phonetic feature carry out Principal component analysis；

(4) coordinate in phonetic feature principal component space is calculated, i.e., selecting from phonetic feature principal component can explain Coordinate of the top n component of 95% variance as N number of principal component projection；

(5) based on the Distance aggregation voice in principal component space, a trained voice cluster is saved.

According to the voice cluster library established above based on speaker's speech audio feature principal component, it is exemplified below table 1:

The voice cluster library that table 1 is established based on speaker's speech audio feature principal component

The voice cluster library established in the above table 1 is given a mark and identified according to the parameter set that signature analysis obtains People is talked about whether in sound-groove model library.

Referring to fig. 2, the above-mentioned voice cluster kept is used into UPGMA cluster algorithm, by speaking for minimum distance People is clustered into cluster or branch, and coordinate is the speaker for including or the average value of leaf, and all speakers are continued until with this It is all added to cluster, forms one tree.When there is new speech, the step of identification speaker is as follows by means of the present invention:

(1) on the basis of reading trained voice cluster, reading or typing new speech；

(2) new speech characteristic is calculated；

(3) coordinate in new speech feature principal component space is calculated, i.e., new speech characteristic is converted into N-dimensional principal component Space projection coordinate；

(4) voice nearest with new speech is found out from trained voice cluster, i.e., by the branches and leaves in existing cluster tree It is compared with new speech, finds out immediate speaker；

(5) similarity of new speech Yu immediate speaker is calculated, specifically:

As d≤r_ave,

(6) if the cutoff value d of score s >=specified, new speech and nearest voice are same speakers；Otherwise, new speech From new speaker；

(7) it is added to the new speech of acquisition as new entry in the above cluster tree, constitutes new voice cluster tree.

Claims

1. with the voice cluster that audio frequency characteristics principal component is established the method that identifies speaker, it is characterised in that: the method is The hierarchical clustering of principal component analysis and the Euclidean distance based on audio frequency characteristics in principle components space is combined, it is specific to wrap Include following steps:

1) different training audio sample collection is collected；

2) algorithm according to described in Librosa calculates the time domain and frequency domain audio feature of each sample；

4) principal component analysis is carried out to training sample by calculated above-mentioned data, selects the preceding N that can explain 95% variance A component；

2. the method according to claim 1 for identifying speaker with the voice cluster of audio frequency characteristics principal component foundation, Be characterized in that: the time domain and frequency domain audio feature of sample described in step 2) include zero-crossing rate, root mean square energy, spectral centroid and Bandwidth, Mel-Frequency cepstrum coefficient and fundamental tone grade or coloration.

3. the method according to claim 1 for identifying speaker with the voice cluster of audio frequency characteristics principal component foundation, It is characterized in that: speaker being clustered specifically first by minimum distance based on the distance in n-dimensional space described in step 6) For speaker clustering at cluster or branch, coordinate is the speaker for including or the average value of leaf, is continued until all say with this Words people is added to cluster, forms one tree.

4. the method for identifying the speaker in new audio using method according to any one of claims 1 to 3, feature exist In: the method for the speaker in the new audio of identification includes the following steps:

Reading or typing new speech, first calculate new audio characteristic data, and are converted into the projection of N-dimensional principle components space and sit Mark；

Branches and leaves in above-mentioned existing cluster tree are compared with new audio, immediate speaker is found out, that is, calculates new audio With the similarity of immediate speaker, specifically:

As d≤r_ave,

Wherein, r in above formula_aveAnd r_sdBe the average of the distance from immediate speaker's audio frequency characteristics coordinate sample to center and Standard deviation,Cdf is normal cumulative distribution function；

If score s is higher than specified cutoff value d, new audio and immediate speaker are same speakers；Otherwise, new audio From new speaker；

The new audio data coordinate of the acquisition is added in the above cluster tree as new entry, comes from for further identifying Thus the voice of this new speaker constitutes new voice cluster tree.