CN109065059A - The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established - Google Patents
The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established Download PDFInfo
- Publication number
- CN109065059A CN109065059A CN201811118265.6A CN201811118265A CN109065059A CN 109065059 A CN109065059 A CN 109065059A CN 201811118265 A CN201811118265 A CN 201811118265A CN 109065059 A CN109065059 A CN 109065059A
- Authority
- CN
- China
- Prior art keywords
- speaker
- audio
- principal component
- new
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000000513 principal component analysis Methods 0.000 claims abstract description 9
- 235000012571 Ficus glomerata Nutrition 0.000 claims description 11
- 244000153665 Ficus glomerata Species 0.000 claims description 11
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 238000005315 distribution function Methods 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 210000004209 hair Anatomy 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/08—Use of distortion metrics or a particular distance between probe pattern and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a kind of methods for identifying speaker with the voice cluster of audio frequency characteristics principal component foundation, this method is to combine the hierarchical clustering of principal component analysis and the Euclidean distance based on audio frequency characteristics in principle components space, specifically: collect different training audio sample collection;Calculate the time domain and frequency domain audio feature of each sample;Calculate the average value and standard deviation of time domain and frequency domain audio feature;Principal component analysis is carried out to training sample by calculated data;Each audio is represented by audio characteristic data along the coordinate of above-mentioned N number of principal component projection;Using UPGMA cluster algorithm, speaker is clustered based on the distance in n-dimensional space.Method of the invention has speed fast, and the new convenient feature of human speech sound of speaking of addition is used for intelligent language tutoring system, realizes Speaker Identification, speaker is differentiated in time from unknown multiple spokesman's sessions, conducive to targetedly imparting knowledge to students.
Description
Technical field:
The invention belongs to speaker Recognition Technology field, in particular to a kind of voice group established with audio frequency characteristics principal component
The method for collecting to identify speaker.
Background technique:
Speaker Identification is one mode identification problem.Various technologies for handling and storing vocal print include that frequency is estimated
Meter, hidden Markov model, gauss hybrid models, pattern matching algorithm, matrix expression, vector quantization, support vector machines and certainly
Plan tree, some systems also use " anti-speaker " technology, such as queuing model and world model.Neural network in recent years, especially
Deep neural network and convolutional neural networks are widely used in speech recognition and obtain immense success.Similar technology is also used for
Speaker Identification.However existing session identification technology not only needs a large amount of voice data, but also the training time is also longer, to having
A bit using not very convenient.
Currently, service robot is either in the world or domestic all not counting especially mature, session robotic is not only wanted
It can understand what you are saying, also to understand more people while talk with, this is difficult for robot.Because speech intonation is different
Mingle, robot can not accept dialogue that cannot be smooth.For this purpose, being difficult to meet reality for session identification technology in the prior art
Border application demand, the application provide a kind of voice cluster established with audio frequency characteristics principal component to break this technical barrier and know
The method of other speaker.
Summary of the invention:
The purpose of the present invention is intended to provide a kind of voice cluster established with audio frequency characteristics principal component and identifies speaker's
Method differentiates speaker from unknown multiple spokesman's sessions to realize intelligent language tutoring system Speaker Identification in time.
In order to achieve the above objectives, the present invention takes following technical scheme:
The voice cluster that the present invention is established with audio frequency characteristics principal component is come the method that identifies speaker, mainly by principal component
The hierarchical clustering of analysis (PCA) and the Euclidean distance based on audio frequency characteristics in principle components space combines, and specifically includes
Following steps:
1) different training audio sample collection is collected;
2) algorithm according to described in Librosa calculates the time domain and frequency domain audio feature of each sample;The frequency domain sound
Frequency feature mainly include zero-crossing rate, root mean square energy, spectral centroid and bandwidth, Mel-Frequency cepstrum coefficient (MFCC) and
Fundamental tone grade or coloration.
3) average value and standard deviation of above-mentioned time domain and frequency domain audio feature are calculated separately out;
4) principal component analysis is carried out to training sample by calculated above-mentioned data, 95% variance can be explained by selecting
Top n component;
5) each audio is represented by audio characteristic data along the coordinate of above-mentioned N number of principal component projection;
6) UPGMA cluster algorithm is used, speaker is clustered based on the distance in n-dimensional space.
The above-mentioned distance based in n-dimensional space clusters specifically first by the speaker clustering of minimum distance speaker
At cluster or branch, coordinate is the speaker for including or the average value of leaf, is continued until that all speakers are added with this
To cluster, one tree is formed.
Further, identify speaker in new audio with the following method:
Reading or typing new speech, first calculate new audio characteristic data, and are converted into the projection of N-dimensional principle components space
Coordinate;
Branches and leaves in above-mentioned existing cluster tree are compared with new audio, find out immediate speaker, that is, are calculated new
The similarity of audio and immediate speaker, specifically:
Distance d is first calculated, matching score s is then calculated by following equation:
As d≤rave,
As d≤rave,
Wherein, r in above formulaaveAnd rsdIt is the flat of the distance from immediate speaker's audio frequency characteristics coordinate sample to center
Equal and standard deviation,Cdf is normal cumulative distribution function.
If score s is higher than specified cutoff value d, new audio and immediate speaker are same speakers;Otherwise, newly
Audio is from new speaker.
The new audio data coordinate of above-mentioned acquisition is added in the above cluster tree as new entry, is used to further identification
Thus voice from this new speaker constitutes new voice cluster tree.
The beneficial effects of the present invention are:
(1) compared with prior art, the method that the present invention identifies speaker only needs one group of different phonetic file to train
With establish a starting cluster tree, the audio to be identified can be entirely different with these training voices, after starting cluster tree foundation
No longer need to be trained can Direct Recognition new speech, addition newly speaks human speech sound.
(2) special algorithm is utilized in the method for present invention identification speaker, listens dialogue by succinct, fast and accurate
Must be clear, then this method has speed fast, the new convenient feature of human speech sound of speaking of addition.
(3) method of the invention is used for intelligent language tutoring system, realizes Speaker Identification, from unknown multiple hairs
Speaker is differentiated in time in speaker's session, conducive to targetedly imparting knowledge to students.
Detailed description of the invention:
Fig. 1 is the flow chart that speaker's voice cluster is established in the specific embodiment of the invention;
Fig. 2 is identification speaker's voice flow figure in the specific embodiment of the invention.
Specific embodiment:
In conjunction with the embodiments below by attached drawing, further specific be described in detail is made to technical solution of the present invention.
Referring to Fig. 1, the present invention first passes through principal component analysis (PCA) and based on audio spy on the basis of Speaker Identification
The hierarchical clustering for levying the Euclidean distance in principle components space, which combines, establishes speaker's voice cluster, and specific steps are such as
Under:
(1) reading training voice document;
(2) phonetic feature is calculated, i.e., the time domain and frequency domain audio feature of each trained voice document mainly include zero passage
Rate, root mean square energy, spectral centroid and bandwidth, Mel-Frequency cepstrum coefficient (MFCC) and fundamental tone grade or coloration;
(3) principal component in phonetic feature is found, that is, the average value and standard deviation for calculating the above phonetic feature carry out
Principal component analysis;
(4) coordinate in phonetic feature principal component space is calculated, i.e., selecting from phonetic feature principal component can explain
Coordinate of the top n component of 95% variance as N number of principal component projection;
(5) based on the Distance aggregation voice in principal component space, a trained voice cluster is saved.
According to the voice cluster library established above based on speaker's speech audio feature principal component, it is exemplified below table 1:
The voice cluster library that table 1 is established based on speaker's speech audio feature principal component
The voice cluster library established in the above table 1 is given a mark and identified according to the parameter set that signature analysis obtains
People is talked about whether in sound-groove model library.
Referring to fig. 2, the above-mentioned voice cluster kept is used into UPGMA cluster algorithm, by speaking for minimum distance
People is clustered into cluster or branch, and coordinate is the speaker for including or the average value of leaf, and all speakers are continued until with this
It is all added to cluster, forms one tree.When there is new speech, the step of identification speaker is as follows by means of the present invention:
(1) on the basis of reading trained voice cluster, reading or typing new speech;
(2) new speech characteristic is calculated;
(3) coordinate in new speech feature principal component space is calculated, i.e., new speech characteristic is converted into N-dimensional principal component
Space projection coordinate;
(4) voice nearest with new speech is found out from trained voice cluster, i.e., by the branches and leaves in existing cluster tree
It is compared with new speech, finds out immediate speaker;
(5) similarity of new speech Yu immediate speaker is calculated, specifically:
Distance d is first calculated, matching score s is then calculated by following equation:
As d≤rave,
As d≤rave,
Wherein, r in above formulaaveAnd rsdIt is the flat of the distance from immediate speaker's audio frequency characteristics coordinate sample to center
Equal and standard deviation,Cdf is normal cumulative distribution function.
(6) if the cutoff value d of score s >=specified, new speech and nearest voice are same speakers;Otherwise, new speech
From new speaker;
(7) it is added to the new speech of acquisition as new entry in the above cluster tree, constitutes new voice cluster tree.
Claims (4)
1. with the voice cluster that audio frequency characteristics principal component is established the method that identifies speaker, it is characterised in that: the method is
The hierarchical clustering of principal component analysis and the Euclidean distance based on audio frequency characteristics in principle components space is combined, it is specific to wrap
Include following steps:
1) different training audio sample collection is collected;
2) algorithm according to described in Librosa calculates the time domain and frequency domain audio feature of each sample;
3) average value and standard deviation of above-mentioned time domain and frequency domain audio feature are calculated separately out;
4) principal component analysis is carried out to training sample by calculated above-mentioned data, selects the preceding N that can explain 95% variance
A component;
5) each audio is represented by audio characteristic data along the coordinate of above-mentioned N number of principal component projection;
6) UPGMA cluster algorithm is used, speaker is clustered based on the distance in n-dimensional space.
2. the method according to claim 1 for identifying speaker with the voice cluster of audio frequency characteristics principal component foundation,
Be characterized in that: the time domain and frequency domain audio feature of sample described in step 2) include zero-crossing rate, root mean square energy, spectral centroid and
Bandwidth, Mel-Frequency cepstrum coefficient and fundamental tone grade or coloration.
3. the method according to claim 1 for identifying speaker with the voice cluster of audio frequency characteristics principal component foundation,
It is characterized in that: speaker being clustered specifically first by minimum distance based on the distance in n-dimensional space described in step 6)
For speaker clustering at cluster or branch, coordinate is the speaker for including or the average value of leaf, is continued until all say with this
Words people is added to cluster, forms one tree.
4. the method for identifying the speaker in new audio using method according to any one of claims 1 to 3, feature exist
In: the method for the speaker in the new audio of identification includes the following steps:
Reading or typing new speech, first calculate new audio characteristic data, and are converted into the projection of N-dimensional principle components space and sit
Mark;
Branches and leaves in above-mentioned existing cluster tree are compared with new audio, immediate speaker is found out, that is, calculates new audio
With the similarity of immediate speaker, specifically:
Distance d is first calculated, matching score s is then calculated by following equation:
As d≤rave,
As d≤rave,
Wherein, r in above formulaaveAnd rsdBe the average of the distance from immediate speaker's audio frequency characteristics coordinate sample to center and
Standard deviation,Cdf is normal cumulative distribution function;
If score s is higher than specified cutoff value d, new audio and immediate speaker are same speakers;Otherwise, new audio
From new speaker;
The new audio data coordinate of the acquisition is added in the above cluster tree as new entry, comes from for further identifying
Thus the voice of this new speaker constitutes new voice cluster tree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811118265.6A CN109065059A (en) | 2018-09-26 | 2018-09-26 | The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811118265.6A CN109065059A (en) | 2018-09-26 | 2018-09-26 | The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109065059A true CN109065059A (en) | 2018-12-21 |
Family
ID=64765876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811118265.6A Withdrawn CN109065059A (en) | 2018-09-26 | 2018-09-26 | The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109065059A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800299A (en) * | 2019-02-01 | 2019-05-24 | 浙江核新同花顺网络信息股份有限公司 | A kind of speaker clustering method and relevant apparatus |
CN110135492A (en) * | 2019-05-13 | 2019-08-16 | 山东大学 | Equipment fault diagnosis and method for detecting abnormality and system based on more Gauss models |
WO2020143263A1 (en) * | 2019-01-11 | 2020-07-16 | 华南理工大学 | Speaker identification method based on speech sample feature space trajectory |
CN112019786A (en) * | 2020-08-24 | 2020-12-01 | 上海松鼠课堂人工智能科技有限公司 | Intelligent teaching screen recording method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1178467A1 (en) * | 2000-07-05 | 2002-02-06 | Matsushita Electric Industrial Co., Ltd. | Speaker verification and identification in their own spaces |
JP2013061402A (en) * | 2011-09-12 | 2013-04-04 | Nippon Telegr & Teleph Corp <Ntt> | Spoken language estimating device, method, and program |
CN103413551A (en) * | 2013-07-16 | 2013-11-27 | 清华大学 | Sparse dimension reduction-based speaker identification method |
CN104538035A (en) * | 2014-12-19 | 2015-04-22 | 深圳先进技术研究院 | Speaker recognition method and system based on Fisher supervectors |
CN107342077A (en) * | 2017-05-27 | 2017-11-10 | 国家计算机网络与信息安全管理中心 | A kind of speaker segmentation clustering method and system based on factorial analysis |
-
2018
- 2018-09-26 CN CN201811118265.6A patent/CN109065059A/en not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1178467A1 (en) * | 2000-07-05 | 2002-02-06 | Matsushita Electric Industrial Co., Ltd. | Speaker verification and identification in their own spaces |
JP2013061402A (en) * | 2011-09-12 | 2013-04-04 | Nippon Telegr & Teleph Corp <Ntt> | Spoken language estimating device, method, and program |
CN103413551A (en) * | 2013-07-16 | 2013-11-27 | 清华大学 | Sparse dimension reduction-based speaker identification method |
CN104538035A (en) * | 2014-12-19 | 2015-04-22 | 深圳先进技术研究院 | Speaker recognition method and system based on Fisher supervectors |
CN107342077A (en) * | 2017-05-27 | 2017-11-10 | 国家计算机网络与信息安全管理中心 | A kind of speaker segmentation clustering method and system based on factorial analysis |
Non-Patent Citations (2)
Title |
---|
张文林等: "基于正则化的本征音说话人自适应方法", 《自动化学报》 * |
方尔庆等: "基于视听信息的自动年龄估计方法", 《软件学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020143263A1 (en) * | 2019-01-11 | 2020-07-16 | 华南理工大学 | Speaker identification method based on speech sample feature space trajectory |
CN109800299A (en) * | 2019-02-01 | 2019-05-24 | 浙江核新同花顺网络信息股份有限公司 | A kind of speaker clustering method and relevant apparatus |
CN110135492A (en) * | 2019-05-13 | 2019-08-16 | 山东大学 | Equipment fault diagnosis and method for detecting abnormality and system based on more Gauss models |
CN112019786A (en) * | 2020-08-24 | 2020-12-01 | 上海松鼠课堂人工智能科技有限公司 | Intelligent teaching screen recording method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109065059A (en) | The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established | |
CN102509547B (en) | Method and system for voiceprint recognition based on vector quantization based | |
CN106847292B (en) | Method for recognizing sound-groove and device | |
CN104036774B (en) | Tibetan dialect recognition methods and system | |
CN108922541B (en) | Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models | |
CN108986824B (en) | Playback voice detection method | |
CN102324232A (en) | Method for recognizing sound-groove and system based on gauss hybrid models | |
WO2019153404A1 (en) | Smart classroom voice control system | |
CN105469784B (en) | A kind of speaker clustering method and system based on probability linear discriminant analysis model | |
CN107342077A (en) | A kind of speaker segmentation clustering method and system based on factorial analysis | |
CN107393554A (en) | In a kind of sound scene classification merge class between standard deviation feature extracting method | |
CN103811009A (en) | Smart phone customer service system based on speech analysis | |
CN105261367B (en) | A kind of method for distinguishing speek person | |
CN106128465A (en) | A kind of Voiceprint Recognition System and method | |
CN109215665A (en) | A kind of method for recognizing sound-groove based on 3D convolutional neural networks | |
CN110457432A (en) | Interview methods of marking, device, equipment and storage medium | |
CN1808567A (en) | Voice-print authentication device and method of authenticating people presence | |
CN109346084A (en) | Method for distinguishing speek person based on depth storehouse autoencoder network | |
CN110047504A (en) | Method for distinguishing speek person under identity vector x-vector linear transformation | |
CN108735200A (en) | A kind of speaker's automatic marking method | |
CN109961794A (en) | A kind of layering method for distinguishing speek person of model-based clustering | |
CN110299150A (en) | A kind of real-time voice speaker separation method and system | |
CN107358947A (en) | Speaker recognition methods and system again | |
CN109377981A (en) | The method and device of phoneme alignment | |
CN106898355A (en) | A kind of method for distinguishing speek person based on two modelings |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20181221 |