CN110211594B - Speaker identification method based on twin network model and KNN algorithm - Google Patents

Speaker identification method based on twin network model and KNN algorithm Download PDF

Info

Publication number
CN110211594B
CN110211594B CN201910494606.8A CN201910494606A CN110211594B CN 110211594 B CN110211594 B CN 110211594B CN 201910494606 A CN201910494606 A CN 201910494606A CN 110211594 B CN110211594 B CN 110211594B
Authority
CN
China
Prior art keywords
speaker
network model
voice
knn algorithm
twin network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910494606.8A
Other languages
Chinese (zh)
Other versions
CN110211594A (en
Inventor
张莉
李文钧
李竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201910494606.8A priority Critical patent/CN110211594B/en
Publication of CN110211594A publication Critical patent/CN110211594A/en
Application granted granted Critical
Publication of CN110211594B publication Critical patent/CN110211594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a speaker identification method based on a twin network model and a KNN algorithm, which comprises the following steps of S1: using a microphone to collect voice information of a speaker as a data set to train an RNN network model; step S2: and constructing a twin network model by using the trained RNN and identifying the speaker by combining a KNN algorithm. By adopting the technical scheme of the invention, the data set of the speaker in the database is trained, the characteristic that the input of each sentence of voice signal input into the twin network can be output to represent the speaker is determined, the distance between different output characteristic vectors is calculated by using cosine distance, and whether the same speaker belongs to is judged by applying KNN algorithm, so that a small amount of samples can also identify the speaker, the network does not need to be trained again along with the increase of the number of the speakers, the requirement of the neural network on the number of the data samples is reduced, and the real-time performance and the accuracy of speaker identification are effectively improved.

Description

Speaker identification method based on twin network model and KNN algorithm
Technical Field
The invention belongs to the technical field of man-machine interaction, in particular to the technical field of speaker recognition, and particularly relates to a speaker recognition method based on a twin network model and a KNN algorithm.
Background
In the field of human-computer interaction, along with the rapid development of technologies such as artificial intelligence, mode recognition and the like, interaction between people and computers is more and more compact, the traditional contact type interaction mode cannot meet the requirements of people, and research on novel interaction modes which accord with the communication habits of people has become a research hotspot in recent years. Speaker recognition, one of the main channels of human-computer interaction, is gradually becoming an important research topic in the field of interaction.
The existing speaker recognition method mainly comprises a voice feature extraction and template matching technology method, a voice statistical model method and a deep learning technology method. The main research work of the traditional model focuses on the speech feature extraction and template matching technology. The template matching based method is to train recognized voiceprint samples in advance and match the voiceprints to be recognized with the voiceprints, and the method is simple to operate, but the recognition accuracy is not high and a large number of data samples are needed. In the method based on the voice statistical model, a recognition task is defined as the probability of calculating variables, and the method has high recognition precision but needs a large amount of data for verification. In the method based on the deep learning technology, the neural network is used for capturing hidden features in the speaker so that the speaker can be better represented, the method not only needs a large amount of data, but also needs to retrain the neural network when a data set is updated every time, and new information is not favorably input.
Disclosure of Invention
Aiming at the defect that a large number of voice samples are needed in the prior art, the invention aims to provide a speaker recognition method based on a twin network model and a KNN algorithm. The voice information of the speaker is collected through the microphone device, and a strategy for adjusting the updating of the information of the speaker by combining a twin RNN (radio network) and a KNN (K nearest neighbor) algorithm is designed, so that the speaker can be identified and identified more quickly and efficiently by small amount of data. The specific technical scheme is as follows:
a speaker recognition method based on twin network model and KNN classification comprises the following steps:
step S1: using a microphone to collect voice information of a speaker as a data set to train an RNN network model;
step S2: constructing a twin network model by using the trained RNN and identifying the speaker by combining a KNN algorithm;
wherein the step S1 further includes:
step S11: acquiring a large number of voice data sets and carrying out data preprocessing;
step S12: storing the preprocessed voice data set in a voice database;
step S13: acquiring a voice signal data set from a voice database, and acquiring a feature vector of a voice signal by using a feature extraction method based on the rule that the voice signal changes along with time;
s14, training an RNN model by using a time-based back propagation algorithm (BPTT) according to the voice signal feature vector v extracted in the step S13 to obtain an optimal parameter theta and an original model;
the step S13 further includes:
step S131: let x be a set of a period of speech over a period of time t, and frame it with a frame length of 25s, resulting in a discrete speech signal x over time t1,x2…xt
Step S132: set input X ═ X in step S1311,x2…xtExtracting the characteristics of the discrete signal by combining MFCC, and extracting a 40-dimensional voice characteristic vector V ═ V { (V) }1,v2…vt}
The step S14 further includes:
step S141: for each speech signal that is temporally correlated, the input at time t of that speech signal contains vtThe last speech state s is memorized in the RNN network modelt-1Hidden layer h at each instanttBoth relate to the input at the current time and the state at the previous time, and are formulated as follows:
ht=Uvt+Wst-1
step S142: for the current time t, its state stIn relation to the hidden layer at that moment, then st=f(ht) Here, the activating function f is a tanh function, which can better fit the speech signal, and the hidden layer value at the moment is substituted to obtain:
Figure BDA0002087204390000021
step S143: output vector f for current time ttThen f ist=g(Vst) Finally, obtaining an output vector F of a section of voice;
step S144: let output vector F ═ F1,f2…ftIs as
Figure BDA0002087204390000022
And (3) setting a parameter theta shared by the RNN network model as { W, U, V }, and obtaining a loss function of the RNN network model according to a difference value of the output value and the real value in the calculation of the total time sum
Figure BDA0002087204390000023
Step S145: the obtained loss function
Figure BDA0002087204390000024
Respectively carrying out derivation on the { W, U and V } by using a back propagation algorithm so as to obtain an optimal parameter theta and an optimal original model;
the step S2 further includes:
step S21: using the trained RNN network model to construct a twin network model, sharing the same network parameter, and respectively inputting a plurality of different voice signals X0…XnThe result set of output vectors FS of the predicted speech signal is { F ═ F0,···,Fn};
Step S22: calculating cosine distances between different output characteristic vectors according to the output vector FS obtained in the last step, and applying a KNN algorithm to obtain whether the voice belongs to the same person;
the step S22 further includes:
step S221: obtaining an output vector F from the voice signal passing through the twin network model0,F1,…,FnUsing the classified vector for weight representation, wherein F1={f11,f12…f1tAnd Fn={fn1,fn2…fntDenotes the speech signal in the speaker sample set, F0={f01,f02…f0tExpressing the voice signal to be tested of the speaker;
step S222: whether the speakers belong to the same speaker is judged according to cosine distance scoring, the similarity of the two speakers is not reflected in that the lengths of the two vectors are only related to the included angle of the two vectors, and then the formula is as follows:
Figure BDA0002087204390000031
step S223: in step S222, cosine distances between different speech signals are calculated, and a KNN algorithm is used to find the distance F0The closest points are indicated as the same speaker.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention designs a speaker identification method aiming at speaker identification, and only provides one or a small number of training samples for each speaker category, and the samples have variability. The output classification model of the model is not directly trained, but the similarity function of the model is trained, so that a small amount of samples can accurately and quickly identify the speaker.
2. The invention decomposes a section of continuous voice signal into discrete voice signal vectors, and needs to input voice signals with the same length in the traditional speaker recognition.
3. The network provided by the invention expands the dual-channel input of the twin network to the multi-channel input on the basis of the traditional twin network. The speaker can be identified more quickly.
4. The invention designs a method for judging the similarity between different speakers according to the similarity between the same speaker and different speakers and the KNN algorithm and judging whether the speakers belong to the same speaker.
Drawings
FIG. 1 is a block flow diagram of a twin network model and KNN classification based speaker recognition method according to the present invention;
FIG. 2 is a detailed flowchart of the speech feature extraction of a twin network model and KNN algorithm based speaker recognition method according to the present invention;
FIG. 3 is a deep circular neural network structure of a speaker recognition method based on a twin network model and a KNN algorithm according to the present invention;
FIG. 4 shows a twin network structure constructed in the speaker recognition method based on the twin network model and KNN algorithm according to the present invention
FIG. 5 is a detailed flowchart of a twin network and KNN algorithm in a speaker recognition method based on a twin network model and a KNN algorithm according to the present invention;
Detailed Description
The technical solution provided by the present invention will be further explained with reference to the accompanying drawings.
In actual life, as people increase and leave, when people want to recognize speakers according to voices, the voice signals of the people need to be added when the people increase, and the existing models need to be retrained again when new voices are added, so that updating is not utilized. The invention provides a method for judging the similarity of a newly added voice signal to a voice to be detected by using a twin network model.
The present invention provides a twin network model and KNN based speaker recognition system as shown in figure 1. Overall, the present invention comprises two major steps: step S1: using a microphone to collect voice information of a speaker as a data set to train an RNN network model; step S2: constructing a twin network model by using the trained RNN and identifying the speaker by combining a KNN algorithm;
referring to fig. 2, a large number of acquired voice data sets are subjected to data preprocessing, and the acquired voice signals are subjected to pre-emphasis, framing and fourier transform to obtain a 40-dimensional voice feature vector v;
referring to the RNN network model shown in fig. 3, the obtained 40-dimensional speech feature vectors are input into the RNN network model for training and an initial model is obtained.
Step S141: for each speech signal that is temporally correlated, the input at time t of that speech signal contains vtThe last speech state s is memorized in the RNN network modelt-1Hidden layer h at each instanttBoth relate to the input at the current time and the state at the previous time, and are formulated as follows:
ht=Uvt+Wst-1
step S142: for the current time t, its state stIn relation to the hidden layer at that moment, then st=f(ht) Here, the activating function f is a tanh function, which can better fit the speech signal, and the hidden layer value at the moment is substituted to obtain:
Figure BDA0002087204390000041
step S143: output vector f for current time ttThen f ist=g(Vst) Finally, obtaining an output vector F of a section of voice;
step S144: let output vector F ═ F1,f2…ftIs as
Figure BDA0002087204390000051
And (3) setting a parameter theta shared by the RNN network model as { W, U, V }, and obtaining a loss function of the RNN network model according to a difference value of the output value and the real value in the calculation of the total time sum
Figure BDA0002087204390000052
Step S145: the obtained loss function
Figure BDA0002087204390000053
Respectively carrying out derivation on the { W, U and V } by using a back propagation algorithm so as to obtain an optimal parameter theta and an optimal original model;
with the twin network structure shown in fig. 4, the dual input of the twin network is referred to as multiple input, and n sections of voice signals are input at each test, where one section of voice signal to be tested and n-1 sections of reference voice sample signals are shown in fig. 5. The Recurrent Neural Network (RNN) extracts the feature vectors of the speech signal to be detected as shown in fig. 3, measures the distances to the feature vectors of the reference speech signal, and then assigns the labels to be detected to the class with the closest spatial distance by the KNN nearest neighbor algorithm, thereby realizing speaker recognition.
Step S221: obtaining an output vector F from the voice signal passing through the twin network model0,F1,…,FnUsing the classified vector for weight representation, wherein F1={f11,f12…f1tAnd Fn={fn1,fn2…fntDenotes the speech signal in the speaker sample set, F0={f01,f02…f0tExpressing the voice signal to be tested of the speaker;
step S222: whether the speakers belong to the same speaker is judged according to cosine distance scoring, the similarity of the two speakers is not reflected in that the lengths of the two vectors are only related to the included angle of the two vectors, and then the formula is as follows:
Figure BDA0002087204390000054
step S223: in step S222, cosine distances between different speech signals are calculated, and a KNN algorithm is used to find the distance F0The closest points are indicated as the same speaker.

Claims (3)

1. A speaker recognition method based on a twin network model and a KNN algorithm is characterized by comprising the following steps:
step S1: using a microphone to collect voice information of a speaker as a data set to train an RNN network model;
step S2: constructing a twin network model by using the trained RNN and identifying the speaker by combining a KNN algorithm;
wherein, the step S1 is as follows:
step S11: acquiring a large number of voice data sets and carrying out data preprocessing;
step S12: storing the preprocessed voice data set in a voice database;
step S13: acquiring a voice signal data set from a voice database, and acquiring a feature vector v of a voice signal by using a feature extraction method based on the rule that the voice signal changes along with time;
s14, training an RNN model by using a time-based back propagation algorithm (BPTT) according to the voice signal feature vector v extracted in the step S13 to obtain an optimal parameter theta and an original model;
the step S2 is specifically as follows:
step S21: constructing a twin network model by using the trained RNN network model, and respectively inputting a plurality of different voice signals X0…XnThe result set of output vectors FS of the predicted speech signal is { F ═ F0,···,Fn};
Step S22: and calculating cosine distances among different output characteristic vectors according to the output vector FS obtained in the last step, and applying a KNN algorithm to obtain whether the voice belongs to the same person.
2. The twin network model and KNN algorithm based speaker recognition method of claim 1,
the step S13 is specifically as follows:
step S131: let X be a set of a segment of speech over a period of time t, which is framed in a frame length of 25s, resulting in a discrete speech signal X for time t1,x2…xt
Step S132: set input X ═ X in step S1311,x2…xtExtracting the characteristics of the discrete voice signal by combining MFCC, and extracting a 40-dimensional voice characteristic vector V ═ V1,v2…vt}。
3. The twin network model and KNN algorithm based speaker recognition method of claim 1,
the step S22 is specifically as follows:
step S221: obtaining an output vector F from the voice signal passing through the twin RNN network model0,F1,…,FnUsing the classified vector for weight representation, wherein F1={f11,f12…f1tAnd Fn={fn1,fn2…fntDenotes the speech signal in the speaker sample set, F0={f01,f02…f0tExpressing the voice signal to be tested of the speaker;
step S222: whether the speakers belong to the same speaker is judged according to cosine distance scoring, the similarity of the two speakers is not reflected in that the lengths of the two vectors are only related to the included angle of the two vectors, and then the formula is as follows:
Figure FDA0002914033640000021
step S223: in step S222, cosine distances between different speech signals are calculated, and a KNN algorithm is used to find the distance F0The closest points are indicated as the same speaker.
CN201910494606.8A 2019-06-06 2019-06-06 Speaker identification method based on twin network model and KNN algorithm Active CN110211594B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910494606.8A CN110211594B (en) 2019-06-06 2019-06-06 Speaker identification method based on twin network model and KNN algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910494606.8A CN110211594B (en) 2019-06-06 2019-06-06 Speaker identification method based on twin network model and KNN algorithm

Publications (2)

Publication Number Publication Date
CN110211594A CN110211594A (en) 2019-09-06
CN110211594B true CN110211594B (en) 2021-05-04

Family

ID=67791537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910494606.8A Active CN110211594B (en) 2019-06-06 2019-06-06 Speaker identification method based on twin network model and KNN algorithm

Country Status (1)

Country Link
CN (1) CN110211594B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569908B (en) * 2019-09-10 2022-05-13 思必驰科技股份有限公司 Speaker counting method and system
CN110767239A (en) * 2019-09-20 2020-02-07 平安科技(深圳)有限公司 Voiceprint recognition method, device and equipment based on deep learning
CN111126563B (en) * 2019-11-25 2023-09-29 中国科学院计算技术研究所 Target identification method and system based on space-time data of twin network
CN111048097B (en) * 2019-12-19 2022-11-29 中国人民解放军空军研究院通信与导航研究所 Twin network voiceprint recognition method based on 3D convolution
CN111785287B (en) 2020-07-06 2022-06-07 北京世纪好未来教育科技有限公司 Speaker recognition method, speaker recognition device, electronic equipment and storage medium
CN112270931B (en) * 2020-10-22 2022-10-21 江西师范大学 Method for carrying out deceptive voice detection based on twin convolutional neural network
CN113903043B (en) * 2021-12-11 2022-05-06 绵阳职业技术学院 Method for identifying printed Chinese character font based on twin metric model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107170445A (en) * 2017-05-10 2017-09-15 重庆大学 The parkinsonism detection means preferably differentiated is cooperateed with based on voice mixing information characteristics
CN108492294A (en) * 2018-03-23 2018-09-04 北京邮电大学 A kind of appraisal procedure and device of image color harmony degree
CN109065032A (en) * 2018-07-16 2018-12-21 杭州电子科技大学 A kind of external corpus audio recognition method based on depth convolutional neural networks
CN109243467A (en) * 2018-11-14 2019-01-18 龙马智声(珠海)科技有限公司 Sound-groove model construction method, method for recognizing sound-groove and system
US20190035431A1 (en) * 2017-07-28 2019-01-31 Adobe Systems Incorporated Apparatus, systems, and methods for integrating digital media content
CN109543009A (en) * 2018-10-17 2019-03-29 龙马智芯(珠海横琴)科技有限公司 Text similarity assessment system and text similarity appraisal procedure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107170445A (en) * 2017-05-10 2017-09-15 重庆大学 The parkinsonism detection means preferably differentiated is cooperateed with based on voice mixing information characteristics
US20190035431A1 (en) * 2017-07-28 2019-01-31 Adobe Systems Incorporated Apparatus, systems, and methods for integrating digital media content
CN108492294A (en) * 2018-03-23 2018-09-04 北京邮电大学 A kind of appraisal procedure and device of image color harmony degree
CN109065032A (en) * 2018-07-16 2018-12-21 杭州电子科技大学 A kind of external corpus audio recognition method based on depth convolutional neural networks
CN109543009A (en) * 2018-10-17 2019-03-29 龙马智芯(珠海横琴)科技有限公司 Text similarity assessment system and text similarity appraisal procedure
CN109243467A (en) * 2018-11-14 2019-01-18 龙马智声(珠海)科技有限公司 Sound-groove model construction method, method for recognizing sound-groove and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Siamese neural network based gait recognition for human identification";Cheng Zhang;《ICASSP》;20161231;全文 *
"sketch-a-net that beats humans";YU Q;《British machine vision conference》;20151231;全文 *
"基于时序特征的草图识别方法";丁美玉;《计算机科学》;20181130;第45卷(第11A期);全文 *
"基于深度学习的足球球员跟踪算法研究";马月洁;《中国传媒大学学报自然科学版》;20180630;第25卷(第3期);全文 *

Also Published As

Publication number Publication date
CN110211594A (en) 2019-09-06

Similar Documents

Publication Publication Date Title
CN110211594B (en) Speaker identification method based on twin network model and KNN algorithm
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
Basu et al. A review on emotion recognition using speech
CN112289323B (en) Voice data processing method and device, computer equipment and storage medium
CN108074576A (en) Inquest the speaker role's separation method and system under scene
JP2002014692A (en) Device and method for generating acoustic model
US11837252B2 (en) Speech emotion recognition method and system based on fused population information
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
Bahari Speaker age estimation using Hidden Markov Model weight supervectors
CN110992988A (en) Speech emotion recognition method and device based on domain confrontation
Pardede et al. Convolutional neural network and feature transformation for distant speech recognition
Markov et al. Never-ending learning system for on-line speaker diarization
CN111091840A (en) Method for establishing gender identification model and gender identification method
CN111968628B (en) Signal accuracy adjusting system and method for voice instruction capture
Kaur et al. An efficient speaker recognition using quantum neural network
Chinmayi et al. Emotion Classification Using Deep Learning
Espi et al. Spectrogram patch based acoustic event detection and classification in speech overlapping conditions
Prakash et al. Analysis of emotion recognition system through speech signal using KNN & GMM classifier
Aggarwal et al. Application of genetically optimized neural networks for hindi speech recognition system
JPH064097A (en) Speaker recognizing method
Jati et al. An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks.
Anindya et al. Development of Indonesian speech recognition with deep neural network for robotic command
Sawakare et al. Speech recognition techniques: a review
CN114898776A (en) Voice emotion recognition method of multi-scale feature combined multi-task CNN decision tree
Utomo et al. Spoken word and speaker recognition using MFCC and multiple recurrent neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant