CN115101076B - Speaker clustering method based on multi-scale channel separation convolution feature extraction - Google Patents

Speaker clustering method based on multi-scale channel separation convolution feature extraction Download PDF

Info

Publication number
CN115101076B
CN115101076B CN202210588389.0A CN202210588389A CN115101076B CN 115101076 B CN115101076 B CN 115101076B CN 202210588389 A CN202210588389 A CN 202210588389A CN 115101076 B CN115101076 B CN 115101076B
Authority
CN
China
Prior art keywords
channel separation
clustering
matrix
model
scale channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210588389.0A
Other languages
Chinese (zh)
Other versions
CN115101076A (en
Inventor
李海滨
张晓龙
李雅倩
肖存军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yanshan University
Original Assignee
Yanshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanshan University filed Critical Yanshan University
Priority to CN202210588389.0A priority Critical patent/CN115101076B/en
Publication of CN115101076A publication Critical patent/CN115101076A/en
Application granted granted Critical
Publication of CN115101076B publication Critical patent/CN115101076B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a speaker clustering method based on multi-scale channel separation convolution feature extraction, which belongs to the technical field of voiceprint recognition and comprises the following steps: dividing the VoxCeleb and AMI data set into a training set, a development set and a test set; preprocessing the VoxCeleb and AMI data; constructing a multi-scale channel separation convolution module on the basis of an ECAPA-TDNN network frame; selecting an AAM-softmax loss function to train the model for multiple times to obtain an optimal model; extracting features from AMI conference data by utilizing a multi-scale channel separation convolution model, and carrying out clustering analysis by utilizing spectral clustering; the clustering results were scored using a standard partition clustering error rate DER. The invention can extract the voice print characteristics with discriminant, and obtain good effect on the spectral clustering algorithm, and lower segmentation clustering error rate is obtained at the cost of relatively smaller parameter quantity.

Description

Speaker clustering method based on multi-scale channel separation convolution feature extraction
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a speaker clustering method based on multi-scale channel separation convolution feature extraction.
Background
With the progress of modern technology, "artificial intelligence-like" with artificial intelligence as a core is attracting research and discussion of scientific researchers. The speaker log, also called speaker segmentation and clustering, is a very important research direction in speech signal processing, and its main task is to segment and cluster in audio containing multiple speakers to extract information of each speaker, identify the boundary and identity of the speaker, and mark the same speaker as a single type. The application of the speaker log field is very wide, for example, the speaker clustering technology can be utilized to carry out specific person audio retrieval on the audio file, provide useful information for constructing and indexing speaker audio files and public security personnel, and carry out automatic segmentation labeling and index establishment on recorded voice in a voice library; the log can be provided for long-time conference audio, so that later verification and study are convenient; the performance of some electronic intelligent devices such as intelligent sound equipment in terms of separating speakers can also be improved. The speaker segmentation clustering is a necessary front-end processing of voiceprint recognition, and is helpful for improving the voiceprint recognition rate. The error rate of segmentation and clustering is reduced, and the embedded vector of the speaker plays a key role.
In the voiceprint field, with respect to voiceprint feature extraction, conventional statistical models and machine learning methods still account for a significant proportion. The low-dimensional feature parameters extracted by the conventional method, such as mel-frequency cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC), linear prediction cepstrum coefficients (Linear Predictive Cepstral Coefficient, LPCC), linear prediction analysis (Linear Prediction Coefficients, LPC), etc., are then identified by the conventional probability model, such as hidden markov model (Hidden Markov Model, HMM), gaussian mixture model (Gaussian Mixture Model, GMM), gaussian mixture model-generic background model (Gaussian Mixture Model-Universal Background Model, UBM), etc. Although these feature parameters may represent some basic features, they are low-dimensional features and they are all probabilistic models to estimate the characteristic information, and serious errors occur when the data set is very large. However, with the rapid development of big data and the internet, the research on deep learning in the voiceprint field is attracting attention of scientific researchers. Meanwhile, in the task of a speaker, the technical method based on the convolutional neural network gradually exceeds the traditional factor analysis framework to a certain extent.
At present, a plurality of convolutional neural networks based on deep learning are used for extracting audio embedded vectors, wherein an ECAPA-TDNN neural network framework is one of mainstream feature extraction models, because local and global features can be extracted by convolution, and meanwhile, contextual information can be utilized, so that the training speed is high. However, feature extraction cannot be performed deeper by increasing the depth and width of the layers in convolutional neural networks, and Res2Net and HS-Net both introduce another dimension: the feature extraction from the angle of channel segmentation, splicing and deconvolution is multi-scale, and because no association operation is performed between non-adjacent channels, the risk of information loss can exist.
Disclosure of Invention
The invention aims to solve the technical problem of providing a speaker clustering method based on multi-scale channel separation convolution feature extraction, which can extract the voice print feature with discriminant by establishing separation convolution among multi-scale channels, and obtain good effect on a spectral clustering algorithm, and lower segmentation clustering error rate is obtained at the cost of relatively smaller parameter quantity.
In order to solve the technical problems, the invention adopts the following technical scheme:
a speaker clustering method based on multi-scale channel separation convolution feature extraction comprises the following steps:
step 1: dividing the VoxCeleb and AMI data set into a training set, a verification set and a test set;
step 2: preprocessing the VoxCeleb and AMI data;
step 3: constructing a multi-scale channel separation convolution model on the basis of an ECAPA-TDNN network frame, and improving a Res2Net multi-scale feature extraction module in the ECAPA-TDNN network frame;
step 4: selecting an AAM-softmax loss function to train the multi-scale channel separation convolution model for multiple times to obtain an optimal model;
step 5: extracting features from AMI conference data by utilizing a multi-scale channel separation convolution model, and carrying out clustering analysis by utilizing spectral clustering;
step 6: the clustering results were scored using a standard partition clustering error rate DER.
The technical scheme of the invention is further improved as follows: in step 2, pre-emphasis, framing, windowing, fast fourier transform, mel triangular filtering, and computation of logarithmic energy and discrete cosine transform are performed on the voxcelleb dataset for model evaluation and the AMI dataset for speaker clustering, and specifically the method comprises the following steps:
step 2.1: pre-emphasis is performed on an input speech signal, and the pre-emphasis is realized through a first-order high-pass filter, and a transfer function model of the first-order filter is expressed as follows:
H(z)=1-tz -1
wherein H (z) is a pre-emphasis function, z represents a transform domain variable, t is a pre-emphasis coefficient, and 0.9< t <1.0;
step 2.2: framing the pre-emphasized voice signal, wherein part of the frames are overlapped between two adjacent frames, and a hamming window model is applied, and the expression mode of the hamming window model is as follows:
wherein w (n) is a hamming window function, Q is the number of samples per frame, n is a time domain discrete scale;
step 2.3: the frequency spectrum of the voice is obtained by the discrete Fourier transform or the fast Fourier transform for each processed frame of time domain signal x (n), and is expressed as follows:
wherein X (N) is a time domain sampling signal of each frame, X (k) is a frequency spectrum of voice, N is a discrete Fourier transform interval length, and k is a frequency domain discrete scale;
step 2.4: smoothing the frequency spectrum signal obtained in the step 2.3, eliminating harmonic waves, and performing Mel triangular filtering, wherein the frequency response of the triangular filter is expressed as follows:
where H m (k) is the delta-filtered frequency response, m represents the mth filter, and f (m) represents the frequency magnitude of the mth filter output;
step 2.5: the log energy is calculated for the delta-filtered frequency domain signal, expressed as:
where s (m) is the filtered logarithmic energy and L is the order of the MFCC coefficients;
step 2.6: the logarithmic energy is subjected to Discrete Cosine Transform (DCT) to obtain the final 80-dimensional MFCC coefficient, and the expression formula of the discrete cosine transform is as follows:
where M is the number of triangular filters.
The technical scheme of the invention is further improved as follows: the step 3 specifically comprises the following steps:
step 3.1: constructing a single multi-scale channel separation convolution feature extraction basic block, dividing a channel into 8 parts after a first TDNN convolution layer, carrying out convolution on each part, splicing the convolved features according to the channel, and carrying out feature fusion through the TDNN convolution layer;
step 3.2: constructing a multi-scale channel separation convolution feature extraction model, connecting 3 continuous multi-scale channel separation convolution feature extraction basic blocks after the pre-processed 80-dimensional MFCC features are subjected to 1x1 convolution, then carrying out channel splicing on the output obtained by each block, and finally completing feature fusion through 1x1 convolution;
step 3.3: and accessing the obtained multi-scale channel separation convolution feature extraction model into a statistical pooling layer to obtain global and local mean values and variances, and obtaining the final embedded feature vector through a softmax activation function and two linear full-connection layers.
The technical scheme of the invention is further improved as follows: in step 4, the AAM-softmax loss function is used for solving the characteristic difference angle theta for the positive sample and the negative sample of the speaker voice section and calculating the weight coefficient in the loss update network structure, and the method specifically comprises the following steps:
step 4.1: and carrying out normalization operation on the embedded feature vector finally extracted by the network and the weight coefficient corresponding to the embedded feature vector, wherein the normalization operation is shown in the following formula:
step 4.2: then, the cosine similarity is used for solving the distance between the embedded feature vectors of the two voice segments, and the following formula is adopted:
according to the above formula, the corresponding AAM-softmax loss function can be calculated, and is set as the negative logarithm of the probability, and expressed as the following formula:
wherein the edge coefficient q is set to 0.2, s is a scaling factor, and is set to 30;
step 4.3: using the AAM-softmax loss function, set epo to 10, set minibatch for each epo training to 16, each minibatch contains 400 speech pairs, and train the network.
The technical scheme of the invention is further improved as follows: in step 5, extracting an embedded feature vector with a specified dimension of 192 from the obtained multi-scale channel separation convolution model, constructing a similarity matrix and a degree matrix of feature samples, calculating first k feature values of a normalized Laplacian matrix and corresponding feature vectors thereof through the similarity matrix and the degree matrix, and finally completing cluster analysis of the affiliated voice fragments through k-means, wherein the method specifically comprises the following steps:
step 5.1: extracting features from the preprocessed data by using an embedded vector feature extraction model to obtain spectrum features with specified dimension of 192;
step 5.2: and calculating the similarity degree of all sample characteristics according to the cosine similarity to obtain a similarity matrix W with the numerical value of 0 to 1, wherein the calculation formula is as follows:
wherein x is i ,x j Representing two different data points in the sample space, the specified parameter σ being 0.01;
step 5.3: the similarity matrix is used for calculating a degree matrix D, and the calculation formula is as follows:
wherein each value D in the degree matrix D i Is similar to each row of elements W of matrix W ij Adding, representing the degree of each sample data, the degree matrix D is the value D to be obtained i A diagonal matrix formed by placing on diagonal lines;
step 5.4: calculating a normalized Laplace matrix from the degree matrix and the similarity matrix:
and calculating eigenvectors p corresponding to the first k minimum eigenvalues of the Laplace matrix Lsym 1 ,p 2 ,...p k Order-making
Step 5.5: making a change
Step 5.6: for each of the i=1, once again, n, orderIs the ith row of the H matrix;
step 5.7: the points are clustered into C by a k-means algorithm 1 ,...,C k
The technical scheme of the invention is further improved as follows: in step 6, consider two kinds of situations that the number of speakers is known and the number of speakers is unknown under the real condition, and evaluate and analyze the clustering result according to the verification set and the test set.
By adopting the technical scheme, the invention has the following technical progress:
1. compared with the currently mainstream embedded feature vector extraction model Res2Net, the invention adopts the improved full-connection HS-Res2Net on the ECAPA-TDNN network frame, and in a finer-grained working mode, the connection convolution between multi-scale channels is established by separating the voiceprint features according to the multi-scale channels, and the discriminant voiceprint features are extracted through the channel separation, the convolution, the channel splicing and the feature fusion.
2. The size of the original ECAPA-TDNN model file is 21M, the size of the model file obtained by training is 24.5M, more receptive field combinations and more scale feature expression are obtained under the condition that the parameter quantity is only increased by 0.16 times, and clustering analysis is carried out on the fine-grained voiceprint features by spectral clustering, so that the segmentation clustering error rate is obviously reduced.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a block diagram of a multi-scale channel separation convolution feature extraction module of the present invention;
fig. 3 is an overall network architecture diagram of ECAPA-TDNN in accordance with the present invention.
Detailed Description
The invention is described in further detail below with reference to the attached drawings and examples:
as shown in FIG. 1, a speaker clustering method based on multi-scale channel separation convolution feature extraction is implemented by pre-emphasis, framing, windowing, fast Fourier transform, mel triangular filtering, calculation of logarithmic energy and Discrete Cosine Transform (DCT), preprocessing operation is carried out on a VoxCeleb and AMI data set, the obtained 80-dimensional low-dimensional feature MFCC is sent to a multi-scale channel separation convolution feature extraction module, feature information is extracted more deeply through segmentation, convolution, dense connection and splicing and deconvolution feature fusion operation of channels, final output of each multi-scale channel separation convolution feature extraction basic block is subjected to channel splicing and feature fusion, average value of global and local information is obtained through a channel and a context-related statistical pooling layer, 192-dimensional multi-scale features are obtained through a softmax activation function and a full connection layer, and clustering analysis is carried out on the extracted features through a spectral clustering algorithm. The experimental results of the VoxCeleb and AMI data sets show that the invention improves the characteristic expression of the ECAPA-TDNN model and solves the problems of insufficient characteristic extraction, characteristic redundancy and the like of the traditional convolutional neural network-based characteristic extraction method.
A speaker clustering method based on multi-scale channel separation convolution feature extraction comprises the following steps:
step 1: dividing the VoxCeleb and AMI data set into a training set, a development set and a test set;
considering that the VoxCeleb data set is a maximum-scale speaker recognition corpus set and is mainly used for training a model, the VoxCeleb2 data set is used as a training set of the model, and the VoxCeleb1 data set is used as a test set for testing the model; consider that the AMI dataset is a conference dataset consisting of 4 people, which is divided into validation and test sets for analysis and evaluation of spectral clustering results.
Step 2: preprocessing VoxCeleb and AMI data;
pre-emphasis, framing, windowing, fast fourier transform, mel triangular filtering, computing log energy and Discrete Cosine Transform (DCT) on the model-estimated voxcelleb and AMI dataset used for speaker clustering;
the pretreatment process specifically comprises the following steps:
step 2.1: pre-emphasis is performed on an input speech signal, and the pre-emphasis is realized through a first-order high-pass filter, and a transfer function model of the first-order filter is expressed as follows:
H(z)=1-tz -1
wherein H (z) is a pre-emphasis function, z represents a transform domain variable, t is a pre-emphasis coefficient, and 0.9< t <1.0;
step 2.2: framing the pre-emphasized voice signal, wherein part of the frames are overlapped between two adjacent frames, and a hamming window model is applied, and the expression mode of the hamming window model is as follows:
wherein w (n) is a hamming window function, Q is the number of samples per frame, n is a time domain discrete scale;
step 2.3: the frequency spectrum of the voice is obtained by the discrete Fourier transform or the fast Fourier transform for each processed frame of time domain signal x (n), and is expressed as follows:
wherein X (N) is a time domain sampling signal of each frame, X (k) is a frequency spectrum of voice, N is a discrete Fourier transform interval length, and k is a frequency domain discrete scale;
step 2.4: smoothing the frequency spectrum signal obtained in the step 2.3, eliminating harmonic waves, and performing Mel triangular filtering, wherein the frequency response of the triangular filter is expressed as follows:
wherein H is m (k) Is the frequency response after triangular filtration, m represents the mth filter, and f (m) represents the frequency of the mth filter output;
step 2.5: the log energy is calculated for the delta-filtered frequency domain signal, expressed as:
where s (m) is the filtered logarithmic energy and L is the order of the MFCC coefficients;
step 2.6: the logarithmic energy is subjected to Discrete Cosine Transform (DCT) to obtain the final 80-dimensional MFCC coefficient, and the expression formula of the discrete cosine transform is as follows:
wherein M is the number of triangular filters;
step 3: constructing a multi-scale channel separation convolution model on the basis of an ECAPA-TDNN network frame; improving a Res2Net multi-scale feature extraction module in an ECAPA-TDNN network framework;
building a multi-scale channel separation convolution module based on an ECAPA-TDNN network specifically comprises the following steps:
step 3.1: constructing a single multi-scale channel separation convolution feature extraction basic block, corresponding to a module x1 in fig. 2, dividing a feature map with 1024 channels obtained by the previous step 1x1 convolution into 8 parts according to the channel number, and marking the first part as an original reserved feature as y1, wherein the first part does not perform any operation, and the channel dimensions are 128; a convolution kernel with the size of 1 is selected for the module x2, the step length is set to be 1, the void ratio is set to be 1, the output dimension is 128, and then the channel is segmented into y21 and y22, wherein y21 is used as a reserved characteristic; the method comprises the steps that a module 3 firstly carries out channel splicing on y22 of a module x2 and current characteristics, then convolves a convolution kernel with the size of 1, sets the step length to be 1, the void ratio to be 1 and the output dimension to be 128, and then carries out channel segmentation to y31 and 32, wherein y31 is taken as a reserved characteristic; the module 4 performs channel splicing on the y22 of the module x2, the y31 of the module x3 and the current characteristics, then performs the same operation as the above, performs channel segmentation on the convolved characteristics into y41 and y42, wherein y41 is taken as a reserved characteristic, and the like, finally performs splicing on y1, y21, y31, y41, y51, y61, y71 and y8 in characteristic dimensions to obtain high-dimensional semantic characteristics, and finally completes characteristic fusion through the convolution module 6 of 1x1 to obtain final output;
step 3.2: constructing a multi-scale channel separation convolution feature extraction model, and designing an overall multi-scale feature extraction module by utilizing the single multi-scale feature extraction module mentioned in the step 3.1, wherein the detailed description of the overall multi-scale feature extraction module is as follows: firstly, carrying out 1x1 convolution, reLU and BN layers on the 80-dimensional MFCC characteristics obtained after pretreatment, and designating an output dimension 512; then three continuous multi-scale channel separation convolution feature extraction modules are connected, wherein the cavity convolution rate is set to be 2,3 and 4 respectively, the multi-scale s is 8, and the designated output dimension is 512; then, carrying out channel splicing on the final output of the 3 multi-scale channel separation convolution characteristic extraction modules, wherein the dimension after splicing is 2048; finally, feature fusion is carried out on the spliced features through a convolution of 1x1, a ReLU layer and a BN layer, and the final output dimension is designated as 1536;
step 3.3: designing the overall architecture of the whole network by using the overall multi-scale feature extraction model mentioned in step 3.2:
after the feature extraction layer described in step 3.2, a channel and context-related statistical pooling layer is accessed, that is, the variance and the mean value output by the overall multi-scale feature extraction module are calculated, the three are spliced together, the mean value is calculated once again, the appointed dimension is 1536, then as shown in fig. 3, a residual error module is introduced, wherein the linear layer is subjected to dimension reduction to 128, the original 1536 is changed through a tanh activation function and a linear layer dimension, and the result obtained through a softmax activation function is used as the output dimension 3072 after the group on the previous mean value; finally, a linear layer is accessed to reduce the dimension to 192, and the final embedded feature vector is obtained.
Step 4: selecting an AAM-softmax loss function to train the multi-scale channel separation convolution model for multiple times to obtain an optimal model;
the method comprises the steps of calculating a characteristic difference angle theta of positive samples and negative samples of a speaker voice segment by using an AAM-softmax loss function, and calculating a weight coefficient in a loss update network structure, wherein the loss function is mainly used for scoring each speaker tag and a real tag output by a classifier, namely if two voice segments of one speaker pair are from the same speaker in a test set, the tags of the two voice pairs are 1, and if the tags of the two voice pairs are not 0. The output of the network is thus the probability of currently determining whether two speech segments are from the same speaker.
The calculation of AAM-softmax loss specifically comprises the following steps:
step 4.1: and carrying out normalization operation on the embedded feature vector finally extracted by the network and the weight coefficient corresponding to the embedded feature vector, wherein the normalization operation is shown in the following formula:
step 4.2: then, the cosine similarity is used for solving the distance between the embedded feature vectors of the two voice segments, and the following formula is adopted:
according to the above formula, the corresponding AAM-softmax loss function can be calculated, and is set as the negative logarithm of the probability, and expressed as the following formula:
wherein the edge coefficient q is set to 0.2, s is a scaling factor, and is set to 30;
step 4.3: setting epo to 10 by using an AAM-softmax loss function, setting miniband of each epo training to 16, wherein each miniband contains 400 voice pairs, and training a network;
the structural parameters of the fully connected HS-Res2Net network are shown in Table 1:
TABLE 1 fully connected HS-Res2Net network architecture parameters
Step 5: extracting features from AMI conference data by utilizing a multi-scale channel separation convolution model, and carrying out clustering analysis by utilizing spectral clustering;
extracting an embedded feature vector with a specified dimension of 192 from the obtained multi-scale channel separation convolution model, constructing a similarity matrix and a degree matrix of a feature sample, calculating the first k feature values and feature vectors of the Laplacian matrix through the similarity matrix and the degree matrix, and finally completing clustering analysis of the affiliated voice fragments through k-means;
the cluster analysis is carried out on the features extracted from the AMI conference data, and the method specifically comprises the following steps:
step 5.1: extracting features from the preprocessed data by using an embedded vector feature extraction model to obtain spectrum features with specified dimension of 192;
step 5.2: and calculating the similarity degree of all sample characteristics according to the cosine similarity to obtain a similarity matrix W with the numerical value of 0 to 1, wherein the calculation formula is as follows:
wherein x is i ,x j Representing two different data points in the sample space, the specified parameter σ being 0.01;
step 5.3: the similarity matrix is used for calculating a degree matrix D, and the calculation formula is as follows:
wherein each value D in the degree matrix D i Is similar to each row of elements W of matrix W ij Adding, representing the degree of each sample data, the degree matrix D is the value D to be obtained i A diagonal matrix formed by placing on diagonal lines;
step 5.4: calculating a normalized Laplace matrix from the degree matrix and the similarity matrix:
and calculating eigenvectors p corresponding to the first k minimum eigenvalues of the Laplace matrix Lsym 1 ,p 2 ,...p k Order-making
Step 5.5: making a change
Step 5.6: for each of the i=1, once again, n, orderIs the ith row of the H matrix;
step 5.7: the points are clustered into C by a k-means algorithm 1 ,...,C k
Step 6: scoring the clustering result using a standard partitioned clustering error rate;
considering two situations of known speaker number and unknown speaker number under the real condition, and carrying out evaluation analysis on the clustering result according to the verification set and the test set;
the standard partition cluster error rate is as follows:
wherein T is Spk Speech duration, T, representing speaker cluster errors Miss Representing the time length of misjudgment of effective voice as non-voice, T False Representing the duration of the non-speech misjudgment as valid speech, T Total To test the total duration of the audio in the collection, here, the sum is 0 because manually labeled speech segments are used. Partition clustering errorThe error rate DER is shown in Table 2:
TABLE 2 partition clustering error Rate DER
In summary, after the first 1x1 convolution in the Res2Net, the method and the device reserve part of the features after each channel convolution as features by splitting, splicing and re-convolving the channels, and establish the feature association with other channels by part, and perform 1x1 feature fusion on the finally obtained features, so that the feature relation among the channels can be utilized to extract embedded feature vectors more fully, then the trained model is used for extracting features of the conference AMI data set, and finally the time intervals of speakers are clustered through spectral clustering.

Claims (4)

1. A speaker clustering method based on multi-scale channel separation convolution feature extraction is characterized in that: the method comprises the following steps:
step 1: dividing the VoxCeleb and AMI data set into a training set, a verification set and a test set;
step 2: preprocessing the VoxCeleb and AMI data;
step 3: constructing a multi-scale channel separation convolution model on the basis of an ECAPA-TDNN network frame, and improving a Res2Net multi-scale feature extraction module in the ECAPA-TDNN network frame;
the step 3 specifically comprises the following steps:
step 3.1: constructing a single multi-scale channel separation convolution feature extraction basic block, dividing a channel into 8 parts after a first TDNN convolution layer, carrying out convolution on each part, splicing the convolved features according to the channel, and carrying out feature fusion through the TDNN convolution layer;
step 3.2: constructing a multi-scale channel separation convolution feature extraction model, connecting 3 continuous multi-scale channel separation convolution feature extraction basic blocks after the pre-processed 80-dimensional MFCC features are subjected to 1x1 convolution, then carrying out channel splicing on the output obtained by each block, and finally completing feature fusion through 1x1 convolution;
step 3.3: the obtained multi-scale channel separation convolution feature extraction model is accessed into a statistics pooling layer to obtain global and local mean values and variances, and a final embedded feature vector is obtained through a softmax activation function and two linear full-connection layers;
step 4: the AAM-softmax loss function is selected to train the multi-scale channel separation convolution model for multiple times to obtain an optimal multi-scale channel separation convolution model;
in step 4, the AAM-softmax loss function is used for solving the characteristic difference angle theta for the positive sample and the negative sample of the speaker voice section and calculating the weight coefficient in the loss update network structure, and the method specifically comprises the following steps:
step 4.1: and carrying out normalization operation on the embedded feature vector finally extracted by the network and the weight coefficient corresponding to the embedded feature vector, wherein the normalization operation is shown in the following formula:
step 4.2: then, the cosine similarity is used for solving the distance between the embedded feature vectors of the two voice segments, and the following formula is adopted:
according to the above formula, the corresponding AAM-softmax loss function can be calculated, and is set as the negative logarithm of the probability, and expressed as the following formula:
wherein the edge coefficient q is set to 0.2, s is a scaling factor, and is set to 30;
step 4.3: setting epo to 10 by using an AAM-softmax loss function, setting miniband of each epo training to 16, wherein each miniband contains 400 voice pairs, and training a network;
step 5: extracting features from AMI conference data by utilizing a multi-scale channel separation convolution model, and carrying out clustering analysis by utilizing spectral clustering;
step 6: the clustering results were scored using a standard partition clustering error rate DER.
2. The speaker clustering method based on multi-scale channel separation convolution feature extraction as claimed in claim 1, wherein: in step 2, pre-emphasis, framing, windowing, fast fourier transform, mel triangular filtering, and computation of logarithmic energy and discrete cosine transform are performed on the voxcelleb dataset for model evaluation and the AMI dataset for speaker clustering, and specifically the method comprises the following steps:
step 2.1: pre-emphasis is performed on an input speech signal, and the pre-emphasis is realized through a first-order high-pass filter, and a transfer function model of the first-order filter is expressed as follows:
H(z)=1-tz -1
wherein H (z) is a pre-emphasis function, z represents a transform domain variable, t is a pre-emphasis coefficient, and 0.9< t <1.0;
step 2.2: framing the pre-emphasized voice signal, wherein part of the frames are overlapped between two adjacent frames, and a hamming window model is applied, and the expression mode of the hamming window model is as follows:
wherein w (n) is a hamming window function, Q is the number of samples per frame, n is a time domain discrete scale;
step 2.3: the frequency spectrum of the voice is obtained by the discrete Fourier transform or the fast Fourier transform for each processed frame of time domain signal x (n), and is expressed as follows:
wherein X (N) is a time domain sampling signal of each frame, X (k) is a frequency spectrum of voice, N is a discrete Fourier transform interval length, and k is a frequency domain discrete scale;
step 2.4: smoothing the frequency spectrum signal obtained in the step 2.3, eliminating harmonic waves, and performing Mel triangular filtering, wherein the frequency response of the triangular filter is expressed as follows:
where H m (k) is the delta-filtered frequency response, m represents the mth filter, and f (m) represents the frequency magnitude of the mth filter output;
step 2.5: the log energy is calculated for the delta-filtered frequency domain signal, expressed as:
where s (m) is the filtered logarithmic energy and L is the order of the MFCC coefficients;
step 2.6: the logarithmic energy is subjected to discrete cosine transformation to obtain a final 80-dimensional MFCC coefficient, and the expression formula of the discrete cosine transformation is as follows:
where M is the number of triangular filters.
3. The speaker clustering method based on multi-scale channel separation convolution feature extraction as claimed in claim 1, wherein: in step 5, extracting an embedded feature vector with a specified dimension of 192 from the obtained multi-scale channel separation convolution model, constructing a similarity matrix and a degree matrix of feature samples, calculating first k feature values of a normalized Laplacian matrix and corresponding feature vectors thereof through the similarity matrix and the degree matrix, and finally completing cluster analysis of the affiliated voice fragments through k-means, wherein the method specifically comprises the following steps:
step 5.1: extracting features from the preprocessed data by using an embedded vector feature extraction model to obtain spectrum features with specified dimension of 192;
step 5.2: and calculating the similarity degree of all sample characteristics according to the cosine similarity to obtain a similarity matrix W with the numerical value of 0 to 1, wherein the calculation formula is as follows:
wherein x is i ,x j Representing two different data points in the sample space, the specified parameter σ being 0.01;
step 5.3: the similarity matrix is used for calculating a degree matrix D, and the calculation formula is as follows:
wherein each value D in the degree matrix D i Is similar to each row of elements W of matrix W ij Adding, representing the degree of each sample data, the degree matrix D is the value D to be obtained i A diagonal matrix formed by placing on diagonal lines;
step 5.4: calculating a normalized Laplace matrix from the degree matrix and the similarity matrix:
and calculating eigenvectors p corresponding to the first k minimum eigenvalues of the Laplace matrix Lsym 1 ,p 2 ,...p k Order-making
Step 5.5: making a change
Step 5.6: for each of the i=1, once again, n, orderIs the ith row of the H matrix;
step 5.7: the points are clustered into C by a k-means algorithm 1 ,...,C k
4. The speaker clustering method based on multi-scale channel separation convolution feature extraction as claimed in claim 1, wherein: in step 6, consider two kinds of situations that the number of speakers is known and the number of speakers is unknown under the real condition, and evaluate and analyze the clustering result according to the verification set and the test set.
CN202210588389.0A 2022-05-26 2022-05-26 Speaker clustering method based on multi-scale channel separation convolution feature extraction Active CN115101076B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210588389.0A CN115101076B (en) 2022-05-26 2022-05-26 Speaker clustering method based on multi-scale channel separation convolution feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210588389.0A CN115101076B (en) 2022-05-26 2022-05-26 Speaker clustering method based on multi-scale channel separation convolution feature extraction

Publications (2)

Publication Number Publication Date
CN115101076A CN115101076A (en) 2022-09-23
CN115101076B true CN115101076B (en) 2023-09-12

Family

ID=83289495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210588389.0A Active CN115101076B (en) 2022-05-26 2022-05-26 Speaker clustering method based on multi-scale channel separation convolution feature extraction

Country Status (1)

Country Link
CN (1) CN115101076B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116072125B (en) * 2023-04-07 2023-10-17 成都信息工程大学 Method and system for constructing self-supervision speaker recognition model in noise environment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN110289002A (en) * 2019-06-28 2019-09-27 四川长虹电器股份有限公司 A kind of speaker clustering method and system end to end
CN111161744A (en) * 2019-12-06 2020-05-15 华南理工大学 Speaker clustering method for simultaneously optimizing deep characterization learning and speaker classification estimation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10133538B2 (en) * 2015-03-27 2018-11-20 Sri International Semi-supervised speaker diarization
WO2018106971A1 (en) * 2016-12-07 2018-06-14 Interactive Intelligence Group, Inc. System and method for neural network based speaker classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN110289002A (en) * 2019-06-28 2019-09-27 四川长虹电器股份有限公司 A kind of speaker clustering method and system end to end
CN111161744A (en) * 2019-12-06 2020-05-15 华南理工大学 Speaker clustering method for simultaneously optimizing deep characterization learning and speaker classification estimation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于多信号特征融合的煤料漏斗堵塞检测研究;潘攀,吴海,李海滨;电工技术学报(第30期);561-564 *

Also Published As

Publication number Publication date
CN115101076A (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN110853654B (en) Model generation method, voiceprint recognition method and corresponding device
CN106952643A (en) A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
CN111243602A (en) Voiceprint recognition method based on gender, nationality and emotional information
CN110211594B (en) Speaker identification method based on twin network model and KNN algorithm
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN110910891B (en) Speaker segmentation labeling method based on long-time and short-time memory deep neural network
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
CN111508524B (en) Method and system for identifying voice source equipment
CN109961794A (en) A kind of layering method for distinguishing speek person of model-based clustering
CN111653267A (en) Rapid language identification method based on time delay neural network
CN115457966B (en) Pig cough sound identification method based on improved DS evidence theory multi-classifier fusion
CN111986699A (en) Sound event detection method based on full convolution network
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
CN115101076B (en) Speaker clustering method based on multi-scale channel separation convolution feature extraction
CN116741148A (en) Voice recognition system based on digital twinning
CN110246509A (en) A kind of stack denoising self-encoding encoder and deep neural network structure for voice lie detection
CN105006231A (en) Distributed large population speaker recognition method based on fuzzy clustering decision tree
Nyodu et al. Automatic identification of Arunachal language using K-nearest neighbor algorithm
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
CN114970695B (en) Speaker segmentation clustering method based on non-parametric Bayesian model
Li et al. Feature extraction with convolutional restricted boltzmann machine for audio classification
CN115064175A (en) Speaker recognition method
CN111326161B (en) Voiceprint determining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant