CN115101076B

CN115101076B - Speaker clustering method based on multi-scale channel separation convolution feature extraction

Info

Publication number: CN115101076B
Application number: CN202210588389.0A
Authority: CN
Inventors: 李海滨; 张晓龙; 李雅倩; 肖存军
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2023-09-12
Anticipated expiration: 2042-05-26
Also published as: CN115101076A

Abstract

The invention discloses a speaker clustering method based on multi-scale channel separation convolution feature extraction, which belongs to the technical field of voiceprint recognition and comprises the following steps: dividing the VoxCeleb and AMI data set into a training set, a development set and a test set; preprocessing the VoxCeleb and AMI data; constructing a multi-scale channel separation convolution module on the basis of an ECAPA-TDNN network frame; selecting an AAM-softmax loss function to train the model for multiple times to obtain an optimal model; extracting features from AMI conference data by utilizing a multi-scale channel separation convolution model, and carrying out clustering analysis by utilizing spectral clustering; the clustering results were scored using a standard partition clustering error rate DER. The invention can extract the voice print characteristics with discriminant, and obtain good effect on the spectral clustering algorithm, and lower segmentation clustering error rate is obtained at the cost of relatively smaller parameter quantity.

Description

Speaker clustering method based on multi-scale channel separation convolution feature extraction

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a speaker clustering method based on multi-scale channel separation convolution feature extraction.

Background

With the progress of modern technology, "artificial intelligence-like" with artificial intelligence as a core is attracting research and discussion of scientific researchers. The speaker log, also called speaker segmentation and clustering, is a very important research direction in speech signal processing, and its main task is to segment and cluster in audio containing multiple speakers to extract information of each speaker, identify the boundary and identity of the speaker, and mark the same speaker as a single type. The application of the speaker log field is very wide, for example, the speaker clustering technology can be utilized to carry out specific person audio retrieval on the audio file, provide useful information for constructing and indexing speaker audio files and public security personnel, and carry out automatic segmentation labeling and index establishment on recorded voice in a voice library; the log can be provided for long-time conference audio, so that later verification and study are convenient; the performance of some electronic intelligent devices such as intelligent sound equipment in terms of separating speakers can also be improved. The speaker segmentation clustering is a necessary front-end processing of voiceprint recognition, and is helpful for improving the voiceprint recognition rate. The error rate of segmentation and clustering is reduced, and the embedded vector of the speaker plays a key role.

In the voiceprint field, with respect to voiceprint feature extraction, conventional statistical models and machine learning methods still account for a significant proportion. The low-dimensional feature parameters extracted by the conventional method, such as mel-frequency cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC), linear prediction cepstrum coefficients (Linear Predictive Cepstral Coefficient, LPCC), linear prediction analysis (Linear Prediction Coefficients, LPC), etc., are then identified by the conventional probability model, such as hidden markov model (Hidden Markov Model, HMM), gaussian mixture model (Gaussian Mixture Model, GMM), gaussian mixture model-generic background model (Gaussian Mixture Model-Universal Background Model, UBM), etc. Although these feature parameters may represent some basic features, they are low-dimensional features and they are all probabilistic models to estimate the characteristic information, and serious errors occur when the data set is very large. However, with the rapid development of big data and the internet, the research on deep learning in the voiceprint field is attracting attention of scientific researchers. Meanwhile, in the task of a speaker, the technical method based on the convolutional neural network gradually exceeds the traditional factor analysis framework to a certain extent.

At present, a plurality of convolutional neural networks based on deep learning are used for extracting audio embedded vectors, wherein an ECAPA-TDNN neural network framework is one of mainstream feature extraction models, because local and global features can be extracted by convolution, and meanwhile, contextual information can be utilized, so that the training speed is high. However, feature extraction cannot be performed deeper by increasing the depth and width of the layers in convolutional neural networks, and Res2Net and HS-Net both introduce another dimension: the feature extraction from the angle of channel segmentation, splicing and deconvolution is multi-scale, and because no association operation is performed between non-adjacent channels, the risk of information loss can exist.

Disclosure of Invention

The invention aims to solve the technical problem of providing a speaker clustering method based on multi-scale channel separation convolution feature extraction, which can extract the voice print feature with discriminant by establishing separation convolution among multi-scale channels, and obtain good effect on a spectral clustering algorithm, and lower segmentation clustering error rate is obtained at the cost of relatively smaller parameter quantity.

In order to solve the technical problems, the invention adopts the following technical scheme:

a speaker clustering method based on multi-scale channel separation convolution feature extraction comprises the following steps:

step 1: dividing the VoxCeleb and AMI data set into a training set, a verification set and a test set;

step 2: preprocessing the VoxCeleb and AMI data;

step 3: constructing a multi-scale channel separation convolution model on the basis of an ECAPA-TDNN network frame, and improving a Res2Net multi-scale feature extraction module in the ECAPA-TDNN network frame;

step 4: selecting an AAM-softmax loss function to train the multi-scale channel separation convolution model for multiple times to obtain an optimal model;

step 5: extracting features from AMI conference data by utilizing a multi-scale channel separation convolution model, and carrying out clustering analysis by utilizing spectral clustering;

step 6: the clustering results were scored using a standard partition clustering error rate DER.

The technical scheme of the invention is further improved as follows: in step 2, pre-emphasis, framing, windowing, fast fourier transform, mel triangular filtering, and computation of logarithmic energy and discrete cosine transform are performed on the voxcelleb dataset for model evaluation and the AMI dataset for speaker clustering, and specifically the method comprises the following steps:

step 2.1: pre-emphasis is performed on an input speech signal, and the pre-emphasis is realized through a first-order high-pass filter, and a transfer function model of the first-order filter is expressed as follows:

H(z)＝1-tz ^-1

wherein H (z) is a pre-emphasis function, z represents a transform domain variable, t is a pre-emphasis coefficient, and 0.9< t <1.0;

step 2.2: framing the pre-emphasized voice signal, wherein part of the frames are overlapped between two adjacent frames, and a hamming window model is applied, and the expression mode of the hamming window model is as follows:

wherein w (n) is a hamming window function, Q is the number of samples per frame, n is a time domain discrete scale;

step 2.3: the frequency spectrum of the voice is obtained by the discrete Fourier transform or the fast Fourier transform for each processed frame of time domain signal x (n), and is expressed as follows:

wherein X (N) is a time domain sampling signal of each frame, X (k) is a frequency spectrum of voice, N is a discrete Fourier transform interval length, and k is a frequency domain discrete scale;

step 2.4: smoothing the frequency spectrum signal obtained in the step 2.3, eliminating harmonic waves, and performing Mel triangular filtering, wherein the frequency response of the triangular filter is expressed as follows:

where H m (k) is the delta-filtered frequency response, m represents the mth filter, and f (m) represents the frequency magnitude of the mth filter output;

step 2.5: the log energy is calculated for the delta-filtered frequency domain signal, expressed as:

where s (m) is the filtered logarithmic energy and L is the order of the MFCC coefficients;

step 2.6: the logarithmic energy is subjected to Discrete Cosine Transform (DCT) to obtain the final 80-dimensional MFCC coefficient, and the expression formula of the discrete cosine transform is as follows:

where M is the number of triangular filters.

The technical scheme of the invention is further improved as follows: the step 3 specifically comprises the following steps:

step 3.1: constructing a single multi-scale channel separation convolution feature extraction basic block, dividing a channel into 8 parts after a first TDNN convolution layer, carrying out convolution on each part, splicing the convolved features according to the channel, and carrying out feature fusion through the TDNN convolution layer;

step 3.2: constructing a multi-scale channel separation convolution feature extraction model, connecting 3 continuous multi-scale channel separation convolution feature extraction basic blocks after the pre-processed 80-dimensional MFCC features are subjected to 1x1 convolution, then carrying out channel splicing on the output obtained by each block, and finally completing feature fusion through 1x1 convolution;

step 3.3: and accessing the obtained multi-scale channel separation convolution feature extraction model into a statistical pooling layer to obtain global and local mean values and variances, and obtaining the final embedded feature vector through a softmax activation function and two linear full-connection layers.

The technical scheme of the invention is further improved as follows: in step 4, the AAM-softmax loss function is used for solving the characteristic difference angle theta for the positive sample and the negative sample of the speaker voice section and calculating the weight coefficient in the loss update network structure, and the method specifically comprises the following steps:

step 4.1: and carrying out normalization operation on the embedded feature vector finally extracted by the network and the weight coefficient corresponding to the embedded feature vector, wherein the normalization operation is shown in the following formula:

step 4.2: then, the cosine similarity is used for solving the distance between the embedded feature vectors of the two voice segments, and the following formula is adopted:

according to the above formula, the corresponding AAM-softmax loss function can be calculated, and is set as the negative logarithm of the probability, and expressed as the following formula:

wherein the edge coefficient q is set to 0.2, s is a scaling factor, and is set to 30;

step 4.3: using the AAM-softmax loss function, set epo to 10, set minibatch for each epo training to 16, each minibatch contains 400 speech pairs, and train the network.

The technical scheme of the invention is further improved as follows: in step 5, extracting an embedded feature vector with a specified dimension of 192 from the obtained multi-scale channel separation convolution model, constructing a similarity matrix and a degree matrix of feature samples, calculating first k feature values of a normalized Laplacian matrix and corresponding feature vectors thereof through the similarity matrix and the degree matrix, and finally completing cluster analysis of the affiliated voice fragments through k-means, wherein the method specifically comprises the following steps:

step 5.1: extracting features from the preprocessed data by using an embedded vector feature extraction model to obtain spectrum features with specified dimension of 192;

step 5.2: and calculating the similarity degree of all sample characteristics according to the cosine similarity to obtain a similarity matrix W with the numerical value of 0 to 1, wherein the calculation formula is as follows:

wherein x is _i ，x _j Representing two different data points in the sample space, the specified parameter σ being 0.01;

step 5.3: the similarity matrix is used for calculating a degree matrix D, and the calculation formula is as follows:

wherein each value D in the degree matrix D _i Is similar to each row of elements W of matrix W _ij Adding, representing the degree of each sample data, the degree matrix D is the value D to be obtained _i A diagonal matrix formed by placing on diagonal lines;

step 5.4: calculating a normalized Laplace matrix from the degree matrix and the similarity matrix:

and calculating eigenvectors p corresponding to the first k minimum eigenvalues of the Laplace matrix Lsym ¹ ,p ² ,...p ^k Order-making

Step 5.5: making a change

Step 5.6: for each of the i=1, once again, n, orderIs the ith row of the H matrix;

step 5.7: the points are clustered into C by a k-means algorithm ₁ ,...,C _k 。

The technical scheme of the invention is further improved as follows: in step 6, consider two kinds of situations that the number of speakers is known and the number of speakers is unknown under the real condition, and evaluate and analyze the clustering result according to the verification set and the test set.

By adopting the technical scheme, the invention has the following technical progress:

1. compared with the currently mainstream embedded feature vector extraction model Res2Net, the invention adopts the improved full-connection HS-Res2Net on the ECAPA-TDNN network frame, and in a finer-grained working mode, the connection convolution between multi-scale channels is established by separating the voiceprint features according to the multi-scale channels, and the discriminant voiceprint features are extracted through the channel separation, the convolution, the channel splicing and the feature fusion.

2. The size of the original ECAPA-TDNN model file is 21M, the size of the model file obtained by training is 24.5M, more receptive field combinations and more scale feature expression are obtained under the condition that the parameter quantity is only increased by 0.16 times, and clustering analysis is carried out on the fine-grained voiceprint features by spectral clustering, so that the segmentation clustering error rate is obviously reduced.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of a multi-scale channel separation convolution feature extraction module of the present invention;

fig. 3 is an overall network architecture diagram of ECAPA-TDNN in accordance with the present invention.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and examples:

as shown in FIG. 1, a speaker clustering method based on multi-scale channel separation convolution feature extraction is implemented by pre-emphasis, framing, windowing, fast Fourier transform, mel triangular filtering, calculation of logarithmic energy and Discrete Cosine Transform (DCT), preprocessing operation is carried out on a VoxCeleb and AMI data set, the obtained 80-dimensional low-dimensional feature MFCC is sent to a multi-scale channel separation convolution feature extraction module, feature information is extracted more deeply through segmentation, convolution, dense connection and splicing and deconvolution feature fusion operation of channels, final output of each multi-scale channel separation convolution feature extraction basic block is subjected to channel splicing and feature fusion, average value of global and local information is obtained through a channel and a context-related statistical pooling layer, 192-dimensional multi-scale features are obtained through a softmax activation function and a full connection layer, and clustering analysis is carried out on the extracted features through a spectral clustering algorithm. The experimental results of the VoxCeleb and AMI data sets show that the invention improves the characteristic expression of the ECAPA-TDNN model and solves the problems of insufficient characteristic extraction, characteristic redundancy and the like of the traditional convolutional neural network-based characteristic extraction method.

step 1: dividing the VoxCeleb and AMI data set into a training set, a development set and a test set;

considering that the VoxCeleb data set is a maximum-scale speaker recognition corpus set and is mainly used for training a model, the VoxCeleb2 data set is used as a training set of the model, and the VoxCeleb1 data set is used as a test set for testing the model; consider that the AMI dataset is a conference dataset consisting of 4 people, which is divided into validation and test sets for analysis and evaluation of spectral clustering results.

Step 2: preprocessing VoxCeleb and AMI data;

pre-emphasis, framing, windowing, fast fourier transform, mel triangular filtering, computing log energy and Discrete Cosine Transform (DCT) on the model-estimated voxcelleb and AMI dataset used for speaker clustering;

the pretreatment process specifically comprises the following steps:

H(z)＝1-tz ^-1

wherein H is _m (k) Is the frequency response after triangular filtration, m represents the mth filter, and f (m) represents the frequency of the mth filter output;

wherein M is the number of triangular filters;

step 3: constructing a multi-scale channel separation convolution model on the basis of an ECAPA-TDNN network frame; improving a Res2Net multi-scale feature extraction module in an ECAPA-TDNN network framework;

building a multi-scale channel separation convolution module based on an ECAPA-TDNN network specifically comprises the following steps:

step 3.1: constructing a single multi-scale channel separation convolution feature extraction basic block, corresponding to a module x1 in fig. 2, dividing a feature map with 1024 channels obtained by the previous step 1x1 convolution into 8 parts according to the channel number, and marking the first part as an original reserved feature as y1, wherein the first part does not perform any operation, and the channel dimensions are 128; a convolution kernel with the size of 1 is selected for the module x2, the step length is set to be 1, the void ratio is set to be 1, the output dimension is 128, and then the channel is segmented into y21 and y22, wherein y21 is used as a reserved characteristic; the method comprises the steps that a module 3 firstly carries out channel splicing on y22 of a module x2 and current characteristics, then convolves a convolution kernel with the size of 1, sets the step length to be 1, the void ratio to be 1 and the output dimension to be 128, and then carries out channel segmentation to y31 and 32, wherein y31 is taken as a reserved characteristic; the module 4 performs channel splicing on the y22 of the module x2, the y31 of the module x3 and the current characteristics, then performs the same operation as the above, performs channel segmentation on the convolved characteristics into y41 and y42, wherein y41 is taken as a reserved characteristic, and the like, finally performs splicing on y1, y21, y31, y41, y51, y61, y71 and y8 in characteristic dimensions to obtain high-dimensional semantic characteristics, and finally completes characteristic fusion through the convolution module 6 of 1x1 to obtain final output;

step 3.2: constructing a multi-scale channel separation convolution feature extraction model, and designing an overall multi-scale feature extraction module by utilizing the single multi-scale feature extraction module mentioned in the step 3.1, wherein the detailed description of the overall multi-scale feature extraction module is as follows: firstly, carrying out 1x1 convolution, reLU and BN layers on the 80-dimensional MFCC characteristics obtained after pretreatment, and designating an output dimension 512; then three continuous multi-scale channel separation convolution feature extraction modules are connected, wherein the cavity convolution rate is set to be 2,3 and 4 respectively, the multi-scale s is 8, and the designated output dimension is 512; then, carrying out channel splicing on the final output of the 3 multi-scale channel separation convolution characteristic extraction modules, wherein the dimension after splicing is 2048; finally, feature fusion is carried out on the spliced features through a convolution of 1x1, a ReLU layer and a BN layer, and the final output dimension is designated as 1536;

step 3.3: designing the overall architecture of the whole network by using the overall multi-scale feature extraction model mentioned in step 3.2:

after the feature extraction layer described in step 3.2, a channel and context-related statistical pooling layer is accessed, that is, the variance and the mean value output by the overall multi-scale feature extraction module are calculated, the three are spliced together, the mean value is calculated once again, the appointed dimension is 1536, then as shown in fig. 3, a residual error module is introduced, wherein the linear layer is subjected to dimension reduction to 128, the original 1536 is changed through a tanh activation function and a linear layer dimension, and the result obtained through a softmax activation function is used as the output dimension 3072 after the group on the previous mean value; finally, a linear layer is accessed to reduce the dimension to 192, and the final embedded feature vector is obtained.

the method comprises the steps of calculating a characteristic difference angle theta of positive samples and negative samples of a speaker voice segment by using an AAM-softmax loss function, and calculating a weight coefficient in a loss update network structure, wherein the loss function is mainly used for scoring each speaker tag and a real tag output by a classifier, namely if two voice segments of one speaker pair are from the same speaker in a test set, the tags of the two voice pairs are 1, and if the tags of the two voice pairs are not 0. The output of the network is thus the probability of currently determining whether two speech segments are from the same speaker.

The calculation of AAM-softmax loss specifically comprises the following steps:

step 4.3: setting epo to 10 by using an AAM-softmax loss function, setting miniband of each epo training to 16, wherein each miniband contains 400 voice pairs, and training a network;

the structural parameters of the fully connected HS-Res2Net network are shown in Table 1:

TABLE 1 fully connected HS-Res2Net network architecture parameters

extracting an embedded feature vector with a specified dimension of 192 from the obtained multi-scale channel separation convolution model, constructing a similarity matrix and a degree matrix of a feature sample, calculating the first k feature values and feature vectors of the Laplacian matrix through the similarity matrix and the degree matrix, and finally completing clustering analysis of the affiliated voice fragments through k-means;

the cluster analysis is carried out on the features extracted from the AMI conference data, and the method specifically comprises the following steps:

Step 5.5: making a change

Step 6: scoring the clustering result using a standard partitioned clustering error rate;

considering two situations of known speaker number and unknown speaker number under the real condition, and carrying out evaluation analysis on the clustering result according to the verification set and the test set;

the standard partition cluster error rate is as follows:

wherein T is _Spk Speech duration, T, representing speaker cluster errors _Miss Representing the time length of misjudgment of effective voice as non-voice, T _False Representing the duration of the non-speech misjudgment as valid speech, T _Total To test the total duration of the audio in the collection, here, the sum is 0 because manually labeled speech segments are used. Partition clustering errorThe error rate DER is shown in Table 2:

TABLE 2 partition clustering error Rate DER

In summary, after the first 1x1 convolution in the Res2Net, the method and the device reserve part of the features after each channel convolution as features by splitting, splicing and re-convolving the channels, and establish the feature association with other channels by part, and perform 1x1 feature fusion on the finally obtained features, so that the feature relation among the channels can be utilized to extract embedded feature vectors more fully, then the trained model is used for extracting features of the conference AMI data set, and finally the time intervals of speakers are clustered through spectral clustering.

Claims

1. A speaker clustering method based on multi-scale channel separation convolution feature extraction is characterized in that: the method comprises the following steps:

step 2: preprocessing the VoxCeleb and AMI data;

the step 3 specifically comprises the following steps:

step 3.3: the obtained multi-scale channel separation convolution feature extraction model is accessed into a statistics pooling layer to obtain global and local mean values and variances, and a final embedded feature vector is obtained through a softmax activation function and two linear full-connection layers;

step 4: the AAM-softmax loss function is selected to train the multi-scale channel separation convolution model for multiple times to obtain an optimal multi-scale channel separation convolution model;

in step 4, the AAM-softmax loss function is used for solving the characteristic difference angle theta for the positive sample and the negative sample of the speaker voice section and calculating the weight coefficient in the loss update network structure, and the method specifically comprises the following steps:

2. The speaker clustering method based on multi-scale channel separation convolution feature extraction as claimed in claim 1, wherein: in step 2, pre-emphasis, framing, windowing, fast fourier transform, mel triangular filtering, and computation of logarithmic energy and discrete cosine transform are performed on the voxcelleb dataset for model evaluation and the AMI dataset for speaker clustering, and specifically the method comprises the following steps:

H(z)＝1-tz ^-1

step 2.6: the logarithmic energy is subjected to discrete cosine transformation to obtain a final 80-dimensional MFCC coefficient, and the expression formula of the discrete cosine transformation is as follows:

where M is the number of triangular filters.

3. The speaker clustering method based on multi-scale channel separation convolution feature extraction as claimed in claim 1, wherein: in step 5, extracting an embedded feature vector with a specified dimension of 192 from the obtained multi-scale channel separation convolution model, constructing a similarity matrix and a degree matrix of feature samples, calculating first k feature values of a normalized Laplacian matrix and corresponding feature vectors thereof through the similarity matrix and the degree matrix, and finally completing cluster analysis of the affiliated voice fragments through k-means, wherein the method specifically comprises the following steps:

Step 5.5: making a change

4. The speaker clustering method based on multi-scale channel separation convolution feature extraction as claimed in claim 1, wherein: in step 6, consider two kinds of situations that the number of speakers is known and the number of speakers is unknown under the real condition, and evaluate and analyze the clustering result according to the verification set and the test set.