CN115083433A

CN115083433A - DNN-based text irrelevant representation tone clustering method

Info

Publication number: CN115083433A
Application number: CN202210634114.6A
Authority: CN
Inventors: 蒋竺芳
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2022-09-20

Abstract

The invention discloses a text irrelevant representation tone clustering method based on DNN, which comprises the following steps: step S10: acquiring a large amount of speaker voice data as a training data set; step S20: constructing an acoustic recognition model; step S30: training an acoustic model; step S40: extracting a tone embedding feature vector; step S50: and (4) applying spectral clustering on the embedding feature vectors to perform timbre clustering. The invention constructs a multi-classification cross entropy target function, is characterized by tone driven by data, and completes the clustering of tone through a spectral clustering algorithm more suitable for high-dimensional clustering, thereby improving the matching rate of the voice of a specific person and further improving the accuracy of the voice recognition of the specific person.

Description

DNN-based text irrelevant representation tone clustering method

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a text irrelevant representation tone clustering method based on DNN.

Background

Tone clustering is a process of dividing the tone in an input audio set into homogeneous segments according to the tone of multiple speakers, and further distinguishing the homogeneous segments from other tones. It attempts to solve the problem of "what characteristics a person's voice has", and has a wide range of applications including multimedia information retrieval, speaker analysis, and audio processing. At present, data sets with audio tone category labels are rarely available, so unsupervised clustering algorithms are mostly adopted for tone identification. A typical timbre clustering system generally consists of two parts: (1) audio feature extraction, in which specific features, such as MFCC, LPC, or LPCC, are extracted from the original audio; (2) and clustering, namely determining the number of tone categories and clustering the extracted audio features into the tone categories.

In previous research, MFCC is widely used as a speech feature, which includes both speaker information and channel information, and has a better performance for speech recognition and other problems sensitive to speech content, but for timbre clustering, it is more concerned about speaker identity characteristics, and it is desirable to reduce interference with speaker information in features, so timbre clustering requires a text-independent feature representation.

Clustering (Clustering): similar data are divided together, the specific meaning of the label is not concerned when the data are divided, the similar data are aggregated together, and clustering is an Unsupervised Learning (Unsupervised Learning) method. The most frequently used clustering algorithm in timbre clustering is K-means, and the problem of directly using the K-means to classify timbres is as follows: speech data is typically non-gaussian, in which case the center of the K-Means cluster is not sufficient to represent a class; furthermore, the speech characteristics are influenced by a number of factors such as gender, age, accent, etc., and this structure results in direct clustering being poor performance, e.g., the difference between male and female speakers is much larger than the difference between two female speakers, which in practice often results in K-Means falsely clustering all corresponding male speakers into one cluster and all corresponding female speakers into another cluster. The presetting of clustering categories is difficult to unify, meanwhile, the initial centroid selection of the K-means is generally selected randomly, when the initial centroid points are different in value, the final clustering effect is different, and the wrong classification caused by unreasonable initial centroid point selection is the most common problem.

Disclosure of Invention

The invention aims to provide a text irrelevant representation tone clustering method based on DNN (deep neural network) so as to overcome the defects in the prior art.

In order to achieve the purpose, the invention provides the following technical scheme:

a DNN-based text-independent characterization tone clustering method comprises the following steps:

step S10: acquiring a large amount of speaker voice data as a training data set;

step S20: constructing an acoustic recognition model, wherein the overall structure of the model is a feedforward DNN, calculating embedding characteristics from variable-length voice data, and dividing the calculation into acoustic characteristic extraction and DNN acoustic model construction;

step S30: training an acoustic model, and constructing a multi-classification cross entropy objective function as follows:

training the network through a multi-classification cross entropy objective function to classify the training speakers, wherein the training optimization process comprises the following steps: assuming that each person has a tone of one class, K tone classes in N training segments, and K speakers are K classes of tones, P (spkrk | x (N)1: T) is defined as the probability that the tone class is K given T input frames x (N)1, x (N)2,. x (N) T, if the speaker label of a speech segment N is K, dnk is 1, otherwise 0;

step S40: extracting timbre embedding feature vectors, training a network target to generate embedding feature vectors containing more speaker timbre information, generalizing the embedding features to speakers (timbres) which do not appear in training data, wherein the embedding features are used for capturing timbre features of the whole voice;

step S50: and (3) applying spectral clustering on the embedding characteristic vector to perform tone clustering, and regarding a similarity matrix S, regarding Sij as the weight of an edge between nodes i and j in an undirected graph. Spectral clustering divides an original graph into subgraphs by removing weak edges with smaller weights.

Preferably, the step S20 of extracting the acoustic features includes the following steps:

s201: receiving an unknown voice signal, and performing pre-emphasis, framing and windowing processing on the unknown voice signal;

s202: performing audio framing, wherein each 25ms frame is moved by 10 ms;

s203: carrying out FFT (fast Fourier transform) on the preprocessed unknown voice signal;

s204: the unknown voice signal after FFT is passed through a Mel filter bank to obtain Mel frequency spectrum;

s205: carrying out logarithmic energy processing on the Mel frequency spectrum to obtain a logarithmic frequency spectrum;

s206: performing DCT transformation on the logarithmic spectrum, extracting first 13-dimensional difference values and first-order and second-order difference values of the logarithmic spectrum, and combining the difference values with frame energy to form 40-dimensional features serving as feature parameters of a DNN model;

preferably, in the step S20, a DNN acoustic model is constructed, where the model overall architecture includes a plurality of frame-level TDNN layers, a statistical pooling layer representing aggregation at a segment level, two sentence-level fully-connected layers, and a last softmax output layer; the first 5 layers of the network are constructed at the frame level, and a time delay neural network is adopted, and the method comprises the following steps:

s211: a designated person reads a section of Chinese text within time t, and takes the section of voice as a voice signal to preprocess the voice signal;

s212: assuming that t is the current time step, firstly splicing the frames at (t-2, t-1, t, t +1, t +2) together as network input;

s213: respectively splicing the outputs of the previous layer at (t-2, t, t +2) and (t-3, t, t +3) together through the next two layers; the fourth and fifth layers are also run at the frame level, but without adding any additional temporal context information.

Preferably, the statistical pooling layer receives the final output of the frame level layer as input, aggregates the final output on input segments, calculates the mean and standard deviation of the TDNN network output, connects the statistical information of the segment levels together, and transmits the statistical information to the full connection layer of two sentence levels.

Preferably, the acoustic model optimizes the data set by removing any audio data less than 4 seconds in step S30, and the training data is hundreds of hours of speech data for a thousand-class speaker.

Preferably, the spectral clustering in step S50 includes the following steps:

s501: constructing a similarity matrix S according to the embedding feature vectors, wherein Sij is the cosine similarity of the embedding vectors of the ith and jth voice segments, and setting a diagonal element i as j to be 0;

s502: the laplacian matrix L is calculated and normalization is performed:

L＝D-S

L _norm ＝D ^-1 L

wherein D is a diagonal matrix, and

s503: calculating a characteristic value and a characteristic vector of Lnorm;

s504: calculating the cluster number k, wherein the cluster number in an attribute representation diagram of Lnorm is equal to the algebraic multiplicity of 0 characteristic value, a threshold value beta is set, and the number of the characteristic values lower than the beta is counted as k;

s505: k minimum eigenvalues λ 1, λ 2,. λ k and eigenvectors P1, P2,. pk corresponding to Lnorm are taken to construct a matrix P e Rn × k using P1, P2,. pk as columns.

S506: the row vectors y1, y 2.. yn for P are clustered using the k-means algorithm.

Compared with the prior art, the invention has the beneficial effects that: under the condition of no manual input, the invention completes the clustering of the tone colors by constructing a multi-classification cross entropy objective function, representing the tone colors driven by data and a spectral clustering algorithm more suitable for high-dimensional clustering so as to improve the matching rate of the specific human voice and improve the accuracy of the specific human voice recognition.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic structural diagram of a DNN acoustic model of the present invention;

FIG. 3 is a flow chart of the speech feature extraction of the speech recognition technique of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Referring to fig. 1 to fig. 3, a method for clustering text-independent characterization timbres based on DNN according to an embodiment of the present invention includes the following steps:

training the network through a multi-classification cross entropy objective function to classify the training speakers, wherein the training optimization process comprises the following steps: assuming that each person has a class of timbre, K classes of timbres in N training segments, and K speakers are K classes of timbres, P (spkrk | x (N)1: T) is defined as the probability that the timbre class is K given T input frames x (N)1, x (N)2,. x (N) T, dnk is 1 if the speaker label of the speech segment N is K, otherwise 0; the acoustic model optimizes the data set by deleting any audio data less than 4 seconds, the training data being hundreds of hours of speech data for a thousand-level speaker;

step S40: extracting timbre embedding feature vectors, training a network target to generate embedding feature vectors containing more speaker timbre information, generalizing the embedding features to speakers (timbres) which do not appear in training data, and capturing timbre features of the whole voice through the embedding features instead of at a frame level; therefore, any fully connected layer after the statistical pool layer is a reasonable position from which to extract the imbedding feature, that is, the position of imbedding a or imbedding b in fig. 1; where embedding a is the fully-connected layer output directly above the statistical pooling layer and embedding b is extracted from the fully-connected layer after the activation function ReLU, so it is the non-linear output of statistical pooling.

Specifically, the step S20 of extracting the acoustic features includes the following steps:

s201: receiving an unknown voice signal, and performing pre-emphasis, framing and windowing processing on the unknown voice signal; the pre-emphasis aims to promote a high-frequency part, emphasize the high-frequency part of the voice, remove the influence of lip radiation, increase the high-frequency resolution of the voice to enable the frequency spectrum of the signal to be flat, keep the frequency spectrum in the whole frequency band from low frequency to high frequency and solve the frequency spectrum by using the same signal-to-noise ratio;

s202: performing audio framing, wherein each frame is moved by 10ms every 25 ms; each smooth frame of the input voice can be obtained through frame division and windowing;

s203: carrying out FFT (fast Fourier transform) on the preprocessed unknown voice signal, namely carrying out FFT on each smooth frame of the voice;

s204: passing the unknown voice signal after FFT through a Mel filter bank to obtain Mel frequency spectrum S (m);

s205: the Mel frequency spectrum S (m) is processed by logarithmic energy to obtain a logarithmic frequency spectrum L (m), namely

S206: and (3) performing DCT (discrete cosine transform) on the logarithmic spectrum, extracting the first 13-dimensional difference values and the first-order difference values and the second-order difference values of the logarithmic spectrum, and combining the obtained values and the frame energy to form 40-dimensional features serving as the characteristic parameters of the DNN model.

Specifically, in step S20, a DNN acoustic model is constructed, where the model overall architecture includes a plurality of frame-level TDNN layers, a statistical pooling layer representing aggregation at a segment level, two sentence-level fully-connected layers, and a last softmax output layer; the first 5 layers of the network are constructed at the frame level, and a time delay neural network is adopted, and the method comprises the following steps:

Structurally, each layer of the TDNN is still CNN, and only the input of each layer is spliced by historical, current and future features, so that time sequence information is introduced; the advantage of the TDNN architecture is that it can parallelize training relative to LSTM, in turn adding timing context information relative to CNN.

Specifically, the statistical pooling layer receives the final output of the frame level layer as input, aggregates the final output on an input segment, calculates the mean value and standard deviation of the TDNN network output, connects the statistical information of the segment levels together, and transmits the statistical information to the full connection layers of the two sentence levels; thus, sentence-level feature expression is obtained, and the dimension of the full-connected layer is 512 (any one of the full-connected layer can be used for extracting the embedding feature vector). And finally, a softmax output layer is arranged on the acoustic model, and the node number of softmax is the number of speakers in the training set.

Specifically, the spectral clustering in step S50 includes the following steps:

s502: the laplacian matrix L is calculated and normalization is performed:

L＝D-S

L _norm ＝D ^-1 L

wherein D is a diagonal matrix, and

s503: calculating a characteristic value and a characteristic vector of Lnorm;

In conclusion, under the condition that no manual input is needed, the multi-classification cross entropy objective function is constructed, deep learning is applied to the feature extraction module, the tone characterization embedding is learned through the deep neural network, tone information is expected to be mapped to a vector space, in the space, the embedding distance obtained by voice mapping of similar tones is closer, the embedding distance obtained by voice mapping of different tones is as far as possible, the tone characterization driven by data is completed through spectral clustering more suitable for high-dimensional data feature clustering on the embedding feature vector, so that the matching rate of sounds of a specific person is improved, and the accuracy of voice recognition of the specific person is improved.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. A DNN-based text-independent characterization tone clustering method is characterized by comprising the following steps:

training the network through a multi-classification cross entropy objective function to classify the training speakers, wherein the training optimization process comprises the following steps: assuming that each person has a class of timbre, K classes of timbres in N training segments, and K speakers are K classes of timbres, P (spkrk | x (N)1: T) is defined as the probability that the timbre class is K given T input frames x (N)1, x (N)2,. x (N) T, dnk is 1 if the speaker label of the speech segment N is K, otherwise 0;

step S50: and (3) applying spectral clustering to the embedding characteristic vector to perform tone clustering, regarding a similarity matrix S based on a graph clustering algorithm, regarding Sij as the weight of an edge between nodes i and j in an undirected graph, and dividing the original graph into sub-graphs by spectral clustering by removing weak edges with smaller weights.

2. The method for clustering text-independent characterization timbres based on DNN (named DNN) according to claim 1, wherein the step of extracting acoustic features in step S20 comprises the following steps:

s202: performing audio framing, wherein each frame is moved by 10ms every 25 ms;

s206: and (3) performing DCT (discrete cosine transformation) on the logarithmic spectrum, extracting the first 13-dimensional difference value and the first-order difference value and the second-order difference value of the logarithmic spectrum, and combining the obtained values with frame energy to form 40-dimensional features serving as the feature parameters of the DNN model.

3. The method for clustering text-independent characterization timbres based on DNN of claim 1, wherein the DNN acoustic model is constructed in step S20, and the model overall structure comprises a plurality of frame-level TDNN layers, a statistical pooling layer representing aggregation at segment level, two sentence-level fully-connected layers, and a last softmax output layer; the first 5 layers of the network are constructed at the frame level, and a time delay neural network is adopted, and the method comprises the following steps:

4. The DNN-based text-independent characterization timbre clustering method of claim 3 wherein the statistical pooling layer receives the final output of the frame level layers as input, aggregates over the input segments, calculates the mean and standard deviation of the TDNN network outputs, concatenates the segment-level statistics, and passes them to the fully-concatenated layer of the two sentence levels.

5. The method of claim 1, wherein the acoustic model optimizes the data set by removing any audio data less than 4 seconds, and the training data is hundreds of hours of speech data of a thousand-level speaker in step S30.

6. The DNN-based text-independent characterization timbre clustering method according to claim 1, wherein the spectral clustering in the step S50 comprises the following steps:

s502: the laplacian matrix L is calculated and normalization is performed:

L＝D-S

L _norm ＝D ^-1 L

wherein D is a diagonal matrix, and

s503: calculating a characteristic value and a characteristic vector of Lnorm;

s505: taking k minimum eigenvalues λ 1, λ 2,. λ k and the eigenvectors P1, P2,. pk corresponding to Lnorm to construct a matrix P e Rn × k using P1, P2,. pk as columns,

s506: the row vectors y1, y 2.. yn of P are clustered using the k-means algorithm.