CN112712099B

CN112712099B - Double-layer knowledge-based speaker model compression system and method by distillation

Info

Publication number: CN112712099B
Application number: CN202011079752.3A
Authority: CN
Inventors: 李入云; 宋丹丹; 欧阳鹏
Original assignee: Jiangsu Qingwei Intelligent Technology Co ltd
Current assignee: Jiangsu Qingwei Intelligent Technology Co ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2024-04-12
Anticipated expiration: 2040-10-10
Also published as: CN112712099A

Abstract

The invention discloses a speaker model compression system and method based on double-layer knowledge distillation, and belongs to the technical field of implementation modes of stereo matching algorithms. Comprising the following steps: the Embedding layer knowledge distillation directs the student network to mimic the segment-level speaker representation (speaker characterization) of the teacher network, which captures the basic distribution of each speaker's features. Logit layer knowledge distillation guides student networks to mimic speaker posterior probability distribution of teacher networks, exploiting similarities between speaker classes. The method migrates a hierarchy of speaker characterization profiles from the teacher's network. The invention solves the problems that the student network in the prior art cannot realize smaller intra-speaker differences and larger inter-speaker differences, and the accuracy of the verification system of the same speaker and different speakers is lower.

Description

Double-layer knowledge-based speaker model compression system and method by distillation

Technical Field

The invention belongs to the technical field of model compression based on a double-layer knowledge distillation technology, and particularly relates to a system and a method for compressing a model of a distilled speaker based on double-layer knowledge.

Background

In recent years, with the increasing abundance of computing resources and data resources. Machine learning based on deep neural network enables the accuracy of speaker recognition system to be significantly improved. In situations where network connectivity is unavailable or personal privacy leakage is a concern, it is desirable to be able to use speaker recognition technology locally on embedded devices such as cell phones, such as speaker recognition systems operating on embedded terminals with a higher level of security. However, existing speaker recognition techniques rely on deep neural networks, which are hindered from deployment in embedded devices with low memory resources due to the high computational effort and large memory usage. Accordingly, increasing research is focused on model compression and acceleration of deep networks without significantly degrading model performance.

To compress these networks, knowledge distillation is a common approach in which a large network (teacher) provides a weighted goal to guide training of a small network (student). While knowledge distillation has proven to be a practical method of model compression in a variety of tasks (e.g., image classification, speech recognition, and speaker verification), previous researchers have only studied the impact of single-layer knowledge distillation on speaker characterization performance, and as compression scales become larger and larger, these methods are insufficient to make up for the performance gap between size models, and it remains a challenge to obtain a student network that performs better than a teacher network.

Disclosure of Invention

The invention aims to provide a double-layer knowledge-based distilled speaker model compression system and method, which are used for solving the problems that a student network in the prior art cannot realize smaller intra-speaker differences and larger inter-speaker differences and the accuracy of the verification system of the same speaker and different speakers is lower.

In order to achieve the above object, the present invention provides the following technical solutions:

a speaker model compression method based on double-layer knowledge distillation comprises the following steps:

s101, training a teacher model, wherein the teacher model can extract the speaker characterization of the teacher network learning, and the teacher model can predict the posterior probability distribution of the speaker of the teacher network learning.

S102, the teacher model comprises a teacher network, and the teacher network comprises a characterization layer and a posterior probability layer.

S103, training a student model by using a teacher model through knowledge distillation. The student model comprises a student network, and the student model can extract speaker characterization learned by the student network.

S104, double-layer knowledge distillation can simultaneously extract knowledge of a characterization layer and a posterior probability layer from a teacher network.

S105, carrying out characterization layer knowledge distillation through speaker characterization learned by a teacher network.

S106, the knowledge distillation of the characterization layer guides the student network to simulate the speaker characterization of the teacher network.

S107, performing posterior probability layer knowledge distillation through speaker posterior probability distribution learned by a teacher network.

S108, the posterior probability layer knowledge distillation guides the student network to simulate the speaker posterior probability distribution of the teacher network through the similarity among speaker categories.

S109, double-layer knowledge distillation can add the differences in the token layer and posterior probability layer outputs between the student network and the teacher network to the total classification loss.

S110, double-layer distillation can yield a distribution of intra-speaker characterization and similarity of inter-class characterization. The students are guided to realize smaller intra-speaker differences and larger inter-speaker differences through the hierarchical distribution of the speaker characterization, so that the modeling accuracy of the speakers is finally improved.

Based on the technical scheme, the invention can also be improved as follows:

further, the token layer knowledge distillation can obtain the overall distribution of the teacher network token for each speaker, so as to directly guide the convergence of the token in the student network speaker.

Further, knowledge is extracted from the output of the posterior probability layer of the teacher network, and posterior distribution which can be predicted by the teacher model is distilled by the posterior probability layer knowledge to guide optimization of the student model. The posterior probability layer knowledge distillation is able to learn the similarity between speaker classes.

Further, knowledge is extracted from the output of the teacher's network posterior probability layer.

Further, the output of the teacher network posterior probability layer is taken as a standard, and is incorporated into the calculation of the student network loss function to guide the update of the student model parameters.

Further, posterior probability layer knowledge distillation guides optimization of student models through posterior probability distribution predicted by teacher models.

Furthermore, the student model introduces a parameter m to control the angle allowance through a classification function AM-loss, the student model generates the angle classification allowance between the characterizations of different speaker categories, and the student model can make the requirements of correct classification more strict.

Further, the total classification loss is a cosine distance loss characterizing layer knowledge distillation, a KL divergence loss of posterior probability layer knowledge distillation, and a softmax loss for speaker classification.

A dual-layer knowledge-based distilled speaker model compression system, comprising:

training a teacher model, the teacher model can extract the speaker characterization of the teacher network learning, and the teacher model can predict the speaker posterior probability distribution of the teacher network learning.

The teacher model comprises a teacher network, and the teacher network comprises a characterization layer and a posterior probability layer.

And training the student model by using the teacher model through knowledge distillation. The student model comprises a student network, and the student model can extract speaker characterization learned by the student network.

The double-layer knowledge distillation can extract knowledge of the characterization layer and the posterior probability layer from the teacher network simultaneously.

And carrying out characterization layer knowledge distillation through speaker characterization learned by a teacher network.

The token layer knowledge distillation directs the student network to mimic the speaker token of the teacher network.

And performing posterior probability layer knowledge distillation through speaker posterior probability distribution learned by a teacher network.

Posterior probability layer knowledge distillation directs student networks to mimic the speaker posterior probability distribution of a teacher network through similarities between speaker classes.

The double-layer knowledge distillation can add differences in the output of the token layer and the posterior probability layer between the student network and the teacher network to the total classification loss.

Double layer distillation can yield a distribution of intra-speaker characterization and similarity of inter-class characterization. The students are guided to realize smaller intra-speaker differences and larger inter-speaker differences through the hierarchical distribution of the speaker characterization, so that the modeling accuracy of the speakers is finally improved.

The invention has the following advantages:

the invention discloses a compression system based on a double-layer knowledge distillation speaker model, which guides a student network to simulate a segment-level speaker representation (speaker characterization) of a teacher network through the knowledge distillation of an Embedding layer, and captures the basic distribution of the characteristics of each speaker. Logit layer knowledge distillation guides student networks to mimic speaker posterior probability distribution of teacher networks, exploiting similarities between speaker classes. The method migrates a hierarchy of speaker characterization profiles from the teacher's network. Double-layer knowledge distillation can help the student network achieve smaller intra-speaker differences and larger inter-speaker differences and further improve the accuracy of the same speaker and different speaker verification systems.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for compressing a model of a distilled speaker based on double-layer knowledge according to the present invention.

FIG. 2 is a flow chart of a double-layer knowledge distillation method of the present invention.

FIG. 3 is a schematic diagram of the double-layer knowledge distillation principle of the present invention.

FIG. 4 is a schematic diagram of the double-layer knowledge distillation principle of the present invention.

FIG. 5 is a graphical representation of comparative data for a double layer knowledge distillation and an original single layer knowledge distillation of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1-5, an embodiment of the present invention provides a dual-layer knowledge-based speaker model compression system, comprising:

knowledge of the basic mathematical model of distillation. Knowledge distillation aims at transferring knowledge from a large teacher network T to a small student network S. The student network is trained to mimic the behavior of the teacher network. Where HT and HS represent the behavioral functions of the teacher network and the student network, respectively. This behavior function converts network inputs into an information representation, specifically the output of any layer in the network. For example, hlS represents the output of layer l in a student's network. Layer l of the student network is matched to layer l 'in the teacher network by a mapping function f (l), which means that layer l of the student network can learn information from layer l' of the teacher network. Finally, through the difference between the minimum chemical output and the teacher output, students can well simulate the behavior of a teacher network:

where $x_ { i } $ represents the ith training set sample. $\mathcal { L } { L } $ refers to a loss function that limits the difference between the output of the student's $l$ layer and the output of the teacher's $f (L) $ layer (e.g., an embedded layer or L $ layer _ogit A layer). Lambda_ { l } $ represents a super parameter, which represents the importance of the distillation of the l$ layer. $N$ is the number of training samples. The $ L $ refers to the total number of layers for the student.

Matching appropriate layers between the student network and the teacher network for knowledge distillation is not easy. In most cases we have to cope with their differences in width and depth.

S101, training a teacher model.

In this step, a teacher model 10 is trained, the teacher model 10 can extract the speaker characterization of the teacher network learning, and the teacher model 10 can predict the speaker posterior probability distribution of the teacher network learning.

A model compression method based on double-layer knowledge distillation takes a large speaker model as a teacher model 10, and the distillation obtains a very small student model 20, and meanwhile, the performance of the teacher model 10 is reserved.

S102, the teacher model comprises a teacher network.

In this step, the teacher model 10 includes a teacher network including a characterization layer and a posterior probability layer. Based on the x-vector structure, a characterization layer and a posterior probability layer are selected from a teacher network to perform knowledge distillation.

S103, training a student model by using a teacher model through knowledge distillation.

In this step, the student model 20 is trained by the teacher model 10 through knowledge distillation, the student model 20 includes a student network, and the student model 20 is capable of extracting speaker characterization of the student network learning.

S104, double-layer knowledge distillation extracts knowledge of a characterization layer and a posterior probability layer from a teacher network simultaneously.

In this step, the double-layer knowledge distillation can extract knowledge of the characterization layer and the posterior probability layer from the teacher network at the same time.

In the step, the knowledge distillation of the characterization layer is carried out through the speaker characterization learned by the teacher network.

In this step, the token layer knowledge distillation directs the student network to mimic the speaker token of the teacher network.

In the step, posterior probability layer knowledge distillation is carried out through speaker posterior probability distribution learned by a teacher network.

In this step, posterior probability layer knowledge distillation directs the student network to mimic the speaker posterior probability distribution of the teacher network through similarities between speaker classes.

In this step, the double-layer knowledge distillation can add the differences in the output of the token layer and the posterior probability layer between the student network and the teacher network to the total classification loss.

S110, double-layer distillation can yield a distribution of intra-speaker characterization and similarity of inter-class characterization.

In this step, double-layer distillation can yield a distribution of intra-speaker characterization and similarity of inter-class characterization. The students are guided to realize smaller intra-speaker differences and larger inter-speaker differences through the hierarchical distribution of the speaker characterization, so that the modeling accuracy of the speakers is finally improved.

Assuming that the student and teacher networks produce speaker representations of the same dimension, the assembled representation layer knowledge distillation limits the similarity of speaker representations learned from the teacher and student model 20 by cosine similarity:

where \ (H { T } { embd } (x_ { i }) \) represents the embedding of the teacher network extracted for the $i $ th sample. (H { S } { emmbd } (x_ { i }) represents embedding calculated by the student network. Other definitions of the symbols are similar to the formula \ref { eq: kd }.

The comparison of the double-layer knowledge distillation of the present invention and the original single-layer knowledge distillation (Wang, shuai, yexin Yang, tianzhe Wang, yanmin Qian, and Kai Yu. "Knowledge distillation for small foot-print deep speaker embedding." In ICASSP 2019-2019IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pp.6021-6025.IEEE, 2019.) as shown In FIG. 5, test set was little loving.

The knowledge distillation of the characterization layer can obtain the overall distribution of the teacher network to the characterization of each speaker, so as to directly guide the convergence of the characterization in the student network speaker.

For speaker i, limited by cosine similarity, the speaker characterization Sspki extracted by student model 20 converges to the speaker characterization Tspki extracted by teacher model 10, thereby allowing student model 20 to achieve smaller intra-class differences.

Knowledge is extracted from the output of the teacher's network posterior probability layer, which distills the posterior distribution that can be predicted by the teacher model 10 to guide the optimization of the student model 20. The posterior probability layer knowledge distillation is able to learn the similarity between speaker classes.

By minimizing the KL divergence between the teacher network and the student network posterior probabilities:

where \ (C\) is the number of speakers in the training set. \ (\tille { y } pi\) is the posterior of the $i$ th sample predicted by the teacher network. \ ({ y } ] i\) is the posterior of the $i $ sample predicted by the student network. Other definitions of the symbols are similar to the formula \ref { eq: cos }.

Posterior probability is valuable information that can encode correlations between different classes. So that similarity between speaker classes can be learned by posterior probability layer knowledge distillation.

As shown in fig. 2-3, it can be seen that posterior probability layer knowledge distillation increases the inter-class differences of the student's network. Speakers with high similarity are grouped into a subclass.

Knowledge is extracted from the output of the teacher's network posterior probability layer.

The output of the teacher network posterior probability layer is taken as a standard, and is incorporated into the calculation of the student network loss function to guide the update of the parameters of the student model 20.

Posterior probability layer knowledge distillation directs optimization of student model 20 through posterior probability distributions predicted by teacher model 10.

The student model 20 controls the angle allowance by introducing the parameter m through the classification function AM-loss, the student model 20 generates the angle classification allowance between the characterizations of different speaker categories, and the student model 20 can make the requirements of correct classification more strict.

The total classification loss is the cosine distance loss characterizing layer knowledge distillation, the KL divergence loss of posterior probability layer knowledge distillation, and the softmax loss for speaker classification. Where α and β are the hyper-parameters used to balance these losses, the values of which will be optimized later in the experiment.

L _total ＝L _A-softmax +αL _KLD +βL _COS

training a teacher model 10, the teacher model 10 can extract the speaker characterization of the teacher's network learning, and the teacher model 10 can predict the speaker posterior probability distribution of the teacher's network learning.

The teacher model 10 includes a teacher network that includes a token layer and a posterior probability layer.

The student model 20 is trained with the teacher model 10 by knowledge distillation. The student model 20 includes a student network, and the student model 20 is capable of extracting speaker representations of student network learning.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting. Although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features thereof can be replaced by equivalents. Such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for compressing a model of a distilled speaker based on double-layer knowledge, comprising:

s101, training a teacher model, wherein the teacher model can extract the speaker characterization of the teacher network learning, and the teacher model can predict the posterior probability distribution of the speaker of the teacher network learning;

s102, the teacher model comprises a teacher network, and the teacher network comprises a characterization layer and a posterior probability layer;

s103, training a student model by using the teacher model through knowledge distillation; the student model comprises a student network, and the student model can extract speaker characterization learned by the student network;

s104, the double-layer knowledge distillation can simultaneously extract knowledge of a characterization layer and a posterior probability layer from the teacher network;

s105, performing characterization layer knowledge distillation through speaker characterization learned by the teacher network;

s106, the characterization layer knowledge distillation guides the student network to simulate the speaker characterization of the teacher network;

s107, performing posterior probability layer knowledge distillation through the speaker posterior probability distribution learned by the teacher network;

s108, the posterior probability layer knowledge distillation guides the student network to simulate the speaker posterior probability distribution of the teacher network through the similarity among speaker categories;

s109, the double-layer knowledge distillation is capable of adding differences in the output of the characterization layer and the posterior probability layer between the student network and the teacher network to the total classification loss;

s110, the double-layer distillation can obtain the distribution of the characterization in the speaker and the similarity of the characterization among classes; the students are guided to realize smaller intra-speaker differences and larger inter-speaker differences through the hierarchical distribution of the speaker characterization, so that the modeling accuracy of the speakers is finally improved.

2. The method of claim 1, wherein the token layer knowledge distillation is capable of obtaining an overall distribution of the teacher's network token for each speaker, thereby directly guiding convergence of the token within the student's network speaker.

3. The method of claim 2, wherein knowledge is extracted from the output of the teacher network posterior probability layer, which distills posterior distributions that can be predicted by the teacher model to guide the optimization of the student model; the posterior probability layer knowledge distillation is able to learn the similarity between speaker classes.

4. The method for compressing a model of a speaker based on double knowledge distillation as recited in claim 3, wherein knowledge is extracted from an output of a posterior probability layer of said teacher network.

5. The method for compressing a model of a speaker based on double knowledge distillation as recited in claim 4, wherein the output of the teacher network posterior probability layer is taken as a standard, and is incorporated into the calculation of the student network loss function, so as to guide the update of the student model parameters.

6. The method for compressing a model of a speaker based on double knowledge distillation according to claim 5 wherein said posterior probability layer knowledge distillation directs optimization of said student model by a posterior probability distribution predicted by a teacher model.

7. The method for compressing a model of a distilled speaker based on double-layer knowledge according to claim 6, wherein the student model introduces a parameter m to control an angle margin through a classification function AM-loss, the student model generates an angle classification margin between the characterizations of different speaker categories, and the student model can make the requirements of correct classification more strict.

8. The double knowledge distillation speaker model compression method according to claim 7 wherein the total classification loss is a cosine distance loss characterizing layer knowledge distillation, a KL divergence loss of posterior probability layer knowledge distillation, and a softmax loss for speaker classification.

9. A dual knowledge-based speaker model compression system, comprising:

training a teacher model, wherein the teacher model can extract the speaker characterization of the teacher network learning, and the teacher model can predict the posterior probability distribution of the speaker of the teacher network learning;

the teacher model comprises a teacher network, and the teacher network comprises a characterization layer and a posterior probability layer;

training a student model by using the teacher model through knowledge distillation; the student model comprises a student network, and the student model can extract speaker characterization learned by the student network;

the double-layer knowledge distillation can simultaneously extract knowledge of a characterization layer and a posterior probability layer from the teacher network;

carrying out characterization layer knowledge distillation through speaker characterization learned by the teacher network;

the knowledge distillation of the characterization layer guides the student network to simulate the speaker characterization of the teacher network;

performing posterior probability layer knowledge distillation through the posterior probability distribution of the speaker learned by the teacher network;

the posterior probability layer knowledge distillation guides a student network to simulate speaker posterior probability distribution of a teacher network through similarity among speaker categories;

the double-layer knowledge distillation is capable of adding differences in the output of the characterization layer and the posterior probability layer between the student network and the teacher network to the total classification loss;

the double-layer distillation can obtain the distribution of the characterization in the speaking human and the similarity of the characterization among the classes; the students are guided to realize smaller intra-speaker differences and larger inter-speaker differences through the hierarchical distribution of the speaker characterization, so that the modeling accuracy of the speakers is finally improved.