CN111160409A

CN111160409A - Heterogeneous neural network knowledge reorganization method based on common feature learning

Info

Publication number: CN111160409A
Application number: CN201911265852.2A
Authority: CN
Inventors: 宋明黎; 罗思惠; 方共凡
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2020-05-15

Abstract

The heterogeneous neural network knowledge reorganization method based on common feature learning comprises the following steps: acquiring a plurality of pre-trained neural network models, which are called teacher models; the characteristics output by the teacher model and the output prediction result are used for guiding the training of the student model through a common characteristic learning and soft target distillation method; in the common characteristic learning process, the characteristics of a plurality of heterogeneous networks are projected to a common characteristic interval, the student models integrate knowledge of a plurality of teacher models, and the soft target distillation method enables the prediction results of the student models to be consistent with the prediction results of the teacher models, so that a stronger student model with the task processing capacity of all the teacher models is obtained. The student model can be trained without any manual marking because only the prediction result of the teacher model needs to be simulated. The method is suitable for knowledge reorganization of the neural network model, in particular to the knowledge reorganization of the heterogeneous image classification task model.

Description

Heterogeneous neural network knowledge reorganization method based on common feature learning

Technical Field

The invention relates to the field of machine learning, in particular to a heterogeneous neural network knowledge reorganization method based on common feature learning

Background

In recent years, Deep Neural Networks (DNNs) have enjoyed dramatic success in a multitude of artificial intelligence tasks such as computer vision and natural language processing. However, despite the extraordinary results, the training of DNN models relies heavily on large-scale manually labeled datasets and its training takes a long time. To ease the reproduction effort, more and more researchers are starting to publish trained models on the internet for users to download and use them instantly. The released models are reused to obtain the customized model with multitasking capability, and manual data marking is not needed, so that the method has great significance. However, due to the rapid development of deep learning and the consequent emergence of a large number of network variables, such publicly available training models often have varying network structures, each oriented to a particular task or data set, which presents challenges to the fused reorganization of these models.

In the present invention, the inventors have addressed a deep model fusion reuse task, with the goal of training lightweight and multi-task-capable student models using a multi-task oriented heterogeneous teacher model. The method can use a plurality of pre-trained teacher models to train a student model which can be competent for all teacher model tasks without manually marking information. The traditional knowledge distillation method only aims at a single teacher model, and the goal is model compression, namely, a prediction result of a trained large network model is simulated and learned by using a small network model, which is specifically described in GeoffreyHinton, Oriol Vinyals, and Jeff dean. Therefore, the present invention resorts to another method, namely, the output characteristics of the teacher model are projected into a shared learnable characteristic space, then the student model is forced to imitate the characteristics of the teacher model after conversion, a powerful student model is obtained by training by imitating the output of the teacher network in both the characteristics and the prediction results, the comprehensive knowledge from the heterogeneous teacher model can be fused without accessing manual labels, and the tasks of all the teacher models can be solved.

Disclosure of Invention

The invention provides a heterogeneous neural network knowledge reorganization method based on common feature learning. Firstly, the task oriented by the method is defined: given several pre-trained teacher networks, the goal of the invention is to learn a student model that fuses the knowledge of all teacher models and is competent for their tasks without annotating data. The teacher models may be the same or different in architecture, and are not particularly limited.

A heterogeneous neural network knowledge reorganization method based on common feature learning comprises the following steps:

step 1, selecting a proper student model structure according to customization requirements, carrying out random initialization, inputting the same unlabeled image data to a teacher model and a student model, and respectively obtaining original output characteristics F of the teacher model and the student model_TiAnd F_sAnd converting and aligning the two by adopting an adaptation layer to obtain f with consistent size_TiAnd f_S；

Step 2, introducing a small learnable subnetwork, wherein parameters of the small learnable subnetwork are shared between teachers and students, namely parameters of shared feature extractors of models of each teacher and each student are the same and are called shared extractors, and the aligned features of the teachers and the students are converted into compatible features in a public feature space through the shared extractors; the shared extractor will f_TiAnd f_SInto a common space

And

step 3, measuring the distribution difference among the transformation characteristics obtained in the step 2 by adopting a Maximum Mean Difference (MMD) method, fusing the characteristics of the teacher model, and adapting to the domains of the transformation characteristics of the teacher model and the student models; utensil for cleaning buttockThe body includes: use of

To represent a set of all the characteristics of a teacher, where Ct is the total number of characteristics of the teacher; similarly, use is made of

To represent all the characteristics of the student, where Cs is the total number of characteristics of the student;

and

the approximate calculation formula of the MMD distance is as follows:

wherein the content of the first and second substances,

is an implicit mapping function; by extending this equation with a kernel function K (·,) the MMD loss is defined as follows:

the kernel function may project the sample vector into a higher dimensional feature space; it is noted that the features after normalization are used here

And

then, the MMD losses between the student model and the N teachers are combined to define the total loss L of the common feature space learning_MComprises the following steps:

step 4, inputting the transferred characteristics into a trainable self-encoder to reconstruct the original output characteristics of the teacher model, and setting F'_TiRepresenting teacher original features F_TiMeasure the difference between the reconstructed features and the original features and define the reconstruction loss L_RIs defined as:

by measuring L_RThe features converted into the public space can be reversely mapped into the original features, so that the loss of information as little as possible in the feature conversion process is ensured, and the learning of the public feature space is more robust;

step 5, enabling the student model to imitate the prediction result of the teacher model on the input unmarked sample, and taking the difference of the prediction results of the student model and the teacher model on the same task as a last loss function, namely the target distillation loss; specifically, on the task of image classification, the score vectors of teacher models with target classes not overlapped are directly overlapped, namely the serial score vectors are used as the learning targets of student models; in addition, the same strategy is used for overlapping teachers: during training, overlapping classes are treated as multiple different classes, but during testing, they are treated as the same class; let w_iRepresents a parameter that maps the output characteristics of the teacher model to the score map, and w_sRepresenting the respective parameters of the students, drives the response scores of the student network to approach the loss function L of the teacher's predicted objective_CComprises the following steps:

L_C＝‖w_s·F_s-[w₁·F₁,…,w_N·F_TN]‖₂(5)

and 6, combining the losses defined in the steps 3, 4 and 5 together through the hyperparametric weight to form an overall loss function of the network, and calculating the value:

L＝L_C+(1-α)(L_M+L_R),α∈[0,1](6)

and 7, calculating the gradient of the network, updating the parameters of the whole network model in the gradient direction of the minimized total loss to obtain the network after the parameters are updated, returning to the step 1, continuously iterating the whole training process until the loss function is converged, and obtaining the student model which is the target model.

Preferably, the structure of the teacher model in step 1 includes, but is not limited to, a residual error network and a VGG network, and the structure of the student model depends on actual needs.

Preferably, the composition of the adaptation layer in step 1 includes, but is not limited to, several layers of 1 × 1 convolution, and the adaptation layer parameters of each teacher model and each student model are different and are obtained through learning; the number of adaptation layer channels can be set to an empirical value 256, or can be set according to actual requirements.

Preferably, the shared feature extractor in step 2 is a small convolutional network composed of three residual modules with 1 × 1 steps; in addition to this, the present invention is,

and

the number of channels is set to 128, which is empirically set and can be adjusted during actual operation as appropriate.

Preferably, in the soft target distillation module of step 5, the target distillation loss L_CDefined as the difference between the response scores of the student network and the predicted scores of the teacher model, may be measured using methods including, but not limited to, calculating Mean Square Error (MSE).

The heterogeneous neural network knowledge reorganization method based on common feature learning comprises the following steps: acquiring a plurality of pre-trained neural network models, which are called teacher models; the characteristics output by the teacher model and the output prediction result are used for guiding the training of the student model through a common characteristic learning and soft target distillation method; in the common characteristic learning process, the characteristics of a plurality of heterogeneous networks are projected to a common characteristic interval, the student models integrate knowledge of a plurality of teacher models, and the soft target distillation method enables the prediction results of the student models to be consistent with the prediction results of the teacher models, so that a lightweight student model with the task processing capacity of all the teacher models and stronger task processing capacity is obtained. The student model can be trained without any manual marking because only the prediction result of the teacher model needs to be simulated. The method is suitable for knowledge reorganization of the neural network model, in particular to the knowledge reorganization of the heterogeneous image classification task model.

The invention has the advantages that: by reusing the published model, the customized model with the multitask processing capability can be trained without manual marking, resources are fully utilized, and a large amount of labor cost is saved.

Drawings

FIG. 1 is a general block diagram of the process of the present invention.

FIG. 2 is a schematic diagram of a specific structure of a common feature learning module in the method of the present invention.

Detailed Description

The experimental method of the present invention will be described in detail below with reference to the accompanying drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as there is no conflict, the embodiments of the present invention and the features in the embodiments may be combined with each other or in a disordered order, and the formed technical solutions are within the protection scope of the present invention.

The specific framework of the heterogeneous neural network knowledge reorganization method based on common feature learning provided by the invention is shown in figure 1, N teacher networks are assumed, and each teacher network uses T_iThe method comprises the following steps:

and step 1, aligning the output characteristics of the teacher model and the student model under the same input.

And selecting a proper student model structure according to the customization requirements and carrying out random initialization. Inputting the same unlabelled image data to the teacher model and the student model, respectivelyObtaining the original output characteristics F of the two_TiAnd F_s. Since the teacher model and the student model are different in structure, F is_TiAnd F_sMay also be inconsistent, step 1 uses adaptation layer to perform conversion to obtain f with consistent size_TiAnd f_S. The composition of the adaptation layer includes but is not limited to several 1 × 1 convolutions, and the adaptation layer parameters of each teacher model and each student model are different and are obtained through learning. In the implementation of the present invention, the number of adaptation layer channels is set to 256.

And 2, converting the public characteristics.

A small learnable subnetwork is introduced whose parameters are shared between the teacher and the students (i.e., the parameters of the shared feature extractor of each teacher and student model are identical), and hence the shared extractor, by which the aligned teacher and student features are transformed into consistent features in a common feature space. The subnet sharing the feature extractor is a small convolutional network consisting of three residual modules of 1 × 1 steps. It will f_TiAnd f_SInto a common space

And

in the course of the particular practice of the present invention,

and

And 3, calculating the characteristic learning loss.

And (3) measuring the distribution difference among the transformation characteristics obtained in the step (2) by adopting a Maximum Mean Difference (MMD) method, fusing the characteristics of the teacher model, and adapting the domains of the transformation characteristics of the teacher model and the student models. The MMD method can be regarded as a distance measure of probability distribution, which is commonly used as a measure of domain matching in a domain adaptation task, and aligns domains of the student model and the teacher model by estimating a posterior, which is described in detail in "Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Scholkopf, etc. a kernel two-sample test, journal of Machine learning research,13(Mar): 723-773, 2012", and in this step, the MMD method is used to measure a distribution difference between the transformation characteristics of the student model and the teacher model, and to use it as a loss function, and by minimizing this loss, a similarity between the transformation characteristics of the student model and the transformation characteristics of the teacher model is improved, thereby achieving a role of migrating the knowledge of the teacher model to the student model.

In the implementation, the feature similarity measurement between the student and the teacher model is used as an example

To represent the set of all the characteristics of the teacher, where Ct is the total number of characteristics of the teacher. Similarly, use is made of

To represent all the characteristics of the student, where Cs is the total number of characteristics of the student.

And

the approximate calculation formula of the MMD distance is as follows:

wherein the content of the first and second substances,

is an implicit mapping function. By extending this equation with a kernel function K (·,) the MMD loss is defined as follows:

the kernel function may project the sample vector into a higher dimensional feature space. It is noted that the features after normalization are used here

And

and 4, calculating the characteristic reconstruction loss.

The transferred features are input into a trainable self-encoder to reconstruct the original output features of the teacher model. F'_TiRepresenting original characteristics F of teacher model_TiMeasure the difference between the reconstructed features and the original features and define the reconstruction loss L_RIs defined as:

by measuring L_RThe features converted into the public space can be reversely mapped into the original features, so that the loss of information as little as possible in the feature conversion process is ensured, and the learning of the public feature space is more robust.

The steps 1 to 4 together form a common feature space learning module, and the specific implementation process of the common feature space learning module is shown in fig. 2. In general, the module transforms the characteristics of the teacher model and the characteristics of the student model to be trained into a common characteristic space through an adaptation layer and a sharing extractor, the parameters of which are learnable. During feature learning, two loss terms are applied: characteristic global loss L_MAnd reconstruction lossL_R. The former encourages student features to approach the teacher model's transformed features in the public space, while the latter ensures minimal error between the transformed features and the original features.

And 5, enabling the student model to simulate the prediction result of the teacher model on the input unmarked sample, and calculating the target distillation loss.

And the teacher model is used for guiding the training of the student model according to the input prediction result of the unlabelled sample, so that the student model can output the prediction result which is the same as or similar to that of the teacher model. Specifically, on the image classification task, the teacher models with the target classes not overlapped are directly overlapped with their score vectors, namely the serial score vectors are used as the learning targets of the student models. In fact, the same strategy is used for overlapping teachers: in training, overlapping classes are treated as multiple different classes, but during testing they are treated as the same class. Let w_iRepresents a parameter that maps the output characteristics of the teacher model to the score map, and w_sRepresenting the respective parameters of the students, drives the response scores of the student network to approach the loss function L of the teacher's predicted objective_CComprises the following steps:

L_C＝‖w_s·F_s-[w₁·F₁,…,w_N·F_TN]‖₂(5)

step 6, calculating the total loss

Combining the loss functions shown in the formulas (3), (4) and (5) to obtain an end-to-end training total loss function of the student network, wherein the total loss function is as follows:

L＝L_C+(1-α)(L_M+L_R),α∈[0,1](6)

α is a hyper-parameter, which acts to balance the three loss terms in equation (6). the overall loss function is calculated by forward propagation through the entire neural network model.

And 7, reversely transmitting and updating the network parameters.

Calculating the gradient of the trainable network shown in fig. 1, updating the parameters of the whole network model in the gradient direction which minimizes the overall loss, and returning to step 1 by using the network after the parameters are updated, wherein the whole training process is iterated continuously until the final convergence, and the obtained student model is the target model.

TABLE 1

Table 1 shows the experimental results of a specific example, given two teacher models, which are trained from two subsets of the stanford dog data set or the category 101 data set, respectively, and have network structures of 18-layer residual network (ResNet-18) and 34-layer residual network (ResNet-34), respectively. Knowledge recombination is carried out on the two teacher models by adopting the method to obtain the student models, and the method is compared with other methods on Stanford dogs and Catech 101 data sets according to classification accuracy. From table 1, it can be seen that the student model obtained by training without manual labeling by the method of the present invention is superior to the performance of two teachers on their respective tasks, and even superior to the model obtained by the methods of model integration, classical knowledge distillation, and even training by using real data labels.

TABLE 2

Model (model)	LFW data set	Agedb30 dataset	CFP-FP dataset
				T1	97.43％	84.72％	86.20％
T2	97.80％	85.87％	87.27％
				Knowledge distillation method	95.15％	84.97％	86.87％
The method of the invention	98.10％	86.93％	87.73％

Table 2 shows the experimental results of another example comparing the process of the present invention with the conventionally known distillation process. Each teacher model in the table is trained on a subset of 3000 classes in the CASIA dataset.

TABLE 3

Model (model)	Stanford dog	CUB data set	FGVC-airplane	Stanford car
					Single teacher model	87.1％	75.6％	73.2％	82.9％
2 teacher model fusion	84.3％	78.9％	-	-
					3 teacher model fusion	83.1％	77.7％	79.0％	-
4 teacher model fusion	82.5％	77.5％	78.3％	84.2％

Table 3 shows that the performance of the student model is affected by the number of teachers. Wherein, the teacher model is obtained by training four different sub-classification task data sets respectively.

TABLE 4

Table 4 shows a comparison of the classification accuracy of the present method and knowledge distillation method on a stanford data set using various teacher models and student models of different structures.

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A heterogeneous neural network knowledge reorganization method based on common feature learning comprises the following steps:

And

step 3, measuring the distribution difference among the transformation characteristics obtained in the step 2 by adopting a Maximum Mean Difference (MMD) method, fusing the characteristics of the teacher model, and adapting to the domains of the transformation characteristics of the teacher model and the student models; the method specifically comprises the following steps: use of

and

the approximate calculation formula of the MMD distance is as follows:

wherein the content of the first and second substances,

And

step 4, inputting the transferred characteristics into a trainable self-encoder to reconstruct the original output characteristics of the teacher model, and setting F'_TiRepresenting teacher original features F_TiMeasure the difference between the reconstructed features and the original, and define the reconstruction loss L_RIs defined as:

L_C＝‖w_s·F_s-[w₁·F₁,…,w_N·F_TN]‖₂(5)

and 6, combining the losses defined in the steps 3, 4 and 5 together through the hyperparametric weight α to form an overall loss function of the network, and calculating the value:

L＝L_C+(1-α)(L_M+L_R),α∈[0,1](6)

2. The method of claim 1 for heterogeneous neural network knowledge reorganization based on common feature learning, wherein: the structure of the teacher model in the step 1 includes, but is not limited to, a residual error network and a VGG network, and the structure of the student model is determined according to actual requirements.

3. The method of claim 1 for heterogeneous neural network knowledge reorganization based on common feature learning, wherein: the adaptation layer in the step 1 comprises but is not limited to a plurality of layers of 1 × 1 convolution, and the parameters of the adaptation layer of each teacher model and each student model are different and are obtained through learning; the number of adaptation layer channels can be set to an empirical value 256, or can be set according to actual requirements.

4. The method of claim 1 for heterogeneous neural network knowledge reorganization based on common feature learning, wherein: the shared feature extractor in the step 2 is a small convolution network consisting of three residual modules with 1 × 1 stride; in addition to this, the present invention is,

and

5. The method of claim 1 for heterogeneous neural network knowledge reorganization based on common feature learning, wherein: the soft target steaming described in step 5In the distillation module, the target distillation loss L_CDefined as the difference between the response scores of the student network and the predicted scores of the teacher model, may be measured using methods including, but not limited to, calculating Mean Square Error (MSE).