WO2018227800A1

WO2018227800A1 - Neural network training method and device

Info

Publication number: WO2018227800A1
Application number: PCT/CN2017/102032
Authority: WO
Inventors: 王乃岩; 陈韫韬
Original assignee: 北京图森未来科技有限公司
Priority date: 2017-06-15
Filing date: 2017-09-18
Publication date: 2018-12-20
Also published as: CN110969250A; CN107358293B; CN110969250B; CN107358293A

Abstract

A neural network training method and device, the method comprising: selecting a teacher network achieving the same functions as a student network (101); and iteratively training the student network on the basis of matching data similarity of first output data and data similarity of second output data corresponding to the same training sample data to obtain a target network, so as to implement migration of output data similarity of the teacher network to the student network (102), wherein the first output data is data output from a first specific network layer of the teacher network after the training sample data is input to the teacher network, and the second output data is data output from a second specific network layer of the student network after the training sample data is input to the student network. The student network trained by the method according to the output data similarity of the teacher network has a better performance.

Description

Neural network training method and device

The present application claims the priority of the Chinese Patent Application, which is filed on June 15, 2017, to the Chinese Patent Office, the number of which is incorporated herein by reference. in.

Technical field

The present invention relates to the field of computer vision, and in particular to a neural network training method and apparatus.

Background technique

In recent years, deep neural networks have achieved great success in various applications in the field of computer vision, such as image classification, target detection, and image segmentation. However, the model of deep neural network often contains a large number of model parameters, which is computationally intensive and slow in processing, and cannot be calculated in real time on some low-power, low-computing devices (such as embedded devices, integrated devices, etc.).

At present, in order to solve this problem, some solutions are proposed. For example, the knowledge of the teacher network (ie, the teacher network, the teacher network generally has a complex network structure, high accuracy, and slow calculation speed) is transferred to the student network through knowledge migration. (ie student network, student network network structure is relatively simple, low accuracy, fast) to improve student network performance. The student network at this time can be applied to devices with low power consumption and local computing power.

Knowledge migration is a general technique for compressing and accelerating deep neural network models. At present, there are three main methods of knowledge transfer, namely the Knowledge Distill (KD) method proposed in the paper “Distilling the knowledge in a neural network” published by Hinton et al. in 2014, and the paper published by Romero et al. in 2015 “Fitnets”. :FitsNets proposed by Hints for thin deep nets, and the Attention Transfer (AT) method proposed by the paper "Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer" published by Sergey in 2016.

The existing knowledge migration method uses the information of the single data in the output data of the teacher network to train the student network. Although the trained student network has certain improvement in performance, there is still much room for improvement.

Interpretation of related terms:

Knowledge Transfer: In deep neural networks, knowledge transfer refers to the use of training sample data in the intermediate network layer of the teacher network or the final network layer to help the student network with faster training but poor performance. Migrate a high-performing teacher network to the student network.

Knowledge Distill: In deep neural networks, knowledge extraction refers to the technique of training student networks by using the smooth category posterior probability output from the teacher network in the classification problem.

Teacher Network: A high-performance neural network used to provide more accurate monitoring information for student networks during the knowledge migration process.

Student Network: A fast calculation but poor performance suitable for deployment to a single neural network in a real-time scenario with high real-time requirements. The student network has greater computational throughput than the teacher network. Fewer model parameters.

Summary of the invention

In view of the above problems, the present invention provides a neural network training method and apparatus to further improve the performance and accuracy of a student network.

In an embodiment of the present invention, an aspect provides a neural network training method, where the method includes:

Select a network of teachers that achieve the same functionality as the student network;

And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;

The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.

Another aspect of the embodiments of the present invention provides a neural network training device, the device comprising:

The selection unit is used to select a teacher network that implements the same function as the student network;

a training unit, configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to implement the teacher network The similarity between the output data is migrated to the student network;

Another aspect of an embodiment of the present invention provides a neural network training apparatus, the apparatus comprising: a processor and at least one memory, the at least one memory for storing at least one machine executable instruction, the processor performing the at least An instruction to achieve:

In the embodiment of the present invention, the sample training data can be completely migrated to the student network in the data of the output data output by the teacher network, thereby realizing the result of the training sample data output through the teacher network and the result output through the target network. Consistent. According to the good generalization performance of the neural network, the output of the target network and the output of the teacher network are basically the same in the test set, thereby improving the accuracy of the student network.

Other features and advantages of the invention will be set forth in the description which follows, The objectives and other advantages of the invention may be realized and obtained by means of the structure particularly pointed in the appended claims.

The technical solution of the present invention will be further described in detail below through the accompanying drawings and embodiments.

DRAWINGS

The drawings are intended to provide a further understanding of the invention, and are intended to be a Obviously, the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without any creative work. In the drawing:

1 is a flowchart of a neural network training method according to an embodiment of the present invention;

2 is a flow chart of training a student network in an embodiment of the present invention;

3 is a schematic structural diagram of a neural network training apparatus according to an embodiment of the present invention;

4 is a schematic structural diagram of a training unit according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a neural network training apparatus according to an embodiment of the present invention.

detailed description

In order to make those skilled in the art better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the accompanying drawings in the embodiments of the present invention. The embodiments are only a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

The above is the core idea of the present invention, and in order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the above objects, features and advantages of the embodiments of the present invention can be more clearly understood. The technical solution in the embodiment of the present invention is further described in detail.

Embodiment 1

1 is a flowchart of a neural network training method according to an embodiment of the present invention, where the method includes:

Step 101: Select a teacher network that implements the same function as the student network.

Functions such as image classification, target detection, image segmentation, etc. are implemented. The teacher network has excellent performance and high accuracy, but its structure is complex relative to the student network, the parameter weight is more, and the calculation speed is slower. The student network is fast, the performance is generally poor or the network structure is simple. A network with the same functions and excellent performance as that implemented by the student network can be selected as a teacher network in a set of pre-set neural network models.

Step 102: Iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to obtain the output of the teacher network. The similarity between data is migrated to the student network.

In the embodiment of the present invention, after the training sample data is input into the teacher network, the data outputted from the first specific network layer of the teacher network is collectively referred to as the first output data; after the training sample data is input into the student network, the second specificity from the student network is obtained. The data output by the network layer is collectively referred to as second output data.

Preferably, in the embodiment of the present invention, the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network.

Preferably, in the embodiment of the present invention, the second specific network layer is an intermediate network layer or a last layer network layer of the student network.

Preferably, the foregoing step 102 specifically implements the method flow that can be as shown in FIG. 2, and specifically includes:

Step 102A: Construct an objective function of the student network, where the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data.

Step 102B: Perform iterative training on the student network by using the training sample data.

Step 102C: When the number of iterations training reaches a threshold or the target function satisfies a preset convergence condition, the target network is obtained.

Preferably, in the foregoing step 102B, the specific implementation may be as follows:

Performing the following iterative training on the student network multiple times (hereinafter referred to as the current iterative training, the training sample data used for the iterative training is referred to as current training sample data, and the iterative training includes the following steps A and B, Step C, Step D, Step E and Step F):

Step A: input current training sample data for the current iteration training into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;

Step B: calculating a similarity between each data in the first output data and calculating a phase between each data in the second output data Similarity

Step C: Calculate a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and select a target order from all the order of the data in the first output data. ;

Step D: calculating, according to the similarity between each data in the second output data, a probability of a target arrangement order of each data in the second output data;

Step E: calculating a value of the target function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting according to the value of the target function The weight of the student network;

Step F: Perform the next iterative training based on the student network after adjusting the weight.

Preferably, in the embodiment of the present invention, in the foregoing step C, the target arrangement order is selected from all the arrangement orders of the data in the first output data, and the implementation manner includes but is not limited to the following two types:

In the first step, the order in which the probability values are greater than the preset threshold is selected from the all sorting orders of the data in the first output data as the target sorting order.

In the mode 2, the order of the preset number of the probabilities taken in the first output data is selected as the target arrangement order.

In the embodiment of the present invention, the order of the selected objects may be one or more, which is not strictly limited in this application.

Preferably, in step B, calculating the similarity between the data in the first output data (the second output data), specifically: calculating a spatial distance between the two data in the first output data (the second output data), A similarity between the two pairs of data is obtained according to the spatial distance.

In the embodiment of the present invention, the spatial distance may be a European distance, a cosine distance, a block distance, or a Mahalanobis distance, and the present application is not strictly limited. Take the example of calculating the Euclidean distance and cosine distance between two pairs of data.

Calculate the Euclidean distance between any two data x _i and x _j by the following formula (1):

In equation (1), α is a preset scale transformation factor, β is a preset contrast expansion factor, γ is an offset, and |·| ₂ represents a l ² norm of the vector.

Calculate the cosine distance between any two data x _i and x _j by the following formula (2):

S _ij =α(x _i ·x _j ) ^β +γ (2)

In equation (2), α is a preset scale transformation factor, β is a preset contrast expansion factor, γ is an offset, and · represents a point multiplication operation between vectors.

Preferably, in step C, each data in the first output data is calculated according to the similarity between the data in the first output data. The probability of all the sorting orders, in the specific implementation: for each sorting order, the order information of the sorting order and the similarity between all adjacent two data in the sorting order of the first output data are input into a preset probability In the calculation model, the probability of the arrangement order is obtained.

A training sample data y={y ₁ , y ₂ , y ₃ } is taken as an example for description. Enter y into the teacher network to obtain corresponding first output data x={x ₁ , x ₂ , x ₃ }; calculate the similarity between the two data in x as s ₁₂ (the similarity between x ₁ and x ₂ ), s ₁₃ (similarity between x ₁ and x ₃ ), s ₂₃ (similarity between x ₂ and x ₃ ). The number of all sort orders of x ₁ , x ₂ , and x ₃ is 3! =6, the order of arrangement is π ₁ =x ₁ →x ₂ →x ₃ ,π ₂ =x ₁ →x ₃ →x ₂ ,π ₃ =x ₂ →x ₁ →x ₃ ,π ₄ =x ₂ → x ₃ →x ₁ , π ₅ =x ₃ →x ₁ →x ₂ , π ₆ =x ₃ →x ₂ →x ₁ ; The probability of calculating the above six sorting orders based on the similarity between the data is

The corresponding target arrangement order of each first output data corresponding to each training sample data may be the same or different, and the foregoing x is taken as an example, and the target output order corresponding to the first output data corresponding to the first sample training data is assumed to be π ₁ =x ₁ →x ₂ →x ₃ , π ₂ =x ₁ →x ₃ →x ₂ , π ₃ =x ₂ →x ₁ →x ₃ , the target corresponding to the first output data corresponding to the second sample training data The order of arrangement is π ₃ = x ₂ → x ₁ → x ₃ , π ₄ = x ₂ → x ₃ → x ₁ , π ₅ = x ₃ → x ₁ → x ₂ .

Preferably, in the step D, the probability of the target arrangement order of each data in the second output data is calculated according to the similarity between the data in the second output data, and the specific implementation is as follows: for each target arrangement order, the target is The order information of the arrangement order and the similarity between all adjacent two data in the target arrangement order of the second output data are input into the probability calculation model, and the probability of the target arrangement order is obtained.

In the embodiment of the present invention, the probability calculation model may be a first-order Plackett probability model, a high-order Plackett probability model, or other models capable of calculating a probability, which is not strictly limited.

The following is an example of the probability of using the first-order Plackett probability model to calculate the order of arrangement.

Assume that the first output data corresponding to a certain training sample data is x={x ₁ , x ₂ , x ₃ , x ₄ }, taking the probability of calculating the order of π ₁ and π ₂ as an example, assuming π ₁ = x ₁ → x ₂ →x ₃ →x ₄ , π ₂ =x ₁ →x ₃ →x ₄ →x ₂ , the following results are obtained by the first-order Plackett probability model:

Where f(·) is any linear or non-linear mapping function, and the sum of the probabilities of all the ordering is 1

In the embodiment of the present invention, the target arrangement order may be one or multiple.

In the embodiment of the present invention, the objective function of the student network may only include a matching function, and the objective function may also be a sum of a matching function and a task loss function, and the expression of the task loss function is related to a task to be implemented by the student network. For example, the task loss function can be the same as the objective function of the teacher network. The expression of the matching function can be, but is not limited to, the following formula (3) and formula (4).

Example 1. When the target order is one, the objective function of the student network can be set as shown in the following formula (3):

L=-log P(π ^t |X ^s ) (3)

In the formula (3), π ^t is the target arrangement order of each data in the first output data corresponding to the current training sample data, and X ^s is the second output data corresponding to the current training sample data, and P(π ^t |X ^s ) is The probability of the target arrangement order of each data in the second output data.

Preferably, the foregoing target arrangement order π ^t is an arrangement order in which the probability values are the largest among all the arrangement orders of the data in the first output data of the current training sample data.

When the target order is multiple, the embodiment of the present invention may train the student network based on a manner of matching a probability distribution of a plurality of target arrangement orders. In the embodiment of the present invention, there are various methods for matching the probability distribution of a plurality of target arrangement orders, such as a total variation distance based on a probability distribution, a Wesserstein distance, a Jensen-Shannon divergence, or a Kullback-Leibler divergence.

Taking the Kullback-Leibler divergence based on the probability distribution as an example, the objective function expression of the student network may be as follows (4):

In the formula (4), π is a target arrangement order, X ^s is the second output data corresponding to the current training sample data, X ^t is the first output data corresponding to the current training sample data, and P(π|X ^s ) is the current The probability of π of each data in the second transmission data of the training sample data, P(π|X ^t ) is the probability of π of each data in the first transmission data of the current training sample data, and Q is a set of the target arrangement order.

Preferably, adjusting the weight of the student network according to the value of the objective function in the foregoing step E includes: adopting a preset gradient descent optimization algorithm, and adjusting the weight of the student network according to the value of the target function.

Preferably, the foregoing steps A and B further include the following steps: processing the first output data and the second output data by using a downsampling algorithm and an interpolation algorithm, so that a spatial dimension of the first output data and a The spatial dimensions of the two output data are consistent, and the number of the first output data and the number of the second output data are both consistent with the number of the current training sample data. Certainly, if the spatial output of the first output data and the second output data obtained by the step A are the same, and the number of the first output data and the second output data are both consistent with the quantity of the current training sample data, the step A is not needed. This step is added to step B, that is, step B is directly executed after step A. The aforementioned spatial dimension generally refers to the number of inputs The number of channels, the number of channels, the height and width of the feature map.

It should be noted that the above steps A to F have no strict sequence, and the following steps A' to B' may be used instead of the above steps A to B.

Step A', inputting current training sample data for the iterative training into the teacher network, obtaining corresponding first output data, and calculating a similarity between the data in the first output data;

Step B', inputting the current training sample data into the student network, obtaining corresponding second output data, and calculating a similarity between the data in the second output data.

Assume that the three training sample data used to train the student network (represented by S) are y ₁ = {y ₁₁ , y ₁₂ , y ₁₃ }, y ₂ = {y ₂₁ , y ₂₂ , y ₂₃ }, y ₃ = {y ₃₁ , y ₃₂ , y ₃₃ }; the first output data of the three training sample data input to the teacher network (represented by T) is

The second output data of the three training sample data input to the student network output is

In the embodiment of the present invention, all the order of the data in the first output data is used as the target arrangement order. a set of target arrangement orders of the first output data corresponding to the i-th training sample data

among them

The probability of calculating the target arrangement order of the first output data corresponding to the i-th training data is

a set of target arrangement orders of the second output data corresponding to the i-th training data

among them

The probability of calculating the target arrangement order of the second output data corresponding to the i-th training sample data is

Since the number of the first output data and the second output data corresponding to the same training sample data is the same, the order in which the data in the first output data and the second output data are arranged in the same order is used as the same target arrangement order. For example, the second output data of the i-th training sample data

With its first output data

As the same target arrangement order, represented by π _i1 , the target output order set Q _i of the first output data and the second output data of the i-th training sample data is expressed as Q _i ={π _i1 , π _i2 , π _I3 , π _i4 , π _i5 , π _i6 }

Perform the following iterations of training:

The first iteration training: input y ₁ into the teacher network and the student network, and obtain the corresponding first output data as

And the second output data is

Calculation

Similarity and calculation between data in

Similarity between the data in the data;

Similarity calculation between data in

The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order;

The similarity between the data is calculated

The probability of the target arrangement order of each of the data; the probability of the target arrangement order of each data in the first output data corresponding to y _{1 and} the probability of the target arrangement order of each data in the second output data are input to the objective function, and the calculation is performed The value of the objective function is L ₁ , and the current weight W _{0 of the} student network is adjusted according to the L ₁ , and the adjusted weight W _{1 is obtained} ;

The second iteration training: input y ₂ into the teacher network and the student network, and obtain the corresponding first output data as

And the second output data is

Calculation

Similarity and calculation between data in

Similarity between the data in the data;

Similarity calculation between data in

The similarity between the data is calculated

The probability of the target arrangement order of each data in the data; the probability of the target arrangement order of each data in the first output data corresponding to y _{2 and} the probability of the target arrangement order of each data in the second output data are input into the objective function, and the calculation is performed The value of the objective function is L ₂ , and the current weight W _{1 of the} student network is adjusted according to the L ₂ , and the adjusted weight is W ₂ ;

The third iteration training: input y ₃ into the teacher network and the student network, and obtain the corresponding first output data as

And the second output data is

Calculation

Similarity and calculation between data in

Similarity between the data in the data;

Similarity calculation between data in

The similarity between the data is calculated

The probability of the target arrangement order of each data in the data; the probability of the target arrangement order of each data in the first output data corresponding to y _{3 and} the probability of the target arrangement order of each data in the second output data are input into the objective function, and the calculation is performed The value of the objective function is L ₃ , and the current weight W _{2 of the} student network is adjusted according to the L ₃ , and the adjusted weight is W ₃ .

Embodiment 2

Based on the same concept as the neural network training method provided in the foregoing first embodiment, the second embodiment of the present invention provides a neural network training device. The structure of the device is as shown in FIG. 3, and includes:

The selecting unit 31 is configured to select a teacher network that implements the same function as the student network;

The training unit 32 is configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to obtain the target network. The similarity between the output data of the network is migrated to the student network;

In the embodiment of the present invention, functions implemented by the teacher network and the student network are image classification, target detection, image segmentation, and the like. The teacher network has excellent performance and high accuracy, but its structure is complex relative to the student network, the parameter weight is more, and the calculation speed is slower. The student network is fast, the performance is generally poor or the network structure is simple. The selecting unit 31 may select a network having the same function and excellent performance as that implemented by the student network in the set of the preset neural network models as the teacher network.

In the embodiment of the present invention, the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network; and/or the second specific network layer is an intermediate network layer or last of the student network. A layer of network.

Preferably, the structure of the training unit 32 is as shown in FIG. 4, and specifically includes a construction module 321, a training module 322, and a determination module 323, where:

a building module 321 is configured to construct an objective function of the student network, where the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;

The training module 322 is configured to perform iterative training on the student network by using the training sample data;

The determining module 323 is configured to obtain the target network when the training module 322 iteratively trains the number of times to reach a threshold or the target function satisfies a preset convergence condition.

Preferably, the training module 322 is specifically configured to:

Step B: calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;

Step C: Calculating all the permutations of the data in the first output data according to the similarity between the data in the first output data a probability of order, and selecting a target arrangement order from all the arrangement orders of each data in the first output data;

Preferably, the training module 322 selects a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes: selecting, from all the arrangement orders of the data in the first output data, an arrangement whose probability value is greater than a preset threshold. The order is arranged as a target; or, from all the order of the data in the first output data, the order in which the probability values are arranged in the previous preset number is selected as the target sorting order.

Preferably, the training module 322 calculates the similarity between the data in the first output data, and specifically includes: calculating a spatial distance between the two data in the first output data, and obtaining the two-two data according to the spatial distance. Similarity between

The training module 322 calculates the similarity between the data in the second output data, and specifically includes: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance degree.

Preferably, the training module 322 calculates a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: for each sorting order, the order of the sorting The sequence information and the similarity between all adjacent two data in the arrangement order of the first output data are input into a preset probability calculation model, and the probability of the arrangement order is obtained;

The training module 322 calculates a target arrangement order probability of each data in the second output data according to the similarity between the data in the second output data, and specifically includes: sequence information of the target arrangement order for each target arrangement order And the similarity between all adjacent two data in the target arrangement order of the second output data is input into the probability calculation model, and the probability of the target arrangement order is obtained.

In the embodiment of the present invention, the target arrangement order may be one or multiple. When the target order is multiple, the embodiment of the present invention may train the student network based on a manner of matching a probability distribution of a plurality of target arrangement orders. In the embodiment of the present invention, there are various methods for matching the probability distribution of a plurality of target arrangement orders, such as a total variation distance based on a probability distribution, a Wesserstein distance, a Jensen-Shannon divergence, or a Kullback-Leibler divergence.

In the embodiment of the present invention, the objective function of the student network may only include one matching function, and the objective function may also be The sum of a matching function and a task loss function. The expression of the task loss function is related to the task to be implemented by the student network. For example, the task loss function can be the same as the objective function of the teacher network.

Preferably, the training module 322 adjusts the weight of the student network according to the value of the objective function, and specifically includes: adopting a preset gradient descent optimization algorithm, and adjusting the student network according to the value of the target function. Weights.

Preferably, the training module 322 is further configured to: before calculating the similarity between the data in the first output data and calculating the similarity between the data in the second output data, by using a downsampling algorithm and an interpolation algorithm Processing the first output data and the second output data such that a spatial dimension of the first output data coincides with a spatial dimension of the second output data, and the number of the first output data and the quantity of the second output data are both The number of current training sample data is consistent.

Embodiment 3

Based on the same concept as the neural network training method provided in the foregoing first embodiment, the third embodiment of the present invention provides a neural network training device. The structure of the device is as shown in FIG. 5, including: a processor 501 and at least one memory 502. The at least one memory 502 is configured to store at least one machine executable instruction, and the processor 501 executes the at least one instruction to: select a teacher network that implements the same function as the student network; and corresponding to matching the same training sample data Iteratively training the student network to obtain a target network by the similarity between the data of the first output data and the data of the second output data, so as to implement the migration of the similarity between the output data of the teacher network to the student network Wherein: the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student after the training sample data is input into the student network The data output by the second specific network layer of the network.

The processor 501 executes the at least one instruction to iteratively train the student network to obtain a target based on matching the inter-sample similarity of the first output data corresponding to the same training sample data with the sample-to-sample similarity of the second output data. The network specifically includes: an objective function for constructing the student network, the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data; The training sample data performs iterative training on the student network; when the iterative training number reaches a threshold or the objective function satisfies a preset convergence condition, the target network is obtained.

The processor 501 executes the at least one instruction to implement the training sample data on the student network. The iterative training of the network specifically includes: performing the following iterative training on the student network: inputting the current training sample data used for the iterative training into the teacher network and the student network, respectively, to obtain corresponding first output data and a second output data; calculating a similarity between the data in the first output data and calculating a similarity between the data in the second output data; calculating each of the first output data according to the similarity between the data in the first output data a probability of all the order of the data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data; calculating each data in the second output data according to the similarity between the data in the second output data The probability of the target arrangement order; calculating the value of the objective function according to the probability of the target arrangement order of each data in the first output data and the probability of the target arrangement order of each data in the second output data, and according to the objective function The value adjusts the weight of the student network; the next iterative training is performed based on the student network after adjusting the weight.

The processor 501 executes the at least one instruction to select a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes: selecting a probability from all the arrangement orders of the data in the first output data. The arrangement order of the values greater than the preset threshold is used as the target arrangement order; or, the order of the preset number of the probabilities taken in the first output data is selected as the target arrangement order.

The processor 501 executes the at least one instruction to implement the calculation of the similarity between the data in the first output data, and specifically includes: calculating a spatial distance between the two data in the first output data, according to the spatial distance Obtaining a similarity between the two data sets; calculating a similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, according to the spatial distance, obtaining the The similarity between two data sets.

The processor 501 executes the at least one instruction to calculate a probability of calculating all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: for each sorting order, Inputting the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data into a preset probability calculation model to obtain a probability of the arrangement order; according to the second output data Calculating the probability of the target arrangement order of each data in the second output data, the specificity includes: ordering the order of the target arrangement order and the target arrangement of the second output data for each target arrangement order The similarity between all adjacent two data in the sequence is input into the probability calculation model, and the probability of the target arrangement order is obtained.

Wherein, when the target arrangement order is one, the objective function of the student network is as follows:

L=-log P(π ^t |X ^s )

Where π ^t is the target arrangement order of each data in the first output data corresponding to the current training sample data, X ^s is the second output data corresponding to the current training sample data, and P(π ^t |X ^s ) is the second output The probability of the target order of each data in the data.

Wherein, when the target is arranged in a plurality of orders, the objective function of the student network is as follows:

Where π is a target arrangement order, X ^s is the second output data corresponding to the current training sample data, X ^t is the first output data corresponding to the current training sample data, and P(π|X ^s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X ^t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.

The processor 501 performs the at least one instruction implementation, and adjusts the weight of the student network according to the value of the target function, which includes: adopting a preset gradient descent optimization algorithm, according to the target function The value adjusts the weight of the student network.

The processor further performs the performing, after the processor 501 executes the at least one instruction to calculate a similarity between data in the first output data and calculates a similarity between data in the second output data. At least one instruction is implemented to: process the first output data and the second output data by a downsampling algorithm and an interpolation algorithm, such that a spatial dimension of the first output data is consistent with a spatial dimension of the second output data, and The number of one output data and the number of second output data are both consistent with the number of current training sample data.

The first specific network layer is an intermediate network layer or a last layer network layer in the teacher network; and the second specific network layer is an intermediate network layer or a last layer network layer of the student network.

Based on the same concept as the foregoing method, the embodiment of the present invention further provides a storage medium (which may be a non-volatile machine readable storage medium), where the computer program for storing neural network training is stored. The program has a code segment configured to perform the following steps: selecting a teacher network that implements the same function as the student network; based on matching the data between the first output data corresponding to the same training sample data and the data of the second output data Similarity to iteratively train the student network to obtain a target network to implement migration of similarity between output data of the teacher network to the student network; wherein: the first output data is used to input the training sample data into a teacher network The data output from the first specific network layer of the teacher network, the second output data is data output from the second specific network layer of the student network after the training sample data is input into the student network.

Based on the same concept as the foregoing method, an embodiment of the present invention further provides a computer program having a code segment configured to perform the following neural network training: selecting a teacher network that implements the same function as the student network; Iteratively training the student network to obtain a target network by using the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to achieve similarity migration between the output data of the teacher network Go to the student network; wherein: the first output data is data output from a first specific network layer of a teacher network after the training sample data is input into a teacher network, and the second output data is input to the training sample data The student network then outputs data from the second specific network layer of the student network.

In summary, in the embodiment of the present invention, the sample training data can be completely migrated to the student network in the data of the output data output by the teacher network, thereby realizing the result and passing of the training sample data output through the teacher network. The results of the target network output are basically the same. According to the good generalization performance of the neural network, the output of the target network and the output of the teacher network are basically the same in the test set, thereby improving the accuracy of the student network. The basic principles of the present invention have been described above in connection with the specific embodiments, but it should be noted that those skilled in the art can understand that all or any of the steps or components of the method and apparatus of the present invention can be in any computing device (including The processor, the storage medium, or the like, or the network of computing devices, implemented in hardware firmware, software, or a combination thereof, which is the basic programming skill of those skilled in the art in the context of reading the description of the present invention. Can be achieved.

A person skilled in the art can understand that all or part of the steps carried by the method of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium. , including one or a combination of the steps of the method embodiments.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.

Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Although the above-described embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to the embodiments once they are aware of the basic inventive concept. Therefore, the appended claims are intended to be interpreted as including the above-described embodiments and all changes and modifications falling within the scope of the invention.

It is apparent that those skilled in the art can make various modifications and variations to the invention without departing from the spirit and scope of the invention. Thus, it is intended that the present invention cover the modifications and modifications of the invention

Claims

A neural network training method, comprising:

Select a network of teachers that achieve the same functionality as the student network;

And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;

The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
The method according to claim 1, wherein the student network is iteratively trained to obtain a target network based on matching the inter-sample similarity of the first output data corresponding to the same training sample data with the sample-to-sample similarity of the second output data. Specifically, including:

Constructing an objective function of the student network, the objective function comprising a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;

Performing iterative training on the student network by using the training sample data;

The target network is obtained when the number of iterative training reaches a threshold or the objective function satisfies a preset convergence condition.
The method according to claim 2, wherein the iterative training is performed on the student network by using the training sample data, which specifically includes:

Perform the following iterative training on the student network:

The current training sample data used for the iterative training is input into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;

Calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;

Calculating a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data;

Calculating a probability of a target arrangement order of each data in the second output data according to a similarity between each data in the second output data;

Calculating a value of the objective function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting the student according to the value of the objective function The weight of the network;

The next iterative training is performed based on the student network after adjusting the weights.
The method according to claim 3, wherein the target arrangement order is selected from all the arrangement orders of the data in the first output data, and specifically includes:

Selecting an order in which the probability values are greater than a preset threshold from all the order of the data in the first output data Target arrangement order;

Alternatively, the order in which the probability values are ranked in the previous preset number is selected from the all sorting orders of the data in the first output data as the target sorting order.
The method according to claim 3, wherein calculating the similarity between the data in the first output data comprises: calculating a spatial distance between the two data in the first output data, according to the spatial distance Similarity between the two pairs of data;

Calculating the similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance.
The method according to claim 3, wherein the probability of calculating all the order of the data in the first output data is calculated according to the similarity between the data in the first output data, and specifically includes:

For each permutation order, the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data are input into a preset probability calculation model to obtain a probability of the arrangement order. ;

The probability of calculating the target arrangement order of each data in the second output data according to the similarity between the data in the second output data, specifically includes: order information of the target arrangement order and the second output for each target arrangement order The similarity between all adjacent two data in the target arrangement order of the data is input into the probability calculation model, and the probability of the target arrangement order is obtained.
The method according to claim 3, wherein when the target arrangement order is one, the objective function of the student network is as follows:

L=-log P(π t |X s )

Where π t is the target arrangement order of each data in the first output data corresponding to the current training sample data, X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is the second output The probability of the target order of each data in the data.
The method according to claim 3, wherein when the target arrangement order is plural, the objective function of the student network is as follows:

Where π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.
The method according to claim 3, wherein adjusting the weight of the student network according to the value of the objective function comprises:

The weight of the student network is adjusted according to the value of the objective function by using a preset gradient descent optimization algorithm.
The method according to claim 3, wherein before calculating the similarity between the data in the first output data and calculating the similarity between the data in the second output data, the method further comprises: performing a downsampling algorithm and interpolating The algorithm processes the first output data and the second output data such that a spatial dimension of the first output data is consistent with a spatial dimension of the second output data, and the number of the first output data and the number of the second output data Both are consistent with the number of current training sample data.
The method according to claim 1, wherein the first specific network layer is an intermediate network layer or a last network layer in a teacher network;

The second specific network layer is an intermediate network layer or a last network layer of the student network.
A neural network training device, comprising:

The selection unit is used to select a teacher network that implements the same function as the student network;

a training unit, configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to implement the teacher network The similarity between the output data is migrated to the student network;

The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
The device according to claim 12, wherein the training unit comprises:

a building module, configured to construct an objective function of the student network, the objective function comprising a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;

a training module, configured to perform iterative training on the student network by using the training sample data;

And a determining module, configured to obtain the target network when the training module iterative training times reaches a threshold or the target function satisfies a preset convergence condition.
The device according to claim 13, wherein the training module is specifically configured to:

Perform the following iterative training on the student network:

The current training sample data used for the iterative training is input into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;

Calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;

Calculating a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data;

Calculating a probability of a target arrangement order of each data in the second output data according to a similarity between each data in the second output data;

Calculating a value of the objective function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting the student according to the value of the objective function The weight of the network;

The next iterative training is performed based on the student network after adjusting the weights.
The apparatus according to claim 14, wherein the training module selects a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes:

Selecting an arrangement order in which the probability values are greater than a preset threshold from all the arrangement orders of the data in the first output data as the target arrangement order;

Alternatively, the order in which the probability values are ranked in the previous preset number is selected from the all sorting orders of the data in the first output data as the target sorting order.
The apparatus according to claim 14, wherein the training module calculates the similarity between the data in the first output data, specifically: calculating a spatial distance between the two data in the first output data, according to The spatial distance is obtained to obtain similarity between the two pairs of data;

The training module calculates the similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance .
The apparatus according to claim 14, wherein the training module calculates a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: Arranging the order, and inputting the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data into a preset probability calculation model to obtain a probability of the arrangement order;

The training module calculates a target arrangement order probability of each data in the second output data according to the similarity between the data in the second output data, and specifically includes: sequence information of the target arrangement order for each target arrangement order and The similarity between all adjacent two data in the target arrangement order of the second output data is input into the probability calculation model, and the probability of the target arrangement order is obtained.
The device of claim 14 wherein:

When the target arrangement order is one, the objective function of the student network is as follows:

L=-log P(π t |X s )

Where π t is the target arrangement order of each data in the second output data, X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is the probability of π t .
The device of claim 14 wherein:

When the target arrangement order is plural, the objective function of the student network is as follows:

Where π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.
The apparatus according to claim 14, wherein the training module adjusts the weight of the student network according to the value of the objective function, which specifically includes:

The weight of the student network is adjusted according to the value of the objective function by using a preset gradient descent optimization algorithm.
The apparatus of claim 14, the training module further configured to:

Processing the first output data and the second output data by a downsampling algorithm and an interpolation algorithm before calculating a similarity between the data in the first output data and calculating a similarity between the data in the second output data, The spatial dimension of the first output data is made to coincide with the spatial dimension of the second output data, and the number of the first output data and the number of the second output data are both consistent with the number of the current training sample data.
The apparatus according to claim 12, wherein said first specific network layer is an intermediate network layer or a last network layer in a teacher network;

The second specific network layer is an intermediate network layer or a last network layer of the student network.
A neural network training apparatus, comprising: a processor and at least one memory, the at least one memory for storing at least one machine executable instruction, the processor executing the at least one instruction to:

Select a network of teachers that achieve the same functionality as the student network;

And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;

The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
The apparatus according to claim 23, wherein said processor executes said at least one instruction to achieve an inter-sample similarity based on matching first output data corresponding to the same training sample data and a sample-to-sample similarity of second output data To iteratively train the student network to obtain the target network, including:

Constructing an objective function of the student network, the objective function comprising a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;

Performing iterative training on the student network by using the training sample data;

The target network is obtained when the number of iterative training reaches a threshold or the objective function satisfies a preset convergence condition.
The apparatus according to claim 24, wherein the processor executes the at least one instruction to perform iterative training on the student network by using the training sample data, and specifically includes:

Perform the following iterative training on the student network:

The current training sample data used for the iterative training is input into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;

Calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;

Calculating a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data;

Calculating a probability of a target arrangement order of each data in the second output data according to a similarity between each data in the second output data;

Calculating a value of the objective function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting the student according to the value of the objective function The weight of the network;

The next iterative training is performed based on the student network after adjusting the weights.
The apparatus according to claim 25, wherein the processor executes the at least one instruction to select a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes:

Selecting an arrangement order in which the probability values are greater than a preset threshold from all the arrangement orders of the data in the first output data as the target arrangement order;

Alternatively, the order in which the probability values are ranked in the previous preset number is selected from the all sorting orders of the data in the first output data as the target sorting order.
The device according to claim 25, wherein the processor executing the at least one instruction to calculate a similarity between data in the first output data comprises: calculating two or two data in the first output data. a spatial distance between the two, and the similarity between the two data sets according to the spatial distance;

Calculating the similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance.
The apparatus according to claim 25, wherein said processor executes said at least one instruction to calculate a probability of calculating all the order of each data in the first output data based on the similarity between the data in the first output data. Specifically, including:

For each permutation order, the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data are input into a preset probability calculation model to obtain a probability of the arrangement order. ;

Calculating an outline of a target arrangement order of each data in the second output data according to the similarity between the data in the second output data The rate includes: for each target arrangement order, inputting the order information of the target arrangement order and the similarity between all adjacent two data in the target arrangement order of the second output data into the probability calculation model, The probability of obtaining the target arrangement order is obtained.
The apparatus according to claim 25, wherein when said target arrangement order is one, an objective function of said student network is as follows:

L=-log P(π t |X s )

Where π t is the target arrangement order of each data in the first output data corresponding to the current training sample data, X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is the second output The probability of the target order of each data in the data.
The apparatus according to claim 25, wherein when the target arrangement order is plural, the objective function of the student network is as follows:

Where π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.
The apparatus according to claim 25, wherein the processor performs the at least one instruction implementation, and adjusting the weight of the student network according to the value of the objective function, specifically:

The weight of the student network is adjusted according to the value of the objective function by using a preset gradient descent optimization algorithm.
The apparatus according to claim 25, wherein said processor executes said at least one instruction to calculate a similarity between data in said first output data and calculate a similarity between data in said second output data The processor further executes the at least one instruction to: process the first output data and the second output data by a downsampling algorithm and an interpolation algorithm, such that a spatial dimension of the first output data The spatial dimensions of the two output data are consistent, and the number of the first output data and the number of the second output data are both consistent with the number of the current training sample data.
The apparatus according to claim 23, wherein said first specific network layer is an intermediate network layer or a last layer network layer in a teacher network;

The second specific network layer is an intermediate network layer or a last network layer of the student network.