WO2018227800A1 - Neural network training method and device - Google Patents

Neural network training method and device Download PDF

Info

Publication number
WO2018227800A1
WO2018227800A1 PCT/CN2017/102032 CN2017102032W WO2018227800A1 WO 2018227800 A1 WO2018227800 A1 WO 2018227800A1 CN 2017102032 W CN2017102032 W CN 2017102032W WO 2018227800 A1 WO2018227800 A1 WO 2018227800A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
output data
network
similarity
output
Prior art date
Application number
PCT/CN2017/102032
Other languages
French (fr)
Chinese (zh)
Inventor
王乃岩
陈韫韬
Original Assignee
北京图森未来科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京图森未来科技有限公司 filed Critical 北京图森未来科技有限公司
Publication of WO2018227800A1 publication Critical patent/WO2018227800A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of computer vision, and in particular to a neural network training method and apparatus.
  • model of deep neural network often contains a large number of model parameters, which is computationally intensive and slow in processing, and cannot be calculated in real time on some low-power, low-computing devices (such as embedded devices, integrated devices, etc.).
  • the knowledge of the teacher network ie, the teacher network, the teacher network generally has a complex network structure, high accuracy, and slow calculation speed
  • the student network is transferred to the student network through knowledge migration.
  • student network, student network network structure is relatively simple, low accuracy, fast
  • the student network at this time can be applied to devices with low power consumption and local computing power.
  • Knowledge migration is a general technique for compressing and accelerating deep neural network models.
  • KD Knowledge Distill
  • AT Attention Transfer
  • the existing knowledge migration method uses the information of the single data in the output data of the teacher network to train the student network. Although the trained student network has certain improvement in performance, there is still much room for improvement.
  • knowledge transfer In deep neural networks, knowledge transfer refers to the use of training sample data in the intermediate network layer of the teacher network or the final network layer to help the student network with faster training but poor performance. Migrate a high-performing teacher network to the student network.
  • Knowledge Distill In deep neural networks, knowledge extraction refers to the technique of training student networks by using the smooth category posterior probability output from the teacher network in the classification problem.
  • Teacher Network A high-performance neural network used to provide more accurate monitoring information for student networks during the knowledge migration process.
  • Student Network A fast calculation but poor performance suitable for deployment to a single neural network in a real-time scenario with high real-time requirements.
  • the student network has greater computational throughput than the teacher network. Fewer model parameters.
  • the present invention provides a neural network training method and apparatus to further improve the performance and accuracy of a student network.
  • an aspect provides a neural network training method, where the method includes:
  • the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network
  • the second output data is input from the student network after the training sample data is input into the student network.
  • the second specific network layer outputs the data.
  • a neural network training device comprising:
  • the selection unit is used to select a teacher network that implements the same function as the student network;
  • a training unit configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to implement the teacher network
  • the similarity between the output data is migrated to the student network
  • the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network
  • the second output data is input from the student network after the training sample data is input into the student network.
  • the second specific network layer outputs the data.
  • a neural network training apparatus comprising: a processor and at least one memory, the at least one memory for storing at least one machine executable instruction, the processor performing the at least An instruction to achieve:
  • the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network
  • the second output data is input from the student network after the training sample data is input into the student network.
  • the second specific network layer outputs the data.
  • the sample training data can be completely migrated to the student network in the data of the output data output by the teacher network, thereby realizing the result of the training sample data output through the teacher network and the result output through the target network. Consistent. According to the good generalization performance of the neural network, the output of the target network and the output of the teacher network are basically the same in the test set, thereby improving the accuracy of the student network.
  • FIG. 1 is a flowchart of a neural network training method according to an embodiment of the present invention.
  • FIG. 2 is a flow chart of training a student network in an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a neural network training apparatus according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a training unit according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a neural network training apparatus according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a neural network training method according to an embodiment of the present invention, where the method includes:
  • Step 101 Select a teacher network that implements the same function as the student network.
  • the teacher network has excellent performance and high accuracy, but its structure is complex relative to the student network, the parameter weight is more, and the calculation speed is slower.
  • the student network is fast, the performance is generally poor or the network structure is simple.
  • a network with the same functions and excellent performance as that implemented by the student network can be selected as a teacher network in a set of pre-set neural network models.
  • Step 102 Iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to obtain the output of the teacher network. The similarity between data is migrated to the student network.
  • the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network
  • the second output data is input from the student network after the training sample data is input into the student network.
  • the second specific network layer outputs the data.
  • the data outputted from the first specific network layer of the teacher network is collectively referred to as the first output data; after the training sample data is input into the student network, the second specificity from the student network is obtained.
  • the data output by the network layer is collectively referred to as second output data.
  • the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network.
  • the second specific network layer is an intermediate network layer or a last layer network layer of the student network.
  • step 102 specifically implements the method flow that can be as shown in FIG. 2, and specifically includes:
  • Step 102A Construct an objective function of the student network, where the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data.
  • Step 102B Perform iterative training on the student network by using the training sample data.
  • Step 102C When the number of iterations training reaches a threshold or the target function satisfies a preset convergence condition, the target network is obtained.
  • the specific implementation may be as follows:
  • the training sample data used for the iterative training is referred to as current training sample data, and the iterative training includes the following steps A and B, Step C, Step D, Step E and Step F):
  • Step A input current training sample data for the current iteration training into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
  • Step B calculating a similarity between each data in the first output data and calculating a phase between each data in the second output data Similarity
  • Step C Calculate a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and select a target order from all the order of the data in the first output data.
  • Step D calculating, according to the similarity between each data in the second output data, a probability of a target arrangement order of each data in the second output data;
  • Step E calculating a value of the target function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting according to the value of the target function The weight of the student network;
  • Step F Perform the next iterative training based on the student network after adjusting the weight.
  • the target arrangement order is selected from all the arrangement orders of the data in the first output data, and the implementation manner includes but is not limited to the following two types:
  • the order in which the probability values are greater than the preset threshold is selected from the all sorting orders of the data in the first output data as the target sorting order.
  • the order of the preset number of the probabilities taken in the first output data is selected as the target arrangement order.
  • the order of the selected objects may be one or more, which is not strictly limited in this application.
  • step B calculating the similarity between the data in the first output data (the second output data), specifically: calculating a spatial distance between the two data in the first output data (the second output data), A similarity between the two pairs of data is obtained according to the spatial distance.
  • the spatial distance may be a European distance, a cosine distance, a block distance, or a Mahalanobis distance, and the present application is not strictly limited. Take the example of calculating the Euclidean distance and cosine distance between two pairs of data.
  • is a preset scale transformation factor
  • is a preset contrast expansion factor
  • is an offset
  • 2 represents a l 2 norm of the vector.
  • is a preset scale transformation factor
  • is a preset contrast expansion factor
  • is an offset
  • represents a point multiplication operation between vectors.
  • each data in the first output data is calculated according to the similarity between the data in the first output data.
  • the probability of all the sorting orders in the specific implementation: for each sorting order, the order information of the sorting order and the similarity between all adjacent two data in the sorting order of the first output data are input into a preset probability In the calculation model, the probability of the arrangement order is obtained.
  • the probability of the target arrangement order of each data in the second output data is calculated according to the similarity between the data in the second output data, and the specific implementation is as follows: for each target arrangement order, the target is The order information of the arrangement order and the similarity between all adjacent two data in the target arrangement order of the second output data are input into the probability calculation model, and the probability of the target arrangement order is obtained.
  • the probability calculation model may be a first-order Plackett probability model, a high-order Plackett probability model, or other models capable of calculating a probability, which is not strictly limited.
  • the following is an example of the probability of using the first-order Plackett probability model to calculate the order of arrangement.
  • f( ⁇ ) is any linear or non-linear mapping function, and the sum of the probabilities of all the ordering is 1
  • the target arrangement order may be one or multiple.
  • the objective function of the student network may only include a matching function, and the objective function may also be a sum of a matching function and a task loss function, and the expression of the task loss function is related to a task to be implemented by the student network.
  • the task loss function can be the same as the objective function of the teacher network.
  • the expression of the matching function can be, but is not limited to, the following formula (3) and formula (4).
  • Example 1 When the target order is one, the objective function of the student network can be set as shown in the following formula (3):
  • ⁇ t is the target arrangement order of each data in the first output data corresponding to the current training sample data
  • X s is the second output data corresponding to the current training sample data
  • X s ) is The probability of the target arrangement order of each data in the second output data.
  • the foregoing target arrangement order ⁇ t is an arrangement order in which the probability values are the largest among all the arrangement orders of the data in the first output data of the current training sample data.
  • the embodiment of the present invention may train the student network based on a manner of matching a probability distribution of a plurality of target arrangement orders.
  • there are various methods for matching the probability distribution of a plurality of target arrangement orders such as a total variation distance based on a probability distribution, a Wesserstein distance, a Jensen-Shannon divergence, or a Kullback-Leibler divergence.
  • the objective function expression of the student network may be as follows (4):
  • is a target arrangement order
  • X s is the second output data corresponding to the current training sample data
  • X t is the first output data corresponding to the current training sample data
  • X s ) is the current The probability of ⁇ of each data in the second transmission data of the training sample data
  • X t ) is the probability of ⁇ of each data in the first transmission data of the current training sample data
  • Q is a set of the target arrangement order.
  • adjusting the weight of the student network according to the value of the objective function in the foregoing step E includes: adopting a preset gradient descent optimization algorithm, and adjusting the weight of the student network according to the value of the target function.
  • the foregoing steps A and B further include the following steps: processing the first output data and the second output data by using a downsampling algorithm and an interpolation algorithm, so that a spatial dimension of the first output data and a The spatial dimensions of the two output data are consistent, and the number of the first output data and the number of the second output data are both consistent with the number of the current training sample data.
  • the step A is not needed. This step is added to step B, that is, step B is directly executed after step A.
  • the aforementioned spatial dimension generally refers to the number of inputs The number of channels, the number of channels, the height and width of the feature map.
  • steps A to F have no strict sequence, and the following steps A' to B' may be used instead of the above steps A to B.
  • Step A' inputting current training sample data for the iterative training into the teacher network, obtaining corresponding first output data, and calculating a similarity between the data in the first output data;
  • Step B' inputting the current training sample data into the student network, obtaining corresponding second output data, and calculating a similarity between the data in the second output data.
  • the first output data of the three training sample data input to the teacher network is The second output data of the three training sample data input to the student network output is
  • all the order of the data in the first output data is used as the target arrangement order.
  • a set of target arrangement orders of the first output data corresponding to the i-th training sample data among them The probability of calculating the target arrangement order of the first output data corresponding to the i-th training data is
  • the order in which the data in the first output data and the second output data are arranged in the same order is used as the same target arrangement order.
  • the second output data of the i-th training sample data With its first output data
  • the first iteration training input y 1 into the teacher network and the student network, and obtain the corresponding first output data as And the second output data is Calculation Similarity and calculation between data in Similarity between the data in the data; Similarity calculation between data in The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order; The similarity between the data is calculated The probability of the target arrangement order of each of the data; the probability of the target arrangement order of each data in the first output data corresponding to y 1 and the probability of the target arrangement order of each data in the second output data are input to the objective function, and the calculation is performed The value of the objective function is L 1 , and the current weight W 0 of the student network is adjusted according to the L 1 , and the adjusted weight W 1 is obtained ;
  • the second iteration training input y 2 into the teacher network and the student network, and obtain the corresponding first output data as And the second output data is Calculation Similarity and calculation between data in Similarity between the data in the data; Similarity calculation between data in The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order; The similarity between the data is calculated The probability of the target arrangement order of each data in the data; the probability of the target arrangement order of each data in the first output data corresponding to y 2 and the probability of the target arrangement order of each data in the second output data are input into the objective function, and the calculation is performed The value of the objective function is L 2 , and the current weight W 1 of the student network is adjusted according to the L 2 , and the adjusted weight is W 2 ;
  • the third iteration training input y 3 into the teacher network and the student network, and obtain the corresponding first output data as And the second output data is Calculation Similarity and calculation between data in Similarity between the data in the data; Similarity calculation between data in The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order; The similarity between the data is calculated The probability of the target arrangement order of each data in the data; the probability of the target arrangement order of each data in the first output data corresponding to y 3 and the probability of the target arrangement order of each data in the second output data are input into the objective function, and the calculation is performed The value of the objective function is L 3 , and the current weight W 2 of the student network is adjusted according to the L 3 , and the adjusted weight is W 3 .
  • the second embodiment of the present invention provides a neural network training device.
  • the structure of the device is as shown in FIG. 3, and includes:
  • the selecting unit 31 is configured to select a teacher network that implements the same function as the student network;
  • the training unit 32 is configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to obtain the target network.
  • the similarity between the output data of the network is migrated to the student network;
  • the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network
  • the second output data is input from the student network after the training sample data is input into the student network.
  • the second specific network layer outputs the data.
  • functions implemented by the teacher network and the student network are image classification, target detection, image segmentation, and the like.
  • the teacher network has excellent performance and high accuracy, but its structure is complex relative to the student network, the parameter weight is more, and the calculation speed is slower.
  • the student network is fast, the performance is generally poor or the network structure is simple.
  • the selecting unit 31 may select a network having the same function and excellent performance as that implemented by the student network in the set of the preset neural network models as the teacher network.
  • the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network; and/or the second specific network layer is an intermediate network layer or last of the student network.
  • a layer of network is an intermediate network layer or a last layer network layer in the teacher network; and/or the second specific network layer is an intermediate network layer or last of the student network.
  • the structure of the training unit 32 is as shown in FIG. 4, and specifically includes a construction module 321, a training module 322, and a determination module 323, where:
  • a building module 321 is configured to construct an objective function of the student network, where the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;
  • the training module 322 is configured to perform iterative training on the student network by using the training sample data
  • the determining module 323 is configured to obtain the target network when the training module 322 iteratively trains the number of times to reach a threshold or the target function satisfies a preset convergence condition.
  • the training module 322 is specifically configured to:
  • the training sample data used for the iterative training is referred to as current training sample data, and the iterative training includes the following steps A and B, Step C, Step D, Step E and Step F):
  • Step A input current training sample data for the current iteration training into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
  • Step B calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;
  • Step C Calculating all the permutations of the data in the first output data according to the similarity between the data in the first output data a probability of order, and selecting a target arrangement order from all the arrangement orders of each data in the first output data;
  • Step D calculating, according to the similarity between each data in the second output data, a probability of a target arrangement order of each data in the second output data;
  • Step E calculating a value of the target function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting according to the value of the target function The weight of the student network;
  • Step F Perform the next iterative training based on the student network after adjusting the weight.
  • the training module 322 selects a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes: selecting, from all the arrangement orders of the data in the first output data, an arrangement whose probability value is greater than a preset threshold.
  • the order is arranged as a target; or, from all the order of the data in the first output data, the order in which the probability values are arranged in the previous preset number is selected as the target sorting order.
  • the training module 322 calculates the similarity between the data in the first output data, and specifically includes: calculating a spatial distance between the two data in the first output data, and obtaining the two-two data according to the spatial distance. Similarity between
  • the training module 322 calculates the similarity between the data in the second output data, and specifically includes: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance degree.
  • the spatial distance may be a European distance, a cosine distance, a block distance, or a Mahalanobis distance, and the present application is not strictly limited. Take the example of calculating the Euclidean distance and cosine distance between two pairs of data.
  • the training module 322 calculates a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: for each sorting order, the order of the sorting The sequence information and the similarity between all adjacent two data in the arrangement order of the first output data are input into a preset probability calculation model, and the probability of the arrangement order is obtained;
  • the training module 322 calculates a target arrangement order probability of each data in the second output data according to the similarity between the data in the second output data, and specifically includes: sequence information of the target arrangement order for each target arrangement order And the similarity between all adjacent two data in the target arrangement order of the second output data is input into the probability calculation model, and the probability of the target arrangement order is obtained.
  • the probability calculation model may be a first-order Plackett probability model, a high-order Plackett probability model, or other models capable of calculating a probability, which is not strictly limited.
  • the target arrangement order may be one or multiple.
  • the embodiment of the present invention may train the student network based on a manner of matching a probability distribution of a plurality of target arrangement orders.
  • there are various methods for matching the probability distribution of a plurality of target arrangement orders such as a total variation distance based on a probability distribution, a Wesserstein distance, a Jensen-Shannon divergence, or a Kullback-Leibler divergence.
  • the objective function of the student network may only include one matching function, and the objective function may also be The sum of a matching function and a task loss function.
  • the expression of the task loss function is related to the task to be implemented by the student network.
  • the task loss function can be the same as the objective function of the teacher network.
  • the training module 322 adjusts the weight of the student network according to the value of the objective function, and specifically includes: adopting a preset gradient descent optimization algorithm, and adjusting the student network according to the value of the target function. Weights.
  • the training module 322 is further configured to: before calculating the similarity between the data in the first output data and calculating the similarity between the data in the second output data, by using a downsampling algorithm and an interpolation algorithm Processing the first output data and the second output data such that a spatial dimension of the first output data coincides with a spatial dimension of the second output data, and the number of the first output data and the quantity of the second output data are both The number of current training sample data is consistent.
  • steps A to F have no strict sequence, and the following steps A' to B' may be used instead of the above steps A to B.
  • Step A' inputting current training sample data for the iterative training into the teacher network, obtaining corresponding first output data, and calculating a similarity between the data in the first output data;
  • Step B' inputting the current training sample data into the student network, obtaining corresponding second output data, and calculating a similarity between the data in the second output data.
  • the third embodiment of the present invention provides a neural network training device.
  • the structure of the device is as shown in FIG. 5, including: a processor 501 and at least one memory 502.
  • the at least one memory 502 is configured to store at least one machine executable instruction, and the processor 501 executes the at least one instruction to: select a teacher network that implements the same function as the student network; and corresponding to matching the same training sample data Iteratively training the student network to obtain a target network by the similarity between the data of the first output data and the data of the second output data, so as to implement the migration of the similarity between the output data of the teacher network to the student network
  • the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network
  • the second output data is input from the student after the training sample data is input into the student network The data output by the second specific network layer of the network.
  • the processor 501 executes the at least one instruction to iteratively train the student network to obtain a target based on matching the inter-sample similarity of the first output data corresponding to the same training sample data with the sample-to-sample similarity of the second output data.
  • the network specifically includes: an objective function for constructing the student network, the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;
  • the training sample data performs iterative training on the student network; when the iterative training number reaches a threshold or the objective function satisfies a preset convergence condition, the target network is obtained.
  • the processor 501 executes the at least one instruction to implement the training sample data on the student network.
  • the iterative training of the network specifically includes: performing the following iterative training on the student network: inputting the current training sample data used for the iterative training into the teacher network and the student network, respectively, to obtain corresponding first output data and a second output data; calculating a similarity between the data in the first output data and calculating a similarity between the data in the second output data; calculating each of the first output data according to the similarity between the data in the first output data a probability of all the order of the data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data; calculating each data in the second output data according to the similarity between the data in the second output data The probability of the target arrangement order; calculating the value of the objective function according to the probability of the target arrangement order of each data in the first output data and the probability of the target arrangement order of each data in the second output data, and according to the objective function The value adjusts the weight of
  • the processor 501 executes the at least one instruction to select a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes: selecting a probability from all the arrangement orders of the data in the first output data.
  • the arrangement order of the values greater than the preset threshold is used as the target arrangement order; or, the order of the preset number of the probabilities taken in the first output data is selected as the target arrangement order.
  • the processor 501 executes the at least one instruction to implement the calculation of the similarity between the data in the first output data, and specifically includes: calculating a spatial distance between the two data in the first output data, according to the spatial distance Obtaining a similarity between the two data sets; calculating a similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, according to the spatial distance, obtaining the The similarity between two data sets.
  • the processor 501 executes the at least one instruction to calculate a probability of calculating all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: for each sorting order, Inputting the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data into a preset probability calculation model to obtain a probability of the arrangement order; according to the second output data Calculating the probability of the target arrangement order of each data in the second output data, the specificity includes: ordering the order of the target arrangement order and the target arrangement of the second output data for each target arrangement order The similarity between all adjacent two data in the sequence is input into the probability calculation model, and the probability of the target arrangement order is obtained.
  • the objective function of the student network is as follows:
  • ⁇ t is the target arrangement order of each data in the first output data corresponding to the current training sample data
  • X s is the second output data corresponding to the current training sample data
  • X s ) is the second output The probability of the target order of each data in the data.
  • the objective function of the student network is as follows:
  • is a target arrangement order
  • X s is the second output data corresponding to the current training sample data
  • X t is the first output data corresponding to the current training sample data
  • X s ) is the current training sample data.
  • the order of each data in the second transmission data is a probability of ⁇
  • X t ) is a probability that the order of each data in the first transmission data of the current training sample data is ⁇
  • Q is the target arrangement order. set.
  • the processor 501 performs the at least one instruction implementation, and adjusts the weight of the student network according to the value of the target function, which includes: adopting a preset gradient descent optimization algorithm, according to the target function The value adjusts the weight of the student network.
  • the processor further performs the performing, after the processor 501 executes the at least one instruction to calculate a similarity between data in the first output data and calculates a similarity between data in the second output data.
  • At least one instruction is implemented to: process the first output data and the second output data by a downsampling algorithm and an interpolation algorithm, such that a spatial dimension of the first output data is consistent with a spatial dimension of the second output data, and The number of one output data and the number of second output data are both consistent with the number of current training sample data.
  • the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network; and the second specific network layer is an intermediate network layer or a last layer network layer of the student network.
  • the embodiment of the present invention further provides a storage medium (which may be a non-volatile machine readable storage medium), where the computer program for storing neural network training is stored.
  • the program has a code segment configured to perform the following steps: selecting a teacher network that implements the same function as the student network; based on matching the data between the first output data corresponding to the same training sample data and the data of the second output data Similarity to iteratively train the student network to obtain a target network to implement migration of similarity between output data of the teacher network to the student network; wherein: the first output data is used to input the training sample data into a teacher network The data output from the first specific network layer of the teacher network, the second output data is data output from the second specific network layer of the student network after the training sample data is input into the student network.
  • an embodiment of the present invention further provides a computer program having a code segment configured to perform the following neural network training: selecting a teacher network that implements the same function as the student network; Iteratively training the student network to obtain a target network by using the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to achieve similarity migration between the output data of the teacher network Go to the student network; wherein: the first output data is data output from a first specific network layer of a teacher network after the training sample data is input into a teacher network, and the second output data is input to the training sample data The student network then outputs data from the second specific network layer of the student network.
  • the sample training data can be completely migrated to the student network in the data of the output data output by the teacher network, thereby realizing the result and passing of the training sample data output through the teacher network.
  • the results of the target network output are basically the same.
  • the output of the target network and the output of the teacher network are basically the same in the test set, thereby improving the accuracy of the student network.
  • each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A neural network training method and device, the method comprising: selecting a teacher network achieving the same functions as a student network (101); and iteratively training the student network on the basis of matching data similarity of first output data and data similarity of second output data corresponding to the same training sample data to obtain a target network, so as to implement migration of output data similarity of the teacher network to the student network (102), wherein the first output data is data output from a first specific network layer of the teacher network after the training sample data is input to the teacher network, and the second output data is data output from a second specific network layer of the student network after the training sample data is input to the student network. The student network trained by the method according to the output data similarity of the teacher network has a better performance.

Description

一种神经网络训练方法及装置Neural network training method and device
本申请要求在2017年6月15日提交中国专利局、申请号为201710450211.9、发明名称为“一种神经网络训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims the priority of the Chinese Patent Application, which is filed on June 15, 2017, to the Chinese Patent Office, the number of which is incorporated herein by reference. in.
技术领域Technical field
本发明涉及计算机视觉领域,特别涉及一种神经网络训练方法及装置。The present invention relates to the field of computer vision, and in particular to a neural network training method and apparatus.
背景技术Background technique
近几年来,深度神经网络在计算机视觉领域的各类应用中取得了巨大的成功,如图像分类、目标检测、图像分割等。但深度神经网络的模型往往包含大量的模型参数,计算量大、处理速度慢,无法在一些低功耗、低计算能力的设备(如嵌入式设备、集成设备等)上进行实时计算。In recent years, deep neural networks have achieved great success in various applications in the field of computer vision, such as image classification, target detection, and image segmentation. However, the model of deep neural network often contains a large number of model parameters, which is computationally intensive and slow in processing, and cannot be calculated in real time on some low-power, low-computing devices (such as embedded devices, integrated devices, etc.).
目前,为解决该问题,提出一些解决方案,例如,通过知识迁移方式将教师网络的知识(即教师网络,教师网络一般具有复杂的网络结构、准确性高、计算速度慢)迁移到学生网络中(即学生网络,学生网络的网络结构相对简单、准确性低、速度快),以提高学生网络性能。此时的学生网络可应用到低功耗、地计算能力的设备中。At present, in order to solve this problem, some solutions are proposed. For example, the knowledge of the teacher network (ie, the teacher network, the teacher network generally has a complex network structure, high accuracy, and slow calculation speed) is transferred to the student network through knowledge migration. (ie student network, student network network structure is relatively simple, low accuracy, fast) to improve student network performance. The student network at this time can be applied to devices with low power consumption and local computing power.
知识迁移是一种通用的对深度神经网络模型进行压缩以及加速的技术。目前知识迁移的方法主要包括三种,分别是2014年Hinton等人发表的论文“Distilling the knowledge in a neural network”中提出的Knowledge Distill(简称KD)方法,2015年Romero等人发表的论文“Fitnets:Hints for thin deep nets”提出的FitNets,以及2016年Sergey发表的论文“Paying more attention to attention:Improving the performance of convolutional neural networks via attention transfer”提出的Attention Transfer(简称AT)方法。Knowledge migration is a general technique for compressing and accelerating deep neural network models. At present, there are three main methods of knowledge transfer, namely the Knowledge Distill (KD) method proposed in the paper “Distilling the knowledge in a neural network” published by Hinton et al. in 2014, and the paper published by Romero et al. in 2015 “Fitnets”. :FitsNets proposed by Hints for thin deep nets, and the Attention Transfer (AT) method proposed by the paper "Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer" published by Sergey in 2016.
现有的知识迁移方式,利用教师网络中输出数据中的单个数据的信息来训练学生网络,训练得到的学生网络虽然在性能上有一定的提高,但仍然还有很大的提升空间。The existing knowledge migration method uses the information of the single data in the output data of the teacher network to train the student network. Although the trained student network has certain improvement in performance, there is still much room for improvement.
相关术语解释:Interpretation of related terms:
知识迁移(Knowledge Transfer):在深度神经网络中,知识迁移是指利用训练样本数据在教师网络的中间网络层或最终网络层的输出数据,辅助训练速度较快但性能较差的学生网络,从而将性能优良的教师网络迁移到学生网络上。Knowledge Transfer: In deep neural networks, knowledge transfer refers to the use of training sample data in the intermediate network layer of the teacher network or the final network layer to help the student network with faster training but poor performance. Migrate a high-performing teacher network to the student network.
知识提取(Knowledge Distill):在深度神经网络中,知识提取是指在分类问题中利用教师网络输出的平滑类别后验概率训练学生网络的技术。 Knowledge Distill: In deep neural networks, knowledge extraction refers to the technique of training student networks by using the smooth category posterior probability output from the teacher network in the classification problem.
教师网络(Teacher Network):知识迁移过程中用以为学生网络提供更加准确的监督信息的高性能神经网络。Teacher Network: A high-performance neural network used to provide more accurate monitoring information for student networks during the knowledge migration process.
学生网络(Student Network):计算速度快但性能较差的适合部署到对实时性要求较高的实际应用场景中的单个神经网络,学生网络相比于教师网络,具有更大的运算吞吐量和更少的模型参数。Student Network: A fast calculation but poor performance suitable for deployment to a single neural network in a real-time scenario with high real-time requirements. The student network has greater computational throughput than the teacher network. Fewer model parameters.
发明内容Summary of the invention
鉴于上述问题,本发明提供一种神经网络训练方法及装置,以更进一步提升学生网络的性能和准确性。In view of the above problems, the present invention provides a neural network training method and apparatus to further improve the performance and accuracy of a student network.
本发明实施例,一方面提供一种神经网络训练方法,该方法包括:In an embodiment of the present invention, an aspect provides a neural network training method, where the method includes:
选取一个与学生网络实现相同功能的教师网络;Select a network of teachers that achieve the same functionality as the student network;
基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;
其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例另一方面提供一种神经网络训练装置,该装置包括:Another aspect of the embodiments of the present invention provides a neural network training device, the device comprising:
选取单元,用于选取一个与学生网络实现相同功能的教师网络;The selection unit is used to select a teacher network that implements the same function as the student network;
训练单元,用于基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;a training unit, configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to implement the teacher network The similarity between the output data is migrated to the student network;
其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例另一方面提供一种神经网络训练装置,该装置包括:一个处理器和至少一个存储器,所述至少一个存储器用于存储至少一条机器可执行指令,所述处理器执行所述至少一条指令以实现:Another aspect of an embodiment of the present invention provides a neural network training apparatus, the apparatus comprising: a processor and at least one memory, the at least one memory for storing at least one machine executable instruction, the processor performing the at least An instruction to achieve:
选取一个与学生网络实现相同功能的教师网络;Select a network of teachers that achieve the same functionality as the student network;
基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络; And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;
其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例中,能够将样本训练数据在教师网络输出的输出数据的各数据间相似信息全面迁移到学生网络中,从而实现训练样本数据通过教师网络输出的结果与通过目标网络输出的结果基本一致。根据神经网络良好的泛化性能,训练得到的目标网络的输出与教师网络的输出在测试集上也基本相同,从而提高了学生网络的准确性。In the embodiment of the present invention, the sample training data can be completely migrated to the student network in the data of the output data output by the teacher network, thereby realizing the result of the training sample data output through the teacher network and the result output through the target network. Consistent. According to the good generalization performance of the neural network, the output of the target network and the output of the teacher network are basically the same in the test set, thereby improving the accuracy of the student network.
本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。Other features and advantages of the invention will be set forth in the description which follows, The objectives and other advantages of the invention may be realized and obtained by means of the structure particularly pointed in the appended claims.
下面通过附图和实施例,对本发明的技术方案做进一步的详细描述。The technical solution of the present invention will be further described in detail below through the accompanying drawings and embodiments.
附图说明DRAWINGS
附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例一起用于解释本发明,并不构成对本发明的限制。显而易见地,下面描述中的附图仅仅是本发明一些实施例,对于本领域普通技术人员而言,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:The drawings are intended to provide a further understanding of the invention, and are intended to be a Obviously, the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without any creative work. In the drawing:
图1为本发明实施例中神经网络训练方法的流程图;1 is a flowchart of a neural network training method according to an embodiment of the present invention;
图2为本发明实施例中训练学生网络的流程图;2 is a flow chart of training a student network in an embodiment of the present invention;
图3为本发明实施例中神经网络训练装置的结构示意图;3 is a schematic structural diagram of a neural network training apparatus according to an embodiment of the present invention;
图4为本发明实施例中训练单元的结构示意图;4 is a schematic structural diagram of a training unit according to an embodiment of the present invention;
图5为本发明实施例中神经网络训练装置的结构示意图。FIG. 5 is a schematic structural diagram of a neural network training apparatus according to an embodiment of the present invention.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本发明中的技术方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to make those skilled in the art better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the accompanying drawings in the embodiments of the present invention. The embodiments are only a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.
以上是本发明的核心思想,为了使本技术领域的人员更好地理解本发明实施例中的技术方案,并使本发明实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明实施例中技术方案作进一步详细的说明。The above is the core idea of the present invention, and in order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the above objects, features and advantages of the embodiments of the present invention can be more clearly understood. The technical solution in the embodiment of the present invention is further described in detail.
实施例一 Embodiment 1
参见图1,为本发明实施例中神经网络训练方法的流程图,该方法包括:1 is a flowchart of a neural network training method according to an embodiment of the present invention, where the method includes:
步骤101、选取一个与学生网络实现相同功能的教师网络。Step 101: Select a teacher network that implements the same function as the student network.
实现的功能如图像分类、目标检测、图像分割等。教师网络性能优良、准确率高,但是相对学生网络其结构复杂、参数权重较多、计算速度较慢。学生网络计算速度快、性能一般或者较差、网络结构简单。可以在预先设置的神经网络模型的集合中选取一个与学生网络实现的功能相同且性能优良的网络作为教师网络。Functions such as image classification, target detection, image segmentation, etc. are implemented. The teacher network has excellent performance and high accuracy, but its structure is complex relative to the student network, the parameter weight is more, and the calculation speed is slower. The student network is fast, the performance is generally poor or the network structure is simple. A network with the same functions and excellent performance as that implemented by the student network can be selected as a teacher network in a set of pre-set neural network models.
步骤102、基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络。Step 102: Iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to obtain the output of the teacher network. The similarity between data is migrated to the student network.
其中,所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例中,将训练样本数据输入教师网络后,从教师网络的第一特定网络层输出的数据统称为第一输出数据;将训练样本数据输入学生网络后,从学生网络的第二特定网络层输出的数据统称为第二输出数据。In the embodiment of the present invention, after the training sample data is input into the teacher network, the data outputted from the first specific network layer of the teacher network is collectively referred to as the first output data; after the training sample data is input into the student network, the second specificity from the student network is obtained. The data output by the network layer is collectively referred to as second output data.
优选地,本发明实施例中,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层。Preferably, in the embodiment of the present invention, the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network.
优选地,本发明实施例中,所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。Preferably, in the embodiment of the present invention, the second specific network layer is an intermediate network layer or a last layer network layer of the student network.
优选地,前述步骤102具体实现可如图2所示的方法流程,具体包括:Preferably, the foregoing step 102 specifically implements the method flow that can be as shown in FIG. 2, and specifically includes:
步骤102A、构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数。 Step 102A: Construct an objective function of the student network, where the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data.
步骤102B、采用所述训练样本数据对所述学生网络进行迭代训练。 Step 102B: Perform iterative training on the student network by using the training sample data.
步骤102C、当迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。 Step 102C: When the number of iterations training reaches a threshold or the target function satisfies a preset convergence condition, the target network is obtained.
优选地,前述步骤102B,具体实现可如下:Preferably, in the foregoing step 102B, the specific implementation may be as follows:
对所述学生网络进行多次以下迭代训练(以下称为本次迭代训练,将用于本次迭代训练的训练样本数据称为当前训练样本数据,本次迭代训练包括以下步骤A、步骤B、步骤C、步骤D、步骤E和步骤F):Performing the following iterative training on the student network multiple times (hereinafter referred to as the current iterative training, the training sample data used for the iterative training is referred to as current training sample data, and the iterative training includes the following steps A and B, Step C, Step D, Step E and Step F):
步骤A、将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;Step A: input current training sample data for the current iteration training into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
步骤B、计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相 似度;Step B: calculating a similarity between each data in the first output data and calculating a phase between each data in the second output data Similarity
步骤C、根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序;Step C: Calculate a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and select a target order from all the order of the data in the first output data. ;
步骤D、根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;Step D: calculating, according to the similarity between each data in the second output data, a probability of a target arrangement order of each data in the second output data;
步骤E、根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;Step E: calculating a value of the target function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting according to the value of the target function The weight of the student network;
步骤F、基于调整权重后的学生网络进行下一次迭代训练。Step F: Perform the next iterative training based on the student network after adjusting the weight.
优选地,本发明实施例中,前述步骤C中从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,实现方式包括但不仅限于以下两种:Preferably, in the embodiment of the present invention, in the foregoing step C, the target arrangement order is selected from all the arrangement orders of the data in the first output data, and the implementation manner includes but is not limited to the following two types:
方式1、从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为目标排列顺序。In the first step, the order in which the probability values are greater than the preset threshold is selected from the all sorting orders of the data in the first output data as the target sorting order.
方式2、从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。In the mode 2, the order of the preset number of the probabilities taken in the first output data is selected as the target arrangement order.
本发明实施例中,选取的目标排列顺序可以是一个也可以是多个,本申请不作严格限定。In the embodiment of the present invention, the order of the selected objects may be one or more, which is not strictly limited in this application.
优选地,步骤B中,计算第一输出数据(第二输出数据)中各数据间的相似度,具体包括:计算第一输出数据(第二输出数据)中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。Preferably, in step B, calculating the similarity between the data in the first output data (the second output data), specifically: calculating a spatial distance between the two data in the first output data (the second output data), A similarity between the two pairs of data is obtained according to the spatial distance.
本发明实施例中,所述空间距离可以是欧式距离、余弦距离、街区距离或马氏距离等,本申请不做严格限定。以计算两两数据之间的欧氏距离和余弦距离为例。In the embodiment of the present invention, the spatial distance may be a European distance, a cosine distance, a block distance, or a Mahalanobis distance, and the present application is not strictly limited. Take the example of calculating the Euclidean distance and cosine distance between two pairs of data.
通过以下公式(1)计算第任意两个数据xi和xj之间的欧式距离:Calculate the Euclidean distance between any two data x i and x j by the following formula (1):
Figure PCTCN2017102032-appb-000001
Figure PCTCN2017102032-appb-000001
式(1)中,α为预置的尺度变换因子,β为预置的对比伸缩因子,γ为偏移量,|·|2代表向量的l2范数。In equation (1), α is a preset scale transformation factor, β is a preset contrast expansion factor, γ is an offset, and |·| 2 represents a l 2 norm of the vector.
通过以下公式(2)计算任意两个数据xi和xj之间的余弦距离:Calculate the cosine distance between any two data x i and x j by the following formula (2):
Sij=α(xi·xj)β+γ           式(2)S ij =α(x i ·x j ) β +γ (2)
式(2)中,α为预置的尺度变换因子,β为预置的对比伸缩因子,γ为偏移量,·代表向量间的点乘操作。In equation (2), α is a preset scale transformation factor, β is a preset contrast expansion factor, γ is an offset, and · represents a point multiplication operation between vectors.
优选地,步骤C中,根据第一输出数据中各数据间的相似度计算第一输出数据中各数据 的所有排列顺序的概率,具体实现下:针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率。Preferably, in step C, each data in the first output data is calculated according to the similarity between the data in the first output data. The probability of all the sorting orders, in the specific implementation: for each sorting order, the order information of the sorting order and the similarity between all adjacent two data in the sorting order of the first output data are input into a preset probability In the calculation model, the probability of the arrangement order is obtained.
以一个训练样本数据y={y1,y2,y3}为例进行描述。将y输入教师网络得到对应的第一输出数据x={x1,x2,x3};计算x中两两数据之间的相似度为s12(x1与x2的相似度)、s13(x1与x3的相似度)、s23(x2与x3的相似度)。x1、x2、x3的所有排列顺序的数量为3!=6个,排列顺序分别为π1=x1→x2→x3、π2=x1→x3→x2、π3=x2→x1→x3、π4=x2→x3→x1、π5=x3→x1→x2、π6=x3→x2→x1;根据各数据间的相似度计算得到前述六种排列顺序的概率分别为
Figure PCTCN2017102032-appb-000002
A training sample data y={y 1 , y 2 , y 3 } is taken as an example for description. Enter y into the teacher network to obtain corresponding first output data x={x 1 , x 2 , x 3 }; calculate the similarity between the two data in x as s 12 (the similarity between x 1 and x 2 ), s 13 (similarity between x 1 and x 3 ), s 23 (similarity between x 2 and x 3 ). The number of all sort orders of x 1 , x 2 , and x 3 is 3! =6, the order of arrangement is π 1 =x 1 →x 2 →x 32 =x 1 →x 3 →x 23 =x 2 →x 1 →x 34 =x 2 → x 3 →x 1 , π 5 =x 3 →x 1 →x 2 , π 6 =x 3 →x 2 →x 1 ; The probability of calculating the above six sorting orders based on the similarity between the data is
Figure PCTCN2017102032-appb-000002
各训练样本数据对应的各第一输出数据选取的对应的目标排列顺序可以相同也可以不相同,以前述x为例,假设第一样本训练数据对应的第一输出数据对应的目标排列顺序为π1=x1→x2→x3、π2=x1→x3→x2、π3=x2→x1→x3,第二样本训练数据对应的第一输出数据对应的目标排列顺序为π3=x2→x1→x3、π4=x2→x3→x1、π5=x3→x1→x2The corresponding target arrangement order of each first output data corresponding to each training sample data may be the same or different, and the foregoing x is taken as an example, and the target output order corresponding to the first output data corresponding to the first sample training data is assumed to be π 1 =x 1 →x 2 →x 3 , π 2 =x 1 →x 3 →x 2 , π 3 =x 2 →x 1 →x 3 , the target corresponding to the first output data corresponding to the second sample training data The order of arrangement is π 3 = x 2 → x 1 → x 3 , π 4 = x 2 → x 3 → x 1 , π 5 = x 3 → x 1 → x 2 .
优选地,所述步骤D中根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率,具体实现如下:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。Preferably, in the step D, the probability of the target arrangement order of each data in the second output data is calculated according to the similarity between the data in the second output data, and the specific implementation is as follows: for each target arrangement order, the target is The order information of the arrangement order and the similarity between all adjacent two data in the target arrangement order of the second output data are input into the probability calculation model, and the probability of the target arrangement order is obtained.
本发明实施例中,所述概率计算模型可以为一阶Plackett概率模型,也可以为高阶Plackett概率模型,还可以是其他能够计算概率的模型,本申请不做严格限定。In the embodiment of the present invention, the probability calculation model may be a first-order Plackett probability model, a high-order Plackett probability model, or other models capable of calculating a probability, which is not strictly limited.
下面以采用一阶Plackett概率模型计算排列顺序的概率为例进行描述。The following is an example of the probability of using the first-order Plackett probability model to calculate the order of arrangement.
假设某一训练样本数据对应的第一输出数据为x={x1,x2,x3,x4},以计算排列顺序π1和π2的概率为例,假设π1=x1→x2→x3→x4、π2=x1→x3→x4→x2,通过一阶Plackett概率模型得到以下结果:Assume that the first output data corresponding to a certain training sample data is x={x 1 , x 2 , x 3 , x 4 }, taking the probability of calculating the order of π 1 and π 2 as an example, assuming π 1 = x 1 → x 2 →x 3 →x 4 , π 2 =x 1 →x 3 →x 4 →x 2 , the following results are obtained by the first-order Plackett probability model:
Figure PCTCN2017102032-appb-000003
Figure PCTCN2017102032-appb-000003
Figure PCTCN2017102032-appb-000004
Figure PCTCN2017102032-appb-000004
其中,f(·)为任意一种线性或非线性的映射函数,且所有排列顺序的概率的和值为1Where f(·) is any linear or non-linear mapping function, and the sum of the probabilities of all the ordering is 1
本发明实施例中,所述目标排列顺序可以为一个,也可以为多个。 In the embodiment of the present invention, the target arrangement order may be one or multiple.
本发明实施例中,学生网络的目标函数可以仅包含一个匹配函数,该目标函数还可以是一个匹配函数与任务损失函数的和值,该任务损失函数的表达式与学生网络所要实现的任务相关,例如该任务损失函数可以与教师网络的目标函数相同。匹配函数的表达式可以但不仅限于以下的公式(3)和公式(4)。In the embodiment of the present invention, the objective function of the student network may only include a matching function, and the objective function may also be a sum of a matching function and a task loss function, and the expression of the task loss function is related to a task to be implemented by the student network. For example, the task loss function can be the same as the objective function of the teacher network. The expression of the matching function can be, but is not limited to, the following formula (3) and formula (4).
实例1、当目标顺序为一个时,所述学生网络的目标函数可设置为如以下公式(3)所示:Example 1. When the target order is one, the objective function of the student network can be set as shown in the following formula (3):
L=-log P(πt|Xs)        式(3)L=-log P(π t |X s ) (3)
式(3)中,πt为当前训练样本数据对应的第一输出数据中各数据的目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,P(πt|Xs)为第二输出数据中各数据的目标排列顺序的概率。In the formula (3), π t is the target arrangement order of each data in the first output data corresponding to the current training sample data, and X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is The probability of the target arrangement order of each data in the second output data.
优选地,前述目标排列顺序πt为当前训练样本数据的第一输出数据中各数据所有排列顺序中概率取值最大的排列顺序。Preferably, the foregoing target arrangement order π t is an arrangement order in which the probability values are the largest among all the arrangement orders of the data in the first output data of the current training sample data.
当目标顺序为多个时,本发明实施例可以基于匹配多个目标排列顺序的概率分布的方式训练得到所述学生网络。本发明实施例中匹配多个目标排列顺序的概率分布的方法有多种,例如基于概率分布的全变分距离、Wesserstein距离、Jensen-Shannon散度或Kullback-Leibler散度等。When the target order is multiple, the embodiment of the present invention may train the student network based on a manner of matching a probability distribution of a plurality of target arrangement orders. In the embodiment of the present invention, there are various methods for matching the probability distribution of a plurality of target arrangement orders, such as a total variation distance based on a probability distribution, a Wesserstein distance, a Jensen-Shannon divergence, or a Kullback-Leibler divergence.
以下以基于概率分布的Kullback-Leibler散度为例,所述学生网络的目标函数表达式可如以下如下式(4)所示:Taking the Kullback-Leibler divergence based on the probability distribution as an example, the objective function expression of the student network may be as follows (4):
Figure PCTCN2017102032-appb-000005
Figure PCTCN2017102032-appb-000005
式(4)中,π为一个目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,Xt为当前训练样本数据对应的第一输出数据,P(π|Xs)为当前训练样本数据的第二传输数据中各数据的π的概率,P(π|Xt)为当前训练样本数据的第一传输数据中各数据的π的概率,Q为目标排列顺序的集合。In the formula (4), π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current The probability of π of each data in the second transmission data of the training sample data, P(π|X t ) is the probability of π of each data in the first transmission data of the current training sample data, and Q is a set of the target arrangement order.
优选地,前述步骤E中根据所述目标函数的取值调整学生网络的权重,具体包括:采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。Preferably, adjusting the weight of the student network according to the value of the objective function in the foregoing step E includes: adopting a preset gradient descent optimization algorithm, and adjusting the weight of the student network according to the value of the target function.
优选地,前述步骤A与步骤B之间还包括以下步骤:通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。当然,如果步骤A得到的第一输出数据与第二输出数据的空间维度相同,且第一输出数据与第二输出数据的数量均与所述当前训练样本数据的数量一致,则无需在步骤A与步骤B之间增加该步骤,即在步骤A之后直接执行步骤B。前述空间维度一般是指输入数 据的数量、频道数、特征图的高度和宽度。Preferably, the foregoing steps A and B further include the following steps: processing the first output data and the second output data by using a downsampling algorithm and an interpolation algorithm, so that a spatial dimension of the first output data and a The spatial dimensions of the two output data are consistent, and the number of the first output data and the number of the second output data are both consistent with the number of the current training sample data. Certainly, if the spatial output of the first output data and the second output data obtained by the step A are the same, and the number of the first output data and the second output data are both consistent with the quantity of the current training sample data, the step A is not needed. This step is added to step B, that is, step B is directly executed after step A. The aforementioned spatial dimension generally refers to the number of inputs The number of channels, the number of channels, the height and width of the feature map.
需要说明的是,前述步骤A~步骤F没有严格的先后顺序,也可以用以下的步骤A’~步骤B’替代前述步骤A~步骤B。It should be noted that the above steps A to F have no strict sequence, and the following steps A' to B' may be used instead of the above steps A to B.
步骤A’、将用于本次迭代训练的当前训练样本数据输入教师网络,得到对应的第一输出数据,并计算第一输出数据中各数据间的相似度;Step A', inputting current training sample data for the iterative training into the teacher network, obtaining corresponding first output data, and calculating a similarity between the data in the first output data;
步骤B’、将所述当前训练样本数据输入学生网络,得到对应的第二输出数据,并计算第二输出数据中各数据间的相似度。Step B', inputting the current training sample data into the student network, obtaining corresponding second output data, and calculating a similarity between the data in the second output data.
假设用于训练学生网络(用S表示)的三个训练样本数据分别为y1={y11,y12,y13},y2={y21,y22,y23},y3={y31,y32,y33};该三个训练样本数据输入到教师网络(用T表示)输出的第一输出数据依次为
Figure PCTCN2017102032-appb-000006
该三个训练样本数据输入到学生网络输出的第二输出数据依次为
Assume that the three training sample data used to train the student network (represented by S) are y 1 = {y 11 , y 12 , y 13 }, y 2 = {y 21 , y 22 , y 23 }, y 3 = {y 31 , y 32 , y 33 }; the first output data of the three training sample data input to the teacher network (represented by T) is
Figure PCTCN2017102032-appb-000006
The second output data of the three training sample data input to the student network output is
Figure PCTCN2017102032-appb-000007
Figure PCTCN2017102032-appb-000007
本发明实施例以第一输出数据中各数据的所有排列顺序作为目标排列顺序。第i个训练样本数据对应的第一输出数据的目标排列顺序的集合
Figure PCTCN2017102032-appb-000008
其中
Figure PCTCN2017102032-appb-000009
Figure PCTCN2017102032-appb-000010
计算得到第i个训练数据对应的第一输出数据的目标排列顺序的概率为
In the embodiment of the present invention, all the order of the data in the first output data is used as the target arrangement order. a set of target arrangement orders of the first output data corresponding to the i-th training sample data
Figure PCTCN2017102032-appb-000008
among them
Figure PCTCN2017102032-appb-000009
Figure PCTCN2017102032-appb-000010
The probability of calculating the target arrangement order of the first output data corresponding to the i-th training data is
Figure PCTCN2017102032-appb-000011
Figure PCTCN2017102032-appb-000011
第i个训练数据对应的第二输出数据的目标排列顺序的集合
Figure PCTCN2017102032-appb-000012
其中
Figure PCTCN2017102032-appb-000013
Figure PCTCN2017102032-appb-000014
Figure PCTCN2017102032-appb-000015
计算得到第i个训练样本数据对应的第二输出数据的目标排列顺序的概率为
a set of target arrangement orders of the second output data corresponding to the i-th training data
Figure PCTCN2017102032-appb-000012
among them
Figure PCTCN2017102032-appb-000013
Figure PCTCN2017102032-appb-000014
Figure PCTCN2017102032-appb-000015
The probability of calculating the target arrangement order of the second output data corresponding to the i-th training sample data is
Figure PCTCN2017102032-appb-000016
Figure PCTCN2017102032-appb-000016
由于同一个训练样本数据对应的第一输出数据和第二输出数据的数量一致,则将第一输出数据与第二输出数据中数据排列顺序相同的排列顺序作为同一个目标排列顺序。例如将第i 个训练样本数据的第二输出数据的
Figure PCTCN2017102032-appb-000017
与其第一输出数据的
Figure PCTCN2017102032-appb-000018
作为同一个目标排列顺序,用πi1表示,则得到第i个训练样本数据的第一输出数据和第二输出数据的目标排列顺序集合Qi表示为Qi={πi1i2i3i4i5i6}
Since the number of the first output data and the second output data corresponding to the same training sample data is the same, the order in which the data in the first output data and the second output data are arranged in the same order is used as the same target arrangement order. For example, the second output data of the i-th training sample data
Figure PCTCN2017102032-appb-000017
With its first output data
Figure PCTCN2017102032-appb-000018
As the same target arrangement order, represented by π i1 , the target output order set Q i of the first output data and the second output data of the i-th training sample data is expressed as Q i ={π i1 , π i2 , π I3 , π i4 , π i5 , π i6 }
执行以下多次迭代训练:Perform the following iterations of training:
第一次迭代训练:将y1输入教师网络和学生网络,得到对应的第一输出数据为
Figure PCTCN2017102032-appb-000019
和第二输出数据为
Figure PCTCN2017102032-appb-000020
计算
Figure PCTCN2017102032-appb-000021
中各数据之间的相似度以及计算
Figure PCTCN2017102032-appb-000022
中各数据之间的相似度;根据
Figure PCTCN2017102032-appb-000023
中各数据间的相似度计算
Figure PCTCN2017102032-appb-000024
中各数据的所有排列顺序的概率,将该所有排列顺序作为目标排列顺序;根据
Figure PCTCN2017102032-appb-000025
中各数据间的相似度计算得到
Figure PCTCN2017102032-appb-000026
中各数据的目标排列顺序的概率;将y1对应的第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率输入至目标函数中,计算得到目标函数的取值为L1,根据该L1调整学生网络当前权重W0,得到调整后的权重W1
The first iteration training: input y 1 into the teacher network and the student network, and obtain the corresponding first output data as
Figure PCTCN2017102032-appb-000019
And the second output data is
Figure PCTCN2017102032-appb-000020
Calculation
Figure PCTCN2017102032-appb-000021
Similarity and calculation between data in
Figure PCTCN2017102032-appb-000022
Similarity between the data in the data;
Figure PCTCN2017102032-appb-000023
Similarity calculation between data in
Figure PCTCN2017102032-appb-000024
The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order;
Figure PCTCN2017102032-appb-000025
The similarity between the data is calculated
Figure PCTCN2017102032-appb-000026
The probability of the target arrangement order of each of the data; the probability of the target arrangement order of each data in the first output data corresponding to y 1 and the probability of the target arrangement order of each data in the second output data are input to the objective function, and the calculation is performed The value of the objective function is L 1 , and the current weight W 0 of the student network is adjusted according to the L 1 , and the adjusted weight W 1 is obtained ;
第二次迭代训练:将y2输入教师网络和学生网络,得到对应的第一输出数据为
Figure PCTCN2017102032-appb-000027
和第二输出数据为
Figure PCTCN2017102032-appb-000028
计算
Figure PCTCN2017102032-appb-000029
中各数据之间的相似度以及计算
Figure PCTCN2017102032-appb-000030
中各数据之间的相似度;根据
Figure PCTCN2017102032-appb-000031
中各数据间的相似度计算
Figure PCTCN2017102032-appb-000032
中各数据的所有排列顺序的概率,将该所有排列顺序作为目标排列顺序;根据
Figure PCTCN2017102032-appb-000033
中各数据间的相似度计算得到
Figure PCTCN2017102032-appb-000034
中各数据的目标排列顺序的概率;将y2对应的第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率输入至目标函数中,计算得到目标函数的取值为L2,根据该L2调整学生网络当前权重W1,得到调整后的权重为W2
The second iteration training: input y 2 into the teacher network and the student network, and obtain the corresponding first output data as
Figure PCTCN2017102032-appb-000027
And the second output data is
Figure PCTCN2017102032-appb-000028
Calculation
Figure PCTCN2017102032-appb-000029
Similarity and calculation between data in
Figure PCTCN2017102032-appb-000030
Similarity between the data in the data;
Figure PCTCN2017102032-appb-000031
Similarity calculation between data in
Figure PCTCN2017102032-appb-000032
The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order;
Figure PCTCN2017102032-appb-000033
The similarity between the data is calculated
Figure PCTCN2017102032-appb-000034
The probability of the target arrangement order of each data in the data; the probability of the target arrangement order of each data in the first output data corresponding to y 2 and the probability of the target arrangement order of each data in the second output data are input into the objective function, and the calculation is performed The value of the objective function is L 2 , and the current weight W 1 of the student network is adjusted according to the L 2 , and the adjusted weight is W 2 ;
第三次迭代训练:将y3输入教师网络和学生网络,得到对应的第一输出数据为
Figure PCTCN2017102032-appb-000035
和第二输出数据为
Figure PCTCN2017102032-appb-000036
计算
Figure PCTCN2017102032-appb-000037
中各数据之间的相似度以及计算
Figure PCTCN2017102032-appb-000038
中各数据之间的相似度;根据
Figure PCTCN2017102032-appb-000039
中各数据间的相似度计算
Figure PCTCN2017102032-appb-000040
中各数据的所有排列顺序的概率,将该所有排列顺序作为目标排列顺序;根据
Figure PCTCN2017102032-appb-000041
中各数据间的相似度计算得到
Figure PCTCN2017102032-appb-000042
中各数据的目标排列顺序的概率;将y3对应的第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率输入至目标函数中,计算得到目标函数的取值为L3,根据该L3调整学生网络当前权重W2,得到调整后的权重为W3
The third iteration training: input y 3 into the teacher network and the student network, and obtain the corresponding first output data as
Figure PCTCN2017102032-appb-000035
And the second output data is
Figure PCTCN2017102032-appb-000036
Calculation
Figure PCTCN2017102032-appb-000037
Similarity and calculation between data in
Figure PCTCN2017102032-appb-000038
Similarity between the data in the data;
Figure PCTCN2017102032-appb-000039
Similarity calculation between data in
Figure PCTCN2017102032-appb-000040
The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order;
Figure PCTCN2017102032-appb-000041
The similarity between the data is calculated
Figure PCTCN2017102032-appb-000042
The probability of the target arrangement order of each data in the data; the probability of the target arrangement order of each data in the first output data corresponding to y 3 and the probability of the target arrangement order of each data in the second output data are input into the objective function, and the calculation is performed The value of the objective function is L 3 , and the current weight W 2 of the student network is adjusted according to the L 3 , and the adjusted weight is W 3 .
实施例二 Embodiment 2
基于与前述实施例一提供的神经网络训练方法的相同构思,本发明实施例二提供一种神经网络训练装置,该装置的结构如图3所示,包括:Based on the same concept as the neural network training method provided in the foregoing first embodiment, the second embodiment of the present invention provides a neural network training device. The structure of the device is as shown in FIG. 3, and includes:
选取单元31,用于选取一个与学生网络实现相同功能的教师网络;The selecting unit 31 is configured to select a teacher network that implements the same function as the student network;
训练单元32,用于基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;The training unit 32 is configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to obtain the target network. The similarity between the output data of the network is migrated to the student network;
其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例中,教师网络和学生网络所实现的功能如图像分类、目标检测、图像分割等。教师网络性能优良、准确率高,但是相对学生网络其结构复杂、参数权重较多、计算速度较慢。学生网络计算速度快、性能一般或者较差、网络结构简单。选取单元31可以在预先设置的神经网络模型的集合中选取一个与学生网络实现的功能相同且性能优良的网络作为教师网络。In the embodiment of the present invention, functions implemented by the teacher network and the student network are image classification, target detection, image segmentation, and the like. The teacher network has excellent performance and high accuracy, but its structure is complex relative to the student network, the parameter weight is more, and the calculation speed is slower. The student network is fast, the performance is generally poor or the network structure is simple. The selecting unit 31 may select a network having the same function and excellent performance as that implemented by the student network in the set of the preset neural network models as the teacher network.
本发明实施例中,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层;和/或,所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。In the embodiment of the present invention, the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network; and/or the second specific network layer is an intermediate network layer or last of the student network. A layer of network.
优选地,训练单元32的结构如图4所示,具体包括构建模块321、训练模块322和确定模块323,其中:Preferably, the structure of the training unit 32 is as shown in FIG. 4, and specifically includes a construction module 321, a training module 322, and a determination module 323, where:
构建模块321,用于构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数;a building module 321 is configured to construct an objective function of the student network, where the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;
训练模块322,用于采用所述训练样本数据对所述学生网络进行迭代训练;The training module 322 is configured to perform iterative training on the student network by using the training sample data;
确定模块323,用于当训练模块322迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。The determining module 323 is configured to obtain the target network when the training module 322 iteratively trains the number of times to reach a threshold or the target function satisfies a preset convergence condition.
优选地,训练模块322,具体用于:Preferably, the training module 322 is specifically configured to:
对所述学生网络进行多次以下迭代训练(以下称为本次迭代训练,将用于本次迭代训练的训练样本数据称为当前训练样本数据,本次迭代训练包括以下步骤A、步骤B、步骤C、步骤D、步骤E和步骤F):Performing the following iterative training on the student network multiple times (hereinafter referred to as the current iterative training, the training sample data used for the iterative training is referred to as current training sample data, and the iterative training includes the following steps A and B, Step C, Step D, Step E and Step F):
步骤A、将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;Step A: input current training sample data for the current iteration training into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
步骤B、计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度;Step B: calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;
步骤C、根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列 顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序;Step C: Calculating all the permutations of the data in the first output data according to the similarity between the data in the first output data a probability of order, and selecting a target arrangement order from all the arrangement orders of each data in the first output data;
步骤D、根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;Step D: calculating, according to the similarity between each data in the second output data, a probability of a target arrangement order of each data in the second output data;
步骤E、根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;Step E: calculating a value of the target function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting according to the value of the target function The weight of the student network;
步骤F、基于调整权重后的学生网络进行下一次迭代训练。Step F: Perform the next iterative training based on the student network after adjusting the weight.
优选地,训练模块322从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,具体包括:从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为目标排列顺序;或者,从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。Preferably, the training module 322 selects a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes: selecting, from all the arrangement orders of the data in the first output data, an arrangement whose probability value is greater than a preset threshold. The order is arranged as a target; or, from all the order of the data in the first output data, the order in which the probability values are arranged in the previous preset number is selected as the target sorting order.
优选地,所述训练模块322计算第一输出数据中各数据间的相似度,具体包括:计算第一输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度;Preferably, the training module 322 calculates the similarity between the data in the first output data, and specifically includes: calculating a spatial distance between the two data in the first output data, and obtaining the two-two data according to the spatial distance. Similarity between
所述训练模块322计算第二输出数据中各数据间的相似度,具体包括:计算第二输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。The training module 322 calculates the similarity between the data in the second output data, and specifically includes: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance degree.
本发明实施例中,所述空间距离可以是欧式距离、余弦距离、街区距离或马氏距离等,本申请不做严格限定。以计算两两数据之间的欧氏距离和余弦距离为例。In the embodiment of the present invention, the spatial distance may be a European distance, a cosine distance, a block distance, or a Mahalanobis distance, and the present application is not strictly limited. Take the example of calculating the Euclidean distance and cosine distance between two pairs of data.
优选地,所述训练模块322根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,具体包括:针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率;Preferably, the training module 322 calculates a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: for each sorting order, the order of the sorting The sequence information and the similarity between all adjacent two data in the arrangement order of the first output data are input into a preset probability calculation model, and the probability of the arrangement order is obtained;
所述训练模块322根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序概率,具体包括:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。The training module 322 calculates a target arrangement order probability of each data in the second output data according to the similarity between the data in the second output data, and specifically includes: sequence information of the target arrangement order for each target arrangement order And the similarity between all adjacent two data in the target arrangement order of the second output data is input into the probability calculation model, and the probability of the target arrangement order is obtained.
本发明实施例中,所述概率计算模型可以为一阶Plackett概率模型,也可以为高阶Plackett概率模型,还可以是其他能够计算概率的模型,本申请不做严格限定。In the embodiment of the present invention, the probability calculation model may be a first-order Plackett probability model, a high-order Plackett probability model, or other models capable of calculating a probability, which is not strictly limited.
本发明实施例中,所述目标排列顺序可以为一个,也可以为多个。当目标顺序为多个时,本发明实施例可以基于匹配多个目标排列顺序的概率分布的方式训练得到所述学生网络。本发明实施例中匹配多个目标排列顺序的概率分布的方法有多种,例如基于概率分布的全变分距离、Wesserstein距离、Jensen-Shannon散度或Kullback-Leibler散度等。In the embodiment of the present invention, the target arrangement order may be one or multiple. When the target order is multiple, the embodiment of the present invention may train the student network based on a manner of matching a probability distribution of a plurality of target arrangement orders. In the embodiment of the present invention, there are various methods for matching the probability distribution of a plurality of target arrangement orders, such as a total variation distance based on a probability distribution, a Wesserstein distance, a Jensen-Shannon divergence, or a Kullback-Leibler divergence.
本发明实施例中,学生网络的目标函数可以仅包含一个匹配函数,该目标函数还可以是 一个匹配函数与任务损失函数的和值,该任务损失函数的表达式与学生网络所要实现的任务相关,例如该任务损失函数可以与教师网络的目标函数相同。In the embodiment of the present invention, the objective function of the student network may only include one matching function, and the objective function may also be The sum of a matching function and a task loss function. The expression of the task loss function is related to the task to be implemented by the student network. For example, the task loss function can be the same as the objective function of the teacher network.
优选地,所述训练模块322根据所述目标函数的取值调整所述学生网络的权重,具体包括:采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。Preferably, the training module 322 adjusts the weight of the student network according to the value of the objective function, and specifically includes: adopting a preset gradient descent optimization algorithm, and adjusting the student network according to the value of the target function. Weights.
优选地,所述训练模块322进一步用于:在计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度之前,通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。Preferably, the training module 322 is further configured to: before calculating the similarity between the data in the first output data and calculating the similarity between the data in the second output data, by using a downsampling algorithm and an interpolation algorithm Processing the first output data and the second output data such that a spatial dimension of the first output data coincides with a spatial dimension of the second output data, and the number of the first output data and the quantity of the second output data are both The number of current training sample data is consistent.
需要说明的是,前述步骤A~步骤F没有严格的先后顺序,也可以用以下的步骤A’~步骤B’替代前述步骤A~步骤B。It should be noted that the above steps A to F have no strict sequence, and the following steps A' to B' may be used instead of the above steps A to B.
步骤A’、将用于本次迭代训练的当前训练样本数据输入教师网络,得到对应的第一输出数据,并计算第一输出数据中各数据间的相似度;Step A', inputting current training sample data for the iterative training into the teacher network, obtaining corresponding first output data, and calculating a similarity between the data in the first output data;
步骤B’、将所述当前训练样本数据输入学生网络,得到对应的第二输出数据,并计算第二输出数据中各数据间的相似度。Step B', inputting the current training sample data into the student network, obtaining corresponding second output data, and calculating a similarity between the data in the second output data.
实施例三Embodiment 3
基于与前述实施例一提供的神经网络训练方法的相同构思,本发明实施例三提供一种神经网络训练装置,该装置的结构如图5所示,包括:一个处理器501和至少一个存储器502,所述至少一个存储器502用于存储至少一条机器可执行指令,所述处理器501执行所述至少一条指令以实现:选取一个与学生网络实现相同功能的教师网络;基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。Based on the same concept as the neural network training method provided in the foregoing first embodiment, the third embodiment of the present invention provides a neural network training device. The structure of the device is as shown in FIG. 5, including: a processor 501 and at least one memory 502. The at least one memory 502 is configured to store at least one machine executable instruction, and the processor 501 executes the at least one instruction to: select a teacher network that implements the same function as the student network; and corresponding to matching the same training sample data Iteratively training the student network to obtain a target network by the similarity between the data of the first output data and the data of the second output data, so as to implement the migration of the similarity between the output data of the teacher network to the student network Wherein: the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student after the training sample data is input into the student network The data output by the second specific network layer of the network.
其中,所述处理器501执行所述至少一条指令实现基于匹配同一训练样本数据对应的第一输出数据的样本间相似性与第二输出数据的样本间相似性来迭代训练所述学生网络得到目标网络,具体包括:构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数;采用所述训练样本数据对所述学生网络进行迭代训练;当迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。The processor 501 executes the at least one instruction to iteratively train the student network to obtain a target based on matching the inter-sample similarity of the first output data corresponding to the same training sample data with the sample-to-sample similarity of the second output data. The network specifically includes: an objective function for constructing the student network, the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data; The training sample data performs iterative training on the student network; when the iterative training number reaches a threshold or the objective function satisfies a preset convergence condition, the target network is obtained.
其中,所述处理器501执行所述至少一条指令实现采用所述训练样本数据对所述学生网 络进行迭代训练,具体包括:对所述学生网络进行多次以下迭代训练:将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度;根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序;根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;基于调整权重后的学生网络进行下一次迭代训练。The processor 501 executes the at least one instruction to implement the training sample data on the student network. The iterative training of the network specifically includes: performing the following iterative training on the student network: inputting the current training sample data used for the iterative training into the teacher network and the student network, respectively, to obtain corresponding first output data and a second output data; calculating a similarity between the data in the first output data and calculating a similarity between the data in the second output data; calculating each of the first output data according to the similarity between the data in the first output data a probability of all the order of the data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data; calculating each data in the second output data according to the similarity between the data in the second output data The probability of the target arrangement order; calculating the value of the objective function according to the probability of the target arrangement order of each data in the first output data and the probability of the target arrangement order of each data in the second output data, and according to the objective function The value adjusts the weight of the student network; the next iterative training is performed based on the student network after adjusting the weight.
其中,所述处理器501执行所述至少一条指令实现从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,具体包括:从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为目标排列顺序;或者,从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。The processor 501 executes the at least one instruction to select a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes: selecting a probability from all the arrangement orders of the data in the first output data. The arrangement order of the values greater than the preset threshold is used as the target arrangement order; or, the order of the preset number of the probabilities taken in the first output data is selected as the target arrangement order.
其中,所述处理器501执行所述至少一条指令实现计算第一输出数据中各数据间的相似度,具体包括:计算第一输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度;计算第二输出数据中各数据间的相似度,具体包括:计算第二输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。The processor 501 executes the at least one instruction to implement the calculation of the similarity between the data in the first output data, and specifically includes: calculating a spatial distance between the two data in the first output data, according to the spatial distance Obtaining a similarity between the two data sets; calculating a similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, according to the spatial distance, obtaining the The similarity between two data sets.
其中,所述处理器501执行所述至少一条指令实现根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,具体包括:针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率;根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率,具体包括:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。The processor 501 executes the at least one instruction to calculate a probability of calculating all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: for each sorting order, Inputting the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data into a preset probability calculation model to obtain a probability of the arrangement order; according to the second output data Calculating the probability of the target arrangement order of each data in the second output data, the specificity includes: ordering the order of the target arrangement order and the target arrangement of the second output data for each target arrangement order The similarity between all adjacent two data in the sequence is input into the probability calculation model, and the probability of the target arrangement order is obtained.
其中,当所述目标排列顺序为一个时,所述学生网络的目标函数如下:Wherein, when the target arrangement order is one, the objective function of the student network is as follows:
L=-log P(πt|Xs)L=-log P(π t |X s )
式中,πt为当前训练样本数据对应的第一输出数据中各数据的目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,P(πt|Xs)为第二输出数据中各数据的目标排列顺序的概率。Where π t is the target arrangement order of each data in the first output data corresponding to the current training sample data, X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is the second output The probability of the target order of each data in the data.
其中,当所述目标排列顺序为多个时,所述学生网络的目标函数如下:Wherein, when the target is arranged in a plurality of orders, the objective function of the student network is as follows:
Figure PCTCN2017102032-appb-000043
Figure PCTCN2017102032-appb-000043
式中,π为一个目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,Xt为当前训练样本数据对应的第一输出数据,P(π|Xs)为当前训练样本数据的第二传输数据中各数据的排列顺序为π的概率,P(π|Xt)为当前训练样本数据的第一传输数据中各数据的排列顺序为π的概率,Q为目标排列顺序的集合。Where π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.
其中,所述处理器501执行所述至少一条指令实现,根据所述目标函数的取值调整所述学生网络的权重,具体包括:采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。The processor 501 performs the at least one instruction implementation, and adjusts the weight of the student network according to the value of the target function, which includes: adopting a preset gradient descent optimization algorithm, according to the target function The value adjusts the weight of the student network.
其中,在所述处理器501执行所述至少一条指令实现计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度之前,所述处理器还执行所述至少一条指令以实现:通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。The processor further performs the performing, after the processor 501 executes the at least one instruction to calculate a similarity between data in the first output data and calculates a similarity between data in the second output data. At least one instruction is implemented to: process the first output data and the second output data by a downsampling algorithm and an interpolation algorithm, such that a spatial dimension of the first output data is consistent with a spatial dimension of the second output data, and The number of one output data and the number of second output data are both consistent with the number of current training sample data.
其中,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层;所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。The first specific network layer is an intermediate network layer or a last layer network layer in the teacher network; and the second specific network layer is an intermediate network layer or a last layer network layer of the student network.
基于与前述方法相同的构思,本发明实施例还提供一种存储介质(该存储介质可以是非易失性机器可读存储介质),该存储介质中存储有用于神经网络训练的计算机程序,该计算机程序具有被配置用于执行以下步骤的代码段:选取一个与学生网络实现相同功能的教师网络;基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。Based on the same concept as the foregoing method, the embodiment of the present invention further provides a storage medium (which may be a non-volatile machine readable storage medium), where the computer program for storing neural network training is stored. The program has a code segment configured to perform the following steps: selecting a teacher network that implements the same function as the student network; based on matching the data between the first output data corresponding to the same training sample data and the data of the second output data Similarity to iteratively train the student network to obtain a target network to implement migration of similarity between output data of the teacher network to the student network; wherein: the first output data is used to input the training sample data into a teacher network The data output from the first specific network layer of the teacher network, the second output data is data output from the second specific network layer of the student network after the training sample data is input into the student network.
基于与前述方法相同的构思,本发明实施例还提供一种计算机程序,该计算机程序具有被配置用于执行以下神经网络训练的代码段:选取一个与学生网络实现相同功能的教师网络;基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。Based on the same concept as the foregoing method, an embodiment of the present invention further provides a computer program having a code segment configured to perform the following neural network training: selecting a teacher network that implements the same function as the student network; Iteratively training the student network to obtain a target network by using the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to achieve similarity migration between the output data of the teacher network Go to the student network; wherein: the first output data is data output from a first specific network layer of a teacher network after the training sample data is input into a teacher network, and the second output data is input to the training sample data The student network then outputs data from the second specific network layer of the student network.
综上所述,本发明实施例中,能够将样本训练数据在教师网络输出的输出数据的各数据间相似信息全面迁移到学生网络中,从而实现训练样本数据通过教师网络输出的结果与通过 目标网络输出的结果基本一致。根据神经网络良好的泛化性能,训练得到的目标网络的输出与教师网络的输出在测试集上也基本相同,从而提高了学生网络的准确性。以上结合具体实施例描述了本发明的基本原理,但是,需要指出的是,对本领域普通技术人员而言,能够理解本发明的方法和装置的全部或者任何步骤或者部件可以在任何计算装置(包括处理器、存储介质等)或者计算装置的网络中,以硬件固件、软件或者他们的组合加以实现,这是本领域普通技术人员在阅读了本发明的说明的情况下运用它们的基本编程技能就能实现的。In summary, in the embodiment of the present invention, the sample training data can be completely migrated to the student network in the data of the output data output by the teacher network, thereby realizing the result and passing of the training sample data output through the teacher network. The results of the target network output are basically the same. According to the good generalization performance of the neural network, the output of the target network and the output of the teacher network are basically the same in the test set, thereby improving the accuracy of the student network. The basic principles of the present invention have been described above in connection with the specific embodiments, but it should be noted that those skilled in the art can understand that all or any of the steps or components of the method and apparatus of the present invention can be in any computing device (including The processor, the storage medium, or the like, or the network of computing devices, implemented in hardware firmware, software, or a combination thereof, which is the basic programming skill of those skilled in the art in the context of reading the description of the present invention. Can be achieved.
本领域普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。A person skilled in the art can understand that all or part of the steps carried by the method of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium. , including one or a combination of the steps of the method embodiments.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。 These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
尽管已描述了本发明的上述实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括上述实施例以及落入本发明范围的所有变更和修改。Although the above-described embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to the embodiments once they are aware of the basic inventive concept. Therefore, the appended claims are intended to be interpreted as including the above-described embodiments and all changes and modifications falling within the scope of the invention.
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。 It is apparent that those skilled in the art can make various modifications and variations to the invention without departing from the spirit and scope of the invention. Thus, it is intended that the present invention cover the modifications and modifications of the invention

Claims (33)

  1. 一种神经网络训练方法,其特征在于,包括:A neural network training method, comprising:
    选取一个与学生网络实现相同功能的教师网络;Select a network of teachers that achieve the same functionality as the student network;
    基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;
    其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
  2. 根据权利要求1所述的方法,其特征在于,基于匹配同一训练样本数据对应的第一输出数据的样本间相似性与第二输出数据的样本间相似性来迭代训练所述学生网络得到目标网络,具体包括:The method according to claim 1, wherein the student network is iteratively trained to obtain a target network based on matching the inter-sample similarity of the first output data corresponding to the same training sample data with the sample-to-sample similarity of the second output data. Specifically, including:
    构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数;Constructing an objective function of the student network, the objective function comprising a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;
    采用所述训练样本数据对所述学生网络进行迭代训练;Performing iterative training on the student network by using the training sample data;
    当迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。The target network is obtained when the number of iterative training reaches a threshold or the objective function satisfies a preset convergence condition.
  3. 根据权利要求2所述的方法,其特征在于,采用所述训练样本数据对所述学生网络进行迭代训练,具体包括:The method according to claim 2, wherein the iterative training is performed on the student network by using the training sample data, which specifically includes:
    对所述学生网络进行多次以下迭代训练:Perform the following iterative training on the student network:
    将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;The current training sample data used for the iterative training is input into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
    计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度;Calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;
    根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序;Calculating a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data;
    根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;Calculating a probability of a target arrangement order of each data in the second output data according to a similarity between each data in the second output data;
    根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;Calculating a value of the objective function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting the student according to the value of the objective function The weight of the network;
    基于调整权重后的学生网络进行下一次迭代训练。The next iterative training is performed based on the student network after adjusting the weights.
  4. 根据权利要求3所述的方法,其特征在于,从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,具体包括:The method according to claim 3, wherein the target arrangement order is selected from all the arrangement orders of the data in the first output data, and specifically includes:
    从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为 目标排列顺序;Selecting an order in which the probability values are greater than a preset threshold from all the order of the data in the first output data Target arrangement order;
    或者,从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。Alternatively, the order in which the probability values are ranked in the previous preset number is selected from the all sorting orders of the data in the first output data as the target sorting order.
  5. 根据权利要求3所述的方法,其特征在于,计算第一输出数据中各数据间的相似度,具体包括:计算第一输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度;The method according to claim 3, wherein calculating the similarity between the data in the first output data comprises: calculating a spatial distance between the two data in the first output data, according to the spatial distance Similarity between the two pairs of data;
    计算第二输出数据中各数据间的相似度,具体包括:计算第二输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。Calculating the similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance.
  6. 根据权利要求3所述的方法,其特征在于,根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,具体包括:The method according to claim 3, wherein the probability of calculating all the order of the data in the first output data is calculated according to the similarity between the data in the first output data, and specifically includes:
    针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率;For each permutation order, the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data are input into a preset probability calculation model to obtain a probability of the arrangement order. ;
    根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率,具体包括:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。The probability of calculating the target arrangement order of each data in the second output data according to the similarity between the data in the second output data, specifically includes: order information of the target arrangement order and the second output for each target arrangement order The similarity between all adjacent two data in the target arrangement order of the data is input into the probability calculation model, and the probability of the target arrangement order is obtained.
  7. 根据权利要求3所述的方法,其特征在于,当所述目标排列顺序为一个时,所述学生网络的目标函数如下:The method according to claim 3, wherein when the target arrangement order is one, the objective function of the student network is as follows:
    L=-log P(πt|Xs)L=-log P(π t |X s )
    式中,πt为当前训练样本数据对应的第一输出数据中各数据的目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,P(πt|Xs)为第二输出数据中各数据的目标排列顺序的概率。Where π t is the target arrangement order of each data in the first output data corresponding to the current training sample data, X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is the second output The probability of the target order of each data in the data.
  8. 根据权利要求3所述的方法,其特征在于,当所述目标排列顺序为多个时,所述学生网络的目标函数如下:The method according to claim 3, wherein when the target arrangement order is plural, the objective function of the student network is as follows:
    Figure PCTCN2017102032-appb-100001
    Figure PCTCN2017102032-appb-100001
    式中,π为一个目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,Xt为当前训练样本数据对应的第一输出数据,P(π|Xs)为当前训练样本数据的第二传输数据中各数据的排列顺序为π的概率,P(π|Xt)为当前训练样本数据的第一传输数据中各数据的排列顺序为π的概率,Q为目标排列顺序的集合。 Where π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.
  9. 根据权利要求3所述的方法,其特征在于,根据所述目标函数的取值调整所述学生网络的权重,具体包括:The method according to claim 3, wherein adjusting the weight of the student network according to the value of the objective function comprises:
    采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。The weight of the student network is adjusted according to the value of the objective function by using a preset gradient descent optimization algorithm.
  10. 根据权利要求3所述的方法,其特征在于,在计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度之前,还包括:通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。The method according to claim 3, wherein before calculating the similarity between the data in the first output data and calculating the similarity between the data in the second output data, the method further comprises: performing a downsampling algorithm and interpolating The algorithm processes the first output data and the second output data such that a spatial dimension of the first output data is consistent with a spatial dimension of the second output data, and the number of the first output data and the number of the second output data Both are consistent with the number of current training sample data.
  11. 根据权利要求1所述的方法,其特征在于,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层;The method according to claim 1, wherein the first specific network layer is an intermediate network layer or a last network layer in a teacher network;
    所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。The second specific network layer is an intermediate network layer or a last network layer of the student network.
  12. 一种神经网络训练装置,其特征在于,包括:A neural network training device, comprising:
    选取单元,用于选取一个与学生网络实现相同功能的教师网络;The selection unit is used to select a teacher network that implements the same function as the student network;
    训练单元,用于基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;a training unit, configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to implement the teacher network The similarity between the output data is migrated to the student network;
    其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
  13. 根据权利要求12所述的装置,其特征在于,所述训练单元,具体包括:The device according to claim 12, wherein the training unit comprises:
    构建模块,用于构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数;a building module, configured to construct an objective function of the student network, the objective function comprising a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;
    训练模块,用于采用所述训练样本数据对所述学生网络进行迭代训练;a training module, configured to perform iterative training on the student network by using the training sample data;
    确定模块,用于当训练模块迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。And a determining module, configured to obtain the target network when the training module iterative training times reaches a threshold or the target function satisfies a preset convergence condition.
  14. 根据权利要求13所述的装置,其特征在于,所述训练模块,具体用于:The device according to claim 13, wherein the training module is specifically configured to:
    对所述学生网络进行多次以下迭代训练:Perform the following iterative training on the student network:
    将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;The current training sample data used for the iterative training is input into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
    计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度;Calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;
    根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序; Calculating a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data;
    根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;Calculating a probability of a target arrangement order of each data in the second output data according to a similarity between each data in the second output data;
    根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;Calculating a value of the objective function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting the student according to the value of the objective function The weight of the network;
    基于调整权重后的学生网络进行下一次迭代训练。The next iterative training is performed based on the student network after adjusting the weights.
  15. 根据权利要求14所述的装置,其特征在于,所述训练模块从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,具体包括:The apparatus according to claim 14, wherein the training module selects a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes:
    从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为目标排列顺序;Selecting an arrangement order in which the probability values are greater than a preset threshold from all the arrangement orders of the data in the first output data as the target arrangement order;
    或者,从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。Alternatively, the order in which the probability values are ranked in the previous preset number is selected from the all sorting orders of the data in the first output data as the target sorting order.
  16. 根据权利要求14所述的装置,其特征在于,所述训练模块计算第一输出数据中各数据间的相似度,具体包括:计算第一输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度;The apparatus according to claim 14, wherein the training module calculates the similarity between the data in the first output data, specifically: calculating a spatial distance between the two data in the first output data, according to The spatial distance is obtained to obtain similarity between the two pairs of data;
    所述训练模块计算第二输出数据中各数据间的相似度,具体包括:计算第二输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。The training module calculates the similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance .
  17. 根据权利要求14所述的装置,其特征在于,所述训练模块根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,具体包括:针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率;The apparatus according to claim 14, wherein the training module calculates a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: Arranging the order, and inputting the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data into a preset probability calculation model to obtain a probability of the arrangement order;
    所述训练模块根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序概率,具体包括:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。The training module calculates a target arrangement order probability of each data in the second output data according to the similarity between the data in the second output data, and specifically includes: sequence information of the target arrangement order for each target arrangement order and The similarity between all adjacent two data in the target arrangement order of the second output data is input into the probability calculation model, and the probability of the target arrangement order is obtained.
  18. 根据权利要求14所述的装置,其特征在于,The device of claim 14 wherein:
    当所述目标排列顺序为一个时,所述学生网络的目标函数如下:When the target arrangement order is one, the objective function of the student network is as follows:
    L=-log P(πt|Xs)L=-log P(π t |X s )
    式中,πt为第二输出数据中各数据的目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,P(πt|Xs)为πt的概率。Where π t is the target arrangement order of each data in the second output data, X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is the probability of π t .
  19. 根据权利要求14所述的装置,其特征在于,The device of claim 14 wherein:
    当所述目标排列顺序为多个时,所述学生网络的目标函数如下: When the target arrangement order is plural, the objective function of the student network is as follows:
    Figure PCTCN2017102032-appb-100002
    Figure PCTCN2017102032-appb-100002
    式中,π为一个目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,Xt为当前训练样本数据对应的第一输出数据,P(π|Xs)为当前训练样本数据的第二传输数据中各数据的排列顺序为π的概率,P(π|Xt)为当前训练样本数据的第一传输数据中各数据的排列顺序为π的概率,Q为目标排列顺序的集合。Where π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.
  20. 根据权利要求14所述的装置,其特征在于,所述训练模块根据所述目标函数的取值调整所述学生网络的权重,具体包括:The apparatus according to claim 14, wherein the training module adjusts the weight of the student network according to the value of the objective function, which specifically includes:
    采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。The weight of the student network is adjusted according to the value of the objective function by using a preset gradient descent optimization algorithm.
  21. 根据权利要求14所述的装置,所述训练模块进一步用于:The apparatus of claim 14, the training module further configured to:
    在计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度之前,通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。Processing the first output data and the second output data by a downsampling algorithm and an interpolation algorithm before calculating a similarity between the data in the first output data and calculating a similarity between the data in the second output data, The spatial dimension of the first output data is made to coincide with the spatial dimension of the second output data, and the number of the first output data and the number of the second output data are both consistent with the number of the current training sample data.
  22. 根据权利要求12所述的装置,其特征在于,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层;The apparatus according to claim 12, wherein said first specific network layer is an intermediate network layer or a last network layer in a teacher network;
    所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。The second specific network layer is an intermediate network layer or a last network layer of the student network.
  23. 一种神经网络训练装置,其特征在于,包括:一个处理器和至少一个存储器,所述至少一个存储器用于存储至少一条机器可执行指令,所述处理器执行所述至少一条指令以实现:A neural network training apparatus, comprising: a processor and at least one memory, the at least one memory for storing at least one machine executable instruction, the processor executing the at least one instruction to:
    选取一个与学生网络实现相同功能的教师网络;Select a network of teachers that achieve the same functionality as the student network;
    基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;
    其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
  24. 根据权利要求23所述的装置,其特征在于,所述处理器执行所述至少一条指令实现基于匹配同一训练样本数据对应的第一输出数据的样本间相似性与第二输出数据的样本间相似性来迭代训练所述学生网络得到目标网络,具体包括:The apparatus according to claim 23, wherein said processor executes said at least one instruction to achieve an inter-sample similarity based on matching first output data corresponding to the same training sample data and a sample-to-sample similarity of second output data To iteratively train the student network to obtain the target network, including:
    构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数; Constructing an objective function of the student network, the objective function comprising a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;
    采用所述训练样本数据对所述学生网络进行迭代训练;Performing iterative training on the student network by using the training sample data;
    当迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。The target network is obtained when the number of iterative training reaches a threshold or the objective function satisfies a preset convergence condition.
  25. 根据权利要求24所述的装置,其特征在于,所述处理器执行所述至少一条指令实现采用所述训练样本数据对所述学生网络进行迭代训练,具体包括:The apparatus according to claim 24, wherein the processor executes the at least one instruction to perform iterative training on the student network by using the training sample data, and specifically includes:
    对所述学生网络进行多次以下迭代训练:Perform the following iterative training on the student network:
    将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;The current training sample data used for the iterative training is input into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
    计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度;Calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;
    根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序;Calculating a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data;
    根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;Calculating a probability of a target arrangement order of each data in the second output data according to a similarity between each data in the second output data;
    根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;Calculating a value of the objective function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting the student according to the value of the objective function The weight of the network;
    基于调整权重后的学生网络进行下一次迭代训练。The next iterative training is performed based on the student network after adjusting the weights.
  26. 根据权利要求25所述的装置,其特征在于,所述处理器执行所述至少一条指令实现从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,具体包括:The apparatus according to claim 25, wherein the processor executes the at least one instruction to select a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes:
    从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为目标排列顺序;Selecting an arrangement order in which the probability values are greater than a preset threshold from all the arrangement orders of the data in the first output data as the target arrangement order;
    或者,从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。Alternatively, the order in which the probability values are ranked in the previous preset number is selected from the all sorting orders of the data in the first output data as the target sorting order.
  27. 根据权利要求25所述的装置,其特征在于,所述处理器执行所述至少一条指令实现计算第一输出数据中各数据间的相似度,具体包括:计算第一输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度;The device according to claim 25, wherein the processor executing the at least one instruction to calculate a similarity between data in the first output data comprises: calculating two or two data in the first output data. a spatial distance between the two, and the similarity between the two data sets according to the spatial distance;
    计算第二输出数据中各数据间的相似度,具体包括:计算第二输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。Calculating the similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance.
  28. 根据权利要求25所述的装置,其特征在于,所述处理器执行所述至少一条指令实现根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,具体包括:The apparatus according to claim 25, wherein said processor executes said at least one instruction to calculate a probability of calculating all the order of each data in the first output data based on the similarity between the data in the first output data. Specifically, including:
    针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率;For each permutation order, the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data are input into a preset probability calculation model to obtain a probability of the arrangement order. ;
    根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概 率,具体包括:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。Calculating an outline of a target arrangement order of each data in the second output data according to the similarity between the data in the second output data The rate includes: for each target arrangement order, inputting the order information of the target arrangement order and the similarity between all adjacent two data in the target arrangement order of the second output data into the probability calculation model, The probability of obtaining the target arrangement order is obtained.
  29. 根据权利要求25所述的装置,其特征在于,当所述目标排列顺序为一个时,所述学生网络的目标函数如下:The apparatus according to claim 25, wherein when said target arrangement order is one, an objective function of said student network is as follows:
    L=-log P(πt|Xs)L=-log P(π t |X s )
    式中,πt为当前训练样本数据对应的第一输出数据中各数据的目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,P(πt|Xs)为第二输出数据中各数据的目标排列顺序的概率。Where π t is the target arrangement order of each data in the first output data corresponding to the current training sample data, X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is the second output The probability of the target order of each data in the data.
  30. 根据权利要求25所述的装置,其特征在于,当所述目标排列顺序为多个时,所述学生网络的目标函数如下:The apparatus according to claim 25, wherein when the target arrangement order is plural, the objective function of the student network is as follows:
    Figure PCTCN2017102032-appb-100003
    Figure PCTCN2017102032-appb-100003
    式中,π为一个目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,Xt为当前训练样本数据对应的第一输出数据,P(π|Xs)为当前训练样本数据的第二传输数据中各数据的排列顺序为π的概率,P(π|Xt)为当前训练样本数据的第一传输数据中各数据的排列顺序为π的概率,Q为目标排列顺序的集合。Where π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.
  31. 根据权利要求25所述的装置,其特征在于,所述处理器执行所述至少一条指令实现,根据所述目标函数的取值调整所述学生网络的权重,具体包括:The apparatus according to claim 25, wherein the processor performs the at least one instruction implementation, and adjusting the weight of the student network according to the value of the objective function, specifically:
    采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。The weight of the student network is adjusted according to the value of the objective function by using a preset gradient descent optimization algorithm.
  32. 根据权利要求25所述的装置,其特征在于,在所述处理器执行所述至少一条指令实现计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度之前,所述处理器还执行所述至少一条指令以实现:通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。The apparatus according to claim 25, wherein said processor executes said at least one instruction to calculate a similarity between data in said first output data and calculate a similarity between data in said second output data The processor further executes the at least one instruction to: process the first output data and the second output data by a downsampling algorithm and an interpolation algorithm, such that a spatial dimension of the first output data The spatial dimensions of the two output data are consistent, and the number of the first output data and the number of the second output data are both consistent with the number of the current training sample data.
  33. 根据权利要求23所述的装置,其特征在于,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层;The apparatus according to claim 23, wherein said first specific network layer is an intermediate network layer or a last layer network layer in a teacher network;
    所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。 The second specific network layer is an intermediate network layer or a last network layer of the student network.
PCT/CN2017/102032 2017-06-15 2017-09-18 Neural network training method and device WO2018227800A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710450211.9A CN107358293B (en) 2017-06-15 2017-06-15 Neural network training method and device
CN201710450211.9 2017-06-15

Publications (1)

Publication Number Publication Date
WO2018227800A1 true WO2018227800A1 (en) 2018-12-20

Family

ID=60273856

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/102032 WO2018227800A1 (en) 2017-06-15 2017-09-18 Neural network training method and device

Country Status (2)

Country Link
CN (2) CN110969250B (en)
WO (1) WO2018227800A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291836A (en) * 2020-03-31 2020-06-16 中国科学院计算技术研究所 Method for generating student network model
CN111340221A (en) * 2020-02-25 2020-06-26 北京百度网讯科技有限公司 Method and device for sampling neural network structure
CN111435424A (en) * 2019-01-14 2020-07-21 北京京东尚科信息技术有限公司 Image processing method and device
CN111444958A (en) * 2020-03-25 2020-07-24 北京百度网讯科技有限公司 Model migration training method, device, equipment and storage medium
CN111598213A (en) * 2020-04-01 2020-08-28 北京迈格威科技有限公司 Network training method, data identification method, device, equipment and medium

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304915B (en) * 2018-01-05 2020-08-11 大国创新智能科技(东莞)有限公司 Deep learning neural network decomposition and synthesis method and system
CN108830288A (en) * 2018-04-25 2018-11-16 北京市商汤科技开发有限公司 Image processing method, the training method of neural network, device, equipment and medium
CN108921282B (en) * 2018-05-16 2022-05-31 深圳大学 Construction method and device of deep neural network model
CN108764462A (en) * 2018-05-29 2018-11-06 成都视观天下科技有限公司 A kind of convolutional neural networks optimization method of knowledge based distillation
CN110598504B (en) * 2018-06-12 2023-07-21 北京市商汤科技开发有限公司 Image recognition method and device, electronic equipment and storage medium
CN108830813B (en) * 2018-06-12 2021-11-09 福建帝视信息科技有限公司 Knowledge distillation-based image super-resolution enhancement method
CN108898168B (en) * 2018-06-19 2021-06-01 清华大学 Compression method and system of convolutional neural network model for target detection
CN108985920A (en) * 2018-06-22 2018-12-11 阿里巴巴集团控股有限公司 Arbitrage recognition methods and device
CN109783824B (en) * 2018-12-17 2023-04-18 北京百度网讯科技有限公司 Translation method, device and storage medium based on translation model
CN109637546B (en) * 2018-12-29 2021-02-12 苏州思必驰信息科技有限公司 Knowledge distillation method and apparatus
CN109840588B (en) * 2019-01-04 2023-09-08 平安科技(深圳)有限公司 Neural network model training method, device, computer equipment and storage medium
CN109800821A (en) * 2019-01-31 2019-05-24 北京市商汤科技开发有限公司 Method, image processing method, device, equipment and the medium of training neural network
CN110009052B (en) * 2019-04-11 2022-11-18 腾讯科技(深圳)有限公司 Image recognition method, image recognition model training method and device
CN110163344B (en) * 2019-04-26 2021-07-09 北京迈格威科技有限公司 Neural network training method, device, equipment and storage medium
CN111401406B (en) * 2020-02-21 2023-07-18 华为技术有限公司 Neural network training method, video frame processing method and related equipment
CN112116441B (en) * 2020-10-13 2024-03-12 腾讯科技(深圳)有限公司 Training method, classification method, device and equipment for financial risk classification model
CN112712052A (en) * 2021-01-13 2021-04-27 安徽水天信息科技有限公司 Method for detecting and identifying weak target in airport panoramic video
CN112365886B (en) * 2021-01-18 2021-05-07 深圳市友杰智新科技有限公司 Training method and device of speech recognition model and computer equipment
CN113378940B (en) * 2021-06-15 2022-10-18 北京市商汤科技开发有限公司 Neural network training method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090971A (en) * 2014-07-17 2014-10-08 中国科学院自动化研究所 Cross-network behavior association method for individual application
CN104657596A (en) * 2015-01-27 2015-05-27 中国矿业大学 Model-transfer-based large-sized new compressor performance prediction rapid-modeling method
CN105844331A (en) * 2015-01-15 2016-08-10 富士通株式会社 Neural network system and training method thereof
US20170024641A1 (en) * 2015-07-22 2017-01-26 Qualcomm Incorporated Transfer learning in neural networks

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7062476B2 (en) * 2002-06-17 2006-06-13 The Boeing Company Student neural network
CN103020711A (en) * 2012-12-25 2013-04-03 中国科学院深圳先进技术研究院 Classifier training method and classifier training system
US20150046181A1 (en) * 2014-02-14 2015-02-12 Brighterion, Inc. Healthcare fraud protection and management
US20160328644A1 (en) * 2015-05-08 2016-11-10 Qualcomm Incorporated Adaptive selection of artificial neural networks
CN105787513A (en) * 2016-03-01 2016-07-20 南京邮电大学 Transfer learning design method and system based on domain adaptation under multi-example multi-label framework

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090971A (en) * 2014-07-17 2014-10-08 中国科学院自动化研究所 Cross-network behavior association method for individual application
CN105844331A (en) * 2015-01-15 2016-08-10 富士通株式会社 Neural network system and training method thereof
CN104657596A (en) * 2015-01-27 2015-05-27 中国矿业大学 Model-transfer-based large-sized new compressor performance prediction rapid-modeling method
US20170024641A1 (en) * 2015-07-22 2017-01-26 Qualcomm Incorporated Transfer learning in neural networks

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435424A (en) * 2019-01-14 2020-07-21 北京京东尚科信息技术有限公司 Image processing method and device
CN111340221A (en) * 2020-02-25 2020-06-26 北京百度网讯科技有限公司 Method and device for sampling neural network structure
CN111340221B (en) * 2020-02-25 2023-09-12 北京百度网讯科技有限公司 Neural network structure sampling method and device
CN111444958A (en) * 2020-03-25 2020-07-24 北京百度网讯科技有限公司 Model migration training method, device, equipment and storage medium
CN111444958B (en) * 2020-03-25 2024-02-13 北京百度网讯科技有限公司 Model migration training method, device, equipment and storage medium
CN111291836A (en) * 2020-03-31 2020-06-16 中国科学院计算技术研究所 Method for generating student network model
CN111291836B (en) * 2020-03-31 2023-09-08 中国科学院计算技术研究所 Method for generating student network model
CN111598213A (en) * 2020-04-01 2020-08-28 北京迈格威科技有限公司 Network training method, data identification method, device, equipment and medium
CN111598213B (en) * 2020-04-01 2024-01-23 北京迈格威科技有限公司 Network training method, data identification method, device, equipment and medium

Also Published As

Publication number Publication date
CN110969250A (en) 2020-04-07
CN107358293B (en) 2021-04-02
CN110969250B (en) 2023-11-10
CN107358293A (en) 2017-11-17

Similar Documents

Publication Publication Date Title
WO2018227800A1 (en) Neural network training method and device
US11651259B2 (en) Neural architecture search for convolutional neural networks
CN108805258B (en) Neural network training method and device and computer server
US11295208B2 (en) Robust gradient weight compression schemes for deep learning applications
WO2018090706A1 (en) Method and device of pruning neural network
US20230259784A1 (en) Regularized neural network architecture search
CN104346629B (en) A kind of model parameter training method, apparatus and system
CN110503192A (en) The effective neural framework of resource
CN111406267A (en) Neural architecture search using performance-predictive neural networks
CN113168559A (en) Automated generation of machine learning models
WO2018227801A1 (en) Method and device for building neural network
US11093714B1 (en) Dynamic transfer learning for neural network modeling
WO2022105108A1 (en) Network data classification method, apparatus, and device, and readable storage medium
CN113449859A (en) Data processing method and device
US20210117781A1 (en) Method and apparatus with neural network operation
CN111008631A (en) Image association method and device, storage medium and electronic device
US20210049474A1 (en) Neural network method and apparatus
CN114072809A (en) Small and fast video processing network via neural architectural search
EP4009239A1 (en) Method and apparatus with neural architecture search based on hardware performance
Wang et al. Towards efficient convolutional neural networks through low-error filter saliency estimation
CN112052865A (en) Method and apparatus for generating neural network model
CN116384471A (en) Model pruning method, device, computer equipment, storage medium and program product
CN110457155A (en) A kind of modification method, device and the electronic equipment of sample class label
US20220138554A1 (en) Systems and methods utilizing machine learning techniques for training neural networks to generate distributions
CN114861671A (en) Model training method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17913592

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17913592

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.04.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17913592

Country of ref document: EP

Kind code of ref document: A1