一种神经网络训练方法及装置Neural network training method and device
本申请要求在2017年6月15日提交中国专利局、申请号为201710450211.9、发明名称为“一种神经网络训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims the priority of the Chinese Patent Application, which is filed on June 15, 2017, to the Chinese Patent Office, the number of which is incorporated herein by reference. in.
技术领域Technical field
本发明涉及计算机视觉领域,特别涉及一种神经网络训练方法及装置。The present invention relates to the field of computer vision, and in particular to a neural network training method and apparatus.
背景技术Background technique
近几年来,深度神经网络在计算机视觉领域的各类应用中取得了巨大的成功,如图像分类、目标检测、图像分割等。但深度神经网络的模型往往包含大量的模型参数,计算量大、处理速度慢,无法在一些低功耗、低计算能力的设备(如嵌入式设备、集成设备等)上进行实时计算。In recent years, deep neural networks have achieved great success in various applications in the field of computer vision, such as image classification, target detection, and image segmentation. However, the model of deep neural network often contains a large number of model parameters, which is computationally intensive and slow in processing, and cannot be calculated in real time on some low-power, low-computing devices (such as embedded devices, integrated devices, etc.).
目前,为解决该问题,提出一些解决方案,例如,通过知识迁移方式将教师网络的知识(即教师网络,教师网络一般具有复杂的网络结构、准确性高、计算速度慢)迁移到学生网络中(即学生网络,学生网络的网络结构相对简单、准确性低、速度快),以提高学生网络性能。此时的学生网络可应用到低功耗、地计算能力的设备中。At present, in order to solve this problem, some solutions are proposed. For example, the knowledge of the teacher network (ie, the teacher network, the teacher network generally has a complex network structure, high accuracy, and slow calculation speed) is transferred to the student network through knowledge migration. (ie student network, student network network structure is relatively simple, low accuracy, fast) to improve student network performance. The student network at this time can be applied to devices with low power consumption and local computing power.
知识迁移是一种通用的对深度神经网络模型进行压缩以及加速的技术。目前知识迁移的方法主要包括三种,分别是2014年Hinton等人发表的论文“Distilling the knowledge in a neural network”中提出的Knowledge Distill(简称KD)方法,2015年Romero等人发表的论文“Fitnets:Hints for thin deep nets”提出的FitNets,以及2016年Sergey发表的论文“Paying more attention to attention:Improving the performance of convolutional neural networks via attention transfer”提出的Attention Transfer(简称AT)方法。Knowledge migration is a general technique for compressing and accelerating deep neural network models. At present, there are three main methods of knowledge transfer, namely the Knowledge Distill (KD) method proposed in the paper “Distilling the knowledge in a neural network” published by Hinton et al. in 2014, and the paper published by Romero et al. in 2015 “Fitnets”. :FitsNets proposed by Hints for thin deep nets, and the Attention Transfer (AT) method proposed by the paper "Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer" published by Sergey in 2016.
现有的知识迁移方式,利用教师网络中输出数据中的单个数据的信息来训练学生网络,训练得到的学生网络虽然在性能上有一定的提高,但仍然还有很大的提升空间。The existing knowledge migration method uses the information of the single data in the output data of the teacher network to train the student network. Although the trained student network has certain improvement in performance, there is still much room for improvement.
相关术语解释:Interpretation of related terms:
知识迁移(Knowledge Transfer):在深度神经网络中,知识迁移是指利用训练样本数据在教师网络的中间网络层或最终网络层的输出数据,辅助训练速度较快但性能较差的学生网络,从而将性能优良的教师网络迁移到学生网络上。Knowledge Transfer: In deep neural networks, knowledge transfer refers to the use of training sample data in the intermediate network layer of the teacher network or the final network layer to help the student network with faster training but poor performance. Migrate a high-performing teacher network to the student network.
知识提取(Knowledge Distill):在深度神经网络中,知识提取是指在分类问题中利用教师网络输出的平滑类别后验概率训练学生网络的技术。
Knowledge Distill: In deep neural networks, knowledge extraction refers to the technique of training student networks by using the smooth category posterior probability output from the teacher network in the classification problem.
教师网络(Teacher Network):知识迁移过程中用以为学生网络提供更加准确的监督信息的高性能神经网络。Teacher Network: A high-performance neural network used to provide more accurate monitoring information for student networks during the knowledge migration process.
学生网络(Student Network):计算速度快但性能较差的适合部署到对实时性要求较高的实际应用场景中的单个神经网络,学生网络相比于教师网络,具有更大的运算吞吐量和更少的模型参数。Student Network: A fast calculation but poor performance suitable for deployment to a single neural network in a real-time scenario with high real-time requirements. The student network has greater computational throughput than the teacher network. Fewer model parameters.
发明内容Summary of the invention
鉴于上述问题,本发明提供一种神经网络训练方法及装置,以更进一步提升学生网络的性能和准确性。In view of the above problems, the present invention provides a neural network training method and apparatus to further improve the performance and accuracy of a student network.
本发明实施例,一方面提供一种神经网络训练方法,该方法包括:In an embodiment of the present invention, an aspect provides a neural network training method, where the method includes:
选取一个与学生网络实现相同功能的教师网络;Select a network of teachers that achieve the same functionality as the student network;
基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;
其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例另一方面提供一种神经网络训练装置,该装置包括:Another aspect of the embodiments of the present invention provides a neural network training device, the device comprising:
选取单元,用于选取一个与学生网络实现相同功能的教师网络;The selection unit is used to select a teacher network that implements the same function as the student network;
训练单元,用于基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;a training unit, configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to implement the teacher network The similarity between the output data is migrated to the student network;
其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例另一方面提供一种神经网络训练装置,该装置包括:一个处理器和至少一个存储器,所述至少一个存储器用于存储至少一条机器可执行指令,所述处理器执行所述至少一条指令以实现:Another aspect of an embodiment of the present invention provides a neural network training apparatus, the apparatus comprising: a processor and at least one memory, the at least one memory for storing at least one machine executable instruction, the processor performing the at least An instruction to achieve:
选取一个与学生网络实现相同功能的教师网络;Select a network of teachers that achieve the same functionality as the student network;
基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;
And iteratively training the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data, so as to achieve similarity between the output data of the teacher network. Sexual migration to the student network;
其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例中,能够将样本训练数据在教师网络输出的输出数据的各数据间相似信息全面迁移到学生网络中,从而实现训练样本数据通过教师网络输出的结果与通过目标网络输出的结果基本一致。根据神经网络良好的泛化性能,训练得到的目标网络的输出与教师网络的输出在测试集上也基本相同,从而提高了学生网络的准确性。In the embodiment of the present invention, the sample training data can be completely migrated to the student network in the data of the output data output by the teacher network, thereby realizing the result of the training sample data output through the teacher network and the result output through the target network. Consistent. According to the good generalization performance of the neural network, the output of the target network and the output of the teacher network are basically the same in the test set, thereby improving the accuracy of the student network.
本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。Other features and advantages of the invention will be set forth in the description which follows, The objectives and other advantages of the invention may be realized and obtained by means of the structure particularly pointed in the appended claims.
下面通过附图和实施例,对本发明的技术方案做进一步的详细描述。The technical solution of the present invention will be further described in detail below through the accompanying drawings and embodiments.
附图说明DRAWINGS
附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例一起用于解释本发明,并不构成对本发明的限制。显而易见地,下面描述中的附图仅仅是本发明一些实施例,对于本领域普通技术人员而言,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:The drawings are intended to provide a further understanding of the invention, and are intended to be a Obviously, the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without any creative work. In the drawing:
图1为本发明实施例中神经网络训练方法的流程图;1 is a flowchart of a neural network training method according to an embodiment of the present invention;
图2为本发明实施例中训练学生网络的流程图;2 is a flow chart of training a student network in an embodiment of the present invention;
图3为本发明实施例中神经网络训练装置的结构示意图;3 is a schematic structural diagram of a neural network training apparatus according to an embodiment of the present invention;
图4为本发明实施例中训练单元的结构示意图;4 is a schematic structural diagram of a training unit according to an embodiment of the present invention;
图5为本发明实施例中神经网络训练装置的结构示意图。FIG. 5 is a schematic structural diagram of a neural network training apparatus according to an embodiment of the present invention.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本发明中的技术方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to make those skilled in the art better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the accompanying drawings in the embodiments of the present invention. The embodiments are only a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.
以上是本发明的核心思想,为了使本技术领域的人员更好地理解本发明实施例中的技术方案,并使本发明实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明实施例中技术方案作进一步详细的说明。The above is the core idea of the present invention, and in order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the above objects, features and advantages of the embodiments of the present invention can be more clearly understood. The technical solution in the embodiment of the present invention is further described in detail.
实施例一
Embodiment 1
参见图1,为本发明实施例中神经网络训练方法的流程图,该方法包括:1 is a flowchart of a neural network training method according to an embodiment of the present invention, where the method includes:
步骤101、选取一个与学生网络实现相同功能的教师网络。Step 101: Select a teacher network that implements the same function as the student network.
实现的功能如图像分类、目标检测、图像分割等。教师网络性能优良、准确率高,但是相对学生网络其结构复杂、参数权重较多、计算速度较慢。学生网络计算速度快、性能一般或者较差、网络结构简单。可以在预先设置的神经网络模型的集合中选取一个与学生网络实现的功能相同且性能优良的网络作为教师网络。Functions such as image classification, target detection, image segmentation, etc. are implemented. The teacher network has excellent performance and high accuracy, but its structure is complex relative to the student network, the parameter weight is more, and the calculation speed is slower. The student network is fast, the performance is generally poor or the network structure is simple. A network with the same functions and excellent performance as that implemented by the student network can be selected as a teacher network in a set of pre-set neural network models.
步骤102、基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络。Step 102: Iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to obtain the output of the teacher network. The similarity between data is migrated to the student network.
其中,所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例中,将训练样本数据输入教师网络后,从教师网络的第一特定网络层输出的数据统称为第一输出数据;将训练样本数据输入学生网络后,从学生网络的第二特定网络层输出的数据统称为第二输出数据。In the embodiment of the present invention, after the training sample data is input into the teacher network, the data outputted from the first specific network layer of the teacher network is collectively referred to as the first output data; after the training sample data is input into the student network, the second specificity from the student network is obtained. The data output by the network layer is collectively referred to as second output data.
优选地,本发明实施例中,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层。Preferably, in the embodiment of the present invention, the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network.
优选地,本发明实施例中,所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。Preferably, in the embodiment of the present invention, the second specific network layer is an intermediate network layer or a last layer network layer of the student network.
优选地,前述步骤102具体实现可如图2所示的方法流程,具体包括:Preferably, the foregoing step 102 specifically implements the method flow that can be as shown in FIG. 2, and specifically includes:
步骤102A、构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数。 Step 102A: Construct an objective function of the student network, where the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data.
步骤102B、采用所述训练样本数据对所述学生网络进行迭代训练。 Step 102B: Perform iterative training on the student network by using the training sample data.
步骤102C、当迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。 Step 102C: When the number of iterations training reaches a threshold or the target function satisfies a preset convergence condition, the target network is obtained.
优选地,前述步骤102B,具体实现可如下:Preferably, in the foregoing step 102B, the specific implementation may be as follows:
对所述学生网络进行多次以下迭代训练(以下称为本次迭代训练,将用于本次迭代训练的训练样本数据称为当前训练样本数据,本次迭代训练包括以下步骤A、步骤B、步骤C、步骤D、步骤E和步骤F):Performing the following iterative training on the student network multiple times (hereinafter referred to as the current iterative training, the training sample data used for the iterative training is referred to as current training sample data, and the iterative training includes the following steps A and B, Step C, Step D, Step E and Step F):
步骤A、将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;Step A: input current training sample data for the current iteration training into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
步骤B、计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相
似度;Step B: calculating a similarity between each data in the first output data and calculating a phase between each data in the second output data
Similarity
步骤C、根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序;Step C: Calculate a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and select a target order from all the order of the data in the first output data. ;
步骤D、根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;Step D: calculating, according to the similarity between each data in the second output data, a probability of a target arrangement order of each data in the second output data;
步骤E、根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;Step E: calculating a value of the target function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting according to the value of the target function The weight of the student network;
步骤F、基于调整权重后的学生网络进行下一次迭代训练。Step F: Perform the next iterative training based on the student network after adjusting the weight.
优选地,本发明实施例中,前述步骤C中从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,实现方式包括但不仅限于以下两种:Preferably, in the embodiment of the present invention, in the foregoing step C, the target arrangement order is selected from all the arrangement orders of the data in the first output data, and the implementation manner includes but is not limited to the following two types:
方式1、从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为目标排列顺序。In the first step, the order in which the probability values are greater than the preset threshold is selected from the all sorting orders of the data in the first output data as the target sorting order.
方式2、从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。In the mode 2, the order of the preset number of the probabilities taken in the first output data is selected as the target arrangement order.
本发明实施例中,选取的目标排列顺序可以是一个也可以是多个,本申请不作严格限定。In the embodiment of the present invention, the order of the selected objects may be one or more, which is not strictly limited in this application.
优选地,步骤B中,计算第一输出数据(第二输出数据)中各数据间的相似度,具体包括:计算第一输出数据(第二输出数据)中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。Preferably, in step B, calculating the similarity between the data in the first output data (the second output data), specifically: calculating a spatial distance between the two data in the first output data (the second output data), A similarity between the two pairs of data is obtained according to the spatial distance.
本发明实施例中,所述空间距离可以是欧式距离、余弦距离、街区距离或马氏距离等,本申请不做严格限定。以计算两两数据之间的欧氏距离和余弦距离为例。In the embodiment of the present invention, the spatial distance may be a European distance, a cosine distance, a block distance, or a Mahalanobis distance, and the present application is not strictly limited. Take the example of calculating the Euclidean distance and cosine distance between two pairs of data.
通过以下公式(1)计算第任意两个数据xi和xj之间的欧式距离:Calculate the Euclidean distance between any two data x i and x j by the following formula (1):
式(1)中,α为预置的尺度变换因子,β为预置的对比伸缩因子,γ为偏移量,|·|2代表向量的l2范数。In equation (1), α is a preset scale transformation factor, β is a preset contrast expansion factor, γ is an offset, and |·| 2 represents a l 2 norm of the vector.
通过以下公式(2)计算任意两个数据xi和xj之间的余弦距离:Calculate the cosine distance between any two data x i and x j by the following formula (2):
Sij=α(xi·xj)β+γ 式(2)S ij =α(x i ·x j ) β +γ (2)
式(2)中,α为预置的尺度变换因子,β为预置的对比伸缩因子,γ为偏移量,·代表向量间的点乘操作。In equation (2), α is a preset scale transformation factor, β is a preset contrast expansion factor, γ is an offset, and · represents a point multiplication operation between vectors.
优选地,步骤C中,根据第一输出数据中各数据间的相似度计算第一输出数据中各数据
的所有排列顺序的概率,具体实现下:针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率。Preferably, in step C, each data in the first output data is calculated according to the similarity between the data in the first output data.
The probability of all the sorting orders, in the specific implementation: for each sorting order, the order information of the sorting order and the similarity between all adjacent two data in the sorting order of the first output data are input into a preset probability In the calculation model, the probability of the arrangement order is obtained.
以一个训练样本数据y={y1,y2,y3}为例进行描述。将y输入教师网络得到对应的第一输出数据x={x1,x2,x3};计算x中两两数据之间的相似度为s12(x1与x2的相似度)、s13(x1与x3的相似度)、s23(x2与x3的相似度)。x1、x2、x3的所有排列顺序的数量为3!=6个,排列顺序分别为π1=x1→x2→x3、π2=x1→x3→x2、π3=x2→x1→x3、π4=x2→x3→x1、π5=x3→x1→x2、π6=x3→x2→x1;根据各数据间的相似度计算得到前述六种排列顺序的概率分别为
A training sample data y={y 1 , y 2 , y 3 } is taken as an example for description. Enter y into the teacher network to obtain corresponding first output data x={x 1 , x 2 , x 3 }; calculate the similarity between the two data in x as s 12 (the similarity between x 1 and x 2 ), s 13 (similarity between x 1 and x 3 ), s 23 (similarity between x 2 and x 3 ). The number of all sort orders of x 1 , x 2 , and x 3 is 3! =6, the order of arrangement is π 1 =x 1 →x 2 →x 3 ,π 2 =x 1 →x 3 →x 2 ,π 3 =x 2 →x 1 →x 3 ,π 4 =x 2 → x 3 →x 1 , π 5 =x 3 →x 1 →x 2 , π 6 =x 3 →x 2 →x 1 ; The probability of calculating the above six sorting orders based on the similarity between the data is
各训练样本数据对应的各第一输出数据选取的对应的目标排列顺序可以相同也可以不相同,以前述x为例,假设第一样本训练数据对应的第一输出数据对应的目标排列顺序为π1=x1→x2→x3、π2=x1→x3→x2、π3=x2→x1→x3,第二样本训练数据对应的第一输出数据对应的目标排列顺序为π3=x2→x1→x3、π4=x2→x3→x1、π5=x3→x1→x2。The corresponding target arrangement order of each first output data corresponding to each training sample data may be the same or different, and the foregoing x is taken as an example, and the target output order corresponding to the first output data corresponding to the first sample training data is assumed to be π 1 =x 1 →x 2 →x 3 , π 2 =x 1 →x 3 →x 2 , π 3 =x 2 →x 1 →x 3 , the target corresponding to the first output data corresponding to the second sample training data The order of arrangement is π 3 = x 2 → x 1 → x 3 , π 4 = x 2 → x 3 → x 1 , π 5 = x 3 → x 1 → x 2 .
优选地,所述步骤D中根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率,具体实现如下:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。Preferably, in the step D, the probability of the target arrangement order of each data in the second output data is calculated according to the similarity between the data in the second output data, and the specific implementation is as follows: for each target arrangement order, the target is The order information of the arrangement order and the similarity between all adjacent two data in the target arrangement order of the second output data are input into the probability calculation model, and the probability of the target arrangement order is obtained.
本发明实施例中,所述概率计算模型可以为一阶Plackett概率模型,也可以为高阶Plackett概率模型,还可以是其他能够计算概率的模型,本申请不做严格限定。In the embodiment of the present invention, the probability calculation model may be a first-order Plackett probability model, a high-order Plackett probability model, or other models capable of calculating a probability, which is not strictly limited.
下面以采用一阶Plackett概率模型计算排列顺序的概率为例进行描述。The following is an example of the probability of using the first-order Plackett probability model to calculate the order of arrangement.
假设某一训练样本数据对应的第一输出数据为x={x1,x2,x3,x4},以计算排列顺序π1和π2的概率为例,假设π1=x1→x2→x3→x4、π2=x1→x3→x4→x2,通过一阶Plackett概率模型得到以下结果:Assume that the first output data corresponding to a certain training sample data is x={x 1 , x 2 , x 3 , x 4 }, taking the probability of calculating the order of π 1 and π 2 as an example, assuming π 1 = x 1 → x 2 →x 3 →x 4 , π 2 =x 1 →x 3 →x 4 →x 2 , the following results are obtained by the first-order Plackett probability model:
其中,f(·)为任意一种线性或非线性的映射函数,且所有排列顺序的概率的和值为1Where f(·) is any linear or non-linear mapping function, and the sum of the probabilities of all the ordering is 1
本发明实施例中,所述目标排列顺序可以为一个,也可以为多个。
In the embodiment of the present invention, the target arrangement order may be one or multiple.
本发明实施例中,学生网络的目标函数可以仅包含一个匹配函数,该目标函数还可以是一个匹配函数与任务损失函数的和值,该任务损失函数的表达式与学生网络所要实现的任务相关,例如该任务损失函数可以与教师网络的目标函数相同。匹配函数的表达式可以但不仅限于以下的公式(3)和公式(4)。In the embodiment of the present invention, the objective function of the student network may only include a matching function, and the objective function may also be a sum of a matching function and a task loss function, and the expression of the task loss function is related to a task to be implemented by the student network. For example, the task loss function can be the same as the objective function of the teacher network. The expression of the matching function can be, but is not limited to, the following formula (3) and formula (4).
实例1、当目标顺序为一个时,所述学生网络的目标函数可设置为如以下公式(3)所示:Example 1. When the target order is one, the objective function of the student network can be set as shown in the following formula (3):
L=-log P(πt|Xs) 式(3)L=-log P(π t |X s ) (3)
式(3)中,πt为当前训练样本数据对应的第一输出数据中各数据的目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,P(πt|Xs)为第二输出数据中各数据的目标排列顺序的概率。In the formula (3), π t is the target arrangement order of each data in the first output data corresponding to the current training sample data, and X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is The probability of the target arrangement order of each data in the second output data.
优选地,前述目标排列顺序πt为当前训练样本数据的第一输出数据中各数据所有排列顺序中概率取值最大的排列顺序。Preferably, the foregoing target arrangement order π t is an arrangement order in which the probability values are the largest among all the arrangement orders of the data in the first output data of the current training sample data.
当目标顺序为多个时,本发明实施例可以基于匹配多个目标排列顺序的概率分布的方式训练得到所述学生网络。本发明实施例中匹配多个目标排列顺序的概率分布的方法有多种,例如基于概率分布的全变分距离、Wesserstein距离、Jensen-Shannon散度或Kullback-Leibler散度等。When the target order is multiple, the embodiment of the present invention may train the student network based on a manner of matching a probability distribution of a plurality of target arrangement orders. In the embodiment of the present invention, there are various methods for matching the probability distribution of a plurality of target arrangement orders, such as a total variation distance based on a probability distribution, a Wesserstein distance, a Jensen-Shannon divergence, or a Kullback-Leibler divergence.
以下以基于概率分布的Kullback-Leibler散度为例,所述学生网络的目标函数表达式可如以下如下式(4)所示:Taking the Kullback-Leibler divergence based on the probability distribution as an example, the objective function expression of the student network may be as follows (4):
式(4)中,π为一个目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,Xt为当前训练样本数据对应的第一输出数据,P(π|Xs)为当前训练样本数据的第二传输数据中各数据的π的概率,P(π|Xt)为当前训练样本数据的第一传输数据中各数据的π的概率,Q为目标排列顺序的集合。In the formula (4), π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current The probability of π of each data in the second transmission data of the training sample data, P(π|X t ) is the probability of π of each data in the first transmission data of the current training sample data, and Q is a set of the target arrangement order.
优选地,前述步骤E中根据所述目标函数的取值调整学生网络的权重,具体包括:采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。Preferably, adjusting the weight of the student network according to the value of the objective function in the foregoing step E includes: adopting a preset gradient descent optimization algorithm, and adjusting the weight of the student network according to the value of the target function.
优选地,前述步骤A与步骤B之间还包括以下步骤:通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。当然,如果步骤A得到的第一输出数据与第二输出数据的空间维度相同,且第一输出数据与第二输出数据的数量均与所述当前训练样本数据的数量一致,则无需在步骤A与步骤B之间增加该步骤,即在步骤A之后直接执行步骤B。前述空间维度一般是指输入数
据的数量、频道数、特征图的高度和宽度。Preferably, the foregoing steps A and B further include the following steps: processing the first output data and the second output data by using a downsampling algorithm and an interpolation algorithm, so that a spatial dimension of the first output data and a The spatial dimensions of the two output data are consistent, and the number of the first output data and the number of the second output data are both consistent with the number of the current training sample data. Certainly, if the spatial output of the first output data and the second output data obtained by the step A are the same, and the number of the first output data and the second output data are both consistent with the quantity of the current training sample data, the step A is not needed. This step is added to step B, that is, step B is directly executed after step A. The aforementioned spatial dimension generally refers to the number of inputs
The number of channels, the number of channels, the height and width of the feature map.
需要说明的是,前述步骤A~步骤F没有严格的先后顺序,也可以用以下的步骤A’~步骤B’替代前述步骤A~步骤B。It should be noted that the above steps A to F have no strict sequence, and the following steps A' to B' may be used instead of the above steps A to B.
步骤A’、将用于本次迭代训练的当前训练样本数据输入教师网络,得到对应的第一输出数据,并计算第一输出数据中各数据间的相似度;Step A', inputting current training sample data for the iterative training into the teacher network, obtaining corresponding first output data, and calculating a similarity between the data in the first output data;
步骤B’、将所述当前训练样本数据输入学生网络,得到对应的第二输出数据,并计算第二输出数据中各数据间的相似度。Step B', inputting the current training sample data into the student network, obtaining corresponding second output data, and calculating a similarity between the data in the second output data.
假设用于训练学生网络(用S表示)的三个训练样本数据分别为y1={y11,y12,y13},y2={y21,y22,y23},y3={y31,y32,y33};该三个训练样本数据输入到教师网络(用T表示)输出的第一输出数据依次为该三个训练样本数据输入到学生网络输出的第二输出数据依次为Assume that the three training sample data used to train the student network (represented by S) are y 1 = {y 11 , y 12 , y 13 }, y 2 = {y 21 , y 22 , y 23 }, y 3 = {y 31 , y 32 , y 33 }; the first output data of the three training sample data input to the teacher network (represented by T) is The second output data of the three training sample data input to the student network output is
本发明实施例以第一输出数据中各数据的所有排列顺序作为目标排列顺序。第i个训练样本数据对应的第一输出数据的目标排列顺序的集合其中
计算得到第i个训练数据对应的第一输出数据的目标排列顺序的概率为In the embodiment of the present invention, all the order of the data in the first output data is used as the target arrangement order. a set of target arrangement orders of the first output data corresponding to the i-th training sample data among them The probability of calculating the target arrangement order of the first output data corresponding to the i-th training data is
第i个训练数据对应的第二输出数据的目标排列顺序的集合其中
计算得到第i个训练样本数据对应的第二输出数据的目标排列顺序的概率为a set of target arrangement orders of the second output data corresponding to the i-th training data among them The probability of calculating the target arrangement order of the second output data corresponding to the i-th training sample data is
由于同一个训练样本数据对应的第一输出数据和第二输出数据的数量一致,则将第一输出数据与第二输出数据中数据排列顺序相同的排列顺序作为同一个目标排列顺序。例如将第i
个训练样本数据的第二输出数据的与其第一输出数据的作为同一个目标排列顺序,用πi1表示,则得到第i个训练样本数据的第一输出数据和第二输出数据的目标排列顺序集合Qi表示为Qi={πi1,πi2,πi3,πi4,πi5,πi6}Since the number of the first output data and the second output data corresponding to the same training sample data is the same, the order in which the data in the first output data and the second output data are arranged in the same order is used as the same target arrangement order. For example, the second output data of the i-th training sample data With its first output data As the same target arrangement order, represented by π i1 , the target output order set Q i of the first output data and the second output data of the i-th training sample data is expressed as Q i ={π i1 , π i2 , π I3 , π i4 , π i5 , π i6 }
执行以下多次迭代训练:Perform the following iterations of training:
第一次迭代训练:将y1输入教师网络和学生网络,得到对应的第一输出数据为和第二输出数据为计算中各数据之间的相似度以及计算中各数据之间的相似度;根据中各数据间的相似度计算中各数据的所有排列顺序的概率,将该所有排列顺序作为目标排列顺序;根据中各数据间的相似度计算得到中各数据的目标排列顺序的概率;将y1对应的第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率输入至目标函数中,计算得到目标函数的取值为L1,根据该L1调整学生网络当前权重W0,得到调整后的权重W1;The first iteration training: input y 1 into the teacher network and the student network, and obtain the corresponding first output data as And the second output data is Calculation Similarity and calculation between data in Similarity between the data in the data; Similarity calculation between data in The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order; The similarity between the data is calculated The probability of the target arrangement order of each of the data; the probability of the target arrangement order of each data in the first output data corresponding to y 1 and the probability of the target arrangement order of each data in the second output data are input to the objective function, and the calculation is performed The value of the objective function is L 1 , and the current weight W 0 of the student network is adjusted according to the L 1 , and the adjusted weight W 1 is obtained ;
第二次迭代训练:将y2输入教师网络和学生网络,得到对应的第一输出数据为和第二输出数据为计算中各数据之间的相似度以及计算中各数据之间的相似度;根据中各数据间的相似度计算中各数据的所有排列顺序的概率,将该所有排列顺序作为目标排列顺序;根据中各数据间的相似度计算得到中各数据的目标排列顺序的概率;将y2对应的第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率输入至目标函数中,计算得到目标函数的取值为L2,根据该L2调整学生网络当前权重W1,得到调整后的权重为W2;The second iteration training: input y 2 into the teacher network and the student network, and obtain the corresponding first output data as And the second output data is Calculation Similarity and calculation between data in Similarity between the data in the data; Similarity calculation between data in The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order; The similarity between the data is calculated The probability of the target arrangement order of each data in the data; the probability of the target arrangement order of each data in the first output data corresponding to y 2 and the probability of the target arrangement order of each data in the second output data are input into the objective function, and the calculation is performed The value of the objective function is L 2 , and the current weight W 1 of the student network is adjusted according to the L 2 , and the adjusted weight is W 2 ;
第三次迭代训练:将y3输入教师网络和学生网络,得到对应的第一输出数据为和第二输出数据为计算中各数据之间的相似度以及计算中各数据之间的相似度;根据中各数据间的相似度计算中各数据的所有排列顺序的概率,将该所有排列顺序作为目标排列顺序;根据中各数据间的相似度计算得到中各数据的目标排列顺序的概率;将y3对应的第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率输入至目标函数中,计算得到目标函数的取值为L3,根据该L3调整学生网络当前权重W2,得到调整后的权重为W3。The third iteration training: input y 3 into the teacher network and the student network, and obtain the corresponding first output data as And the second output data is Calculation Similarity and calculation between data in Similarity between the data in the data; Similarity calculation between data in The probability of all the order of the data in each order, the order of all the sorting is used as the target sorting order; The similarity between the data is calculated The probability of the target arrangement order of each data in the data; the probability of the target arrangement order of each data in the first output data corresponding to y 3 and the probability of the target arrangement order of each data in the second output data are input into the objective function, and the calculation is performed The value of the objective function is L 3 , and the current weight W 2 of the student network is adjusted according to the L 3 , and the adjusted weight is W 3 .
实施例二
Embodiment 2
基于与前述实施例一提供的神经网络训练方法的相同构思,本发明实施例二提供一种神经网络训练装置,该装置的结构如图3所示,包括:Based on the same concept as the neural network training method provided in the foregoing first embodiment, the second embodiment of the present invention provides a neural network training device. The structure of the device is as shown in FIG. 3, and includes:
选取单元31,用于选取一个与学生网络实现相同功能的教师网络;The selecting unit 31 is configured to select a teacher network that implements the same function as the student network;
训练单元32,用于基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;The training unit 32 is configured to iteratively train the student network to obtain a target network based on matching the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to obtain the target network. The similarity between the output data of the network is migrated to the student network;
其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。The first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student network after the training sample data is input into the student network. The second specific network layer outputs the data.
本发明实施例中,教师网络和学生网络所实现的功能如图像分类、目标检测、图像分割等。教师网络性能优良、准确率高,但是相对学生网络其结构复杂、参数权重较多、计算速度较慢。学生网络计算速度快、性能一般或者较差、网络结构简单。选取单元31可以在预先设置的神经网络模型的集合中选取一个与学生网络实现的功能相同且性能优良的网络作为教师网络。In the embodiment of the present invention, functions implemented by the teacher network and the student network are image classification, target detection, image segmentation, and the like. The teacher network has excellent performance and high accuracy, but its structure is complex relative to the student network, the parameter weight is more, and the calculation speed is slower. The student network is fast, the performance is generally poor or the network structure is simple. The selecting unit 31 may select a network having the same function and excellent performance as that implemented by the student network in the set of the preset neural network models as the teacher network.
本发明实施例中,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层;和/或,所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。In the embodiment of the present invention, the first specific network layer is an intermediate network layer or a last layer network layer in the teacher network; and/or the second specific network layer is an intermediate network layer or last of the student network. A layer of network.
优选地,训练单元32的结构如图4所示,具体包括构建模块321、训练模块322和确定模块323,其中:Preferably, the structure of the training unit 32 is as shown in FIG. 4, and specifically includes a construction module 321, a training module 322, and a determination module 323, where:
构建模块321,用于构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数;a building module 321 is configured to construct an objective function of the student network, where the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data;
训练模块322,用于采用所述训练样本数据对所述学生网络进行迭代训练;The training module 322 is configured to perform iterative training on the student network by using the training sample data;
确定模块323,用于当训练模块322迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。The determining module 323 is configured to obtain the target network when the training module 322 iteratively trains the number of times to reach a threshold or the target function satisfies a preset convergence condition.
优选地,训练模块322,具体用于:Preferably, the training module 322 is specifically configured to:
对所述学生网络进行多次以下迭代训练(以下称为本次迭代训练,将用于本次迭代训练的训练样本数据称为当前训练样本数据,本次迭代训练包括以下步骤A、步骤B、步骤C、步骤D、步骤E和步骤F):Performing the following iterative training on the student network multiple times (hereinafter referred to as the current iterative training, the training sample data used for the iterative training is referred to as current training sample data, and the iterative training includes the following steps A and B, Step C, Step D, Step E and Step F):
步骤A、将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;Step A: input current training sample data for the current iteration training into the teacher network and the student network, respectively, to obtain corresponding first output data and second output data;
步骤B、计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度;Step B: calculating a similarity between each data in the first output data and calculating a similarity between the data in the second output data;
步骤C、根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列
顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序;Step C: Calculating all the permutations of the data in the first output data according to the similarity between the data in the first output data
a probability of order, and selecting a target arrangement order from all the arrangement orders of each data in the first output data;
步骤D、根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;Step D: calculating, according to the similarity between each data in the second output data, a probability of a target arrangement order of each data in the second output data;
步骤E、根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;Step E: calculating a value of the target function according to a probability of a target arrangement order of each data in the first output data and a probability of a target arrangement order of each data in the second output data, and adjusting according to the value of the target function The weight of the student network;
步骤F、基于调整权重后的学生网络进行下一次迭代训练。Step F: Perform the next iterative training based on the student network after adjusting the weight.
优选地,训练模块322从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,具体包括:从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为目标排列顺序;或者,从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。Preferably, the training module 322 selects a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes: selecting, from all the arrangement orders of the data in the first output data, an arrangement whose probability value is greater than a preset threshold. The order is arranged as a target; or, from all the order of the data in the first output data, the order in which the probability values are arranged in the previous preset number is selected as the target sorting order.
优选地,所述训练模块322计算第一输出数据中各数据间的相似度,具体包括:计算第一输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度;Preferably, the training module 322 calculates the similarity between the data in the first output data, and specifically includes: calculating a spatial distance between the two data in the first output data, and obtaining the two-two data according to the spatial distance. Similarity between
所述训练模块322计算第二输出数据中各数据间的相似度,具体包括:计算第二输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。The training module 322 calculates the similarity between the data in the second output data, and specifically includes: calculating a spatial distance between the two data in the second output data, and obtaining a similarity between the two data according to the spatial distance degree.
本发明实施例中,所述空间距离可以是欧式距离、余弦距离、街区距离或马氏距离等,本申请不做严格限定。以计算两两数据之间的欧氏距离和余弦距离为例。In the embodiment of the present invention, the spatial distance may be a European distance, a cosine distance, a block distance, or a Mahalanobis distance, and the present application is not strictly limited. Take the example of calculating the Euclidean distance and cosine distance between two pairs of data.
优选地,所述训练模块322根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,具体包括:针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率;Preferably, the training module 322 calculates a probability of all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: for each sorting order, the order of the sorting The sequence information and the similarity between all adjacent two data in the arrangement order of the first output data are input into a preset probability calculation model, and the probability of the arrangement order is obtained;
所述训练模块322根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序概率,具体包括:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。The training module 322 calculates a target arrangement order probability of each data in the second output data according to the similarity between the data in the second output data, and specifically includes: sequence information of the target arrangement order for each target arrangement order And the similarity between all adjacent two data in the target arrangement order of the second output data is input into the probability calculation model, and the probability of the target arrangement order is obtained.
本发明实施例中,所述概率计算模型可以为一阶Plackett概率模型,也可以为高阶Plackett概率模型,还可以是其他能够计算概率的模型,本申请不做严格限定。In the embodiment of the present invention, the probability calculation model may be a first-order Plackett probability model, a high-order Plackett probability model, or other models capable of calculating a probability, which is not strictly limited.
本发明实施例中,所述目标排列顺序可以为一个,也可以为多个。当目标顺序为多个时,本发明实施例可以基于匹配多个目标排列顺序的概率分布的方式训练得到所述学生网络。本发明实施例中匹配多个目标排列顺序的概率分布的方法有多种,例如基于概率分布的全变分距离、Wesserstein距离、Jensen-Shannon散度或Kullback-Leibler散度等。In the embodiment of the present invention, the target arrangement order may be one or multiple. When the target order is multiple, the embodiment of the present invention may train the student network based on a manner of matching a probability distribution of a plurality of target arrangement orders. In the embodiment of the present invention, there are various methods for matching the probability distribution of a plurality of target arrangement orders, such as a total variation distance based on a probability distribution, a Wesserstein distance, a Jensen-Shannon divergence, or a Kullback-Leibler divergence.
本发明实施例中,学生网络的目标函数可以仅包含一个匹配函数,该目标函数还可以是
一个匹配函数与任务损失函数的和值,该任务损失函数的表达式与学生网络所要实现的任务相关,例如该任务损失函数可以与教师网络的目标函数相同。In the embodiment of the present invention, the objective function of the student network may only include one matching function, and the objective function may also be
The sum of a matching function and a task loss function. The expression of the task loss function is related to the task to be implemented by the student network. For example, the task loss function can be the same as the objective function of the teacher network.
优选地,所述训练模块322根据所述目标函数的取值调整所述学生网络的权重,具体包括:采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。Preferably, the training module 322 adjusts the weight of the student network according to the value of the objective function, and specifically includes: adopting a preset gradient descent optimization algorithm, and adjusting the student network according to the value of the target function. Weights.
优选地,所述训练模块322进一步用于:在计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度之前,通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。Preferably, the training module 322 is further configured to: before calculating the similarity between the data in the first output data and calculating the similarity between the data in the second output data, by using a downsampling algorithm and an interpolation algorithm Processing the first output data and the second output data such that a spatial dimension of the first output data coincides with a spatial dimension of the second output data, and the number of the first output data and the quantity of the second output data are both The number of current training sample data is consistent.
需要说明的是,前述步骤A~步骤F没有严格的先后顺序,也可以用以下的步骤A’~步骤B’替代前述步骤A~步骤B。It should be noted that the above steps A to F have no strict sequence, and the following steps A' to B' may be used instead of the above steps A to B.
步骤A’、将用于本次迭代训练的当前训练样本数据输入教师网络,得到对应的第一输出数据,并计算第一输出数据中各数据间的相似度;Step A', inputting current training sample data for the iterative training into the teacher network, obtaining corresponding first output data, and calculating a similarity between the data in the first output data;
步骤B’、将所述当前训练样本数据输入学生网络,得到对应的第二输出数据,并计算第二输出数据中各数据间的相似度。Step B', inputting the current training sample data into the student network, obtaining corresponding second output data, and calculating a similarity between the data in the second output data.
实施例三Embodiment 3
基于与前述实施例一提供的神经网络训练方法的相同构思,本发明实施例三提供一种神经网络训练装置,该装置的结构如图5所示,包括:一个处理器501和至少一个存储器502,所述至少一个存储器502用于存储至少一条机器可执行指令,所述处理器501执行所述至少一条指令以实现:选取一个与学生网络实现相同功能的教师网络;基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。Based on the same concept as the neural network training method provided in the foregoing first embodiment, the third embodiment of the present invention provides a neural network training device. The structure of the device is as shown in FIG. 5, including: a processor 501 and at least one memory 502. The at least one memory 502 is configured to store at least one machine executable instruction, and the processor 501 executes the at least one instruction to: select a teacher network that implements the same function as the student network; and corresponding to matching the same training sample data Iteratively training the student network to obtain a target network by the similarity between the data of the first output data and the data of the second output data, so as to implement the migration of the similarity between the output data of the teacher network to the student network Wherein: the first output data is data output from the first specific network layer of the teacher network after the training sample data is input into the teacher network, and the second output data is input from the student after the training sample data is input into the student network The data output by the second specific network layer of the network.
其中,所述处理器501执行所述至少一条指令实现基于匹配同一训练样本数据对应的第一输出数据的样本间相似性与第二输出数据的样本间相似性来迭代训练所述学生网络得到目标网络,具体包括:构建所述学生网络的目标函数,所述目标函数包含训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性的匹配函数;采用所述训练样本数据对所述学生网络进行迭代训练;当迭代训练次数达到阈值或者所述目标函数满足预置的收敛条件时,得到所述目标网络。The processor 501 executes the at least one instruction to iteratively train the student network to obtain a target based on matching the inter-sample similarity of the first output data corresponding to the same training sample data with the sample-to-sample similarity of the second output data. The network specifically includes: an objective function for constructing the student network, the objective function includes a matching function of the similarity between the data of the first output data corresponding to the training sample data and the data of the second output data; The training sample data performs iterative training on the student network; when the iterative training number reaches a threshold or the objective function satisfies a preset convergence condition, the target network is obtained.
其中,所述处理器501执行所述至少一条指令实现采用所述训练样本数据对所述学生网
络进行迭代训练,具体包括:对所述学生网络进行多次以下迭代训练:将用于本次迭代训练的当前训练样本数据分别输入所述教师网络和学生网络,得到对应的第一输出数据和第二输出数据;计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度;根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,并从所述第一输出数据中各数据的所有排列顺序中选取目标排列顺序;根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率;根据第一输出数据中各数据的目标排列顺序的概率和第二输出数据中各数据的目标排列顺序的概率计算所述目标函数的取值,并根据所述目标函数的取值调整所述学生网络的权重;基于调整权重后的学生网络进行下一次迭代训练。The processor 501 executes the at least one instruction to implement the training sample data on the student network.
The iterative training of the network specifically includes: performing the following iterative training on the student network: inputting the current training sample data used for the iterative training into the teacher network and the student network, respectively, to obtain corresponding first output data and a second output data; calculating a similarity between the data in the first output data and calculating a similarity between the data in the second output data; calculating each of the first output data according to the similarity between the data in the first output data a probability of all the order of the data, and selecting a target arrangement order from all the arrangement orders of the data in the first output data; calculating each data in the second output data according to the similarity between the data in the second output data The probability of the target arrangement order; calculating the value of the objective function according to the probability of the target arrangement order of each data in the first output data and the probability of the target arrangement order of each data in the second output data, and according to the objective function The value adjusts the weight of the student network; the next iterative training is performed based on the student network after adjusting the weight.
其中,所述处理器501执行所述至少一条指令实现从第一输出数据中各数据的所有排列顺序中选取目标排列顺序,具体包括:从第一输出数据中各数据的所有排列顺序中选取概率取值大于预置阈值的排列顺序作为目标排列顺序;或者,从第一输出数据中各数据的所有排列顺序中选取概率取值排在前面的预置数量的排列顺序作为目标排列顺序。The processor 501 executes the at least one instruction to select a target arrangement order from all the arrangement orders of the data in the first output data, and specifically includes: selecting a probability from all the arrangement orders of the data in the first output data. The arrangement order of the values greater than the preset threshold is used as the target arrangement order; or, the order of the preset number of the probabilities taken in the first output data is selected as the target arrangement order.
其中,所述处理器501执行所述至少一条指令实现计算第一输出数据中各数据间的相似度,具体包括:计算第一输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度;计算第二输出数据中各数据间的相似度,具体包括:计算第二输出数据中两两数据之间的空间距离,根据所述空间距离得到所述两两数据间的相似度。The processor 501 executes the at least one instruction to implement the calculation of the similarity between the data in the first output data, and specifically includes: calculating a spatial distance between the two data in the first output data, according to the spatial distance Obtaining a similarity between the two data sets; calculating a similarity between the data in the second output data, specifically: calculating a spatial distance between the two data in the second output data, according to the spatial distance, obtaining the The similarity between two data sets.
其中,所述处理器501执行所述至少一条指令实现根据第一输出数据中各数据间的相似度计算第一输出数据中各数据的所有排列顺序的概率,具体包括:针对每个排列顺序,将所述排列顺序的顺序信息以及第一输出数据的该排列顺序中所有相邻两个数据间的相似度输入预置的概率计算模型中,得到所述排列顺序的概率;根据第二输出数据中各数据间的相似度计算第二输出数据中各数据的目标排列顺序的概率,具体包括:针对每一个目标排列顺序,将所述目标排列顺序的顺序信息以及第二输出数据的该目标排列顺序中所有相邻两个数据间的相似度输入所述概率计算模型中,得到所述目标排列顺序的概率。The processor 501 executes the at least one instruction to calculate a probability of calculating all the order of the data in the first output data according to the similarity between the data in the first output data, and specifically includes: for each sorting order, Inputting the order information of the arrangement order and the similarity between all adjacent two data in the arrangement order of the first output data into a preset probability calculation model to obtain a probability of the arrangement order; according to the second output data Calculating the probability of the target arrangement order of each data in the second output data, the specificity includes: ordering the order of the target arrangement order and the target arrangement of the second output data for each target arrangement order The similarity between all adjacent two data in the sequence is input into the probability calculation model, and the probability of the target arrangement order is obtained.
其中,当所述目标排列顺序为一个时,所述学生网络的目标函数如下:Wherein, when the target arrangement order is one, the objective function of the student network is as follows:
L=-log P(πt|Xs)L=-log P(π t |X s )
式中,πt为当前训练样本数据对应的第一输出数据中各数据的目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,P(πt|Xs)为第二输出数据中各数据的目标排列顺序的概率。Where π t is the target arrangement order of each data in the first output data corresponding to the current training sample data, X s is the second output data corresponding to the current training sample data, and P(π t |X s ) is the second output The probability of the target order of each data in the data.
其中,当所述目标排列顺序为多个时,所述学生网络的目标函数如下:Wherein, when the target is arranged in a plurality of orders, the objective function of the student network is as follows:
式中,π为一个目标排列顺序,Xs为当前训练样本数据对应的第二输出数据,Xt为当前训练样本数据对应的第一输出数据,P(π|Xs)为当前训练样本数据的第二传输数据中各数据的排列顺序为π的概率,P(π|Xt)为当前训练样本数据的第一传输数据中各数据的排列顺序为π的概率,Q为目标排列顺序的集合。Where π is a target arrangement order, X s is the second output data corresponding to the current training sample data, X t is the first output data corresponding to the current training sample data, and P(π|X s ) is the current training sample data. The order of each data in the second transmission data is a probability of π, and P(π|X t ) is a probability that the order of each data in the first transmission data of the current training sample data is π, and Q is the target arrangement order. set.
其中,所述处理器501执行所述至少一条指令实现,根据所述目标函数的取值调整所述学生网络的权重,具体包括:采用预置的梯度下降优化算法,根据所述目标函数的取值调整所述学生网络的权重。The processor 501 performs the at least one instruction implementation, and adjusts the weight of the student network according to the value of the target function, which includes: adopting a preset gradient descent optimization algorithm, according to the target function The value adjusts the weight of the student network.
其中,在所述处理器501执行所述至少一条指令实现计算第一输出数据中各数据间的相似度以及计算第二输出数据中各数据间的相似度之前,所述处理器还执行所述至少一条指令以实现:通过下采样算法与插值算法对所述第一输出数据和第二输出数据进行处理,使得所述第一输出数据的空间维度与第二输出数据的空间维度一致,且第一输出数据的数量和第二输出数据的数量均与所述当前训练样本数据的数量一致。The processor further performs the performing, after the processor 501 executes the at least one instruction to calculate a similarity between data in the first output data and calculates a similarity between data in the second output data. At least one instruction is implemented to: process the first output data and the second output data by a downsampling algorithm and an interpolation algorithm, such that a spatial dimension of the first output data is consistent with a spatial dimension of the second output data, and The number of one output data and the number of second output data are both consistent with the number of current training sample data.
其中,所述第一特定网络层为教师网络中的一个中间网络层或最后一层网络层;所述第二特定网络层为学生网络的一个中间网络层或最后一层网络层。The first specific network layer is an intermediate network layer or a last layer network layer in the teacher network; and the second specific network layer is an intermediate network layer or a last layer network layer of the student network.
基于与前述方法相同的构思,本发明实施例还提供一种存储介质(该存储介质可以是非易失性机器可读存储介质),该存储介质中存储有用于神经网络训练的计算机程序,该计算机程序具有被配置用于执行以下步骤的代码段:选取一个与学生网络实现相同功能的教师网络;基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。Based on the same concept as the foregoing method, the embodiment of the present invention further provides a storage medium (which may be a non-volatile machine readable storage medium), where the computer program for storing neural network training is stored. The program has a code segment configured to perform the following steps: selecting a teacher network that implements the same function as the student network; based on matching the data between the first output data corresponding to the same training sample data and the data of the second output data Similarity to iteratively train the student network to obtain a target network to implement migration of similarity between output data of the teacher network to the student network; wherein: the first output data is used to input the training sample data into a teacher network The data output from the first specific network layer of the teacher network, the second output data is data output from the second specific network layer of the student network after the training sample data is input into the student network.
基于与前述方法相同的构思,本发明实施例还提供一种计算机程序,该计算机程序具有被配置用于执行以下神经网络训练的代码段:选取一个与学生网络实现相同功能的教师网络;基于匹配同一训练样本数据对应的第一输出数据的数据间相似性与第二输出数据的数据间相似性来迭代训练所述学生网络得到目标网络,以实现将所述教师网络的输出数据间相似性迁移到所述学生网络;其中:所述第一输出数据为所述训练样本数据输入教师网络后从教师网络的第一特定网络层输出的数据,所述第二输出数据为所述训练样本数据输入学生网络后从学生网络的第二特定网络层输出的数据。Based on the same concept as the foregoing method, an embodiment of the present invention further provides a computer program having a code segment configured to perform the following neural network training: selecting a teacher network that implements the same function as the student network; Iteratively training the student network to obtain a target network by using the similarity between the data of the first output data corresponding to the same training sample data and the data of the second output data to achieve similarity migration between the output data of the teacher network Go to the student network; wherein: the first output data is data output from a first specific network layer of a teacher network after the training sample data is input into a teacher network, and the second output data is input to the training sample data The student network then outputs data from the second specific network layer of the student network.
综上所述,本发明实施例中,能够将样本训练数据在教师网络输出的输出数据的各数据间相似信息全面迁移到学生网络中,从而实现训练样本数据通过教师网络输出的结果与通过
目标网络输出的结果基本一致。根据神经网络良好的泛化性能,训练得到的目标网络的输出与教师网络的输出在测试集上也基本相同,从而提高了学生网络的准确性。以上结合具体实施例描述了本发明的基本原理,但是,需要指出的是,对本领域普通技术人员而言,能够理解本发明的方法和装置的全部或者任何步骤或者部件可以在任何计算装置(包括处理器、存储介质等)或者计算装置的网络中,以硬件固件、软件或者他们的组合加以实现,这是本领域普通技术人员在阅读了本发明的说明的情况下运用它们的基本编程技能就能实现的。In summary, in the embodiment of the present invention, the sample training data can be completely migrated to the student network in the data of the output data output by the teacher network, thereby realizing the result and passing of the training sample data output through the teacher network.
The results of the target network output are basically the same. According to the good generalization performance of the neural network, the output of the target network and the output of the teacher network are basically the same in the test set, thereby improving the accuracy of the student network. The basic principles of the present invention have been described above in connection with the specific embodiments, but it should be noted that those skilled in the art can understand that all or any of the steps or components of the method and apparatus of the present invention can be in any computing device (including The processor, the storage medium, or the like, or the network of computing devices, implemented in hardware firmware, software, or a combination thereof, which is the basic programming skill of those skilled in the art in the context of reading the description of the present invention. Can be achieved.
本领域普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。A person skilled in the art can understand that all or part of the steps carried by the method of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium. , including one or a combination of the steps of the method embodiments.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
尽管已描述了本发明的上述实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括上述实施例以及落入本发明范围的所有变更和修改。Although the above-described embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to the embodiments once they are aware of the basic inventive concept. Therefore, the appended claims are intended to be interpreted as including the above-described embodiments and all changes and modifications falling within the scope of the invention.
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。
It is apparent that those skilled in the art can make various modifications and variations to the invention without departing from the spirit and scope of the invention. Thus, it is intended that the present invention cover the modifications and modifications of the invention