JP2023028232A

JP2023028232A - Learning device and learning method

Info

Publication number: JP2023028232A
Application number: JP2021133805A
Authority: JP
Inventors: 竜介関; Ryusuke Seki; 康貴岡田; Yasutaka Okada; 雄喜片山; Yuki Katayama; 怜広見; Rei Hiromi; 葵荻島; Aoi Ogishima
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2023-03-03

Abstract

To provide a technique capable of appropriately improving the performance of a student model in the deep layer learning using the knowledge distillation.SOLUTION: Disclosed is an exemplary learning device which trains a student model using a teacher model and includes: an evaluation part for determining the performance difference between the teacher model and the student model; and a change part for changing the teacher model based on the performance difference or changing a weight coefficient at the time of calculating a loss of the student model.SELECTED DRAWING: Figure 3

Description

本発明は、学習装置および学習方法に関する。 The present invention relates to a learning device and a learning method.

従来、ディープラーニング（深層学習）の学習手法の一つとして、知識蒸留（Knowledge Distillation）と呼ばれる手法が知られている（例えば特許文献１参照）。知識蒸留においては、パラメータ数の多い大規模な教師モデルを用いて、パラメータ数の少ない生徒モデルの訓練が行われる。知識蒸留により、モデルのサイズを小さくしつつ、モデルの性能の低下を抑制することができる。 Conventionally, a technique called knowledge distillation is known as one of the learning techniques of deep learning (see Patent Document 1, for example). In knowledge distillation, a large teacher model with a large number of parameters is used to train a student model with a small number of parameters. By knowledge distillation, it is possible to reduce model performance degradation while reducing model size.

特開２０２０－７１８８３号公報JP 2020-71883 A

知識蒸留を利用した学習手法では、生徒モデルの性能が教師モデルの性能に近づく学習の後半において、生徒モデルの性能が飽和し、知識蒸留による性能の向上効果が十分に得られないことがある。また、使用する教師モデルの性能が高すぎる場合に、生徒モデルの性能を十分に向上させることができないことがある。 In the learning method using knowledge distillation, the performance of the student model is saturated in the latter half of the learning when the performance of the student model approaches that of the teacher model, and the performance improvement effect of knowledge distillation may not be sufficiently obtained. Also, if the performance of the teacher model used is too high, it may not be possible to sufficiently improve the performance of the student model.

また、知識蒸留を利用した学習手法では、学習の終盤において、教師モデルと生徒モデルとの特徴空間の違いのために、例えば教師モデルの特徴マップに生徒モデルの特徴マップを近づけることが、生徒モデルの学習に悪影響を及ぼすことが懸念される。当該悪影響により、生徒モデルの性能を悪化させることが懸念される。 Also, in the learning method using knowledge distillation, due to the difference in feature space between the teacher model and the student model at the end of learning, for example, bringing the feature map of the student model closer to the feature map of the teacher model is not possible. It is feared that it will adversely affect the learning of children. There is concern that this adverse effect may deteriorate the performance of the student model.

本発明は、上記の点に鑑み、知識蒸留を利用した深層学習において、生徒モデルの性能を適切に向上することができる技術を提供することを目的とする。 In view of the above points, it is an object of the present invention to provide a technique capable of appropriately improving the performance of a student model in deep learning using knowledge distillation.

例示的な本発明の学習装置は、教師モデルを用いて生徒モデルを訓練する学習装置であって、前記教師モデルと前記生徒モデルとの性能差を求める評価部と、前記性能差に基づいて、前記教師モデルの変更を行うか、或いは、前記生徒モデルの損失の算出時における重み係数の変更を行う変更部と、を備える。 An exemplary learning device according to the present invention is a learning device that trains a student model using a teacher model, and includes an evaluation unit that obtains a performance difference between the teacher model and the student model, and based on the performance difference, and a changing unit that changes the teacher model or changes a weighting factor when calculating the loss of the student model.

例示的な本発明の学習方法は、教師モデルを用いて生徒モデルを訓練する学習方法であって、前記教師モデルと前記生徒モデルとの性能差に基づいて、前記教師モデルの変更を行う。 An exemplary learning method of the present invention is a learning method that uses a teacher model to train a student model, and changes the teacher model based on a performance difference between the teacher model and the student model.

例示的な本発明の学習方法は、教師モデルを用いて生徒モデルを訓練する学習方法であって、前記生徒モデルの損失を、学習データに対する損失と、前記教師モデルに対する損失とを用いて求め、前記学習データに対する損失と、前記教師モデルに対する損失とのバランスを調整する重み係数を、所定のタイミングで変更して前記生徒モデルを訓練する。 An exemplary learning method of the present invention is a learning method for training a student model using a teacher model, wherein a loss of the student model is obtained using a loss for learning data and a loss for the teacher model, The student model is trained by changing, at a predetermined timing, a weighting factor that adjusts the balance between the loss for the learning data and the loss for the teacher model.

例示的な本発明によれば、知識蒸留を利用した深層学習において、生徒モデルの性能を適切に向上することができる。 According to the exemplary invention, it is possible to appropriately improve the performance of a student model in deep learning using knowledge distillation.

学習装置のハードウェア構成を示すブロック図Block diagram showing the hardware configuration of the learning device 学習装置が備えるプロセッサの機能構成を示すブロック図FIG. 3 is a block diagram showing the functional configuration of a processor included in the learning device; 教師モデルを変更する構成について説明するための模式図Schematic diagram for explaining the configuration for changing the teacher model 第１実施形態の学習装置を用いて生徒モデルの訓練を行う場合における生徒モデルの性能変化の様子を示す模式図Schematic diagram showing how the performance of a student model changes when the learning device of the first embodiment is used to train the student model. 第１実施形態に係る学習装置における生徒モデルの訓練の流れを示すフローチャート3 is a flow chart showing the flow of student model training in the learning device according to the first embodiment. 第２実施形態に係る学習装置における生徒モデルの訓練の流れを示すフローチャートFlowchart showing the flow of student model training in the learning device according to the second embodiment 複数エポックごとに重み係数λが減衰される例を示す図A diagram showing an example where the weighting factor λ is decayed every multiple epochs １エポックごとに重み係数λが減衰される例を示す図A diagram showing an example in which the weighting factor λ is attenuated every epoch

以下、本発明の例示的な実施形態について、図面を参照しながら詳細に説明する。 Exemplary embodiments of the invention are described in detail below with reference to the drawings.

＜１．第１実施形態＞
図１は、本発明の第１実施形態に係る学習装置１０のハードウェア構成を示すブロック図である。学習装置１０は、機械学習用の学習装置であり、学習手法として知識蒸留を利用する。以下、知識蒸留を、単に「蒸留」と記載することがある。学習装置１０は、教師モデルを用いて生徒モデルを訓練する学習装置である。学習装置１０は、教師モデルを用いて生徒モデルを訓練する学習方法を実行する。詳細には、学習装置１０は、学習データおよび教師モデルを用いて生徒モデルに学習させる。なお、本実施形態では、学習装置１０は複数の教師モデルを用いて生徒モデルの訓練を行う。 <1. First Embodiment>
FIG. 1 is a block diagram showing the hardware configuration of a learning device 10 according to the first embodiment of the invention. The learning device 10 is a learning device for machine learning, and uses knowledge distillation as a learning method. Hereinafter, knowledge distillation may be simply referred to as "distillation". The learning device 10 is a learning device that trains a student model using a teacher model. The learning device 10 executes a learning method of training a student model using a teacher model. Specifically, the learning device 10 makes the student model learn using the learning data and the teacher model. In this embodiment, the learning device 10 trains student models using a plurality of teacher models.

本実施形態においては、学習データは、正解ラベル付きの学習データである。教師モデルは、パラメータ数の多い大規模なディープニューラルネットワークであり、既に学習済みのモデルである。生徒モデルは、教師モデルに比べてパラメータ数の少ない小規模のディープニューラルネットワークである。生徒モデルは、これから学習（訓練）を行わせるモデルである。 In this embodiment, the learning data is learning data with a correct answer label. A teacher model is a large-scale deep neural network with a large number of parameters, and is a trained model. The student model is a small-scale deep neural network with fewer parameters than the teacher model. A student model is a model that is to be learned (trained) from now on.

図１に示すように、学習装置１０は、プロセッサ１１と記憶部１２とを備える。記憶部１２は、教師モデルおよび生徒モデルを記憶する。教師モデルは、学習済みパラメータを含む。生徒モデルは、学習により更新されるパラメータを含む。本実施形態では、教師モデルは複数であり、記憶部１２には、複数の教師モデルが記憶されている。 As shown in FIG. 1 , the learning device 10 includes a processor 11 and a storage section 12 . The storage unit 12 stores teacher models and student models. The teacher model contains learned parameters. The student model contains parameters that are updated by learning. In this embodiment, there are a plurality of teacher models, and the storage unit 12 stores a plurality of teacher models.

プロセッサ１１は、記憶部１２に記憶されている教師モデルを用いた学習データに対する推論、および、生徒モデルを用いた学習データに対する推論を行う機能を発揮する。プロセッサ１１は、教師モデルの推論結果と、生徒モデルの訓練結果との誤差が小さくなるように、生徒モデルのパラメータを更新する機能を発揮する。なお、推論結果の代わりに、中間層の特徴マップが利用されてもよい。 The processor 11 exhibits a function of inferring learning data using the teacher model stored in the storage unit 12 and inferring learning data using the student model. The processor 11 functions to update the parameters of the student model so that the error between the inference result of the teacher model and the training result of the student model is reduced. Note that a feature map of the intermediate layer may be used instead of the inference result.

図２は、本発明の第１実施形態に係る学習装置１０が備えるプロセッサ１１の機能構成を示すブロック図である。プロセッサ１１の機能は、記憶部１２に記憶されるプログラムにしたがった演算処理の実行により発揮される。図２に示すように、本実施形態のプロセッサ１１は、その機能として、取得部１１０と、第１推論部１１１と、第２推論部１１２と、学習部１１３と、評価部１１４と、変更部１１５とを備える。換言すると、学習装置１０は、取得部１１０と、第１推論部１１１と、第２推論部１１２と、学習部１１３と、評価部１１４と、変更部１１５とを備える。取得部１１０は、学習データを取得する。 FIG. 2 is a block diagram showing the functional configuration of the processor 11 included in the learning device 10 according to the first embodiment of the invention. The functions of the processor 11 are exhibited by executing arithmetic processing according to a program stored in the storage unit 12 . As shown in FIG. 2, the processor 11 of the present embodiment has functions such as an acquisition unit 110, a first inference unit 111, a second inference unit 112, a learning unit 113, an evaluation unit 114, and a change unit. 115. In other words, learning device 10 includes acquisition unit 110 , first inference unit 111 , second inference unit 112 , learning unit 113 , evaluation unit 114 , and change unit 115 . Acquisition unit 110 acquires learning data.

第１推論部１１１は、学習済みの教師モデルを利用して、入力された学習データに対する推論を行う。第１推論部１１１の機能は、プロセッサ１１によって記憶部１２に記憶される教師モデルが読み込まれることにより実現される。推論により、推論の最終的な結果である推論結果や、中間出力である特徴マップ等が得られる。本実施形態では、学習データは画像データであり、第１推論部１１１は、入力された画像データに対する物体検出を行う。第１推論部１１１の推論結果には、物体の種類および位置の特定が含まれる。物体の種類は、例えば、歩行者、自転車、自動車、信号機等である。位置の特定には、バウンディングボックスが利用される。 The first inference unit 111 uses a trained teacher model to infer input learning data. The function of the first inference unit 111 is realized by reading the teacher model stored in the storage unit 12 by the processor 11 . Through inference, an inference result, which is the final result of inference, and a feature map, which is an intermediate output, are obtained. In this embodiment, the learning data is image data, and the first inference unit 111 performs object detection on the input image data. The inference result of the first inference unit 111 includes identification of the type and position of the object. The types of objects are, for example, pedestrians, bicycles, automobiles, and traffic lights. A bounding box is used to specify the position.

第２推論部１１２は、教師モデルによる訓練の対象である生徒モデルを利用して、入力された学習データに対する推論を行う。第２推論部１１２の機能は、プロセッサ１１によって記憶部１２に記憶されている生徒モデルが読み込まれることにより実現される。推論により、推論の最終的な結果である推論結果や、中間出力である特徴マップ等が得られる。なお、生徒モデルのパラメータは、学習により逐次更新される。本実施形態では、上述のように学習データは画像データであり、第２推論部１１２は、入力された画像データに対する物体検出を行う。第２推論部１１２の推論結果には、第１推論部１１１と同様に、物体の種類と位置の特定が含まれる。 The second inference unit 112 uses the student model, which is the target of training by the teacher model, to infer the input learning data. The function of the second inference unit 112 is realized by reading the student model stored in the storage unit 12 by the processor 11 . Through inference, an inference result, which is the final result of inference, and a feature map, which is an intermediate output, are obtained. Note that the parameters of the student model are successively updated by learning. In this embodiment, the learning data is image data as described above, and the second inference unit 112 performs object detection on the input image data. As with the first inference unit 111, the inference result of the second inference unit 112 includes identification of the type and position of the object.

学習部１１３は、生徒モデルの教師モデルに対する損失（蒸留損失Ｌdstl）を求める。教師モデルに対する損失（蒸留損失Ｌdstl）は、教師モデルと生徒モデルとの間の、推論結果の誤差、或いは、中間層の特徴マップ（中間出力）の誤差である。これにより、従来の技術を利用して損失関数を容易に求めることができる。 The learning unit 113 obtains the loss (distillation loss Ldstl) of the student model with respect to the teacher model. The loss for the teacher model (distillation loss Ldstl) is the error in the inference result or the error in the intermediate layer feature map (intermediate output) between the teacher model and the student model. This makes it possible to easily obtain the loss function using conventional techniques.

本実施形態では、蒸留損失Ｌdstlは、教師モデルの特徴マップと、生徒モデルの特徴マップとの間の誤差である。蒸留損失Ｌdstlは、例えば、以下の式（１）で表されるＬ２損失であってよい。ただし、蒸留損失Ｌdstlは、L２損失以外であってもよく、例えば、教師モデルと生徒モデルの出力の分布間の損失を表すＫＬダイバージェンス等であってもよい。 In this embodiment, the distillation loss Ldstl is the error between the feature map of the teacher model and the feature map of the student model. The distillation loss Ldstl may be, for example, the L2 loss represented by Equation (1) below. However, the distillation loss Ldstl may be other than the L2 loss, and may be, for example, the KL divergence representing the loss between the output distributions of the teacher model and the student model.

式（１）において、ｎはサンプル（データ）の総数を示す。ｆ_ｘｉ ^ｓは、サンプルｘｉの生徒モデルの出力を示す。ｆ_ｘｉ ^ｔは、サンプルｘｉの教師モデルの出力を示す。

In Equation (1), n indicates the total number of samples (data). f _xi ^s denotes the output of the student model for sample xi. f _xi ^t denotes the output of the teacher model for sample xi.

また、学習部１１３は、生徒モデルの学習データに対する損失を求める。詳細には、学習部１１３は、生徒モデルの、学習データの正解ラベルに対する学習タスク特有の誤差（損失）を求める。本実施形態では、学習タスクは物体検出であり、学習タスク特有の損失Ｌuqには、例えば、クラス分類損失Ｌclsと、バウンディングボックス回帰損失Ｌbboxとが含まれる。本実施形態の学習タスク特有の損失Ｌuqは、例えば、以下の式（２）により求められる。 Also, the learning unit 113 obtains the loss for the learning data of the student model. Specifically, the learning unit 113 obtains the learning task-specific error (loss) of the student model with respect to the correct label of the learning data. In this embodiment, the learning task is object detection, and the learning task-specific losses Luq include, for example, the classification loss Lcls and the bounding box regression loss Lbbox. The learning task-specific loss Luq of this embodiment is obtained, for example, by the following equation (2).

なお、以下の式（２）は、物体検出アルゴリズムの一例であるＹＯＬＯを利用する場合を想定したものである。ＹＯＬＯでは、入力画像をＳ×Ｓのグリッドに区切り、グリッド毎に、ある一定のアスペクト比を持つＢ個の矩形領域の中心座標（ｘ、ｙ）、幅と高さのスケール（ｗ、ｈ）、および、矩形領域内に物体が存在する確率（Ｃ）を予測する。さらに、各矩形領域に何らかの物体が存在するとき、その物体がどのクラスに属するかを示す事後確率ｐ（ｃ）も予測する。ＹＯＬＯでは、矩形領域、物体の存在確率、クラスの事後確率の予測を、１つの損失関数に統合している。 Note that the following formula (2) is based on the assumption that YOLO, which is an example of an object detection algorithm, is used. In YOLO, the input image is divided into S×S grids, and for each grid, the center coordinates (x, y) and the width and height scales (w, h) of B rectangular regions with a certain aspect ratio , and the probability (C) that an object exists within the rectangular region. Furthermore, when an object exists in each rectangular area, the posterior probability p(c) indicating which class the object belongs to is also predicted. YOLO integrates predictions of rectangular regions, object existence probabilities, and class posterior probabilities into a single loss function.

添え字がない変数は予測値、添え字「truth」が付与された変数が正解を示す。λcoord、λnoobjは係数である。式（２）において、第１項および第２項は、矩形領域の中心座標と大きさに関する損失関数を示す。式（２）において、第３項および第４項は、存在確率に関する損失関数を示す。式（２）において、第５項はクラスの事後確率に関する損失関数を示す。

Variables without subscripts indicate predicted values, and variables with the subscript "truth" indicate correct answers. λcoord and λnoobj are coefficients. In Equation (2), the first and second terms represent loss functions related to the center coordinates and size of the rectangular area. In Equation (2), the third and fourth terms represent loss functions related to existence probability. In equation (2), the fifth term indicates the loss function for the posterior probability of the class.

学習部１１３は、生徒モデルの教師モデルに対する損失（蒸留損失Ｌdstl）と、生徒モデルの学習データに対する損失（学習タスク特有の損失Ｌuq）とを用いて算出される損失Ｌを最小化するように生徒モデルを学習する。詳細には、学習部１１３は、誤差逆伝播法を用いて生徒モデルのパラメータの更新を行う。 The learning unit 113 minimizes the loss L calculated using the loss of the student model to the teacher model (distillation loss Ldstl) and the loss of the student model to the learning data (learning task specific loss Luq). learn the model. Specifically, the learning unit 113 updates the parameters of the student model using backpropagation.

なお、教師モデルに対する損失と、学習データに対する損失とを用いて算出される損失Ｌは、例えば以下の式（３）で算出されてよい。式（３）において、λは重み係数である。重み係数λは、学習データに対する損失と、教師モデルに対する損失とのバランスを調整する係数である。本実施形態では、重み係数λは定数である。
Ｌ＝Ｌuq ＋ λ・Ｌdstl （３） Note that the loss L calculated using the loss for the teacher model and the loss for the learning data may be calculated, for example, by the following equation (3). In Equation (3), λ is a weighting factor. The weighting factor λ is a factor that adjusts the balance between the loss to the training data and the loss to the teacher model. In this embodiment, the weighting factor λ is a constant.
L = Luq + λ·Ldstl (3)

評価部１１４は、教師モデルと生徒モデルとの性能の差（性能差）を求める。上述のように、本実施形態において、性能とは、物体検出の性能である。本実施形態によれば、物体検出処理を高速で行うことができる高性能の学習済みモデルを、知識蒸留を用いて生成することができる。 The evaluation unit 114 obtains a performance difference (performance difference) between the teacher model and the student model. As described above, in this embodiment, the performance is object detection performance. According to this embodiment, it is possible to generate a high-performance trained model capable of performing object detection processing at high speed using knowledge distillation.

詳細には、評価部１１４は、物体検出タスクに対する評価指標であるｍＡＰ（mean Average Precision）を用いて性能差を算出する。すなわち、評価部１１４は、生徒モデルのｍＡＰを求めて、教師モデルのｍＡＰとの差を算出する。なお、教師モデルのｍＡＰは、毎回算出する構成としても、予め記憶部１２に記憶されている構成としてもよい。学習タスクが物体検出でない場合には、ｍＡＰとは異なる性能の評価指標が用いられればよい。例えば、学習タスクが画像のセグメンテーションである場合には、性能の評価指標としてｍＩｏＵ（mean Intersection over Union）が用いられてよい。 Specifically, the evaluation unit 114 calculates the performance difference using mAP (mean average precision), which is an evaluation index for the object detection task. That is, the evaluation unit 114 obtains the mAP of the student model and calculates the difference from the mAP of the teacher model. Note that the mAP of the teacher model may be calculated each time or may be stored in the storage unit 12 in advance. If the learning task is not object detection, a performance evaluation index different from mAP may be used. For example, if the learning task is image segmentation, mean intersection over union (mIoU) may be used as a performance metric.

変更部１１５は、教師モデルと生徒モデルとの性能差に基づいて教師モデルの変更を行う。図３は、教師モデルを変更する構成について説明するための模式図である。図３に示すように、本実施形態では、予め準備される教師モデルの数は１つではなく複数である。図３において、Ｎ＞１である。複数の教師モデルは、互いに性能が異なる。生徒モデルを訓練するために使用される教師モデルは、教師モデルと生徒モデルとの性能性に応じて順次変更される。 The changing unit 115 changes the teacher model based on the performance difference between the teacher model and the student model. FIG. 3 is a schematic diagram for explaining the configuration for changing the teacher model. As shown in FIG. 3, in this embodiment, the number of teacher models prepared in advance is not one but a plurality. In FIG. 3, N>1. A plurality of teacher models have different performances. The teacher model used to train the student model is sequentially changed according to the performance of the teacher model and the student model.

このような構成とすると、生徒モデルの性能が教師モデルの性能に近づいた場合に、より性能の高い教師モデルに変更して生徒モデルの訓練を行うことができる。このために、教師モデルを１つのみとして生徒モデルを訓練する場合に比べて生徒モデルの性能を向上させることができる。また、生徒モデルの性能に応じて教師モデルの性能を徐々に変化させることができ、生徒モデルの訓練を適切に行うことができる。 With such a configuration, when the performance of the student model approaches the performance of the teacher model, it is possible to train the student model by changing to a teacher model with higher performance. For this reason, the performance of the student model can be improved compared to the case where the student model is trained with only one teacher model. In addition, the performance of the teacher model can be gradually changed according to the performance of the student model, and the student model can be appropriately trained.

図３において、教師モデルの番号が大きくなるにつれて、教師モデルの性能は高くなる。すなわち、図３に示す例では、変更部１１５は、教師モデルと生徒モデルとの性能差が所定の閾値よりも小さくなった場合に、生徒モデルの訓練に使用する教師モデルを、使用中の教師モデルよりも性能の高い教師モデルに変更する。所定の閾値は、例えば試行錯誤により決定される任意の値である。 In FIG. 3, as the teacher model number increases, the performance of the teacher model increases. That is, in the example shown in FIG. 3, when the performance difference between the teacher model and the student model becomes smaller than a predetermined threshold, the changing unit 115 changes the teacher model used for training the student model to the current teacher model. Change to a teacher model with higher performance than the model. The predetermined threshold is an arbitrary value determined by trial and error, for example.

図４は、本発明の第１実施形態に係る学習装置１０を用いて訓練を行う場合における、生徒モデルの性能変化の様子を示す模式図である。図４の横軸はイタレーションであり、詳細には、生徒モデルのパラメータの更新回数を示す。図４の縦軸は性能であり、例えば検出対象物体の検出率である。 FIG. 4 is a schematic diagram showing how the performance of a student model changes when training is performed using the learning device 10 according to the first embodiment of the present invention. The horizontal axis in FIG. 4 is the iteration, and more specifically, the number of times the parameters of the student model have been updated. The vertical axis in FIG. 4 represents performance, for example, the detection rate of the object to be detected.

図４に示すように、最初の教師モデルである教師モデル１を用いた訓練により、生徒モデルの性能が教師モデル１の性能に近づくと、教師モデルが教師モデル１から教師モデル１よりも性能が高い教師モデル２に変更されて訓練が行われる。そして、生徒モデルの性能が教師モデル２に近づくと、教師モデルが教師モデル２から教師モデル２よりも性能が高い教師モデル３に変更されて訓練が行われる。このような繰り返しが予め準備された教師モデルの数（Ｎ個）だけ行われる。 As shown in FIG. 4, when the performance of the student model approaches the performance of the teacher model 1 by training using the teacher model 1, which is the first teacher model, the performance of the teacher model from the teacher model 1 is higher than that of the teacher model 1. Training is performed with a change to the high supervised model 2. Then, when the performance of the student model approaches the teacher model 2, the teacher model is changed from the teacher model 2 to the teacher model 3 having higher performance than the teacher model 2, and training is performed. Such repetition is repeated by the number of teacher models prepared in advance (N).

このように、本構成では、生徒モデルの性能に応じた性能を有する教師モデルを用いて生徒モデルの訓練を開始し、生徒モデルの性能の向上に合わせて訓練に用いる教師モデルの性能を徐々に上げる構成となっている。このために、最初から高性能の教師モデルを使用して生徒モデルを訓練する場合に比べて、生徒モデルの性能を適切に向上させることができる。また、本構成によれば、効率良く生徒モデルの訓練を行うことができる。 Thus, in this configuration, the training of the student model is started using the teacher model having the performance corresponding to the performance of the student model, and the performance of the teacher model used for training is gradually increased in accordance with the improvement of the performance of the student model. It is configured to be raised. For this reason, the performance of the student model can be appropriately improved compared to training the student model using a high-performance teacher model from the beginning. Moreover, according to this configuration, it is possible to train the student model efficiently.

図５は、本発明の第１実施形態に係る学習装置１０における生徒モデルの訓練の流れを示すフローチャートである。なお、図５に示すフローチャートの開始前に、Ｎ（Ｎ＞１）個の学習済みの教師モデルが作成されている。教師モデルを学習させる学習手法は公知の手法が利用されてよい。 FIG. 5 is a flow chart showing the flow of training a student model in the learning device 10 according to the first embodiment of the present invention. Note that N (N>1) trained teacher models are created before the flow chart shown in FIG. 5 is started. A known method may be used as a learning method for learning the teacher model.

ステップＳ１では、学習に用いる教師モデルの番号に対応する変数「ｉ」がゼロに設定される。なお、Ｎ個の教師モデルには、性能の低い方から高い方に向けて順番に、０、１、・・・Ｎ－１の番号が割り当てられている。ステップＳ１の処理が完了すると、次のステップＳ２に処理が進められる。 In step S1, a variable "i" corresponding to the teacher model number used for learning is set to zero. The N teacher models are assigned numbers 0, 1, . . . When the process of step S1 is completed, the process proceeds to the next step S2.

ステップＳ２では、エポック数を表す変数「ｅｐｏｃｈ」がゼロに設定される。なお、エポック数の最大値はＭに設定されている。Ｍエポックは、例えば１００エポック等である。ステップＳ２の処理が完了すると、次のステップＳ３に処理が進められる。 In step S2, a variable "epoch" representing the number of epochs is set to zero. Note that the maximum number of epochs is set to M. The M epoch is, for example, 100 epochs. When the process of step S2 is completed, the process proceeds to the next step S3.

ステップＳ３では、ｉ番目の教師モデルを用いて生徒モデルの学習が行われる。当該学習は、上述した知識蒸留を用いた学習である。ここでは、１エポック分の学習が行われる。詳細には、学習データのセットをバッチサイズにしたがって複数のサブセットに分け、サブセット毎（１イタレーション毎）に知識蒸留を用いた学習が行われる。サブセット毎に、ｉ番目の教師モデルによる推論と、生徒モデルによる推論とが行われる。そして、サブセット毎に、ｉ番目の教師モデル、および、生徒モデルを用いた推論により得られる損失関数（式（３）参照）を用いて、生徒モデルのパラメータ更新が行われる。全てのサブセットの学習が完了することで、１エポック分の学習が完了したことになる。ステップＳ３の処理が完了すると、次のステップＳ４に処理が進められる。 In step S3, learning of the student model is performed using the i-th teacher model. The learning is learning using the knowledge distillation described above. Here, learning for one epoch is performed. Specifically, the learning data set is divided into a plurality of subsets according to the batch size, and learning using knowledge distillation is performed for each subset (every one iteration). Inference by the i-th teacher model and inference by the student model are performed for each subset. Then, for each subset, the parameters of the student model are updated using the i-th teacher model and the loss function (see equation (3)) obtained by inference using the student model. Completion of learning for all subsets means completion of learning for one epoch. When the process of step S3 is completed, the process proceeds to the next step S4.

ステップＳ４では、ステップＳ３で学習が行われた生徒モデルの性能の評価が行われる。本実施形態では、ステップＳ３における学習後の生徒モデルについて、物体検出タスクの評価指標であるｍＡＰを求める。ｍＡＰを求めると、次のステップＳ５に処理が進められる。 In step S4, the performance of the student model trained in step S3 is evaluated. In this embodiment, mAP, which is an evaluation index of the object detection task, is obtained for the student model after learning in step S3. After obtaining the mAP, the process proceeds to the next step S5.

ステップＳ５では、教師モデルと生徒モデルの性能差が閾値未満であるか否かが判定される。具体的には、教師モデルのｍＡＰと、生徒モデルのｍＡＰとの差が閾値未満であるか否かが判定される。閾値は、予め記憶部１２に記憶された一定値である。性能差が閾値以上である場合（ステップＳ５でＮｏ）、次のステップＳ６に処理が進められる。性能差が閾値未満である場合（ステップＳ５でＹｅｓ）、ステップＳ８に処理が進められる。すなわち、ステップＳ６およびステップＳ７の処理がとばされる。 In step S5, it is determined whether or not the performance difference between the teacher model and the student model is less than a threshold. Specifically, it is determined whether or not the difference between the mAP of the teacher model and the mAP of the student model is less than a threshold. The threshold is a constant value stored in the storage unit 12 in advance. If the performance difference is greater than or equal to the threshold (No in step S5), the process proceeds to next step S6. If the performance difference is less than the threshold (Yes in step S5), the process proceeds to step S8. That is, the processing of steps S6 and S7 is skipped.

ステップＳ６では、変数「ｅｐｏｃｈ」に１が加算される。この時点で変数「ｅｐｏｃｈ」がゼロであった場合には、変数「ｅｐｏｃｈ」が１となる。すなわち、ｉ番目の教師モデルを用いた学習（知識蒸留）が１エポック完了したことになる。また、変数「ｅｐｏｃｈ」が５０であった場合には、変数「ｅｐｏｃｈ」が５１となる。すなわち、ｉ番目の教師モデルを用いた学習が５１エポック完了したことになる。ステップＳ６の処理が完了すると、次のステップＳ７に処理が進められる。 In step S6, 1 is added to the variable "epoch". If the variable "epoch" was zero at this time, the variable "epoch" becomes one. That is, one epoch of learning (knowledge distillation) using the i-th teacher model is completed. Also, if the variable “epoch” was 50, the variable “epoch” becomes 51. That is, 51 epochs of learning using the i-th teacher model have been completed. When the process of step S6 is completed, the process proceeds to the next step S7.

ステップＳ７では、変数「ｅｐｏｃｈ」が最大エポック数Ｍより小さいか否かが確認される。変数「ｅｐｏｃｈ」が最大エポック数Ｍより小さい場合（ステップＳ７でＹｅｓ）、ステップＳ３に処理が戻される。すなわち、ｉ番目の教師モデルを用いた学習が繰り返されることになる。変数「ｅｐｏｃｈ」が最大エポック数Ｍに到達している場合（ステップＳ７でＮｏ）、次のステップＳ８に処理が進められる。 In step S7, it is checked whether the variable "epoch" is smaller than the maximum number M of epochs. If the variable "epoch" is smaller than the maximum number of epochs M (Yes in step S7), the process returns to step S3. That is, learning using the i-th teacher model is repeated. If the variable "epoch" has reached the maximum number of epochs M (No in step S7), the process proceeds to the next step S8.

ステップＳ８では、変数「ｉ」に１が加算される。すなわち、学習に使用する教師モデルが、ｉ番目の教師モデルからｉ＋１番目の教師モデルとされる。例えば、ｉ＝０であれば、ｉ＝１となり、教師モデル０から教師モデル１に変更されることになる。ステップＳ８の処理が完了すると、次のステップＳ９に処理が進められる。 In step S8, 1 is added to the variable "i". That is, the teacher model used for learning is set to the i-th teacher model to the (i+1)-th teacher model. For example, if i=0, i=1, and teacher model 0 is changed to teacher model 1. FIG. When the process of step S8 is completed, the process proceeds to the next step S9.

ステップＳ９では、変数「ｉ」がＮより小さいか否かが確認される。すなわち、予め準備された全ての教師モデルが生徒モデルの訓練に用いられたか否かが確認される。変数「ｉ」がＮより小さい場合（ステップＳ９でＹｅｓ）、未だ全ての教師モデルが生徒モデルの訓練に使用されていないために、ステップＳ２に処理が戻される。これにより、直前の学習で使用された教師モデルより性能の高い新たな教師モデルを使った、知識蒸留による生徒モデルの訓練が開始される。なお、教師モデルの変更に合わせて、ステップＳ５の処理における閾値も変更されてよい。変数「ｉ」がＮに到達している場合（ステップＳ９でＮｏ）、図５に示す生徒モデルの訓練（学習）が完了し、新たな学習済み（訓練済み）モデルが完成する。 In step S9, it is checked whether the variable "i" is smaller than N. That is, it is checked whether or not all teacher models prepared in advance have been used for training the student models. If the variable "i" is smaller than N (Yes in step S9), the process returns to step S2 because not all teacher models have been used for training student models. This initiates training of the student model by knowledge distillation using a new teacher model with higher performance than the teacher model used in the previous learning. Note that the threshold in the process of step S5 may also be changed in accordance with the change of the teacher model. If the variable "i" has reached N (No in step S9), training (learning) of the student model shown in FIG. 5 is completed, and a new learned (trained) model is completed.

図５の処理により得られた学習済みの生徒モデルは、予め準備された教師モデルのうちの最も高性能な教師モデルに近い性能を獲得することができる。また、当該学習済みモデルは、教師モデルに比べてパラメータの数が少なく、処理を迅速に行うことができる。 The trained student model obtained by the process of FIG. 5 can acquire performance close to that of the teacher model with the highest performance among the teacher models prepared in advance. In addition, the trained model has fewer parameters than the teacher model, and can be processed quickly.

＜２．第２実施形態＞
次に、第２実施形態の学習装置について説明する。第２実施形態の学習装置も、第１実施形態と同様に、教師モデルを用いた知識蒸留により、生徒モデルの訓練を行う。第２実施形態の学習装置のハードウェア構成は、図１に示す第１実施形態の学習装置１０のハードウェア構成と同様である。また、第２実施形態の学習装置が備えるプロセッサの機能構成も、図２に示す第１実施形態の学習装置１０が備えるプロセッサ１１の機能構成と同様である。ただし、プロセッサが備える学習部１１３および変更部１１５の機能の詳細について、第２実施形態は第１実施形態と異なる。以下、この異なる部分を中心に説明を行う。第１実施形態と重複する内容については、特に説明の必要がある場合を除いて説明を省略する。 <2. Second Embodiment>
Next, the learning device of the second embodiment will be described. Similarly to the first embodiment, the learning device of the second embodiment also trains the student model by knowledge distillation using the teacher model. The hardware configuration of the learning device of the second embodiment is the same as the hardware configuration of the learning device 10 of the first embodiment shown in FIG. Also, the functional configuration of the processor included in the learning device of the second embodiment is the same as the functional configuration of the processor 11 included in the learning device 10 of the first embodiment shown in FIG. However, the details of the functions of the learning unit 113 and the changing unit 115 included in the processor are different in the second embodiment from those in the first embodiment. The following description will focus on these different parts. The description of the contents that overlap with the first embodiment will be omitted unless it is particularly necessary.

学習部１１３は、第１実施形態と同様に、式（３）に示される損失Ｌを最小化するように生徒モデルを学習する。ただし、第１実施形態では式（３）に示す重み係数λが一定とされたが、第２実施形態では重み係数λは一定でない。この点、第１実施形態と異なる。 As in the first embodiment, the learning unit 113 learns the student model so as to minimize the loss L shown in Equation (3). However, in the first embodiment, the weighting factor λ shown in equation (3) is constant, but in the second embodiment the weighting factor λ is not constant. This point is different from the first embodiment.

変更部１１５は、教師モデルと生徒モデルとの性能差に基づいて、生徒モデルの損失の算出時における重み係数の変更を行う。このような構成とすることで、例えば生徒モデルの性能が教師モデルの性能に近づいた段階で重み係数λを変更して、生徒モデルが教師モデルの影響を受け難い状態として学習させることができる。 The changing unit 115 changes the weighting factor when calculating the loss of the student model based on the performance difference between the teacher model and the student model. By adopting such a configuration, for example, when the performance of the student model approaches the performance of the teacher model, the weighting factor λ can be changed, and the student model can be learned in a state in which it is less susceptible to the influence of the teacher model.

詳細には、生徒モデルにおける損失Ｌは、上述のように式（３）で算出される。すなわち、本実施形態では、重み係数は、学習データに対する損失Ｌuqと、教師モデルに対する損失Ｌdstlとのバランスを調整する係数λである。生徒モデルの性能が教師モデルの性能に近づいた場合にλの値を小さくすることにより、教師モデルの影響を受け難くして学習させることができる。これにより、教師モデルが生徒モデルの学習に悪影響を及ぼす可能性を低減することができる。 Specifically, the loss L in the student model is calculated by equation (3) as described above. That is, in the present embodiment, the weighting factor is a factor λ for adjusting the balance between the loss Luq for the training data and the loss Ldstl for the teacher model. By reducing the value of λ when the performance of the student model approaches the performance of the teacher model, learning can be made less susceptible to the teacher model. This can reduce the possibility that the teacher model will adversely affect the learning of the student model.

図６は、本発明の第２実施形態に係る学習装置における生徒モデルの訓練の流れを示すフローチャートである。なお、図６に示すフローチャートの開始前に、公知の手法により学習が行われた学習済みの教師モデルが作成されている。本実施形態では、第１実施形態と異なり、教師モデルの数は１つである。 FIG. 6 is a flow chart showing the flow of student model training in the learning device according to the second embodiment of the present invention. Before starting the flow chart shown in FIG. 6, a trained teacher model which has been trained by a known method is created. In this embodiment, unlike the first embodiment, the number of teacher models is one.

ステップＳ１１では、エポック数を表す変数「ｅｐｏｃｈ」がゼロに設定される。なお、学習予定のエポック数はＭに設定されている。エポック数Ｍは、例えば１００エポック等である。ステップＳ１１の処理が完了すると、次のステップＳ１２に処理が進められる。 In step S11, a variable "epoch" representing the number of epochs is set to zero. Note that the number of epochs to be learned is set to M. The number of epochs M is, for example, 100 epochs. When the process of step S11 is completed, the process proceeds to the next step S12.

ステップＳ１２では、教師モデルを用いて生徒モデルの学習が行われる。当該学習は、上述した知識蒸留を用いた学習である。ここでは、１エポック分の学習が行われる。詳細には、学習データのセットをバッチサイズにしたがって複数のサブセットに分け、サブセット毎（１イタレーション毎）に知識蒸留を用いた学習が行われる。サブセット毎に、教師モデルによる推論と、生徒モデルによる推論とが行われる。そして、サブセット毎に、教師モデルおよび生徒モデルを用いた推論により得られる損失関数（式（３）参照）を用いて、生徒モデルのパラメータ更新が行われる。全てのサブセットの学習が完了することで、１エポック分の学習が完了したことになる。ステップＳ１２の処理が完了すると、次のステップＳ１３に処理が進められる。 In step S12, the student model is learned using the teacher model. The learning is learning using the knowledge distillation described above. Here, learning for one epoch is performed. Specifically, the learning data set is divided into a plurality of subsets according to the batch size, and learning using knowledge distillation is performed for each subset (every one iteration). Inference by the teacher model and inference by the student model are performed for each subset. Then, for each subset, the parameter of the student model is updated using a loss function (see Equation (3)) obtained by inference using the teacher model and the student model. Completion of learning for all subsets means completion of learning for one epoch. When the process of step S12 is completed, the process proceeds to the next step S13.

ステップＳ１３では、ステップＳ１２で学習が行われた生徒モデルの性能の評価が行われる。本実施形態では、ステップＳ１２における学習後の生徒モデルについて、物体検出タスクの評価指標であるｍＡＰを求める。ｍＡＰを求めると、次のステップＳ１４に処理が進められる。 In step S13, the performance of the student model trained in step S12 is evaluated. In this embodiment, mAP, which is an evaluation index of the object detection task, is obtained for the student model after learning in step S12. After obtaining the mAP, the process proceeds to the next step S14.

ステップＳ１４では、教師モデルと生徒モデルの性能差が閾値未満であるか否かが判定される。具体的には、教師モデルのｍＡＰと、生徒モデルのｍＡＰとの差が閾値未満であるか否かが判定される。閾値は、予め記憶部１２に記憶された一定値である。性能差が閾値未満である場合（ステップＳ１４でＹｅｓ）、次のステップＳ１５に処理が進められる。性能差が閾値以上である場合（ステップＳ１４でＮｏ）、ステップＳ１６に処理が進められる。すなわち、ステップＳ１５の処理がとばされる。 In step S14, it is determined whether or not the performance difference between the teacher model and the student model is less than a threshold. Specifically, it is determined whether or not the difference between the mAP of the teacher model and the mAP of the student model is less than a threshold. The threshold is a constant value stored in the storage unit 12 in advance. If the performance difference is less than the threshold (Yes in step S14), the process proceeds to next step S15. If the performance difference is greater than or equal to the threshold (No in step S14), the process proceeds to step S16. That is, the process of step S15 is skipped.

ステップＳ１５では、式（３）に含まれる重み係数λが変更される。詳細には、重み係数λの値は小さくされる。例えば、現在のλの値に対して１よりも小さい定数（例えば０．５等）が乗じられる。すなわち、式（３）において、λ×Ｌdstlで表される項の重みが小さくなる。この結果、生徒モデルの損失Ｌにおける教師モデルに対する損失Ｌdstlの重みが小さくなり、教師モデルの影響を小さくして生徒モデルを学習させることができる。ステップＳ１５の処理が完了すると、次のステップＳ１６に処理が進められる。 In step S15, the weighting factor λ included in equation (3) is changed. Specifically, the value of the weighting factor λ is reduced. For example, the current value of λ is multiplied by a constant smaller than 1 (eg, 0.5, etc.). That is, the weight of the term represented by λ×Ldstl in Equation (3) is reduced. As a result, the weight of the loss Ldstl with respect to the teacher model in the loss L of the student model is reduced, and the student model can be learned while reducing the influence of the teacher model. When the process of step S15 is completed, the process proceeds to the next step S16.

なお、ステップＳ１５で重み係数λが変更された場合、当該変更に合わせてステップＳ１４の閾値も変更される構成としてよい。また、重み係数λが変更された後も、閾値を一定としてもよい。この場合には、最初に重み係数λが変更された後は、１エポック毎に重み係数λが徐々に小さくなることになる。また、一度重み係数λが変更された後は、その後は重み係数λが変更されない構成としてもよい。 Note that when the weighting factor λ is changed in step S15, the threshold in step S14 may be changed in accordance with the change. Also, the threshold may be kept constant even after the weighting factor λ is changed. In this case, after the weighting factor λ is changed for the first time, the weighting factor λ is gradually decreased for each epoch. Further, after the weighting factor λ is changed once, the weighting factor λ may not be changed thereafter.

ステップＳ１６では、変数「ｅｐｏｃｈ」に１が加算される。この時点で変数「ｅｐｏｃｈ」がゼロであった場合には、変数「ｅｐｏｃｈ」が１となる。すなわち、教師モデルを用いた学習（知識蒸留）が１エポック完了したことになる。また、変数「ｅｐｏｃｈ」が５０であった場合には、変数「ｅｐｏｃｈ」が５１となる。すなわち、教師モデルを用いた学習が５１エポック完了したことになる。ステップＳ１６の処理が完了すると、次のステップＳ１７に処理が進められる。 In step S16, 1 is added to the variable "epoch". If the variable "epoch" was zero at this time, the variable "epoch" becomes one. That is, one epoch of learning (knowledge distillation) using the teacher model is completed. Also, if the variable “epoch” was 50, the variable “epoch” becomes 51. That is, 51 epochs of learning using the teacher model have been completed. When the process of step S16 is completed, the process proceeds to the next step S17.

ステップＳ１７では、変数「ｅｐｏｃｈ」が学習予定のエポック数Ｍより小さいか否かが確認される。変数「ｅｐｏｃｈ」がエポック数Ｍより小さい場合（ステップＳ１７でＹｅｓ）、ステップＳ１２に処理が戻され、学習が繰り返されることになる。変数「ｅｐｏｃｈ」がエポック数Ｍに到達している場合（ステップＳ１７でＮｏ）、図６に示す生徒モデルの訓練（学習）が完了し、新たな学習済みモデルが完成する。 In step S17, it is checked whether or not the variable "epoch" is smaller than the number M of epochs to be learned. If the variable "epoch" is smaller than the number of epochs M (Yes in step S17), the process returns to step S12 and learning is repeated. When the variable "epoch" has reached the number of epochs M (No in step S17), the training (learning) of the student model shown in FIG. 6 is completed, and a new trained model is completed.

図６の処理により得られた学習済みの生徒モデルは、教師モデルから悪影響を受ける可能性が低く、知識蒸留を用いることで適切に性能を向上させることができる。また、当該学習済みモデルは、教師モデルに比べてパラメータの数が少なく、処理を迅速に行うことができる。 The trained student model obtained by the process of FIG. 6 is less likely to be adversely affected by the teacher model, and the performance can be appropriately improved by using knowledge distillation. In addition, the trained model has fewer parameters than the teacher model, and can be processed quickly.

以上からわかるように、第２実施形態の学習方法においても、生徒モデルの損失Ｌは、学習データに対する損失Ｌuqと、教師モデルに対する損失Ｌdstlとを用いて求められる。また、第２実施形態の学習方法においては、学習データに対する損失Ｌuqと、教師モデルに対する損失Ｌdstlとのバランスを調整する重み係数λを、所定のタイミングで変更して生徒モデルが訓練される。このような構成とすることで、生徒モデルの訓練時における教師モデルの影響を調整しながら生徒モデルの訓練を行うことができる。 As can be seen from the above, also in the learning method of the second embodiment, the loss L of the student model is obtained using the loss Luq of the learning data and the loss Ldstl of the teacher model. Further, in the learning method of the second embodiment, the weighting coefficient λ for adjusting the balance between the loss Luq for the learning data and the loss Ldstl for the teacher model is changed at a predetermined timing to train the student model. With such a configuration, it is possible to train the student model while adjusting the influence of the teacher model during the training of the student model.

上述のように、本実施形態では、所定のタイミングは、教師モデルと生徒モデルとの性能差が所定の閾値よりも小さくなった場合である。これにより、生徒モデルの性能が教師モデルの性能に近づいた段階で重み係数λを変更して、生徒モデルが教師モデルの影響を受け難い状態として学習させることができる。 As described above, in this embodiment, the predetermined timing is when the performance difference between the teacher model and the student model becomes smaller than a predetermined threshold. As a result, when the performance of the student model approaches the performance of the teacher model, the weighting factor λ can be changed, and the student model can be learned in a state in which it is less susceptible to the influence of the teacher model.

ただし、別の例として、所定のタイミングは、所定エポック数ごとであってもよい。このように構成すると、所定エポック数ごとに重み係数λを減衰させることができ、知識蒸留を用いた学習の進行に合わせて教師モデルの影響を徐々に減らすことができる。 However, as another example, the predetermined timing may be every predetermined number of epochs. With this configuration, the weighting factor λ can be attenuated every predetermined number of epochs, and the influence of the teacher model can be gradually reduced in accordance with the progress of learning using knowledge distillation.

所定エポック数ごとは、例えば複数エポックごとであっても、１エポックごとであってもよい。図７は、複数エポックごとに重み係数λが減衰される例を示す図である。図７において、横軸はエポック数であり、縦軸はλの値である。図７に示す例においては、エポック数に対するλの値は、以下の式（４）で表すことができる。
λ ＝ λ_int＊（０．９＾（ｅｐｏｃｈｍｏｄ５０)）（４）
λ_intは、λの初期値である。ｅｐｏｃｈは、現在のエポック数を示す。式（４）に示す例では、重み係数λが、５０エポックごとに０．９倍される。０．９倍は一例であり、１より小さい他の数であってよい。 The predetermined number of epochs may be, for example, multiple epochs or one epoch. FIG. 7 is a diagram showing an example in which the weighting factor λ is attenuated every multiple epochs. In FIG. 7, the horizontal axis is the number of epochs and the vertical axis is the value of λ. In the example shown in FIG. 7, the value of λ with respect to the number of epochs can be expressed by Equation (4) below.
λ = λ_int * (0.9^(epoch mod 50)) (4)
λ_int is the initial value of λ. epoch indicates the current epoch number. In the example shown in equation (4), the weighting factor λ is multiplied by 0.9 every 50 epochs. 0.9 times is an example, and other numbers less than 1 may be used.

図８は、１エポックごとに重み係数λが減衰される例を示す図である。図８において、横軸はエポック数であり、縦軸はλの値である。図８に示す例においては、エポック数に対するλの値は、以下の式（５）で表すことができる。
λ = λ_int*（ｃｏｓ（π＊ｅｐｏｃｈ／Ｍ) ＋１）／２（５）
λ_intは、λの初期値である。ｅｐｏｃｈは、現在のエポック数を示す。Ｍは学習予定のエポック数である。図８に示す例では、λは、いわゆるＣｏｓｉｎｅＡｎｎｅｓｌｉｎｇが施されて連続的（１エポックごと）に減衰される。 FIG. 8 is a diagram showing an example in which the weighting factor λ is attenuated for each epoch. In FIG. 8, the horizontal axis is the number of epochs and the vertical axis is the value of λ. In the example shown in FIG. 8, the value of λ with respect to the number of epochs can be expressed by Equation (5) below.
λ = λ_int*(cos(π*epoch/M) + 1)/2 (5)
λ_int is the initial value of λ. epoch indicates the current epoch number. M is the number of epochs to be learned. In the example shown in FIG. 8, λ is attenuated continuously (every epoch) by so-called Cosine Annesling.

＜３．留意事項等＞
本明細書中に開示されている種々の技術的特徴は、上記実施形態のほか、その技術的創作の主旨を逸脱しない範囲で種々の変更を加えることが可能である。すなわち、上記実施形態は、全ての点で例示であって、制限的なものではないと考えられるべきであり、本発明の技術的範囲は、上記実施形態の説明ではなく、特許請求の範囲によって示されるものであり、特許請求の範囲と均等の意味及び範囲内に属する全ての変更が含まれると理解されるべきである。また、本明細書中に示される複数の実施形態及び変形例は可能な範囲で適宜組み合わせて実施されてよい。 <3. Notes, etc.>
Various modifications can be made to the various technical features disclosed in this specification without departing from the gist of the technical creation in addition to the above-described embodiments. That is, the above embodiments should be considered as examples in all respects and not restrictive, and the technical scope of the present invention is not defined by the description of the above embodiments, but by the scope of claims. All changes that come within the meaning and range of equivalency of the claims are to be understood. In addition, the multiple embodiments and modifications shown in this specification may be implemented in appropriate combinations within a possible range.

１０・・・学習装置
１１４・・・評価部
１１５・・・変更部 10... Learning device 114... Evaluation unit 115... Change unit

Claims

A learning device for training a student model using a teacher model,
an evaluation unit that obtains a performance difference between the teacher model and the student model;
a changing unit that changes the teacher model based on the performance difference, or changes a weighting factor when calculating the loss of the student model;
A learning device comprising:

wherein, when the performance difference becomes smaller than a predetermined threshold, the changing unit changes the teacher model used for training the student model to a teacher model having higher performance than the teacher model in use. Item 1. The learning device according to item 1.

2. The learning device according to claim 1, wherein said weighting coefficient is a coefficient for adjusting a balance between a loss for learning data and a loss for said teacher model.

4. The learning device according to claim 3, wherein the loss for said teacher model is an error in an inference result or an error in an intermediate layer feature map between said teacher model and said student model.

The learning device according to any one of claims 1 to 4, wherein the performance is object detection performance.

A learning method for training a student model using a teacher model, comprising:
A learning method, wherein the teacher model is changed based on a performance difference between the teacher model and the student model.

A learning method for training a student model using a teacher model, comprising:
Obtaining the loss of the student model using the loss for the learning data and the loss for the teacher model,
A learning method, wherein a weighting factor for adjusting a balance between a loss for the learning data and a loss for the teacher model is changed at a predetermined timing to train the student model.

8. The learning method according to claim 7, wherein said predetermined timing is when a performance difference between said teacher model and said student model becomes smaller than a predetermined threshold.

8. The learning method according to claim 7, wherein said predetermined timing is every predetermined number of epochs.