JP2022152023A

JP2022152023A - Feature quantity data generation device and method thereof, and machine learning device and method thereof

Info

Publication number: JP2022152023A
Application number: JP2021054630A
Authority: JP
Inventors: 竜介関; Ryusuke Seki; 康貴岡田; Yasutaka Okada; 雄喜片山; Yuki Katayama; 怜広見; Rei Hiromi; 葵荻島; Aoi Ogishima
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2022-10-12

Abstract

To provide a feature quantity data generation device and a method thereof, and a machine learning device and a method thereof that reduce learning time in machine learning.SOLUTION: In a data processor including a first learning data acquisition unit, a first coupling unit, a first learning unit, a second learning data acquisition unit, a second coupling unit, and a second learning unit, the second learning unit 60 acquires, as input data (IN_B), second coupled data composed of image data of a plurality of images each including a recognition target object. By compressing the input data IN_B with a learned encoder 32a, compressed data (E_B) including each feature quantity of a plurality of recognition target objects in a plurality of images are generated. The second learning unit 60 uses the compressed data E_B as input data to a neural network (NN) 61 to perform machine learning of the NN 61.SELECTED DRAWING: Figure 9

Description

本発明は、特徴量データ生成装置及び方法並びに機械学習装置及び方法に関する。 The present invention relates to a feature amount data generation device and method and a machine learning device and method.

画像認識用の推論モデルを学習させる際、ミニバッチ学習が多く利用される。ミニバッチ学習では、学習データを構成する複数の学習用画像の画像データを所定のミニバッチサイズを有するミニバッチを単位に分割し、ミニバッチごとに学習を行う。例えば、学習用画像の水平方向の画素数Ｗ及び垂直方向の画素数Ｈが共に１００であって且つＲＧＢ形式のカラー画像を学習用画像として用いる場合、１枚の学習用画像のデータサイズは（Ｗ×Ｈ×３）であり、３２枚の学習用画像の画像データをミニバッチのサイズ方向に結合することでミニバッチを形成する。この場合におけるミニバッチサイズは、（Ｗ×Ｈ×３×３２）である。 Mini-batch learning is often used when training an inference model for image recognition. In mini-batch learning, image data of a plurality of learning images constituting learning data is divided into mini-batches having a predetermined mini-batch size, and learning is performed for each mini-batch. For example, when the number of pixels W in the horizontal direction and the number of pixels H in the vertical direction of the learning image are both 100 and an RGB format color image is used as the learning image, the data size of one learning image is ( W×H×3), and a mini-batch is formed by combining image data of 32 learning images in the size direction of the mini-batch. The mini-batch size in this case is (W x H x 3 x 32).

そして例えば、学習データに１０２４０枚分の学習用画像が含まれているのであれば、“１０２４０／３２＝３２０”より、ミニバッチ学習を３２０回実行することで、全学習用画像に対する１回分の学習が完了することになる。即ち、イテレーション数（繰り返し回数）は３２０であり、３２０回分のミニバッチ学習が１エポックに相当する。 For example, if the learning data contains 10,240 learning images, by executing mini-batch learning 320 times from "10,240/32=320", one time of learning for all learning images will be completed. That is, the number of iterations (the number of repetitions) is 320, and 320 mini-batch learnings correspond to one epoch.

特開２０２０－７１８０８号公報Japanese Patent Application Laid-Open No. 2020-71808

上記の方法において、１つのミニバッチに含まれる学習用画像の枚数を増大させれば、それに比例してミニバッチサイズも増大するが、１エポック当たりのミニバッチ学習の実行回数が減少する。例えば、ミニバッチサイズを（Ｗ×Ｈ×３×３２０）とすれば、ミニバッチ学習を３２回実行することで全学習用画像に対する１回分の学習が完了することになる。即ち３２回分のミニバッチ学習で１エポックが完了する。１エポック当たりのミニバッチ学習の回数を低減させることで、推論モデルの学習時間（例えば損失関数の値が所定の閾値以下になるまでに必要な時間）が短縮される可能性がある。 In the above method, if the number of learning images included in one mini-batch is increased, the mini-batch size is increased proportionally, but the number of mini-batch learning executions per epoch is decreased. For example, if the mini-batch size is (W×H×3×320), the mini-batch learning is performed 32 times to complete one-time learning for all the learning images. That is, one epoch is completed with 32 mini-batch learnings. By reducing the number of mini-batch learnings per epoch, the learning time of the inference model (for example, the time required for the value of the loss function to become equal to or less than a predetermined threshold) may be shortened.

しかしながら、機械学習を行う装置に搭載されるメモリの容量には制限があるため、ミニバッチサイズを無条件に増大させることはできない。１枚当たりの学習用画像のサイズにも依存するが、１ミニバッチ当たりの学習用画像の枚数は現実的には３２枚程度が上限になることが多い。このため、１ミニバッチ当たりの学習用画像の枚数が３２枚を超えて増大するのであれば、必要メモリ容量の増大に伴って装置のコストが増大してしまう。必要メモリ容量を増大させることなく学習時間を低減させることができれば有益である。 However, the mini-batch size cannot be unconditionally increased due to limitations on the capacity of the memory installed in the device that performs machine learning. Although it depends on the size of each learning image, the upper limit of the number of learning images per mini-batch is often about 32 in practice. Therefore, if the number of learning images per mini-batch exceeds 32, the cost of the apparatus will increase as the required memory capacity increases. It would be beneficial if training time could be reduced without increasing memory requirements.

本発明は、学習時間の低減に寄与する特徴量データ生成装置及び方法並びに機械学習装置及び方法を提供することを目的とする。 An object of the present invention is to provide a feature amount data generation device and method and a machine learning device and method that contribute to reduction in learning time.

本発明に係る特徴量データ生成装置は、各々に認識対象物体を含む複数の画像の画像データを取得する画像データ取得部と、前記複数の画像の画像データを圧縮することで前記複数の画像における複数の認識対象物体の各特徴量を含む特徴量データを生成する特徴量データ生成部と、を備えた構成（第１の構成）である。 A feature amount data generation device according to the present invention includes an image data acquisition unit for acquiring image data of a plurality of images each including a recognition target object, and compressing the image data of the plurality of images to a feature amount data generation unit that generates feature amount data including each feature amount of a plurality of recognition target objects (first configuration).

上記第１の構成に係る特徴量データ生成装置において、前記複数の画像は、所定カメラにて時間的に連続して撮影された２以上の画像を含む構成（第２の構成）であっても良い。 In the feature amount data generation device according to the first configuration, the plurality of images may be a configuration (second configuration) including two or more images captured temporally continuously by a predetermined camera. good.

本発明に係る機械学習装置は、複数の第１入力画像の画像データをチャネル方向に結合することで第１結合データを生成する第１結合部と、前記第１結合データの供給を受け、前記第１結合データを前記チャネル方向に圧縮するエンコーダ及び前記圧縮を復元するデコーダを有するオートエンコーダを学習させる第１学習部と、複数の第２入力画像の画像データを前記チャネル方向に結合することで第２結合データを生成する第２結合部と、前記第１学習部による学習後の前記エンコーダに前記第２結合データを入力することで当該エンコーダから出力される圧縮データを、ニューラルネットワークに入力し、これによって前記ニューラルネットワークを学習させる第２学習部と、を備える構成（第３の構成）である。 A machine learning device according to the present invention includes: a first combining unit that generates first combined data by combining image data of a plurality of first input images in a channel direction; A first learning unit that trains an autoencoder having an encoder that compresses first combined data in the channel direction and a decoder that restores the compression, and combining image data of a plurality of second input images in the channel direction. Compressed data output from a second combining unit that generates second combined data and input of the second combined data to the encoder after learning by the first learning unit is input to a neural network. , and a second learning unit for learning the neural network thereby (third configuration).

上記第３の構成に係る機械学習装置において、前記第２学習部は、前記複数の第２入力画像に対応付けられた複数のラベルデータを含む教師データを用いて、前記ニューラルネットワークを学習させる構成（第４の構成）であっても良い。 In the machine learning device according to the third configuration, the second learning unit uses teacher data including a plurality of label data associated with the plurality of second input images to cause the neural network to learn. (Fourth configuration).

上記第４の構成に係る機械学習装置において、前記第２学習部は、前記ニューラルネットワークを学習させることで物体検出が可能な推論モデルを作成する構成（第５の構成）であっても良い。 In the machine learning device according to the fourth configuration, the second learning unit may create an inference model capable of object detection by learning the neural network (fifth configuration).

上記第５の構成に係る機械学習装置において、各第１入力画像及び各第２入力画像は前記物体検出における認識対象物体を含む構成（第６の構成）であっても良い。 In the machine learning device according to the fifth configuration, each first input image and each second input image may include a recognition target object in the object detection (sixth configuration).

上記第３～第６の構成の何れかに係る機械学習装置において、前記第１結合データでは、前記チャネル方向において前記複数の第１入力画像の画像データが配列され、前記第２結合データでは、前記チャネル方向において前記複数の第２入力画像の画像データが配列され、前記第１学習部での学習において、前記エンコーダにより、前記第１結合データのチャネル方向の次元数が削減されることで前記第１結合データが圧縮され、前記第２学習部での学習において、前記第１学習部による学習後の前記エンコーダにより、前記第２結合データのチャネル方向の次元数が削減されることで前記第２結合データが圧縮され、これによって前記圧縮データが得られる構成（第７の構成）であっても良い。 In the machine learning device according to any one of the third to sixth configurations, in the first combined data, the image data of the plurality of first input images are arranged in the channel direction, and in the second combined data, The image data of the plurality of second input images are arranged in the channel direction, and the number of dimensions in the channel direction of the first combined data is reduced by the encoder in the learning by the first learning unit. The first combined data is compressed, and in learning by the second learning unit, the number of dimensions in the channel direction of the second combined data is reduced by the encoder after learning by the first learning unit. A configuration (seventh configuration) may be employed in which two-combined data is compressed to obtain the compressed data.

上記第７の構成に係る機械学習装置において、各第１入力画像の画像データ及び各第２入力画像の画像データは、複数色分の画像データを含み、前記第１結合データでは、前記チャネル方向において各第１入力画像の前記複数色分の画像データが配列され、前記第２結合データでは、前記チャネル方向において各第２入力画像の前記複数色分の画像データが配列される構成（第８の構成）であっても良い。 In the machine learning device according to the seventh configuration, the image data of each first input image and the image data of each second input image include image data for a plurality of colors, and in the first combined data, the channel direction in which the image data for the plurality of colors of each first input image are arranged, and in the second combined data, the image data for the plurality of colors of each second input image are arranged in the channel direction (eighth configuration).

本発明に係る特徴量データ生成方法は、各々に認識対象物体を含む複数の画像の画像データを取得する画像データ取得ステップと、前記複数の画像の画像データを圧縮することで前記複数の画像における複数の認識対象物体の各特徴量を含む特徴量データを生成する特徴量データ生成ステップと、を備えた構成（第９の構成）である。 A feature amount data generation method according to the present invention includes an image data acquisition step of acquiring image data of a plurality of images each including a recognition target object; and a feature amount data generation step of generating feature amount data including each feature amount of a plurality of recognition target objects (a ninth configuration).

本発明に係る機械学習方法は、複数の第１入力画像の画像データをチャネル方向に結合することで第１結合データを生成する第１結合ステップと、前記第１結合データの供給を受け、前記第１結合データを前記チャネル方向に圧縮するエンコーダ及び前記圧縮を復元するデコーダを有するオートエンコーダを学習させる第１学習ステップと、複数の第２入力画像の画像データを前記チャネル方向に結合することで第２結合データを生成する第２結合ステップと、前記第１学習ステップによる学習後の前記エンコーダに前記第２結合データを入力することで当該エンコーダから出力される圧縮データを、ニューラルネットワークに入力し、これによって前記ニューラルネットワークを学習させる第２学習ステップと、を備える構成（第１０の構成）である。 A machine learning method according to the present invention includes a first combining step of generating first combined data by combining image data of a plurality of first input images in a channel direction; a first learning step of training an autoencoder having an encoder that compresses first combined data in the channel direction and a decoder that restores the compression; and combining image data of a plurality of second input images in the channel direction. a second combining step of generating second combined data; and inputting the second combined data into the encoder after learning in the first learning step, thereby inputting compressed data output from the encoder into a neural network. , and a second learning step for learning the neural network by this (a tenth configuration).

本発明によれば、データ記録に関わる利便性向上に寄与するデータ記録装置及び方法を提供することが可能となる。 According to the present invention, it is possible to provide a data recording apparatus and method that contribute to improving the convenience of data recording.

本発明の実施形態に係るデータ処理装置の構成図である。1 is a configuration diagram of a data processing device according to an embodiment of the present invention; FIG. 本発明の実施形態に係る第１学習データの構成図である。4 is a configuration diagram of first learning data according to the embodiment of the present invention; FIG. 本発明の実施形態に係る第２学習データの構成図である。4 is a configuration diagram of second learning data according to the embodiment of the present invention; FIG. 本発明の実施形態に係り、１枚の入力画像と、それに対応するラベルデータを示す図である。FIG. 4 is a diagram showing one input image and corresponding label data according to the embodiment of the present invention; 本発明の実施形態に係り、ＲＧＢ形式のカラー画像としての１枚の入力画像の構成図である。1 is a configuration diagram of one input image as a color image in RGB format according to an embodiment of the present invention; FIG. 本発明の実施形態に係り、第１結合データの構成図である。FIG. 4 is a configuration diagram of first combined data according to the embodiment of the present invention; 本発明の実施形態に係り、オートエンコーダの構成及び動作の説明図である。FIG. 4 is an explanatory diagram of the configuration and operation of an autoencoder according to the embodiment of the present invention; 本発明の実施形態に係り、第２結合データの構成図である。FIG. 4 is a configuration diagram of second combined data according to the embodiment of the present invention; 本発明の実施形態に係り、第２学習部の学習の動作説明図である。FIG. 10 is an explanatory diagram of learning operation of the second learning unit according to the embodiment of the present invention; 本発明の実施形態に係り、第２学習部のニューラルネットワークへの入力データの説明図である。FIG. 10 is an explanatory diagram of input data to the neural network of the second learning unit according to the embodiment of the present invention; 本発明の実施形態に係り、教師データの内容を説明するための図である。FIG. 4 is a diagram for explaining the contents of teacher data according to the embodiment of the present invention; 本発明の実施形態に係るデータ処理装置の動作フローチャートである。4 is an operation flowchart of the data processing device according to the embodiment of the present invention; 本発明の実施形態に係り、データの圧縮による効果を説明するための図である。FIG. 4 is a diagram for explaining the effect of data compression according to the embodiment of the present invention; 本発明の実施形態に係る特徴量データ生成装置の構成図である。1 is a configuration diagram of a feature amount data generation device according to an embodiment of the present invention; FIG.

以下、本発明の実施形態の例を、図面を参照して具体的に説明する。参照される各図において、同一の部分には同一の符号を付し、同一の部分に関する重複する説明を原則として省略する。尚、本明細書では、記述の簡略化上、情報、信号、物理量又は部材等を参照する記号又は符号を記すことによって、該記号又は符号に対応する情報、信号、物理量又は部材等の名称を省略又は略記することがある。例えば、後述の“４０”によって参照される第２学習データ取得部は（図１参照）、第２学習データ取得部４０と表記されることもあるし、取得部４０と略記されることもあり得るが、それらは全て同じものを指す。 Hereinafter, examples of embodiments of the present invention will be specifically described with reference to the drawings. In each figure referred to, the same parts are denoted by the same reference numerals, and redundant descriptions of the same parts are omitted in principle. In this specification, for simplification of description, by describing symbols or codes that refer to information, signals, physical quantities, or members, etc., the names of information, signals, physical quantities, or members, etc. corresponding to the symbols or codes are It may be omitted or abbreviated. For example, the second learning data acquisition unit (see FIG. 1) referred to by “40” to be described later may be referred to as the second learning data acquisition unit 40, or may be abbreviated as the acquisition unit 40. but they all refer to the same thing.

詳細は後述するが、本実施形態では、第１学習データを用いて画像に含まれる特徴量を抽出することができる学習済みエンコーダ３２ａを生成する（図９参照）。次に、学習済みエンコーダ３２ａを用いて、第２学習データから認識対象物体の特徴量を抽出したデータ（圧縮データ）を生成する。学習済みエンコーダ３２ａが第２学習データから認識対象物体の特徴量を抽出する際には、いわゆる圧縮という手法を用いる。次に、認識対象物体の特徴量を抽出したデータ（圧縮データ）を用いてＮＮ６１を学習させる。ＮＮ６１は、学習により物体検出用の推論モデルとなる。ＮＮ６１の学習には、認識対象物体の特徴量を抽出したデータ（圧縮データ）を用いるので、ＮＮ６１の学習時間の低減に寄与することができる。以下、詳細に説明する。 Although the details will be described later, in this embodiment, a trained encoder 32a capable of extracting a feature amount included in an image is generated using the first learning data (see FIG. 9). Next, the learned encoder 32a is used to generate data (compressed data) in which the feature amount of the recognition target object is extracted from the second learning data. When the learned encoder 32a extracts the feature amount of the recognition target object from the second learning data, a so-called compression technique is used. Next, the NN 61 is trained using the data (compressed data) obtained by extracting the feature amount of the object to be recognized. The NN 61 becomes an inference model for object detection through learning. Since the data (compressed data) obtained by extracting the feature amount of the object to be recognized is used for learning of the NN 61, the learning time of the NN 61 can be reduced. A detailed description will be given below.

図１に本実施形態に係るデータ処理装置１の構成図を示す。データ処理装置１は機械学習装置の例である。データ処理装置１は、第１学習データ取得部１０、第１結合部２０、第１学習部３０、第２学習データ取得部４０、第２結合部５０及び第２学習部６０を備える。尚、データ処理装置１は単一のコンピュータ装置にて構成されても良いし、物理的に分離した複数のコンピュータ装置にて構成されても良い。所謂クラウドコンピューティングを利用してデータ処理装置１が構成されても良い。 FIG. 1 shows a block diagram of a data processing device 1 according to this embodiment. The data processing device 1 is an example of a machine learning device. The data processing device 1 includes a first learning data acquiring section 10 , a first combining section 20 , a first learning section 30 , a second learning data acquiring section 40 , a second combining section 50 and a second learning section 60 . The data processing apparatus 1 may be configured by a single computer device, or may be configured by a plurality of physically separated computer devices. The data processing device 1 may be configured using so-called cloud computing.

第１学習データ取得部１０は、複数の画像の画像データを含む第１学習データを取得する。第１学習データを構成する各画像の画像データは第１結合部２０に入力されるため、第１学習データを構成する各画像を第１入力画像と称する。図２に示す如く、第１学習データは計Ｐ枚の第１入力画像ＩＡ［１］～ＩＡ［Ｐ］の画像データを含む。Ｐは２以上の任意の整数であり、例えば、数十～数千の値を有する。尚、第１入力画像ＩＡ［ｉ］は単に入力画像ＩＡ［ｉ］と表記されることがある。ｉは任意の整数を表す。 The first learning data acquisition unit 10 acquires first learning data including image data of a plurality of images. Since the image data of each image forming the first learning data is input to the first combining unit 20, each image forming the first learning data is referred to as a first input image. As shown in FIG. 2, the first learning data includes image data of a total of P first input images IA[1] to IA[P]. P is any integer greater than or equal to 2, and has a value of, for example, tens to thousands. Note that the first input image IA[i] may be simply referred to as input image IA[i]. i represents an arbitrary integer.

第２学習データ取得部４０は、複数の画像の画像データを含む第２学習データを取得する。第２学習データを構成する各画像の画像データは第２結合部５０に入力されるため、第２学習データを構成する各画像を第２入力画像と称する。図３に示す如く、第２学習データは計Ｑ枚の第２入力画像ＩＢ［１］～ＩＢ［Ｑ］の画像データを含む。Ｑは２以上の任意の整数であり、例えば、数千～数万の値を有する。尚、第２入力画像ＩＢ［ｉ］は単に入力画像ＩＢ［ｉ］と表記されることがある。本実施形態では、第１学習データが有する画像の枚数よりも第２学習データが有する画像の枚数の方が大きい。即ち“Ｐ＜Ｑ”が成立する。 The second learning data acquisition unit 40 acquires second learning data including image data of a plurality of images. Since the image data of each image forming the second learning data is input to the second combining unit 50, each image forming the second learning data is referred to as a second input image. As shown in FIG. 3, the second learning data includes image data of a total of Q second input images IB[1] to IB[Q]. Q is any integer greater than or equal to 2, and has a value of, for example, thousands to tens of thousands. Note that the second input image IB[i] may be simply referred to as the input image IB[i]. In this embodiment, the number of images included in the second learning data is greater than the number of images included in the first learning data. That is, "P<Q" is established.

尚、第１入力画像又は第２入力画像などの任意の画像は、当該画像の画像データと、その他のデータ（以下、付加データと称する）と、を含む。任意の画像はカメラにて撮影された画像であって良く、或る画像についての付加データは、当該画像のうちの、画像データではないデータを含み、更に当該画像の撮影時刻を表す撮影時刻情報を含む。 An arbitrary image such as the first input image or the second input image includes image data of the image and other data (hereinafter referred to as additional data). An arbitrary image may be an image captured by a camera, and additional data for a certain image includes data other than image data in the image, and shooting time information representing the shooting time of the image. including.

後に述べられるが、データ処理装置１では、第２学習部６０の学習を経て推論モデル（アルゴリズム）が作成され、当該推論モデルは画像認識として物体検出を行うことができる。物体検出では、認識の対象となる画像内の物体の位置を特定する位置特定と、認識の対象となる画像内の物体のクラス（種別）を特定するクラス識別と、が行われる。各第１入力画像及び各第２入力画像は認識の対象となる物体を１以上含む。本実施形態において物体とは、物体検出における画像認識の対象となる認識対象物体を指す。一部の第１入力画像に認識対象物体が含まれないことがあり得る。同様に、一部の第２入力画像に認識対象物体が含まれないことがあり得る。また、１以上の第１入力画像には認識対象物体以外の物が含まれることがある。同様に、１以上の第２入力画像には認識対象物体以外の物が含まれることがある。 As will be described later, in the data processing device 1, an inference model (algorithm) is created through learning by the second learning unit 60, and the inference model can perform object detection as image recognition. In object detection, position identification for identifying the position of an object within an image to be recognized and class identification for identifying the class (type) of the object within the image for recognition are performed. Each first input image and each second input image includes one or more objects to be recognized. In the present embodiment, an object refers to a recognition target object that is a target of image recognition in object detection. A part of the first input image may not include the recognition target object. Similarly, some of the second input images may not include the recognition target object. Also, one or more first input images may include an object other than the recognition target object. Similarly, the one or more second input images may include objects other than the recognition target object.

尚、本実施形態では、或る画像内に物体の画像データが含まれることを、当該画像に当該物体が含まれる又は存在すると表現することがある。同様に、或る画像中の注目した画像領域（例えば後述の物体領域）内に物体の画像データが含まれることを、注目した画像領域に物体が含まれる又は存在すると表現することがある。 In this embodiment, inclusion of image data of an object in an image may be expressed as inclusion or presence of the object in the image. Similarly, inclusion of image data of an object in an image area of interest (for example, an object area described later) in an image may be expressed as inclusion or presence of an object in the image area of interest.

第２学習データは第２入力画像ごとにラベルデータを含む。第２学習データにおいて、第２入力画像ＩＢ［ｉ］に対応付けられたラベルデータを記号“ＬＢ［ｉ］”にて参照する。ラベルデータＬＢ［ｉ］は、第２入力画像ＩＢ［ｉ］に含まれる物体ごとに、物体の位置を特定する位置情報及び物体のクラスを特定するクラス情報を含む。 The second training data includes label data for each second input image. In the second learning data, the label data associated with the second input image IB[i] is referenced by the symbol "LB[i]". The label data LB[i] includes position information specifying the position of each object included in the second input image IB[i] and class information specifying the class of the object.

図４に入力画像６１０を示す。入力画像６１０は第２入力画像ＩＢ［ｉ］の例である。図４の入力画像６１０には３つの物体６１１～６１３が含まれる。物体６１１、６１２、６１３は、夫々、車両、人間、信号機であって、何れも認識対象物体であるとする。ここでは、車両、人間、信号機は、第１、第２、第３クラスに分類されるものとし、推論モデルは、第１～第３クラスを含む複数のクラスの物体に対して物体検出を行うことができるものとする。尚、ここでは、車両として道路上を走行可能な自動車を想定する。 An input image 610 is shown in FIG. Input image 610 is an example of second input image IB[i]. Input image 610 in FIG. 4 includes three objects 611-613. Objects 611, 612, and 613 are a vehicle, a person, and a traffic light, respectively, and are all objects to be recognized. Here, vehicles, humans, and traffic lights are classified into first, second, and third classes, and the inference model performs object detection for objects in multiple classes, including the first to third classes. It shall be possible. Here, an automobile that can run on roads is assumed as the vehicle.

図４の入力画像６１０に対し、物体６１１の像を取り囲む物体領域６１１Ｂ、物体６１２の像を取り囲む物体領域６１２Ｂ、及び、物体６１３の像を取り囲む物体領域６１３Ｂが設定される。或る物体の物体領域は、当該物体の像を取り囲む矩形領域（望ましくは最小の矩形領域）であって、バウンディングボックスとも称される。 An object region 611B surrounding the image of the object 611, an object region 612B surrounding the image of the object 612, and an object region 613B surrounding the image of the object 613 are set for the input image 610 in FIG. An object region of an object is a rectangular region (preferably the smallest rectangular region) surrounding the image of the object, also called a bounding box.

図４の入力画像６１０に対応するラベル情報６２０は、物体６１１についての位置情報ＰＯＳ_６１１及びクラス情報ＣＬＳ_６１１と、物体６１２についての位置情報ＰＯＳ_６１２及びクラス情報ＣＬＳ_６１２と、物体６１３についての位置情報ＰＯＳ_６１３及びクラス情報ＣＬＳ_６１３と、を含む。入力画像６１０が第２入力画像ＩＢ［ｉ］であればラベル情報６２０はラベル情報ＬＢ［ｉ］である。位置情報ＰＯＳ_６１１、ＰＯＳ_６１２、ＰＯＳ_６１３は、夫々、入力画像６１０における物体領域６１１Ｂの位置、物体領域６１２Ｂの位置、物体領域６１３Ｂの位置を表す。詳細には、物体領域６１１Ｂとしての矩形領域における一つの対角線の２端点の座標値（図４の座標値（ｘ_１，ｙ_１）及び（ｘ_２，ｙ_２）に相当）が、位置情報ＰＯＳ_６１１にて規定される。他の位置情報も同様である。クラス情報ＣＬＳ_６１１、ＣＬＳ_６１２、ＣＬＳ_６１３は、夫々、物体６１１が属するクラス、物体６１２が属するクラス、物体６１３が属するクラスを表す。図４の例では、クラス情報ＣＬＳ_６１１、ＣＬＳ_６１２、ＣＬＳ_６１３は、夫々、車両が属する第１クラス、人間が属する第２クラス、信号機が属する第３クラスを表す。 _Label information ₆₂₀ _{corresponding} to the input image ₆₁₀ in FIG. POS ₆₁₃ and class information CLS ₆₁₃ . If the input image 610 is the second input image IB[i], the label information 620 is the label information LB[i]. Position information POS ₆₁₁ , POS ₆₁₂ , and POS ₆₁₃ represent the positions of object regions 611B, 612B, and 613B in the input image 610, respectively. Specifically, the coordinate values of the two endpoints of one diagonal line in the rectangular region as the object region 611B (corresponding to the coordinate values (x ₁ , y ₁ ) and (x ₂ , y ₂ ) in FIG. 4) are the position information POS ₆₁₁ . The same applies to other location information. Class information CLS ₆₁₁ , CLS ₆₁₂ , and CLS ₆₁₃ represent the class to which object 611 belongs, the class to which object 612 belongs, and the class to which object 613 belongs, respectively. In the example of FIG. 4, the class information CLS ₆₁₁ , CLS ₆₁₂ , and CLS ₆₁₃ respectively represent the first class to which vehicles belong, the second class to which people belong, and the third class to which traffic lights belong.

例えば、自動車等の車両に搭載されたカメラの撮影画像の中から第１入力画像ＩＡ［１］～ＩＡ［Ｐ］及び第２入力画像ＩＢ［１］～ＩＢ［Ｑ］が選ばれて良い。第１入力画像ＩＡ［１］～ＩＡ［Ｐ］と第２入力画像ＩＢ［１］～ＩＢ［Ｑ］とが部分的に重複することもあり得る。 For example, the first input images IA[1] to IA[P] and the second input images IB[1] to IB[Q] may be selected from images captured by a camera mounted on a vehicle such as an automobile. The first input images IA[1] to IA[P] and the second input images IB[1] to IB[Q] may partially overlap.

第１学習データ取得部１０は自らが第１学習データを作成する機能ブロックであっても良いし、データ処理装置１と異なる外部装置（不図示）から有線又は無線通信を通じ、予め作成された第１学習データの入力を受けるものであっても良い。同様に、第２学習データ取得部４０は自らが第２学習データを作成する機能ブロックであっても良いし、データ処理装置１と異なる外部装置（不図示）から有線又は無線通信を通じ、予め作成された第２学習データの入力を受けるものであっても良い。 The first learning data acquisition unit 10 may be a functional block that itself creates the first learning data, or may acquire a previously created first learning data from an external device (not shown) different from the data processing device 1 through wired or wireless communication. 1 learning data may be input. Similarly, the second learning data acquisition unit 40 may be a functional block that creates the second learning data by itself, or an external device (not shown) different from the data processing device 1 through wired or wireless communication, The input of the second learning data may be received.

各第１入力画像及び各第２入力画像は水平方向及び垂直方向に大きさを持つ二次元の静止画像である。１以上の第１入力画像は動画像のフレームであっても良い。同様に、１以上の第２入力画像は動画像のフレームであっても良い。各第１入力画像及び各第２入力画像における水平方向の画素数をＷで表し、各第１入力画像及び各第２入力画像における垂直方向の画素数をＨで表す。そうすると、各第１入力画像及び各第２入力画像は（Ｗ×Ｈ）個の画素から成る。また、第１入力画像及び第２入力画像はＲＧＢ形式で表現されるカラー画像であるとする。つまり、第１入力画像の各画素及び第２入力画像の各画素は、赤の信号成分を表すＲ信号、緑の信号成分を表すＧ信号及び青の信号成分を表すＢ信号を有するものとする。 Each first input image and each second input image is a two-dimensional still image having horizontal and vertical dimensions. The one or more first input images may be frames of a moving image. Similarly, the one or more second input images may be frames of a moving image. Let W denote the number of pixels in the horizontal direction in each first input image and each second input image, and let H be the number of pixels in the vertical direction in each first input image and each second input image. Then each first input image and each second input image consists of (W×H) pixels. It is also assumed that the first input image and the second input image are color images expressed in RGB format. That is, each pixel of the first input image and each pixel of the second input image have an R signal representing a red signal component, a G signal representing a green signal component, and a B signal representing a blue signal component. .

そうすると、任意の１枚の第１入力画像である又は任意の１枚の第２入力画像である入力画像６５０は、図５に示す如く、（Ｗ×Ｈ）個の画素から成り且つＲ信号のみを色信号として有する赤濃淡画像６５０Ｒと、（Ｗ×Ｈ）個の画素から成り且つＧ信号のみを色信号として有する緑濃淡画像６５０Ｇと、（Ｗ×Ｈ）個の画素から成り且つＢ信号のみを色信号として有する青濃淡画像６５０Ｂと、で構成されると考えることができる、画像６５０Ｒ、６５０Ｇ及び６５０Ｂは、上記水平方向及び垂直方向の何れとも異なるチャネル方向に配列される。入力画像６５０の画像データを構成する色信号の種類数を“Ｃ”で表す。ここでは、“Ｃ＝３”である。 Then, an input image 650, which is an arbitrary first input image or an arbitrary second input image, consists of (W×H) pixels and only the R signal, as shown in FIG. as color signals, a green grayscale image 650G consisting of (W×H) pixels and having only G signals as color signals, and a (W×H) pixels consisting only of B signals. Images 650R, 650G and 650B, which can be considered to consist of a blue-toned image 650B having as color signals, are arranged in channel directions different from both the horizontal and vertical directions. The number of types of color signals forming the image data of the input image 650 is represented by "C". Here, "C=3".

図１を再度参照し、第１結合部２０は、第１入力画像ＩＡ［１］～ＩＡ［Ｐ］を、所定のミニバッチサイズを有するミニバッチを単位に分割する。そして、第１結合部２０は、ミニバッチごとに、当該ミニバッチに属する複数の第１入力画像をチャネル方向に結合することで第１結合データを生成する。ここではミニバッチサイズは、第１入力画像のＮ枚分のデータサイズであるとする。Ｎは２以上の任意の整数であり、例えば“Ｎ＝３２”である。第１入力画像ＩＡ［１］～ＩＡ［Ｐ］から（Ｐ／Ｎ）個分の第１結合データが形成される。（Ｐ／Ｎ）は２以上の任意の整数である。 Referring to FIG. 1 again, the first combiner 20 divides the first input images IA[1] to IA[P] into mini-batches having a predetermined mini-batch size. Then, for each mini-batch, the first combining unit 20 generates first combined data by combining the plurality of first input images belonging to the mini-batch in the channel direction. Here, the mini-batch size is assumed to be the data size for N first input images. N is an arbitrary integer greater than or equal to 2, for example "N=32". (P/N) pieces of first combined data are formed from the first input images IA[1] to IA[P]. (P/N) is an arbitrary integer of 2 or more.

図６に、第１結合部２０にて生成される１つの第１結合データ（即ち１つのミニバッチの構造）を示す。図６に示される第１結合データは、入力画像ＩＡ［ｉ］～ＩＡ［ｉ＋Ｎ］の画像データから構成され、チャネル方向において入力画像ＩＡ［ｉ］～ＩＡ［ｉ＋Ｎ］の画像データが配列される。入力画像ＩＡ［ｉ］～ＩＡ［ｉ＋Ｎ］の各々はチャネル方向に配列された赤濃淡画像、緑濃淡画像及び青濃淡画像にて構成される。故に、第１結合データにおいては、チャネル方向に、入力画像ＩＡ［ｉ］の複数色分の画像データ（赤濃淡画像、緑濃淡画像及び青濃淡画像の画像データ）、入力画像ＩＡ［ｉ＋１］の複数色分の画像データ（赤濃淡画像、緑濃淡画像及び青濃淡画像の画像データ）、・・・、及び、入力画像ＩＡ［ｉ＋Ｎ］の複数色分の画像データ（赤濃淡画像、緑濃淡画像及び青濃淡画像の画像データ）が配列される。 FIG. 6 shows one first combined data (that is, one mini-batch structure) generated by the first combining unit 20 . The first combined data shown in FIG. 6 is composed of the image data of the input images IA[i] to IA[i+N], and the image data of the input images IA[i] to IA[i+N] are arranged in the channel direction. . Each of the input images IA[i] to IA[i+N] is composed of a red grayscale image, a green grayscale image and a blue grayscale image arranged in the channel direction. Therefore, in the first combined data, in the channel direction, the image data of the input image IA[i] for a plurality of colors (the image data of the red grayscale image, the green grayscale image, and the blue grayscale image), the input image IA[i+1] Image data for a plurality of colors (image data for a red grayscale image, a green grayscale image, and a blue grayscale image), . and blue grayscale image data) are arranged.

このため、第１結合データは、各々が（Ｈ×Ｗ）個の単色画素から成る単色の二次元画像を（Ｃ×Ｎ）枚分、チャネル方向に沿って配列したものに相当する。第１結合データは（Ｗ×Ｈ×Ｃ×Ｎ）個の単色画素分のデータ量を持つことになる。第１結合データにおいて、チャネル数は（Ｃ×Ｎ）であり、故にチャネル方向の次元数は（Ｃ×Ｎ）である。 Therefore, the first combined data corresponds to (C×N) monochromatic two-dimensional images, each consisting of (H×W) monochromatic pixels, arranged along the channel direction. The first combined data has a data amount for (W.times.H.times.C.times.N) monochrome pixels. In the first combined data, the number of channels is (C×N), so the number of dimensions in the channel direction is (C×N).

図１を再度参照し、第１学習部３０はニューラルネットワーク３１（以下、ＮＮ３１と称する）を有し、第１学習データを用いてＮＮ３１の機械学習を行う。この際、ミニバッチを単位にＮＮ３１の機械学習を行う。即ち、第１学習データに基づく（Ｐ／Ｎ）個の第１結合データを順次ＮＮ３１への入力データとして用いて、ミニバッチを単位にＮＮ３１の機械学習を行う（ミニバッチ学習を行う）。第１学習部３０における機械学習は深層学習に分類されるものであって良く、従ってＮＮ３１はディープニューラルネットワークであって良い。第１学習部３０における機械学習は教師なし学習であり、ＮＮ３１によりオートエンコーダが形成される。即ち、第１学習部３０ではオートエンコーダを学習させる（換言すればＮＮ３１を学習させることでオートエンコーダを作成する）。 Referring to FIG. 1 again, the first learning unit 30 has a neural network 31 (hereinafter referred to as NN31), and performs machine learning of NN31 using first learning data. At this time, the machine learning of the NN 31 is performed on a mini-batch basis. That is, (P/N) pieces of first combined data based on the first learning data are sequentially used as input data to the NN 31, and machine learning of the NN 31 is performed on a mini-batch basis (mini-batch learning is performed). Machine learning in the first learning unit 30 may be classified as deep learning, so the NN 31 may be a deep neural network. Machine learning in the first learning unit 30 is unsupervised learning, and the NN 31 forms an autoencoder. That is, the first learning unit 30 learns the autoencoder (in other words, the autoencoder is created by learning the NN 31).

図７にオートエンコーダの構成を示す。オートエンコーダを形成するＮＮ３１はエンコーダ３２及びデコーダ３３を備える。ここにおけるオートエンコーダの種類は任意であり、例えば、変分オートエンコーダ（ＶＡＥ）又は畳み込みオートエンコーダ（ＣＡＥ）であって良い。第１結合データが入力データＩＮ＿Ａとしてエンコーダ３２に入力され、エンコーダ３２は入力データＩＮ＿Ａを圧縮することで圧縮データＥ＿Ａを生成する。デコーダ３３は圧縮データＥ＿Ａを復元することで（つまり、エンコーダ３２による圧縮を復元することで）出力データＯＵＴ＿Ａを得る。第１学習部３０における機械学習では、出力データＯＵＴ＿Ａが入力データＩＮ＿Ａと一致するように、ＮＮ３１の各パラメータ（バイアス及び重み）を調整する。 FIG. 7 shows the configuration of the autoencoder. The NN 31 forming an autoencoder comprises an encoder 32 and a decoder 33 . The type of autoencoder here is arbitrary, and may be, for example, a variational autoencoder (VAE) or a convolutional autoencoder (CAE). The first combined data is input to the encoder 32 as input data IN_A, and the encoder 32 compresses the input data IN_A to generate compressed data E_A. The decoder 33 obtains the output data OUT_A by restoring the compressed data E_A (that is, by restoring the compression by the encoder 32). In machine learning in the first learning unit 30, each parameter (bias and weight) of the NN 31 is adjusted so that the output data OUT_A matches the input data IN_A.

この際、入力データＩＮ＿Ａ（従って第１結合データ）がチャネル方向に圧縮されるようにエンコーダ３２を設計し、圧縮データＥ＿Ａがチャネル方向に復元されるようにデコーダ３３を設計しておく。つまり、エンコーダ３２による圧縮はチャネル方向の次元削減に相当し、エンコーダ３２にて入力データＩＮ＿Ａ（従って第１結合データ）のチャネル方向の次元数を“（Ｃ×Ｎ）”から“Ｊ”に削減する。換言すれば、入力データＩＮ＿Ａ（従って第１結合データ）のチャネル方向の次元数は“（Ｃ×Ｎ）”であって、エンコーダ３２にて入力データＩＮ＿Ａの次元削減をチャネル方向に行うことで、チャネル方向の次元数が“Ｊ”の圧縮データＥ＿Ａを得る。エンコーダ３２にて、チャネル数が“（Ｃ×Ｎ）”から“Ｊ”に削減されると考えることもできる。 At this time, the encoder 32 is designed so that the input data IN_A (therefore, the first combined data) is compressed in the channel direction, and the decoder 33 is designed so that the compressed data E_A is restored in the channel direction. That is, compression by the encoder 32 corresponds to dimensionality reduction in the channel direction, and the encoder 32 reduces the number of dimensions in the channel direction of the input data IN_A (therefore, the first combined data) from "(C×N)" to "J". do. In other words, the number of dimensions in the channel direction of the input data IN_A (therefore, the first combined data) is "(C×N)". Compressed data E_A with the number of dimensions in the channel direction of "J" is obtained. It can also be considered that the encoder 32 reduces the number of channels from "(C×N)" to "J".

“Ｃ×Ｎ＞Ｊ”である。例えば“（Ｃ，Ｎ，Ｊ）＝（３，３２，３）”であれば、エンコーダ３２にて、入力データＩＮ＿Ａ（従って第１結合データ）のチャネル方向の次元数が９６次元から３次元に削減されることになり、この場合、“３／（３×３２）＝１／３２”より、圧縮データＥ＿Ａのデータサイズは入力データＩＮ＿Ａのデータサイズの１／３２となる。 It is "CxN>J". For example, if "(C, N, J)=(3, 32, 3)", the encoder 32 changes the number of dimensions in the channel direction of the input data IN_A (therefore, the first combined data) from 96 to 3. In this case, the data size of the compressed data E_A is 1/32 of the data size of the input data IN_A from “3/(3×32)=1/32”.

オートエンコーダとして機能するＮＮ３１の訓練誤差（損失関数の値）が所定値以下になるまで第１学習部３０はＮＮ３１の機械学習を行う。この機械学習の完了後のエンコーダ３２を、以下、特に学習済みエンコーダ３２ａと称する（図９参照）。 The first learning unit 30 performs machine learning of the NN 31 until the training error (loss function value) of the NN 31 that functions as an autoencoder becomes equal to or less than a predetermined value. The encoder 32 after completing this machine learning is hereinafter particularly referred to as a learned encoder 32a (see FIG. 9).

図１を再度参照し、第２結合部５０は、第２入力画像ＩＢ［１］～ＩＢ［Ｑ］を、所定のミニバッチサイズを有するミニバッチを単位に分割する。第２結合部５０におけるミニバッチサイズは第１結合部２０におけるミニバッチサイズと同じである。故に、第２結合部５０におけるミニバッチサイズは、第２入力画像のＮ枚分のデータサイズである（例えば“Ｎ＝３２”）。そして、第２結合部５０は、ミニバッチごとに、当該ミニバッチに属する複数の第２入力画像をチャネル方向に結合することで第２結合データを生成する。第２入力画像ＩＢ［１］～ＩＢ［Ｑ］から（Ｑ／Ｎ）個分の第２結合データが形成される。（Ｑ／Ｎ）は２以上の任意の整数であり、例えば、数百～数千の値を持つ。 Referring to FIG. 1 again, the second combining unit 50 divides the second input images IB[1] to IB[Q] into mini-batches having a predetermined mini-batch size. The mini-batch size at the second joint 50 is the same as the mini-batch size at the first joint 20 . Therefore, the mini-batch size in the second combiner 50 is the data size of N second input images (for example, "N=32"). Then, for each mini-batch, the second combining unit 50 generates second combined data by combining the plurality of second input images belonging to the mini-batch in the channel direction. (Q/N) pieces of second combined data are formed from the second input images IB[1] to IB[Q]. (Q/N) is an arbitrary integer of 2 or more, and has a value of several hundred to several thousands, for example.

図８に、第２結合部５０にて生成される１つの第２結合データ（即ち１つのミニバッチの構造）を示す。第２結合データの構造は第１結合データの構造と同様である。即ち、図８に示される第２結合データは、入力画像ＩＢ［ｉ］～ＩＢ［ｉ＋Ｎ］の画像データから構成され、チャネル方向において入力画像ＩＢ［ｉ］～ＩＢ［ｉ＋Ｎ］の画像データが配列される。入力画像ＩＢ［ｉ］～ＩＢ［ｉ＋Ｎ］の各々はチャネル方向に配列された赤濃淡画像、緑濃淡画像及び青濃淡画像にて構成される。故に、第２結合データにおいては、チャネル方向に、入力画像ＩＢ［ｉ］の複数色分の画像データ（赤濃淡画像、緑濃淡画像及び青濃淡画像の画像データ）、入力画像ＩＢ［ｉ＋１］の複数色分の画像データ（赤濃淡画像、緑濃淡画像及び青濃淡画像の画像データ）、・・・、及び、入力画像ＩＢ［ｉ＋Ｎ］の複数色分の画像データ（赤濃淡画像、緑濃淡画像及び青濃淡画像の画像データ）が配列される。 FIG. 8 shows one second combined data (that is, one mini-batch structure) generated by the second combining unit 50 . The structure of the second combined data is similar to the structure of the first combined data. That is, the second combined data shown in FIG. 8 is composed of the image data of the input images IB[i] to IB[i+N], and the image data of the input images IB[i] to IB[i+N] are arranged in the channel direction. be done. Each of the input images IB[i] to IB[i+N] is composed of a red grayscale image, a green grayscale image and a blue grayscale image arranged in the channel direction. Therefore, in the second combined data, in the channel direction, the image data for a plurality of colors of the input image IB[i] (the image data of the red grayscale image, the green grayscale image, and the blue grayscale image), the input image IB[i+1] Image data for multiple colors (image data for a red grayscale image, green grayscale image, and blue grayscale image), and image data for multiple colors of the input image IB[i+N] (red grayscale image, green grayscale image) and blue grayscale image data) are arranged.

このため、第２結合データは、各々が（Ｈ×Ｗ）個の単色画素から成る単色の二次元画像を（Ｃ×Ｎ）枚分、チャネル方向に沿って配列したものに相当する。第２結合データは（Ｗ×Ｈ×Ｃ×Ｎ）個の単色画素分のデータ量を持つことになる。第２結合データにおいて、チャネル数は（Ｃ×Ｎ）であり、故にチャネル方向の次元数は（Ｃ×Ｎ）である。 Therefore, the second combined data corresponds to (C×N) monochromatic two-dimensional images, each consisting of (H×W) monochromatic pixels, arranged along the channel direction. The second combined data has a data amount for (W.times.H.times.C.times.N) monochrome pixels. In the second combined data, the number of channels is (C×N), so the number of dimensions in the channel direction is (C×N).

図１を再度参照し、第２学習部６０はニューラルネットワーク６１（以下、ＮＮ６１と称する）を有し、第２学習データを用いてＮＮ６１の機械学習を行う。第２学習部６０における機械学習は深層学習に分類されるものであって良く、従ってＮＮ６１はディープニューラルネットワークであって良い。第２学習部６０における機械学習は教師あり学習であり、ＮＮ６１により物体検出用の推論モデルが形成される。即ち、第２学習部６０では物体検出用の推論モデルを学習させる（換言すればＮＮ６１を学習させることで物体検出が可能な推論モデルを作成する）。 Referring to FIG. 1 again, the second learning unit 60 has a neural network 61 (hereinafter referred to as NN61), and performs machine learning of NN61 using second learning data. Machine learning in the second learning unit 60 may be classified as deep learning, so the NN 61 may be a deep neural network. Machine learning in the second learning unit 60 is supervised learning, and the NN 61 forms an inference model for object detection. That is, the second learning unit 60 learns an inference model for object detection (in other words, an inference model capable of object detection is created by making the NN 61 learn).

図９を参照して第２学習部６０による機械学習を説明する。第２学習部６０による機械学習には上述の学習済みエンコーダ３２ａが利用される。第２結合データが入力データＩＮ＿Ｂとして学習済みエンコーダ３２ａに入力され、学習済みエンコーダ３２ａは入力データＩＮ＿Ｂを圧縮することで圧縮データＥ＿Ｂを生成する。学習済みエンコーダ３２ａによる圧縮では、入力データＩＮ＿Ｂ（従って第２結合データ）のチャネル方向の次元数を“（Ｃ×Ｎ）”から“Ｊ”に削減する。換言すれば、入力データＩＮ＿Ｂ（従って第２結合データ）のチャネル方向の次元数は“（Ｃ×Ｎ）”であって、学習済みエンコーダ３２ａにて入力データＩＮ＿Ｂの次元削減をチャネル方向に行うことで、チャネル方向の次元数が“Ｊ”の圧縮データＥ＿Ｂを得る。学習済みエンコーダ３２ａにて、チャネル数が“（Ｃ×Ｎ）”から“Ｊ”に削減されると考えることもできる。 Machine learning by the second learning unit 60 will be described with reference to FIG. The learned encoder 32a described above is used for machine learning by the second learning unit 60 . The second combined data is input to the learned encoder 32a as input data IN_B, and the learned encoder 32a compresses the input data IN_B to generate compressed data E_B. In compression by the trained encoder 32a, the number of dimensions in the channel direction of the input data IN_B (therefore, the second combined data) is reduced from "(C×N)" to "J". In other words, the number of dimensions in the channel direction of the input data IN_B (therefore, the second combined data) is "(C×N)", and the trained encoder 32a reduces the dimensionality of the input data IN_B in the channel direction. , the compressed data E_B with the number of dimensions in the channel direction of "J" is obtained. It can also be considered that the number of channels is reduced from "(C×N)" to "J" in the learned encoder 32a.

第２入力画像ＩＢ［１］～ＩＢ［Ｑ］に基づく複数の第２結合データを順次、入力データＩＮ＿Ｂとして学習済みエンコーダ３２ａに入力することで、当該複数の第２結合データに基づく複数の圧縮データＥ＿Ｂが得られる。 By sequentially inputting a plurality of second combined data based on the second input images IB[1] to IB[Q] as input data IN_B to the learned encoder 32a, a plurality of compressions based on the plurality of second combined data Data E_B is obtained.

第２学習部６０では、圧縮データＥ＿ＢをＮＮ６１への入力データとして用いてＮＮ６１の機械学習を行う。この際、ミニバッチを単位にＮＮ６１の機械学習を行う（即ちミニバッチ学習を行う）。ＮＮ６１の機械学習におけるミニバッチサイズと、ＮＮ３１の機械学習におけるミニバッチサイズとが異なっていても良いが、ここでは、それらが同じであるとする。そうすると、ＮＮ６１の機械学習におけるミニバッチサイズは第２入力画像のＮ枚分のデータサイズであり、第２入力画像のＮ枚分のデータサイズは（Ｗ×Ｈ×Ｃ×Ｎ）である。 The second learning unit 60 performs machine learning of the NN 61 using the compressed data E_B as input data to the NN 61 . At this time, machine learning of the NN 61 is performed in units of mini-batch (that is, mini-batch learning is performed). Although the mini-batch size in machine learning of NN61 and the mini-batch size in machine learning of NN31 may be different, here they are assumed to be the same. Then, the mini-batch size in machine learning of NN61 is the data size for N second input images, and the data size for N second input images is (W×H×C×N).

図１０に、ＮＮ６１への入力データが生成されるまでの流れの概要を示す。図１０において、データＤＴａは、Ｎ枚の第２入力画像の画像データの組を“（Ｃ×Ｎ）／Ｊ”組分含む。データＤＴｂは、“（Ｃ×Ｎ）／Ｊ”個の入力データＩＮ＿Ｂ、即ち“（Ｃ×Ｎ）／Ｊ”個の第２結合データから成る。データＤＴｃは、データＤＴｂに基づく“（Ｃ×Ｎ）／Ｊ”個の圧縮データＥ＿Ｂから成る。 FIG. 10 shows an overview of the flow until input data to the NN 61 is generated. In FIG. 10, data DTa includes “(C×N)/J” sets of image data of N second input images. The data DTb consists of “(C×N)/J” pieces of input data IN_B, that is, “(C×N)/J” pieces of second combined data. Data DTc consists of "(C×N)/J" pieces of compressed data E_B based on data DTb.

データＤＴａが第２結合部５０に入力されることでデータＤＴｂが得られる。即ち、Ｎ枚の第２入力画像の画像データの組が、“（Ｃ×Ｎ）／Ｊ”組分、順次、第２結合部５０に入力されることで、第２結合部５０から“（Ｃ×Ｎ）／Ｊ”個の第２結合データが出力される。各々の第２結合データのデータサイズは（Ｗ×Ｈ×Ｃ×Ｎ）である。故に、データＤＴｂのデータサイズは“（Ｗ×Ｈ×Ｃ×Ｎ）×（Ｃ×Ｎ）／Ｊ”である。データＤＴａのデータサイズも同様である。 Data DTb is obtained by inputting data DTa to the second combining unit 50 . That is, the sets of image data of the N second input images are sequentially input to the second combining unit 50 for “(C×N)/J” sets, so that the second combining unit 50 outputs “( C×N)/J″ pieces of second combined data are output. The data size of each second combined data is (W×H×C×N). Therefore, the data size of the data DTb is "(W.times.H.times.C.times.N).times.(C.times.N)/J". The same applies to the data size of data DTa.

各々の第２結合データが入力データＩＮ＿Ｂとして学習済みエンコーダ３２ａに入力されることで第２結合データごとに圧縮データＥ＿Ｂが生成され、結果、“（Ｃ×Ｎ）／Ｊ”個の圧縮データＥ＿Ｂから成るデータＤＴｃが得られる。学習済みエンコーダ３２ａにおいてチャネル方向の次元数が“（Ｃ×Ｎ）”から“Ｊ”に削減されるので、１つの圧縮データＥ＿Ｂのデータサイズは（Ｗ×Ｈ×Ｊ）である。故に、データＤＴｃのデータサイズ）は、（Ｗ×Ｈ×Ｃ×Ｎ）である。 Compressed data E_B is generated for each second coupled data by inputting each of the second coupled data as input data IN_B to the learned encoder 32a, and as a result, "(C×N)/J" pieces of compressed data E_B are generated. A data DTc consisting of is obtained. Since the number of dimensions in the channel direction is reduced from "(C×N)" to "J" in the trained encoder 32a, the data size of one piece of compressed data E_B is (W×H×J). Therefore, the data size of data DTc is (W×H×C×N).

（Ｗ×Ｈ×Ｃ×Ｎ）のデータサイズ分の圧縮データＥ＿Ｂを、１回あたりのミニバッチ学習のデータとしてＮＮ６１に入力する。これは、ＮＮ６１の１回あたりのミニバッチ学習において、“Ｎ×（Ｃ×Ｎ）／Ｊ”枚分の入力画像の情報をＮＮ６１に入力することに相当する。例えば、“（Ｃ，Ｎ，Ｊ）＝（３，３２，３）”且つ“Ｑ＝１０２４０”が成立する数値例では、１回のミニバッチ学習において、３２^２枚分の入力画像の情報がＮＮ６１に入力されることになる。そうすると、“１０２４０／３２^２＝１０”より、ＮＮ６１のミニバッチ学習を１０回行うことで、第２学習データを構成する全ての第２入力画像を利用した１回分の学習が完了することになる（即ちイテレーション数は１０となる）。 Compressed data E_B corresponding to a data size of (W×H×C×N) is input to NN 61 as data for mini-batch learning per one time. This corresponds to inputting information of “N×(C×N)/J” pieces of input images to the NN 61 in the mini-batch learning per NN 61 . For example, in a numerical example where "(C, N, J) = (3, 32, 3)" and "Q = 10240" holds, in one mini- ^batch learning, the information of 322 input images is NN61 will be entered in the Then, from "10240/32 ² =10", by performing NN61 mini-batch learning 10 times, one-time learning using all the second input images that make up the second learning data will be completed ( That is, the number of iterations is 10).

Ｎ枚の第２入力画像の画像データそのものをＮＮ６１に入力する仮想ケースでは、第２学習データを構成する全ての第２入力画像を利用した１回分の学習を完了させるために、上記数値例においてＮＮ６１のミニバッチ学習を３２０回行う必要があり、データ処理装置１との比較において学習時間が長くなる。 In the hypothetical case in which the image data of N second input images themselves are input to the NN 61, in order to complete one-time learning using all the second input images that make up the second learning data, in the above numerical example, Mini-batch learning of the NN 61 needs to be performed 320 times, and the learning time is longer than that of the data processing device 1 .

第２学習部６０におけるＮＮ６１の機械学習において、ＮＮ６１はミニバッチサイズを有する圧縮データＥ＿Ｂに基づき出力データＯＵＴ＿Ｂを生成する（図９参照）。第２学習部６０は、ミニバッチごとに（ミニバッチ学習ごとに）出力データＯＵＴ＿Ｂと教師データとの誤差に相当する損失関数の値を導出し、損失関数の値が低減されるよう、誤差逆伝搬法を用いてＮＮ６１のパラメータ（重み及びバイアス）を調整する。損失関数の値が所定の閾値以下になるまでＮＮ６１の機械学習（即ち物体検出用の推論モデルの機械学習）が行われる。 In machine learning of NN 61 in second learning unit 60, NN 61 generates output data OUT_B based on compressed data E_B having a mini-batch size (see FIG. 9). The second learning unit 60 derives the value of the loss function corresponding to the error between the output data OUT_B and the teacher data for each mini-batch (for each mini-batch learning), and performs error back propagation so that the value of the loss function is reduced. is used to tune the parameters (weights and biases) of NN61. Machine learning of NN 61 (that is, machine learning of an inference model for object detection) is performed until the value of the loss function is equal to or less than a predetermined threshold.

ＮＮ６１のミニバッチ学習において、教師データは、当該ミニバッチ学習に用いる全ての第２入力画像に対するラベルデータにより構成される。例えば、或るミニバッチ学習において、上記データＤＴａ（図１０参照）が第２入力画像ＩＢ［１］～ＩＢ［１０２４］の画像データにて構成されるのでれば、当該ミニバッチ学習における教師データはラベルデータＬＢ［１］～ＬＢ［１０２４］により構成される。即ち例えば、データＤＴａに第２入力画像ＩＢ［１］及びＩＢ［２］の画像データが含まれているのであれば、図１１に示す如く、それらに対応するラベルデータＬＢ［１］及びＬＢ［２］の和が、データＤＴａに対応する教師データに含められる（図１１ではラベルデータＬＢ［１］及びＬＢ［２］の情報のみ図示）。 In the NN61 mini-batch learning, the teacher data consists of label data for all the second input images used in the mini-batch learning. For example, in a certain mini-batch learning, if the data DTa (see FIG. 10) is composed of the image data of the second input images IB[1] to IB[1024], the teacher data in the mini-batch learning is label It consists of data LB[1] to LB[1024]. That is, for example, if the image data of the second input images IB[1] and IB[2] are included in the data DTa, as shown in FIG. 11, the corresponding label data LB[1] and LB[ 2] is included in the teacher data corresponding to the data DTa (only the information of the label data LB[1] and LB[2] is shown in FIG. 11).

推論モデルとして機能すべきＮＮ６１は、データＤＴａを用いた機械学習の中で、データＤＴａを構成する各入力画像中の物体の位置特定及びクラス識別を行い、位置特定及びクラス識別の結果を出力データＯＵＴ＿Ｂとして出力する。この出力データＯＵＴ＿ＢがデータＤＴａに対応する教師データと比較されることで損失関数の値が導出される。 The NN 61, which should function as an inference model, performs position identification and class identification of objects in each input image that constitutes the data DTa in machine learning using the data DTa, and outputs the results of the position identification and class identification as output data. Output as OUT_B. The value of the loss function is derived by comparing this output data OUT_B with teacher data corresponding to the data DTa.

図１２にデータ処理装置１の動作フローチャートを示す。まずステップＳ１において、第１学習データ取得部１０により第１学習データが取得される。次にステップＳ２において、第１結合部２０により第１学習データに基づき第１結合データが生成される。次にステップＳ３において、第１学習部３０により第１結合データに基づいてオートエンコーダを学習させ（即ちＮＮ３１を学習させ）、これによって学習済みエンコーダ３２ａを作成する。次にステップＳ４において第２学習データ取得部４０により第２学習データが取得される。尚、第２学習データの取得のタイミングはステップＳ５よりも前であれば任意である。 FIG. 12 shows an operation flowchart of the data processing device 1. As shown in FIG. First, in step S<b>1 , first learning data is acquired by the first learning data acquisition unit 10 . Next, in step S2, first combined data is generated by the first combining unit 20 based on the first learning data. Next, in step S3, the first learning unit 30 learns the autoencoder based on the first combined data (that is, learns the NN 31), thereby creating the trained encoder 32a. Next, the second learning data is acquired by the second learning data acquiring section 40 in step S4. Note that the acquisition timing of the second learning data is arbitrary as long as it is before step S5.

第２学習データの取得後、ステップＳ５において、第２結合部５０により第２学習データに基づき第２結合データが生成される。この際、上述の教師データも作成される。教師データの作成主体は第２結合部５０であっても良いし、第２学習部６０であっても良い。その後、ステップＳ６において、第２結合データを学習済みエンコーダ３２ａに入力することで圧縮データ（Ｅ＿Ｂ）を生成し、生成した圧縮データ（Ｅ＿Ｂ）に基づき第２学習部６０にて物体検出用の推論モデルを学習させる（換言すればＮＮ６１を学習させることで物体検出用の推論モデルを作成する）。 After obtaining the second learning data, in step S5, the second combining unit 50 generates the second combined data based on the second learning data. At this time, the teacher data described above is also created. The teacher data may be created by the second combining unit 50 or the second learning unit 60 . Thereafter, in step S6, the second combined data is input to the learned encoder 32a to generate compressed data (E_B), and the second learning unit 60 performs inference for object detection based on the generated compressed data (E_B). Train the model (in other words, train the NN 61 to create an inference model for object detection).

本実施形態では、上述の如く学習データ（第２学習データ）を圧縮することで第２学習部６０における１ミニバッチ当たりのデータの情報量を増やすことができる。即ち、Ｎ枚の第２入力画像の画像データそのものをＮＮ６１に入力する仮想ケースと比べて、第２学習部６０における１ミニバッチ当たりのデータの情報量が“（Ｃ×Ｎ）／Ｊ”倍に増大する（例えば３２倍に増大する）。このため、仮想ケースとの比較において、第２学習部６０における学習時間（例えばＮＮ６１による推論モデルの損失関数の値が所定の閾値以下になるまでに必要な時間）を短縮することが可能となる。見方を変えて、学習時間を一定とみなした場合には、必要メモリ容量を小さくすることもできる。 In this embodiment, by compressing the learning data (second learning data) as described above, the information amount of data per mini-batch in the second learning unit 60 can be increased. That is, compared to the hypothetical case in which the image data of N second input images themselves are input to the NN 61, the information amount of data per mini-batch in the second learning unit 60 is "(C×N)/J" times. Increase (eg, increase 32-fold). Therefore, in comparison with the virtual case, the learning time in the second learning unit 60 (for example, the time required for the value of the loss function of the inference model by the NN 61 to become equal to or less than a predetermined threshold value) can be shortened. . From a different point of view, if the learning time is assumed to be constant, the required memory capacity can be reduced.

図１３に“（Ｃ×Ｎ）／Ｊ＝３×３２／３＝３２”である場合におけるデータサイズの圧縮効果等を示す。図１３の数値例では、学習済みエンコーダ３２ａを用いることで入力データＩＮ＿Ｂのデータサイズ（データ量）が１／３２に圧縮されて圧縮データＥ＿Ｂが得られる。このため、ＮＮ６１へ入力されるデータの単位データ量あたりの学習時間が一定であるとみなしたならば、ＮＮ６１の学習時間は仮想ケースと比べて１／３２に短縮される。また、ＮＮ６１の学習におけるイテレーション数も、仮想ケースで必要なイテレーション数の１／３２に低減する。他方、本実施形態に係るＮＮ６１の学習時間を仮想ケースに係る学習時間と同じにする場合にあっては必要メモリ容量を仮想ケースの１／３２にまで縮小することが可能である。 FIG. 13 shows the data size compression effect and the like when "(C×N)/J=3×32/3=32". In the numerical example of FIG. 13, by using the learned encoder 32a, the data size (data amount) of the input data IN_B is compressed to 1/32 to obtain the compressed data E_B. Therefore, assuming that the learning time per unit data amount of the data input to the NN 61 is constant, the learning time of the NN 61 is shortened to 1/32 compared to the hypothetical case. Also, the number of iterations in learning the NN 61 is reduced to 1/32 of the number of iterations required in the hypothetical case. On the other hand, if the learning time of the NN 61 according to this embodiment is the same as the learning time according to the virtual case, it is possible to reduce the required memory capacity to 1/32 of the virtual case.

各第１入力画像には推論モデルの認識対象物体が含まれるため、オートエンコーダでは、第１結合データ（ＩＮ＿Ａ）から各第１入力画像の認識対象物体の特徴量が抽出されて圧縮データＥ＿Ａに含められる。つまり、認識対象物体を含む入力画像から認識対象物体の特徴量が抽出されるようオートエンコーダの学習が進んで学習済みエンコーダ３２ａが構成される。故に、認識対象物体を含む第２入力画像に基づく第２結合データ（ＩＮ＿Ｂ）を学習済みエンコーダ３２ａに入力すれば、学習済みエンコーダ３２ａにて各第２入力画像中の認識対象物体の特徴量が抽出されて圧縮データＥ＿Ｂに含められる。この圧縮データＥ＿Ｂを推論モデル（ＮＮ６１）に入力することで、学習時間の短縮化に寄与する効率的な学習が可能となる。 Since each first input image includes the recognition target object of the inference model, the autoencoder extracts the feature amount of the recognition target object of each first input image from the first combined data (IN_A) and converts it into compressed data E_A. be included. That is, learning of the autoencoder progresses so that the feature amount of the recognition target object is extracted from the input image including the recognition target object, and the learned encoder 32a is configured. Therefore, if the second combined data (IN_B) based on the second input image including the recognition target object is input to the learned encoder 32a, the feature amount of the recognition target object in each second input image is obtained by the learned encoder 32a. It is extracted and included in the compressed data E_B. By inputting this compressed data E_B to the inference model (NN61), efficient learning that contributes to shortening the learning time becomes possible.

認識対象物体の特徴量の観点からデータ処理装置１の機能を検討したとき、データ処理装置１は、図１４の特徴量データ生成装置２として機能する又は特徴量データ生成装置２を含む、と考えることができる。特徴量データ生成装置２は、各々に認識対象物体を含む複数の画像ＩＩを取得する画像データ取得部２Ａと、複数の画像ＩＩの画像データを圧縮することで複数の画像ＩＩにおける複数の認識対象物体の各特徴量を含む特徴量データを生成する特徴量データ生成部２Ｂと、を備えている。取得部２Ａ及び２Ｂに関わる複数の画像ＩＩは複数の第２入力画像に相当する。取得部２Ａは図１の取得部４０に相当し、生成部２Ｂは図１の結合部５０と図９の学習済みエンコーダ３２ａを含む機能ブロックに相当する。生成部２Ｂにて生成される特徴量データは圧縮データＥ＿Ｂに相当する。 When considering the function of the data processing device 1 from the viewpoint of the feature amount of the recognition target object, it is considered that the data processing device 1 functions as the feature amount data generation device 2 in FIG. 14 or includes the feature amount data generation device 2. be able to. The feature amount data generation device 2 includes an image data acquisition unit 2A that acquires a plurality of images II each including a recognition target object, and a plurality of recognition targets in the plurality of images II by compressing the image data of the plurality of images II. and a feature amount data generation unit 2B that generates feature amount data including each feature amount of the object. The plurality of images II associated with acquisition units 2A and 2B correspond to the plurality of second input images. The acquisition unit 2A corresponds to the acquisition unit 40 in FIG. 1, and the generation unit 2B corresponds to a functional block including the combining unit 50 in FIG. 1 and the learned encoder 32a in FIG. The feature data generated by the generator 2B corresponds to the compressed data E_B.

つまり、特徴量データ生成装置２は、各々に認識対象物体を含む複数の第２入力画像を取得し、複数の第２入力画像における複数の認識対象物体の各特徴量を含む特徴量データ（ＥＮ＿Ｂ）を生成する。この特徴量データを用いて物体検出用の推論モデル（ＮＮ６１）を学習させれば、学習時間の短縮化に寄与する効率的な学習が可能となる。 That is, the feature amount data generation device 2 acquires a plurality of second input images each including a recognition target object, and the feature amount data (EN_B ). By learning an inference model (NN 61) for object detection using this feature amount data, efficient learning that contributes to shortening of the learning time becomes possible.

ここで、上記の複数の画像ＩＩは、所定カメラ（不図示）にて時間的に連続して撮影された２以上の画像を含んでいると良い。即ち、第２入力画像ＩＢ［１］～ＩＢ［Ｑ］の内、少なくとも一部は、所定カメラにて時間的に連続して撮影された２以上の画像であって良い。所定カメラは、自身の撮影領域内の様子（被写体）を撮影し、撮影された画像であるカメラ画像の画像データを生成する。この際、所定カメラは所定のフレームレートで周期的に撮影を行う。そうすると、所定カメラにより、フレームレートの逆数の間隔で時系列上に並ぶ複数のカメラ画像が取得される。この時系列上に並ぶ複数のカメラ画像（以下、カメラ画像列と称する）が、所定カメラにて時間的に連続して撮影された２以上の画像に相当する。 Here, the plurality of images II described above preferably include two or more images captured temporally continuously by a predetermined camera (not shown). That is, at least a part of the second input images IB[1] to IB[Q] may be two or more images captured temporally continuously by a predetermined camera. The predetermined camera captures a scene (subject) within its own capture area and generates image data of a camera image, which is a captured image. At this time, the predetermined camera periodically takes pictures at a predetermined frame rate. Then, a predetermined camera acquires a plurality of camera images arranged in time series at intervals of the reciprocal of the frame rate. A plurality of camera images arranged in time series (hereinafter referred to as a camera image sequence) correspond to two or more images temporally consecutively captured by a predetermined camera.

所定カメラは一定の場所に固定された定点カメラであって良い。この場合、カメラ画像列においてカメラ画像中の風景（認識対象物体以外の部分）は殆ど変化せず、認識対象物体としての車両や人間のみがカメラ画像列の中で動くと期待される。そうすると、エンコーダ３２（学習済みエンコーダ３２ａ）による圧縮作用が高まって、認識対象物体の特徴量を効率的に抽出することが可能となり、ひいては推論モデル（ＮＮ６１）の効率的な学習が促進される。所定カメラは車両等の移動体に搭載されたカメラであっても良い。 The predetermined camera may be a fixed point camera fixed at a fixed location. In this case, it is expected that the scenery (parts other than the object to be recognized) in the camera image hardly changes in the sequence of camera images, and only the vehicle or the person as the object to be recognized moves in the sequence of camera images. As a result, the compression effect of the encoder 32 (learned encoder 32a) increases, enabling efficient extraction of the feature amount of the object to be recognized, which in turn promotes efficient learning of the inference model (NN61). The predetermined camera may be a camera mounted on a moving object such as a vehicle.

画像データ取得部２Ａは、所定カメラにて撮影された画像の集まりの中から、所定カメラにて撮影された各画像の付加データに含まれる撮影時刻情報に基づいて、上記時間的に連続して撮影された２以上の画像を抽出して良い。例えば、所定カメラにて撮影された画像に第１撮影画像及び第２撮影画像が含まれている場合において、第１撮影画像の撮影時刻及び第２撮影画像の撮影時刻間の時間差が所定時間以下であるとき、第１及び第２撮影画像は時間的に連続して撮影された２枚の画像として抽出されて良い。 The image data acquiring unit 2A selects the images shot by the predetermined camera from among the images shot by the predetermined camera, based on the shooting time information included in the additional data of each image shot by the predetermined camera. More than one captured image may be extracted. For example, when a first captured image and a second captured image are included in images captured by a predetermined camera, the time difference between the capturing time of the first captured image and the capturing time of the second captured image is equal to or less than a predetermined time. , the first and second captured images may be extracted as two images that are temporally consecutively captured.

上述の構成に対する補足事項、応用技術又は変形技術等を以下に示す。 Supplementary matters, applied techniques, modified techniques, etc. for the above configuration are shown below.

各第１入力画像及び各第２入力画像はモノクロ画像（色情報を持たない濃淡画像）であっても構わない。この場合には“Ｃ＝１”となる。 Each first input image and each second input image may be a monochrome image (a grayscale image without color information). In this case, "C=1".

エンコーダ３２（学習済みエンコーダ３２ａを含む）は、入力データ（ＩＮ＿Ａ又はＩＮ＿Ｂ）をチャネル方向に圧縮するが、この際、入力画像の水平方向又は垂直方向に入力データが圧縮されることがあっても良い。 The encoder 32 (including the trained encoder 32a) compresses the input data (IN_A or IN_B) in the channel direction, even if the input data is compressed in the horizontal or vertical direction of the input image. good.

データ処理装置１は、ハードウェアとして、演算処理装置であるＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＲＯＭ（Read only memory）及びＲＡＭ（Random access memory）等を備える。データ処理装置１は、ＲＯＭに格納されたプログラム又は他の装置から通信を通じて取得されたプログラムをＣＰＵにて実行することにより、図１に示す各部位の機能を実現して良く、故に図１２のステップＳ１～Ｓ６の各処理を実現して良い。 The data processing device 1 includes, as hardware, a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), etc., which are arithmetic processing devices. The data processing device 1 may realize the functions of the parts shown in FIG. Each process of steps S1 to S6 may be implemented.

データ処理装置１にて作成された推論モデルを車載装置（不図示）に適用しても良い。車載装置は自動車等の車両に搭載される電子機器の一種である。この場合、第２学習部６０によるＮＮ６１の機械学習を経てＮＮ６１により形成される推論モデルを車載装置に適用すると良い。そして、車載装置にて推論モデルによる物体検出を行わせ、推論結果を車両で実施され得る自動運転又は運転支援等に利用して良い。 An inference model created by the data processing device 1 may be applied to an in-vehicle device (not shown). An in-vehicle device is a type of electronic device mounted in a vehicle such as an automobile. In this case, it is preferable to apply the inference model formed by the NN 61 through the machine learning of the NN 61 by the second learning unit 60 to the in-vehicle device. Then, the in-vehicle device may be caused to detect an object using the inference model, and the inference result may be used for automatic driving or driving assistance that can be implemented in the vehicle.

尚、データ処理装置１自体が車載装置であっても構わない。車両（例えば放送中継車）によっては、豊富な計算資源を有する車載装置が設置されることもあり、この場合においては特にデータ処理装置１自体を車載装置とすることも可能である。 Note that the data processing device 1 itself may be an in-vehicle device. Depending on the vehicle (for example, broadcast relay van), an on-vehicle device having abundant computational resources may be installed.

また、データ処理装置１により実行される処理の一部又は全部は、ソフトウェアおよびハードウェアの混在処理により実現しても良い。前述した方法をコンピュータに実行させるコンピュータプログラム及びそのプログラムを記録したコンピュータ読み取り可能な記録媒体は、本実施形態の範囲に含まれる。ここで、コンピュータ読み取り可能な記録媒体は、例えば、フレキシブルディスク、ハードディスク、ＣＤ－ＲＯＭ、ＭＯ、ＤＶＤ、ＤＶＤ－ＲＯＭ、ＤＶＤ－ＲＡＭ、大容量ＤＶＤ、次世代ＤＶＤ、半導体メモリである。 Also, part or all of the processing executed by the data processing device 1 may be realized by mixed processing of software and hardware. A computer program that causes a computer to execute the method described above and a computer-readable recording medium that records the program are included in the scope of this embodiment. Here, computer-readable recording media are, for example, flexible disks, hard disks, CD-ROMs, MOs, DVDs, DVD-ROMs, DVD-RAMs, large-capacity DVDs, next-generation DVDs, and semiconductor memories.

本発明の実施形態は、特許請求の範囲に示された技術的思想の範囲内において、適宜、種々の変更が可能である。以上の実施形態は、あくまでも、本発明の実施形態の例であって、本発明ないし各構成要件の用語の意義は、以上の実施形態に記載されたものに制限されるものではない。上述の説明文中に示した具体的な数値は、単なる例示であって、当然の如く、それらを様々な数値に変更することができる。 The embodiments of the present invention can be appropriately modified in various ways within the scope of the technical idea indicated in the scope of claims. The above embodiments are merely examples of the embodiments of the present invention, and the meanings of the terms of the present invention and each constituent element are not limited to those described in the above embodiments. The specific numerical values given in the above description are merely examples and can of course be changed to various numerical values.

１データ処理装置
２特徴量データ生成装置
２Ａ画像データ取得部
２Ｂ特徴量データ生成部
１０第１学習データ取得部
２０第１結合部
３０第１学習部
３１ニューラルネットワーク（オートエンコーダ）
３２エンコーダ
３３デコーダ
４０第２学習データ取得部
５０第２結合部
６０第２学習部
６１ニューラルネットワーク（推論モデル） 1 data processing device 2 feature amount data generation device 2A image data acquisition unit 2B feature amount data generation unit 10 first learning data acquisition unit 20 first coupling unit 30 first learning unit 31 neural network (autoencoder)
32 encoder 33 decoder 40 second learning data acquisition unit 50 second coupling unit 60 second learning unit 61 neural network (inference model)

Claims

an image data acquisition unit that acquires image data of a plurality of images each including a recognition target object;
a feature amount data generation unit configured to generate feature amount data including each feature amount of a plurality of recognition target objects in the plurality of images by compressing image data of the plurality of images. .

2. The feature amount data generation device according to claim 1, wherein said plurality of images includes two or more images captured temporally continuously by a predetermined camera.

a first combining unit that generates first combined data by combining image data of a plurality of first input images in a channel direction;
a first learning unit that receives the supply of the first combined data and trains an autoencoder having an encoder that compresses the first combined data in the channel direction and a decoder that restores the compression;
a second combining unit that generates second combined data by combining image data of a plurality of second input images in the channel direction;
A second learning unit for inputting compressed data output from the encoder by inputting the second combined data to the encoder after learning by the first learning unit to a neural network, thereby learning the neural network. and a machine learning device.

4. The machine learning device according to claim 3, wherein said second learning unit makes said neural network learn using teacher data including a plurality of label data associated with said plurality of second input images.

5. The machine learning device according to claim 4, wherein said second learning unit creates an inference model capable of object detection by learning said neural network.

6. The machine learning device according to claim 5, wherein each first input image and each second input image include a recognition target object in said object detection.

in the first combined data, the image data of the plurality of first input images are arranged in the channel direction;
in the second combined data, the image data of the plurality of second input images are arranged in the channel direction;
In learning by the first learning unit, the encoder reduces the number of dimensions of the first combined data in the channel direction, thereby compressing the first combined data,
In learning by the second learning unit, the encoder after learning by the first learning unit reduces the number of dimensions of the second combined data in the channel direction, thereby compressing the second combined data. 7. The machine learning device according to any one of claims 3 to 6, wherein said compressed data is obtained by:

The image data of each first input image and the image data of each second input image include image data for a plurality of colors,
in the first combined data, the image data for the plurality of colors of each first input image are arranged in the channel direction;
8. The machine learning device according to claim 7, wherein in said second combined data, image data for said plurality of colors of each second input image are arranged in said channel direction.

an image data acquisition step of acquiring image data of a plurality of images each including a recognition target object;
a feature amount data generating step of compressing image data of the plurality of images to generate feature amount data including each feature amount of a plurality of recognition target objects in the plurality of images. .

a first combining step of generating first combined data by combining image data of a plurality of first input images in a channel direction;
a first learning step of receiving the first combined data and training an autoencoder having an encoder for compressing the first combined data in the channel direction and a decoder for restoring the compression;
a second combining step of generating second combined data by combining image data of a plurality of second input images in the channel direction;
A second learning step of inputting compressed data output from the encoder by inputting the second combined data into the encoder after learning in the first learning step to a neural network, thereby learning the neural network. and a machine learning method.