JP7168485B2

JP7168485B2 - LEARNING DATA GENERATION METHOD, LEARNING DATA GENERATION DEVICE, AND PROGRAM

Info

Publication number: JP7168485B2
Application number: JP2019028270A
Authority: JP
Inventors: 晃下山
Original assignee: Hitachi Solutions Create Ltd
Current assignee: Hitachi Solutions Create Ltd
Priority date: 2019-02-20
Filing date: 2019-02-20
Publication date: 2022-11-09
Anticipated expiration: 2039-02-20
Also published as: JP2020135432A

Description

本発明は、学習データの生成方法、学習データ生成装置及びプログラムに関する。 The present invention relates to a learning data generation method, a learning data generation device, and a program.

機械学習によって画像認識用のモデルを構築するためには、人や物などの画像認識の対象を撮影した大量の静止画像を学習データとして用意する必要がある。 In order to build a model for image recognition by machine learning, it is necessary to prepare a large amount of still images of people, objects, and other targets for image recognition as training data.

そしてこのようなモデルを用いた画像認識の精度を向上させるためには、撮影の方向や距離、傾きなどの撮影条件が異なる様々な静止画像を豊富に含む良質な学習データを用いてモデルの学習を行うことが好ましい。 In order to improve the accuracy of image recognition using such a model, it is necessary to learn the model using high-quality training data that includes a wide variety of still images with different shooting conditions such as shooting direction, distance, and tilt. It is preferable to

そのためこのような学習データを効率的に生成するための技術が開発されている（例えば特許文献１参照）。 Therefore, techniques for efficiently generating such learning data have been developed (see, for example, Patent Document 1).

特開２０１６－０６２５２４号公報JP 2016-062524 A

一方で、画像認識の対象を動画撮影用カメラで撮影し、動画データから各フレームの静止画像を取り出すことで、静止画像を大量に含む学習データを生成することも行われている。 On the other hand, learning data containing a large amount of still images is also generated by taking a picture of an object for image recognition with a video camera and extracting a still image of each frame from the video data.

しかしながらこの場合、学習データに含まれる静止画像は数１０分の１秒毎に撮影されたものである。そのため、生成される学習データには、撮影の向きや距離などの撮影条件がほぼ等しく、冗長な静止画像が大量に含まれ、データ量が増大し、学習に要する時間も長くなる。 However, in this case, the still images included in the learning data are captured every several tenths of a second. Therefore, the generated learning data has almost the same shooting conditions such as shooting direction and distance, and contains a large amount of redundant still images, which increases the amount of data and the time required for learning.

逆に、学習データのデータ量を減らそうとして短時間で撮影を行った場合などには、モデルの学習に必要な静止画像が十分に得られない可能性もある。 Conversely, if the images are captured in a short period of time in an attempt to reduce the amount of learning data, there is a possibility that sufficient still images required for model learning may not be obtained.

本発明はこのような課題を鑑みてなされたものであり、画像認識用のモデルの学習を行うための良質な学習データを効率的に生成する学習データの生成方法、学習データ生成装置及びプログラムを提供することを目的の一つとする。 The present invention has been made in view of such problems, and provides a learning data generation method, a learning data generation device, and a program for efficiently generating good quality learning data for learning a model for image recognition. One of the purposes is to provide

本発明の一実施形態に係る学習データの生成方法は、画像認識用のモデルの学習を行うための学習データの生成方法であって、コンピュータが、前記画像認識の対象が撮影されている動画データを取得する処理と、前記動画データに含まれる各フレームの静止画像の特徴量を求め、前記静止画像の中から２つの静止画像を選ぶ組み合わせ毎に、前記組み合わせを成す２つの静止画像の前記特徴量の差分を前記２つの静止画像の相違の程度を表す指標値として求め、前記指標値を元に、前記動画データに含まれる静止画像から冗長な静止画像を取り除くことにより、前記学習データを生成する処理と、を実行する。 A learning data generating method according to an embodiment of the present invention is a learning data generating method for performing learning of a model for image recognition, wherein a computer generates video data in which the target for image recognition is photographed. and obtaining a feature amount of a still image of each frame included in the moving image data, and for each combination of selecting two still images from the still images, the features of the two still images forming the combination The difference in quantity is obtained as an index value representing the degree of difference between the two still images, and the learning data is generated by removing redundant still images from the still images included in the moving image data based on the index value. Execute the processing to be performed.

その他、本願が開示する課題、及びその解決方法は、発明を実施するための形態の欄の記載、及び図面の記載等により明らかにされる。 In addition, the problems disclosed by the present application and their solutions will be clarified by the descriptions in the description of the mode for carrying out the invention, the descriptions in the drawings, and the like.

画像認識用のモデルの学習を行うための良質な学習データを効率的に生成することができる。 It is possible to efficiently generate good quality learning data for learning a model for image recognition.

情報システムの全体構成図である。1 is an overall configuration diagram of an information system; FIG. ユーザ端末のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a user terminal. 学習データ生成装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a learning data generation apparatus. 記憶装置を示す図である。FIG. 3 is a diagram showing a storage device; ニューラルネットワークモデルを示す図である。FIG. 4 is a diagram showing a neural network model; 学習データ生成処理の概要を示す図である。It is a figure which shows the outline|summary of learning data generation processing. 学習データ生成装置の機能構成を示す図である。It is a figure which shows the functional structure of a learning data generation apparatus. 学習データ生成処理の流れを示すフローチャートである。4 is a flowchart showing the flow of learning data generation processing; 類似画像削除処理の流れを示すフローチャートである。10 is a flowchart showing the flow of similar image deletion processing; 類似画像削除処理を説明するための図である。It is a figure for demonstrating similar image deletion processing. 類似画像削除処理の流れを示すフローチャートである。10 is a flowchart showing the flow of similar image deletion processing; 類似画像削除処理を説明するための図である。It is a figure for demonstrating similar image deletion processing. 類似画像削除処理を説明するための図である。It is a figure for demonstrating similar image deletion processing. 学習データ生成処理の流れを示すフローチャートである。4 is a flowchart showing the flow of learning data generation processing; 静止画像のグループ分けを説明するための図である。FIG. 4 is a diagram for explaining grouping of still images; FIG.

本明細書および添付図面の記載により、少なくとも以下の事項が明らかとなる。以下、本発明をその一実施形態に即して添付図面を参照しつつ説明する。 At least the following matters will become apparent from the description of the present specification and the accompanying drawings. DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will now be described in accordance with one embodiment thereof with reference to the accompanying drawings.

[第１実施形態]
＝＝全体構成＝＝
図１に、本発明の一実施形態に係る学習データ生成装置２００及びユーザ端末１００を含む情報システム１０００を示す。学習データ生成装置２００及びユーザ端末１００は、インターネットやＬＡＮ（Local Area Network）、電話網等のネットワーク５００を通じて通信可能に接続されている。 [First embodiment]
== Overall configuration ==
FIG. 1 shows an information system 1000 including a learning data generation device 200 and a user terminal 100 according to one embodiment of the present invention. The learning data generation device 200 and the user terminal 100 are communicably connected through a network 500 such as the Internet, a LAN (Local Area Network), or a telephone network.

学習データ生成装置２００は、画像認識用のモデル（以下、画像認識モデル６１０、あるいは第１のニューラルネットワークモデル６１０とも記す）の学習を行うための学習データを生成するサーバやパソコン、クラウドコンピュータ等のコンピュータないしは情報処理装置である。 The learning data generation device 200 is a server, personal computer, cloud computer, or the like that generates learning data for learning an image recognition model (hereinafter also referred to as an image recognition model 610 or a first neural network model 610). It is a computer or an information processing device.

画像認識モデル６１０は、画像に写っている被写体が特定の認識対象であるか否かを画像データから判別するための判別式あるいは関数等の数式を含んで構成される。画像認識モデル６１０の学習が行われると、これらの数式の係数が調整され、画像認識の精度が変化する。 The image recognition model 610 includes formulas such as discriminants or functions for determining from image data whether or not a subject appearing in an image is a specific recognition target. As the image recognition model 610 is trained, the coefficients of these equations are adjusted to change the accuracy of image recognition.

詳細は後述するが、本実施形態に係る学習データ生成装置２００は、図６に示すように、人や物などの画像認識の対象が撮影された動画データ６２０に含まれる各フレームの静止画像６２１の中から冗長な静止画像６３０を取り除くことで、画像認識モデル６１０の学習を行うための学習データ６４０を効率的に生成する。 Although the details will be described later, as shown in FIG. 6, the learning data generation device 200 according to the present embodiment generates still images 621 of each frame included in moving image data 620 in which an image recognition target such as a person or an object is photographed. Learning data 640 for learning the image recognition model 610 is efficiently generated by removing the redundant still image 630 from the .

冗長な静止画像６３０は、例えば動画データ６２０に含まれる静止画像６２１中の互いに類似の静止画像６２１の中から選出される。詳細は後述するが、本実施形態では、各静止画像６２１の特徴量６５０を相互に比較することで動画データ６２０内の静止画像６２
１の類似性（相違の程度）を判定し、冗長な静止画像６３０を特定する。 The redundant still images 630 are selected from similar still images 621 among the still images 621 included in the moving image data 620, for example. Although the details will be described later, in this embodiment, the still images 62 in the moving image data 620 are compared by comparing the feature amounts 650 of the respective still images 621 with each other.
1's similarity (degree of difference) to identify redundant still images 630 .

学習データ生成装置２００は、動画データ６２０を用いて学習データ６４０を生成することにより、大量の学習用画像データを効率よく収集することができ、また動画データ６２０から冗長な静止画像６３０を取り除くことで、学習効率の高い良質の学習データ６４０を生成することができる。このため、認識精度の高い画像認識モデル６１０を効率良く構築することが可能となる。 By generating learning data 640 using moving image data 620, the learning data generation device 200 can efficiently collect a large amount of image data for learning, and remove redundant still images 630 from the moving image data 620. , high-quality learning data 640 with high learning efficiency can be generated. Therefore, it is possible to efficiently construct the image recognition model 610 with high recognition accuracy.

図１に戻って、ユーザ端末１００は、ユーザが使用するコンピュータであり、例えばスマートフォンや携帯電話機、ノートパソコン、タブレットなどの可搬型の情報処理装置や、各ユーザが職場や自宅等で使用するパソコン等の据え置き型の情報処理装置である。 Returning to FIG. 1, the user terminal 100 is a computer used by a user. It is a stationary type information processing apparatus such as.

ユーザがユーザ端末１００を用いて動画データ６２０を学習データ生成装置２００に送信すると、学習データ生成装置２００からユーザ端末１００に学習データ６４０が送信されてくる。そしてユーザは、ユーザ端末１００あるいは不図示のコンピュータを用いて、この学習データ６４０により画像認識モデル６１０の学習を行う。 When the user uses the user terminal 100 to transmit the video data 620 to the learning data generation device 200 , learning data 640 is transmitted from the learning data generation device 200 to the user terminal 100 . The user then uses the user terminal 100 or a computer (not shown) to learn the image recognition model 610 based on this learning data 640 .

以下、詳細に説明する。 A detailed description will be given below.

＝＝ユーザ端末＝＝
まずユーザ端末１００について図２を参照しながら説明する。 == user terminal ==
First, the user terminal 100 will be described with reference to FIG.

ユーザ端末１００のハードウェア構成図の一例を図２に示す。本実施形態に係るユーザ端末１００は、ＣＰＵ（Central Processing Unit）１１０、メモリ１２０、通信装置１
３０、記憶装置１４０、入力装置１５０、出力装置１６０、及び記録媒体読取装置１７０を備えて構成されるスマートフォンやパソコンなどのコンピュータである。 An example of a hardware configuration diagram of the user terminal 100 is shown in FIG. The user terminal 100 according to the present embodiment includes a CPU (Central Processing Unit) 110, a memory 120, a communication device 1
30 , a storage device 140 , an input device 150 , an output device 160 , and a recording medium reading device 170 .

記憶装置１４０は、ユーザ端末１００によって実行あるいは処理されるユーザ端末制御プログラム７１０や動画データ６２０等の各種のデータを格納する。 The storage device 140 stores various data such as the user terminal control program 710 and video data 620 that are executed or processed by the user terminal 100 .

記憶装置１４０に記憶されているユーザ端末制御プログラム７１０や各種のデータがメモリ１２０に読み出されてＣＰＵ１１０によって実行あるいは処理されることにより、ユーザ端末１００の各種機能が実現される。例えばユーザ端末１００は、動画データ６２０を学習データ生成装置２００に送信する。 Various functions of the user terminal 100 are realized by reading the user terminal control program 710 and various data stored in the storage device 140 into the memory 120 and executing or processing them by the CPU 110 . For example, the user terminal 100 transmits the moving image data 620 to the learning data generation device 200 .

ここで、記憶装置１４０は例えばハードディスクやＳＳＤ（Solid State Drive）、フ
ラッシュメモリ等の不揮発性の記憶装置である。 Here, the storage device 140 is a nonvolatile storage device such as a hard disk, SSD (Solid State Drive), flash memory, or the like.

またユーザ端末制御プログラム７１０は、本実施形態に係るユーザ端末１００が有する各種機能を実現するためのプログラムを総称しており、例えば、ユーザ端末１００上で動作するアプリケーションプログラムやＯＳ（Operating System）、種々のライブラリ等を含む。 The user terminal control program 710 is a general term for programs for realizing various functions of the user terminal 100 according to the present embodiment. Including various libraries.

記録媒体読取装置１７０は、ＳＤカードやＤＶＤ等の記録媒体８００に記録された各種のプログラムやデータを読み取り、記憶装置１４０に格納する。 The recording medium reading device 170 reads various programs and data recorded in a recording medium 800 such as an SD card and DVD, and stores them in the storage device 140 .

通信装置１３０は、ネットワーク５００を介して、学習データ生成装置２００や不図示の他のコンピュータと各種プログラムやデータの授受を行う。例えば他のコンピュータに上述したユーザ端末制御プログラム７１０や動画データ６２０を格納しておき、ユーザ端末１００がこのコンピュータからユーザ端末制御プログラム７１０や動画データ６２０を
ダウンロードするようにすることができる。 The communication device 130 exchanges various programs and data with the learning data generation device 200 and other computers (not shown) via the network 500 . For example, the above-described user terminal control program 710 and video data 620 can be stored in another computer, and the user terminal 100 can download the user terminal control program 710 and video data 620 from this computer.

入力装置１５０は、ユーザによるコマンドやデータの入力を受け付ける装置であり、各種ボタンやスイッチ、キーボード、タッチパネルディスプレイ上でのタッチ位置を検出するタッチセンサ、マイクなどの入力インタフェース、加速度センサ、温度センサ、ＧＰＳ受信機やコンパスなどの位置検出センサ、カメラなどを含む。 The input device 150 is a device that receives commands and data input by the user, and includes various buttons, switches, a keyboard, a touch sensor that detects a touch position on a touch panel display, an input interface such as a microphone, an acceleration sensor, a temperature sensor, It includes a GPS receiver, a position detection sensor such as a compass, and a camera.

また出力装置１６０は、例えばディスプレイなどの表示装置、スピーカ、バイブレータ、照明などの出力ユーザインタフェースである。 The output device 160 is, for example, a display device such as a display, a speaker, a vibrator, an output user interface such as lighting.

＝＝学習データ生成装置＝＝
学習データ生成装置２００は、動画データ６２０から学習データ６４０を生成するコンピュータである。学習データ生成装置２００は、図３に示す様に、ＣＰＵ２１０、メモリ２２０、通信装置２３０、記憶装置２４０、入力装置２５０、出力装置２６０、及び記録媒体読取装置２７０を備えて構成される。これらの学習データ生成装置２００のハードウェア構成は、ユーザ端末１００のハードウェア構成と必ずしも同じではないものの、基本的な構成は共通である。そのため、これらのハードウェア構成についての重複した説明は省略する。 ==Learning Data Generator==
The learning data generation device 200 is a computer that generates learning data 640 from video data 620 . As shown in FIG. 3, the learning data generation device 200 includes a CPU 210, a memory 220, a communication device 230, a storage device 240, an input device 250, an output device 260, and a recording medium reader 270. FIG. Although the hardware configuration of these learning data generating devices 200 is not necessarily the same as that of the user terminal 100, they have a common basic configuration. Therefore, redundant description of these hardware configurations will be omitted.

学習データ生成装置２００の記憶装置２４０には、図４に示す様に、学習データ生成装置２００によって実行される学習データ生成装置制御プログラム７２０や、ユーザ端末１００から取得した動画データ６２０、ニューラルネットワークモデル６００（以下、第２のニューラルネットワークモデル６００とも記す）、学習データ６４０、特徴量６５０（ベクトルデータ６５１）等の各種のプログラムやデータが格納される。 In the storage device 240 of the learning data generation device 200, as shown in FIG. Various programs and data such as 600 (hereinafter also referred to as a second neural network model 600), learning data 640, and feature amounts 650 (vector data 651) are stored.

記憶装置２４０に記憶されている学習データ生成装置制御プログラム７２０や動画データ６２０等の各種のデータがメモリ２２０に読み出されてＣＰＵ２１０によって実行あるいは処理されることにより、学習データ生成装置２００の各種機能が実現される。 Various functions of the learning data generation device 200 are performed by reading various data such as the learning data generation device control program 720 and the video data 620 stored in the storage device 240 into the memory 220 and being executed or processed by the CPU 210. is realized.

動画データ６２０は、画像認識対象が撮影されたデータである。動画データ６２０の仕様（規格やフレームレート、解像度、画面サイズ等）については、本実施形態では特に制約はなく、どのような仕様の動画データ６２０であっても良い。例えば動画データ６２０は、ユーザ端末１００の動画撮影機能を用いて撮影された動画を記録したデータであっても良いし、不図示の動画撮影用カメラを用いて撮影された動画を記録したデータであっても良い。 The moving image data 620 is data obtained by photographing an image recognition target. The specifications (standard, frame rate, resolution, screen size, etc.) of the video data 620 are not particularly limited in this embodiment, and the video data 620 may have any specifications. For example, the moving image data 620 may be data recording a moving image captured using the moving image capturing function of the user terminal 100, or data recording a moving image captured using a moving image capturing camera (not shown). It can be.

ニューラルネットワークモデル６００は、動画データ６２０内の各静止画像６２１の特徴量６５０を取得するために用いられる。本実施形態では、一例としてニューラルネットワークモデル６００の種類はＣＮＮ（Convolution Neural Network）であり、図５に示す様に、中間層からベクトルデータ６５１を特徴量６５０として取り出すことができる。ニューラルネットワークモデル６００は、動画データ６２０内の各静止画像６２１の類似性（相違の程度）に基づいて冗長な静止画像６３０を正しく特定できるように、静止画像６２１が類似している場合には類似したベクトルデータ６５１が出力されるように、予めある程度の学習が行われている。 The neural network model 600 is used to acquire the feature quantity 650 of each still image 621 within the moving image data 620 . In this embodiment, as an example, the type of neural network model 600 is a CNN (Convolution Neural Network), and as shown in FIG. The neural network model 600 is designed to correctly identify redundant still images 630 based on the similarity (degree of difference) of each still image 621 in the moving image data 620, if the still images 621 are similar. A certain amount of learning is performed in advance so that the vector data 651 obtained by the calculation is output.

なお本実施形態では、静止画像６２１の類似性は、比較対象の２つの静止画像６２１の各ベクトルデータ６５１のユークリッド距離を、後述する閾値Ａ（所定の判定値）と比較することにより判定される。具体的には、２つの静止画像６２１の各ベクトルデータ６５１のユークリッド距離が閾値Ａ以下である場合は、これらの静止画像６２１は類似であると判定する。 In this embodiment, the similarity of the still images 621 is determined by comparing the Euclidean distance of each vector data 651 of the two still images 621 to be compared with a threshold value A (predetermined determination value) described later. . Specifically, when the Euclidean distance of each vector data 651 of two still images 621 is equal to or less than the threshold value A, it is determined that these still images 621 are similar.

またニューラルネットワークモデル６００は、中間層からベクトルデータ６５１（特徴量６５０）を取り出すことが可能なモデルであれば、ＲＮＮ（Recurrent Neural Network）など他の種類のモデルでもよい。 Further, the neural network model 600 may be another type of model such as RNN (Recurrent Neural Network) as long as it is a model capable of extracting vector data 651 (feature quantity 650) from the intermediate layer.

ベクトルデータ６５１は、ニューラルネットワークモデル６００を構成する複数の中間層のうちの、どの中間層から出力されるものでも良いが、出力層の直前の中間層ないしは直前付近の中間層から出力されるものの方が、認識対象の特徴がより明確に特徴量６５０として数値化されているため、好ましい。 The vector data 651 may be output from any intermediate layer among the plurality of intermediate layers forming the neural network model 600. This is preferable because the feature to be recognized is more clearly quantified as the feature amount 650 .

以上のようにして、動画データ６２０内の各静止画像６２１の類似性を判定し、冗長な静止画像６３０を取り除くことにより学習データ６４０が生成され、図４に示す様に記憶装置２４０に記憶される。 As described above, the similarity of each still image 621 in the moving image data 620 is determined, and redundant still images 630 are removed to generate learning data 640, which is stored in the storage device 240 as shown in FIG. be.

＜機能構成＞
次に学習データ生成装置２００の機能構成図の一例を図７に示す。本実施形態に係る学習データ生成装置２００は、動画データ取得部２０１、及び学習データ生成部２０２の各機能を含む。 <Functional configuration>
Next, FIG. 7 shows an example of a functional block diagram of the learning data generation device 200. As shown in FIG. A learning data generation device 200 according to this embodiment includes functions of a video data acquisition unit 201 and a learning data generation unit 202 .

これらの各機能は、学習データ生成装置２００のハードウェアによって本実施形態に係る学習データ生成装置制御プログラム７２０が実行されることにより実現される。 Each of these functions is implemented by the hardware of the learning data generation device 200 executing the learning data generation device control program 720 according to the present embodiment.

動画データ取得部２０１は、画像認識モデル６１０に認識させる人や物等の認識対象が撮影された動画データ６２０を取得する。なお動画データ取得部２０１は、動画データ６２０をユーザ端末１００から取得するだけでなく、動画データ６２０が不図示の他のコンピュータに格納されている場合には、ユーザ端末１００からの指示によってこのコンピュータから動画データ６２０を取得するようにしてもよい。 The moving image data acquisition unit 201 acquires moving image data 620 in which a recognition target such as a person or an object to be recognized by the image recognition model 610 is photographed. Note that the moving image data acquisition unit 201 not only acquires the moving image data 620 from the user terminal 100, but also, if the moving image data 620 is stored in another computer (not shown), acquires this computer according to an instruction from the user terminal 100. The moving image data 620 may be acquired from.

また上述したように、動画データ６２０の規格やフレームレート等の仕様については特に制約はなく、動画データ取得部２０１は様々な仕様の動画データ６２０を取得することができる。 Also, as described above, there are no particular restrictions on the specifications of the moving image data 620 such as the standard and the frame rate, and the moving image data acquisition unit 201 can acquire the moving image data 620 with various specifications.

学習データ生成部２０２は、動画データ６２０に含まれる各フレームの静止画像６２１の中から２つの静止画像６２１を選ぶ組み合わせ毎に、これら２つの静止画像６２１の相違の程度（類似性）を表す指標値を求め、これらの指標値を元に、動画データ６２０に含まれる静止画像６２１から冗長な静止画像６３０を取り除くことにより、学習データ６４０を生成する。 The learning data generation unit 202 selects two still images 621 from among the still images 621 of each frame included in the moving image data 620. For each combination, an index representing the degree of difference (similarity) between these two still images 621 Learning data 640 is generated by removing redundant still images 630 from still images 621 included in moving image data 620 based on these index values.

このような態様により、画像認識モデル６１０の学習を行うための良質な学習データ６４０を効率的に生成することができる。 With such an aspect, it is possible to efficiently generate good quality learning data 640 for learning the image recognition model 610 .

なお学習データ生成部２０２は、上記指標値を求める際に、動画データ６２０に含まれる各静止画像６２１の特徴量６５０（本実施形態ではベクトルデータ６５１）を求め、上記各組み合わせ毎に、組み合わせを成す２つの静止画像６２１のそれぞれの特徴量６５０の差分を、上記指標値として求めるようにしても良い。 Note that when obtaining the index value, the learning data generation unit 202 obtains a feature amount 650 (vector data 651 in this embodiment) of each still image 621 included in the moving image data 620, and calculates a combination for each combination. A difference between the feature amounts 650 of the two still images 621 formed may be obtained as the index value.

このような態様により、特徴量６５０の差分がより小さな２つの静止画像６２１は、お互いに相違の程度がより小さいと判定できるため、これらの静止画像６２１の両方あるいはいずれか一方を冗長な静止画像６３０として特定することが可能となる。これにより、冗長な静止画像６３０をより的確に特定することが可能となる。 In this manner, two still images 621 having a smaller difference in the feature amount 650 can be determined to have a smaller degree of difference. 630 can be identified. This makes it possible to more accurately identify the redundant still image 630 .

なお静止画像６２１の特徴量６５０としては、ニューラルネットワークモデル６００の中間層から取り出すことにより得られるベクトルデータ６５１の他にも、ＨＯＧ特徴量やＥＯＨ特徴量、Ｈａａｒ－ｌｉｋｅ特徴量、ピクセル差分特徴量、あるいはこれらの組み合わせ等、様々な特徴量６５０を採用することができる。これらのいずれの特徴量６５０であっても、特徴量６５０の差分がより小さな２つの静止画像６２１は、お互いに相違の程度がより小さいと判定できる。 Note that, as the feature quantity 650 of the still image 621, in addition to the vector data 651 obtained by retrieving from the intermediate layer of the neural network model 600, the HOG feature quantity, the EOH feature quantity, the Haar-like feature quantity, the pixel difference feature quantity , or a combination thereof. It can be determined that two still images 621 having a smaller difference in the feature amount 650 have a smaller degree of difference with each other.

また学習データ生成部２０２は、上述したように、動画データ６２０に含まれる各静止画像６２１をＣＮＮ等のニューラルネットワークモデル６００に入力し、ニューラルネットワークモデル６００内の中間層からの出力データを用いて、静止画像６２１の特徴量６５０としてベクトルデータ６５１を求めるようにしているが、このとき、画像認識モデル６１０及びニューラルネットワークモデル６００を同一種類のニューラルネットワーク（本実施形態ではＣＮＮ）にしておくと、画像認識モデル６１０の特性とニューラルネットワークモデル６００の特性が共通になるため、画像認識モデル６１０の特性に合った学習データ６４０を生成することが可能となる。これにより、画像認識モデル６１０の学習をより効率的に行うことが可能となる。 In addition, as described above, the learning data generation unit 202 inputs each still image 621 included in the moving image data 620 to the neural network model 600 such as CNN, and uses the output data from the intermediate layer in the neural network model 600 , vector data 651 is obtained as the feature quantity 650 of the still image 621. At this time, if the image recognition model 610 and the neural network model 600 are the same type of neural network (CNN in this embodiment), Since the characteristics of the image recognition model 610 and the characteristics of the neural network model 600 are common, it is possible to generate learning data 640 that matches the characteristics of the image recognition model 610 . This makes it possible to learn the image recognition model 610 more efficiently.

例えばＣＮＮは、同一の物体が画面内で平行移動した位置に写っている２枚の静止画像が入力された場合、中間層から得られるこれらの静止画像の特徴量（ベクトルデータ）はほぼ等しくなる特性を持っているため、これらの物体が同一の物体であると正しく認識できるが、画面内で同一の物体を回転移動させた位置に写っている２枚の静止画像が入力された場合は、特徴量（ベクトルデータ）の差異が大きくなりやすいという特性を持っているため、異なる物体であると誤認識しやすい。 For example, in a CNN, when two still images are input in which the same object appears in a position that has been translated in the screen, the feature values (vector data) of these still images obtained from the intermediate layer are almost the same. Since these objects have the characteristics, they can be correctly recognized as the same object. Since it has the characteristic that the difference in the feature amount (vector data) tends to be large, it is easy to mistakenly recognize that it is a different object.

このため、ニューラルネットワークモデル６００から出力されるベクトルデータ６５１がほぼ等しい２枚の静止画像６２１を学習データ６４０に残しても、画像認識モデル６１０はいずれの静止画像６２１からも正しく被写体を認識できてしまい、学習にはあまり寄与しないため、これらの静止画像６２１の少なくとも一方は冗長な静止画像６３０であるとして取り除いた方が好ましい。 Therefore, even if two still images 621 with substantially the same vector data 651 output from the neural network model 600 are left in the training data 640, the image recognition model 610 cannot correctly recognize the subject from any of the still images 621. It is preferable to remove at least one of these still images 621 as a redundant still image 630 because it does not contribute much to learning.

逆に、ニューラルネットワークモデル６００から出力されるベクトルデータ６５１の差異が大きな２枚の静止画像６２１（被写体が同一であるにもかかわらず、非同一と誤認識しやすい）を学習データ６４０に残しておくことにより、画像認識モデル６１０に対して、同一の被写体であることを学習させることができるため、好ましい。 Conversely, two still images 621 with a large difference in vector data 651 output from the neural network model 600 (even though the subjects are the same, they are likely to be erroneously recognized as non-identical) are left in the learning data 640. This is preferable because it allows the image recognition model 610 to learn that it is the same subject.

なお学習データ生成部６０２は、動画データ６２０内の２つの静止画像６２１の各組み合わせの内、一の静止画像６２１と他の静止画像６２１との組み合わせの中に、２つの静止画像６２１の相違の程度（類似性）を表す指標値が所定の判定値（閾値Ａ）以下となる組み合わせがある場合に、上記一の静止画像６２１を冗長な静止画像６３０として取り除く処理を、動画データ６２０に含まれる各静止画像６２１を順に上記一の静止画像６２１として繰り返し行うことにより、学習データ６４０を生成するようにすると良い。 Note that the learning data generation unit 602 detects the difference between the two still images 621 in the combination of one still image 621 and the other still image 621 among the combinations of the two still images 621 in the moving image data 620. The moving image data 620 includes a process of removing the one still image 621 as a redundant still image 630 when there is a combination whose index value representing the degree (similarity) is equal to or less than a predetermined judgment value (threshold A). It is preferable to generate the learning data 640 by repeating each of the still images 621 in order as the one still image 621 .

例えば、図１２に示す様に、５枚の静止画像（Ａ、Ｂ、Ｃ、Ｄ、Ｅ）６２１があり、これらの静止画像６２１から選んだ２枚の静止画像６２１の相違の程度を表す指標値が図１２に示される値である場合に（例えば静止画像ＡとＢの指標値は７０）、最初に静止画像Ａ６２１を一の静止画像６２１として、他の静止画像（Ｂ、Ｃ、Ｄ、Ｅ）６２１との組み合わせの中に、指標値が判定値（例えば１００）以下となる組み合わせがあるか否かを判定する。図１２に示す例の場合、静止画像Ａ６２１と静止画像Ｂ６２１との指標値が７０（１００以下）であるので、静止画像Ａ６２１は取り除かれる。 For example, as shown in FIG. 12, there are five still images (A, B, C, D, E) 621, and an index representing the degree of difference between two still images 621 selected from these still images 621. If the values are the values shown in FIG. 12 (for example, the index value of still images A and B is 70), first the still image A 621 is taken as one still image 621, and then the other still images (B, C, D, E) Determine whether or not there is a combination with 621 whose index value is equal to or less than a determination value (for example, 100). In the example shown in FIG. 12, the still image A621 and the still image B621 have an index value of 70 (100 or less), so the still image A621 is removed.

つぎに、静止画像Ｂ６２１を一の静止画像６２１として、他の静止画像（Ｃ、Ｄ、Ｅ）６２１との組み合わせの中に、指標値が判定値（１００）以下となる組み合わせがあるか否かを判定する。静止画像Ｂ６２１と静止画像Ｃ６２１との指標値が６０であるので、静止画像Ｂ６２１も取り除かれる。 Next, whether or not there is a combination of the still image B 621 as one still image 621 and the other still images (C, D, E) 621 whose index value is equal to or less than the judgment value (100) judge. Since the index value of still image B621 and still image C621 is 60, still image B621 is also removed.

以下、順に静止画像Ｃ、Ｄ、Ｅ６２１を一の静止画像６２１として同様の処理を行う。これにより、静止画像Ｃ６２１は取り除かれるが、静止画像Ｄ、Ｅ６２１は取り除かれずに残る。そのため学習データ生成部６０２は、静止画像Ｄ、Ｅ６２１からなる学習データ６４０を生成する。 Thereafter, the same processing is performed with the still images C, D, and E 621 as one still image 621 in order. As a result, the still image C621 is removed, but the still images D and E621 remain without being removed. Therefore, the learning data generation unit 602 generates learning data 640 composed of still images D and E621.

このような態様により、冗長な静止画像６３０を含まない学習データ６４０を生成することが可能となる。 This aspect makes it possible to generate learning data 640 that does not include redundant still images 630 .

図７に戻って、学習データ生成部６０２は、動画データ６２０内の２つの静止画像６２１の各組み合わせの中から、相違の程度を表す指標値が最小の第１の組み合わせを特定した上で、さらに、第１の組み合わせを成す２つの静止画像６２１のうちの一つを含む他の組み合わせの中で指標値が最小となる第２の組み合わせを特定し、第１の組み合わせと第２の組み合わせに共通する静止画像６２１を、冗長な静止画像６３０として取り除く処理を、指標値が所定の判定値以下となる組み合わせがなくなるまで繰り返し行うことにより、学習データ６４０を生成するようにしても良い。 Returning to FIG. 7, the learning data generation unit 602 identifies the first combination with the smallest index value representing the degree of difference from each combination of the two still images 621 in the moving image data 620, and then Furthermore, a second combination with the smallest index value is specified among other combinations including one of the two still images 621 forming the first combination, and the first combination and the second combination are identified. The learning data 640 may be generated by repeatedly performing the process of removing the common still image 621 as the redundant still image 630 until there is no combination in which the index value is equal to or less than a predetermined determination value.

再び図１２の例で説明すると、学習データ生成部６０２は、まず、指標値が最小となる組み合わせ（第１の組み合わせ）として、静止画像Ｂ６２１及び静止画像Ｅ６２１の組み合わせ（指標値４０）を特定する。この組み合わせは、図１３（ａ）に示す、静止画像６２１の各組み合わせの指標値のリストのうち、「α」で示す組み合わせである。 Again referring to the example of FIG. 12, the learning data generation unit 602 first identifies the combination (index value 40) of the still image B621 and the still image E621 as the combination (first combination) with the smallest index value. . This combination is the combination indicated by “α” in the list of index values for each combination of still images 621 shown in FIG. 13(a).

このとき、学習データ生成部６０２は、静止画像Ｂ６２１を含む他の組み合わせと、静止画像Ｅ６２１を含む他の組み合わせと、の中で、指標値が最小となる組み合わせ（第２の組み合わせ）を特定する。図１２に示す例では、静止画像Ｂ６２１を含む他の組み合わせで指標値が最小となるのは、静止画像Ｂ６２１と静止画像Ｃ６２１との組み合わせであり（指標値６０）、静止画像Ｅ６２１を含む他の組み合わせで指標値が最小となるのは、静止画像Ｅ６２１と静止画像Ｃ６２１との組み合わせである（指標値９０）。そのため、学習データ生成部６０２は、第２の組み合わせとして、指標値が最小の静止画像Ｂ６２１と静止画像Ｃ６２１との組み合わせを特定する。この組み合わせは、図１３（ａ）に示す、静止画像６２１の各組み合わせの指標値のリストのうち、「β」で示す組み合わせである。 At this time, the learning data generation unit 602 identifies a combination (second combination) with the smallest index value among other combinations including the still image B621 and other combinations including the still image E621. . In the example shown in FIG. 12, among the other combinations including the still image B621, the combination of the still image B621 and the still image C621 has the smallest index value (index value 60), and the combination including the still image E621 has the lowest index value. The combination with the smallest index value is the combination of the still image E621 and the still image C621 (index value 90). Therefore, the learning data generation unit 602 identifies the combination of the still image B621 and the still image C621 with the smallest index value as the second combination. This combination is the combination indicated by "β" in the list of index values for each combination of still images 621 shown in FIG. 13(a).

そして学習データ生成部６０２は、図１３（ａ）において、第１の組み合わせ（α）と第２の組みわせ（β）に共通する静止画像Ｂ６２１を冗長な静止画像６３０として取り除く。 Then, the learning data generation unit 602 removes the still image B621 common to the first combination (α) and the second combination (β) as a redundant still image 630 in FIG. 13(a).

続いて、学習データ生成部６０２は、図１３（ｂ）に示すように、静止画像Ｂ６２１を取り除いた各組み合わせのうち、指標値が最小となる組み合わせ（第１の組み合わせ）として、静止画像Ｃ６２１及び静止画像Ｄ６２１の組み合わせ（指標値５０）を特定する。そして学習データ生成部６０２は、第２の組み合わせとして、静止画像Ｃ６２１及び静止画像Ｅ６２１の組み合わせ（指標値９０）を特定する。 Subsequently, as shown in FIG. 13B, the learning data generation unit 602 selects the still image C621 and A combination of still images D621 (index value 50) is specified. The learning data generation unit 602 then identifies the combination (index value 90) of the still image C621 and the still image E621 as the second combination.

そして学習データ生成部６０２は、第１の組み合わせ（α）と第２の組みわせ（β）に共通する静止画像Ｃ６２１を冗長な静止画像６３０として取り除く。 The learning data generation unit 602 then removes the still image C621 common to the first combination (α) and the second combination (β) as a redundant still image 630 .

ここで図１３（ｃ）に示す様に、静止画像Ｂ６２１及び静止画像Ｃ６２１を取り除いた各組み合わせは、いずれの指標値も判定値（１００）よりも大きい。 Here, as shown in FIG. 13C, in each combination from which the still image B621 and the still image C621 are removed, all index values are greater than the judgment value (100).

そのため、学習データ生成部６０２は、静止画像Ａ、Ｄ、Ｅ６２１からなる学習データ６４０を生成する。 Therefore, the learning data generation unit 602 generates learning data 640 composed of still images A, D, and E 621 .

このように、動画データ６２０内の静止画像６２１の相違の程度が最小の静止画像６２１を優先的に削除することにより、より適切に、冗長な静止画像６３０を含まない学習データ６４０を生成することができる。 In this way, by preferentially deleting the still image 621 with the smallest degree of difference between the still images 621 in the video data 620, it is possible to more appropriately generate the learning data 640 that does not include the redundant still images 630. can be done.

＝＝処理の流れ＝＝
次に、本実施形態に係る情報システム１０００による処理の流れを、図８～図１５を参照しながら説明する。 == Process flow ==
Next, the flow of processing by the information system 1000 according to this embodiment will be described with reference to FIGS. 8 to 15. FIG.

まず学習データ生成装置２００は、画像認識の対象が撮影されている動画データ６２０を取得する（S1000）。学習データ生成装置２００は、ユーザ端末１００から動画データ
６２０を取得しても良いし、ユーザ端末１００からの指示により不図示の他のコンピュータから取得しても良い。 First, the learning data generation device 200 acquires the moving image data 620 in which the target of image recognition is captured (S1000). The learning data generation device 200 may acquire the video data 620 from the user terminal 100 or may acquire it from another computer (not shown) according to an instruction from the user terminal 100 .

そして学習データ生成装置２００は、動画データ６２０から、各フレームの静止画像６２１を抽出する（S1010）。例えば学習データ生成装置２００は、フレームレートが３０fpsの５分間の動画データ６２０から９０００（３０×６０×５）枚の静止画像６２１を抽出する。 The learning data generation device 200 then extracts the still image 621 of each frame from the moving image data 620 (S1010). For example, the learning data generation device 200 extracts 9000 (30×60×5) still images 621 from five-minute video data 620 with a frame rate of 30 fps.

つぎに学習データ生成装置２００は、各静止画像６２１をニューラルネットワークモデル６００に入力し、それぞれのベクトルデータ６５１を求める（S1020）。 Next, the learning data generating device 200 inputs each still image 621 to the neural network model 600 to obtain respective vector data 651 (S1020).

そして学習データ生成装置２００は、閾値Ａ（上述した所定の判定値）を求める（S1030）。閾値Ａは、動画データ６２０内の静止画像６２１から選んだ２枚の静止画像６２１
の類似性（相違の程度）を判定する際の判定値である。本実施形態では、２枚の静止画像６２１の各ベクトルデータ６５１の差分のノルム（例えば各ベクトルデータ６５１のユークリッド距離）が閾値Ａ以下である場合に、これら２枚の静止画像６２１は類似していると判定される。 Then, the learning data generation device 200 obtains the threshold A (predetermined determination value described above) (S1030). The threshold A is two still images 621 selected from the still images 621 in the moving image data 620.
is a judgment value when judging the similarity (degree of difference) between . In this embodiment, when the norm of the difference between the vector data 651 of the two still images 621 (for example, the Euclidean distance of the vector data 651) is equal to or less than the threshold A, the two still images 621 are similar. is determined to be

なお、閾値Ａは、各静止画像６２１のベクトルデータ６５１を元に決めると良い。例えば学習データ生成装置２００は、各ベクトルデータ６５１のＬ２ノルムの平均値を閾値Ａとして求めると良い。この理由の一つは、ニューラルネットワークモデル６００が静止画像６２１に写っている認識対象をうまく認識できる程、ベクトルデータ６５１すなわち特徴量６５０の大きさ（Ｌ２ノルム）の値が大きくなるからである。 Note that the threshold A is preferably determined based on the vector data 651 of each still image 621 . For example, the learning data generation device 200 may obtain the average value of the L2 norm of each vector data 651 as the threshold value A. FIG. One of the reasons for this is that the better the neural network model 600 can recognize the recognition target in the still image 621, the larger the magnitude (L2 norm) of the vector data 651, that is, the feature quantity 650.

つまり、例えばニューラルネットワークモデル６００の学習が適切になされていれば、認識対象である物体Ｘが静止画像６２１に写っている場合のベクトルデータ６５１は、物体Ｘが静止画像６２１に写っていない場合のベクトルデータ６５１よりも大きな値になるはずだからである（こうなるように学習がなされる）。 In other words, for example, if the neural network model 600 is properly trained, the vector data 651 when the object X to be recognized appears in the still image 621 is the vector data 651 when the object X does not appear in the still image 621. This is because the value should be larger than the vector data 651 (learning is performed to achieve this).

なお、閾値Ａの値を大きくすると、類似と判断される静止画像６２１の枚数が増加するため、動画データ６２０から冗長な静止画像６３０として取り除かれる静止画像６２１の枚数が増加し、学習データ６４０のデータ量が減少する。逆に、閾値Ａの値を小さくすると、類似と判断される静止画像６２１の枚数が減少するため、動画データ６２０から冗長
な静止画像６３０として取り除かれる静止画像６２１の枚数が減少し、学習データ６４０のデータ量が増加する。 Note that if the value of the threshold A is increased, the number of still images 621 that are judged to be similar increases. Data volume is reduced. Conversely, if the value of the threshold A is decreased, the number of still images 621 that are judged to be similar decreases. data volume increases.

このため、学習データ６４０に含まれる静止画像の枚数、あるいは冗長な静止画像６３０として取り除く静止画像６２１の枚数に応じて、閾値Ａを調整するとなおよい。このような態様により、学習データ６４０のデータサイズを適切に調整することが可能となる。 Therefore, it is more preferable to adjust the threshold A according to the number of still images included in the learning data 640 or the number of still images 621 to be removed as redundant still images 630 . Such an aspect makes it possible to appropriately adjust the data size of the learning data 640 .

つぎに、学習データ生成装置２００は、類似画像削除処理を実行する（S1040）。これ
により動画データ６２０から冗長な静止画像６３０を取り除くことができる。 Next, the learning data generation device 200 executes similar image deletion processing (S1040). As a result, the redundant still image 630 can be removed from the moving image data 620. FIG.

類似画像削除処理は、図１０に示す様に、動画データ６２０内の２つの静止画像６２１の各組み合わせの内、一の静止画像６２１（図１０において符号iが付された静止画像６
２１）と他の静止画像６２１（図１０において符号j,j+1,…,MAXが付された静止画像６２１）との組み合わせの中に、２つの静止画像６２１の相違の程度を表す指標値が所定の判定値（閾値Ａ）以下となる組み合わせがある場合に、上記一の静止画像６２１を冗長な静止画像６３０として取り除く処理を、動画データ６２０に含まれる各静止画像６２１を順に上記一の静止画像６２１として繰り返し行うことにより、学習データ６４０を生成するようにする処理である。 As shown in FIG. 10, in the similar image deletion process, one still image 621 (still image 6 denoted by i in FIG.
21) and other still images 621 (still images 621 denoted by symbols j, j+1, . is equal to or less than a predetermined judgment value (threshold value A), the process of removing the one still image 621 as a redundant still image 630 is performed by sequentially removing each of the still images 621 included in the moving image data 620 from the one This is a process for generating learning data 640 by repeating the still image 621 .

類似画像削除処理の流れを、図９のフローチャートを参照しながら説明すると、まず学習データ生成装置２００は、制御変数として、i=1、j=i+1を設定する（S2000、S2010）。制御変数iは一の静止画像６２１を示し、制御変数jは他の静止画像６２１を示す。 The flow of similar image deletion processing will be described with reference to the flowchart of FIG. 9. First, the learning data generation device 200 sets i=1 and j=i+1 as control variables (S2000, S2010). A control variable i indicates one still image 621 and a control variable j indicates another still image 621 .

つぎに学習データ生成装置２００は、i番目の静止画像６２１とj番目の静止画像６２１のそれぞれのベクトルデータ６５１の差分のノルムを算出する（S2020）。具体的には各
ベクトルデータ６５１のユークリッド距離を算出する。 Next, the learning data generation device 200 calculates the norm of the difference between the vector data 651 of the i-th still image 621 and the j-th still image 621 (S2020). Specifically, the Euclidean distance of each vector data 651 is calculated.

これらのノルム（ユークリッド距離）が閾値Ａ以下である場合には（S2030においてYES）、学習データ生成装置２００は、i番目の静止画像６２１がj番目の静止画像６２１と類似であると判定し、i番目の静止画像６２１を削除する（S2060）。 If these norms (Euclidean distance) are equal to or less than the threshold A (YES in S2030), the learning data generation device 200 determines that the i-th still image 621 is similar to the j-th still image 621, The i-th still image 621 is deleted (S2060).

一方、これらのノルムが閾値Ａ以下でなければ（S2030においてNO）、学習データ生成
装置２００は、制御変数jに1を加えて（S2040）、i番目の静止画像６２１と次のj番目の
静止画像６２１との間で同様の処理を行う（S2020、S2030）。 On the other hand, if these norms are not equal to or less than the threshold A (NO in S2030), the learning data generation device 200 adds 1 to the control variable j (S2040), the i-th still image 621 and the next j-th still image Similar processing is performed with the image 621 (S2020, S2030).

ただし、S2040において制御変数jに1を加えた結果、jがMAXを超えた場合には、全ての
静止画像６２１との比較を終えたので、学習データ生成装置２００は、iに1を加える（S2070）。 However, as a result of adding 1 to the control variable j in S2040, if j exceeds MAX, the comparison with all the still images 621 is completed, so the learning data generation device 200 adds 1 to i ( S2070).

そして学習データ生成装置２００は、iがMAXを超えるまで（S2080）、i番目の静止画像６２１及びj番目の静止画像６２１のユークリッド距離と、閾値Ａと、の比較を行い、ユ
ークリッド距離が閾値Ａよりも小さい場合にi番目の静止画像６２１を削除する処理を繰
り返し行う。 Then, the learning data generation device 200 compares the Euclidean distance of the i-th still image 621 and the j-th still image 621 with the threshold A until i exceeds MAX (S2080). The processing of deleting the i-th still image 621 when it is smaller than is repeated.

図８に戻って、このようにして学習データ生成装置２００は動画データ６２０から冗長な静止画像６３０を取り除くことで、学習データ６４０を生成する（S1050）。 Returning to FIG. 8, the learning data generation device 200 thus removes the redundant still image 630 from the video data 620 to generate the learning data 640 (S1050).

このような態様によって、画像認識モデル６１０の学習を行うための学習データ６４０を効率的に生成することができる。 With such an aspect, the learning data 640 for learning the image recognition model 610 can be efficiently generated.

なお学習データ生成装置２００は、類似画像削除処理を、図１１のフローチャートに示す様な手順で行うことも可能である。 Note that the learning data generation device 200 can also perform similar image deletion processing in accordance with the procedure shown in the flowchart of FIG. 11 .

この場合、学習データ生成装置２００は、動画データ６２０内の２つの静止画像６２１の各組み合わせの中から、相違の程度を表す指標値が最小の第１の組み合わせ（図１３に示したαで示す組み合わせ）を特定した上で、さらに、第１の組み合わせを成す２つの静止画像６２１のうちの一つを含む他の組み合わせの中で指標値が最小となる第２の組み合わせ（図１３に示したβで示す組み合わせ）を特定し、第１の組み合わせ（α）と第２の組み合わせ（β）に共通する静止画像６２１を、冗長な静止画像６３０として取り除く処理を、指標値が所定の判定値以下となる組み合わせがなくなるまで繰り返し行う。 In this case, the learning data generation device 200 selects the first combination (indicated by α in FIG. combination) is specified, and then a second combination (shown in FIG. 13 β), and remove the still image 621 common to the first combination (α) and the second combination (β) as a redundant still image 630. Repeat until there are no more combinations.

図１１において、学習データ生成装置２００は、まず、動画データ６２０内の静止画像６２１から２枚の静止画像６２１を選ぶ各組み合わせについて、各静止画像６２１のベクトルデータ６５１の差分のノルム（例えば各ベクトルデータ６５１のユークリッド距離）を計算する（S3000）。 In FIG. 11, the learning data generation device 200 first determines the norm of the difference (for example, each vector Euclidean distance of data 651) is calculated (S3000).

そして学習データ生成装置２００は、ノルムが閾値Ａ以下となる組み合わせがない場合には（S3000においてNO）、処理を終了して図８のS1050に進み、学習データを出力する。 If there is no combination in which the norm is equal to or less than threshold A (NO in S3000), learning data generation device 200 ends the process, advances to S1050 in FIG. 8, and outputs learning data.

一方、S3010においてノルムが閾値Ａ以下となる組み合わせがあった場合には、学習デ
ータ生成装置２００は、それらの組み合わせの中でノルムが最小の組み合わせ（上述した例でαで示した組み合わせ）を特定する（S3020）。 On the other hand, if there is a combination whose norm is equal to or less than the threshold A in S3010, the learning data generation device 200 identifies the combination with the smallest norm among those combinations (the combination indicated by α in the above example). (S3020).

次に学習データ生成装置２００は、この組み合わせ（α）を成す２つの静止画像６２１のうちの一つを含む他の組み合わせのうち、指標値が最小となる組み合わせ（上述したβで示した組み合わせ）を特定する（S3030）。 Next, the learning data generation device 200 selects the combination (the combination indicated by β described above) with the smallest index value among the other combinations including one of the two still images 621 forming the combination (α). is identified (S3030).

そして学習データ生成装置２００は、これらの組み合わせ（α、β）に共通する静止画像６２１を冗長な静止画像６３０として削除する（S3040）。 The learning data generation device 200 then deletes the still images 621 common to these combinations (α, β) as redundant still images 630 (S3040).

以下、学習データ生成装置２００は、ノルムが閾値Ａ以下となる組み合わせがなくなるまで（S3000においてNO）、S3020～S3040の処理を繰り返す。 After that, learning data generation device 200 repeats the processing of S3020 to S3040 until there is no combination whose norm is equal to or less than threshold A (NO in S3000).

このような態様により、学習データ生成装置２００は、より適切に冗長な静止画像６３０を含まない学習データ６４０を生成することが可能となる。 With such an aspect, the learning data generation device 200 can more appropriately generate the learning data 640 that does not include the redundant still image 630 .

[第２実施形態]
なお、学習データ生成装置２００は、図１４及び図１５に示すような態様で処理を行っても良い。 [Second embodiment]
Note that the learning data generation device 200 may perform processing in a mode as shown in FIGS. 14 and 15. FIG.

本実施形態では、学習データ生成装置２００は、動画データ６２０に含まれる静止画像６２１を時系列順に複数のグループに分け、第１実施形態で説明した冗長な静止画像６３０を取り除く処理をグループ単位に行う。図１５に、動画データ６２０に含まれる静止画像６２１をＮ個のグループに分ける様子を示す。 In this embodiment, the learning data generation device 200 divides the still images 621 included in the moving image data 620 into a plurality of groups in chronological order, and performs the process of removing the redundant still images 630 described in the first embodiment for each group. conduct. FIG. 15 shows how still images 621 included in moving image data 620 are divided into N groups.

そして学習データ生成装置２００は、グループ単位に冗長な静止画像６３０を取り除く処理を行うことにより中間データを生成した後に、この中間データの全体に対してさらに第１実施形態で説明した冗長な静止画像６３０を取り除く処理を行う。このようにして学習データ生成装置２００は学習データ６４０を生成する。 Then, the learning data generation device 200 generates intermediate data by performing a process of removing redundant still images 630 on a group-by-group basis. 630 is removed. The learning data generation device 200 generates the learning data 640 in this manner.

このような態様により、動画データ６２０から２枚の静止画像６２１を選ぶ組み合わせの数を減らすことができるので、学習データ６４０を生成するための処理時間を短縮することが可能となる。 With this aspect, the number of combinations for selecting two still images 621 from the moving image data 620 can be reduced, so the processing time for generating the learning data 640 can be shortened.

また各グループ内の静止画像６２１は、撮影されたタイミングが相互に時間的に近いため、類似である可能性が高い。そのため、本実施形態のように、一旦グループ内で各静止画像６２１の類似性を判断することで、効率よく冗長な静止画像６３０を取り除くことが可能となる。 In addition, the still images 621 in each group are likely to be similar because the timings at which they were shot are close to each other. Therefore, by once determining the similarity of each still image 621 within a group as in the present embodiment, redundant still images 630 can be removed efficiently.

図１４のフローチャートに沿って本実施形態に係る処理の流れを説明する。 The flow of processing according to this embodiment will be described along the flowchart of FIG. 14 .

まず学習データ生成装置２００は、画像認識の対象が撮影されている動画データ６２０を取得する（S4000）。 First, learning data generation device 200 acquires video data 620 in which an object for image recognition is captured (S4000).

そして学習データ生成装置２００は、動画データ６２０から、各フレームの静止画像６２１を抽出し（S4010）、各静止画像６２１をニューラルネットワークモデル６００に入
力し、それぞれのベクトルデータ６５１を求める（S4020）。そして学習データ生成装置
２００は、閾値Ａを求める（S4030）。以上の処理は、第１実施形態と同様である。 The learning data generation device 200 then extracts the still images 621 of each frame from the video data 620 (S4010), inputs each still image 621 to the neural network model 600, and obtains vector data 651 (S4020). The learning data generation device 200 then obtains a threshold A (S4030). The above processing is the same as in the first embodiment.

学習データ生成装置２００は、各静止画像６２１を時系列順にＮ個のグループに分割する（S4040）。 The learning data generation device 200 divides each still image 621 into N groups in chronological order (S4040).

そして学習データ生成装置２００は、グループ単位に類似画像削除処理（冗長な静止画像６３０を取り除く処理）を行うことにより中間データを生成する（S4050）。 Then, the learning data generation device 200 generates intermediate data by performing similar image deletion processing (processing for removing redundant still images 630) on a group-by-group basis (S4050).

学習データ生成装置２００は、この中間データの全体に対してさらに類似画像削除処理を行う（S4060）。 The learning data generation device 200 further performs similar image deletion processing on the entire intermediate data (S4060).

そして学習データ生成装置２００は、学習データ６４０を生成する（S4070）。その後
学習データ生成装置２００は、学習データ６４０をユーザ端末１００に送信する。 The learning data generation device 200 then generates learning data 640 (S4070). The learning data generation device 200 then transmits the learning data 640 to the user terminal 100 .

このような態様により、学習データ６４０を生成するための処理時間をさらに短縮することが可能となる。 Such an aspect makes it possible to further shorten the processing time for generating the learning data 640 .

以上、学習データ６４０の生成方法、学習データ生成装置２００及びプログラムについて説明したが、上述した実施の形態は本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。本発明はその趣旨を逸脱することなく変更、改良され得るとともに、本発明にはその等価物も含まれる。 The method of generating learning data 640, the learning data generation device 200, and the program have been described above. not from The present invention may be modified and improved without departing from its spirit, and the present invention also includes equivalents thereof.

例えば上記実施形態では、学習データ生成装置２００が学習データ６４０を生成後、この学習データ６４０をユーザ端末１００に送信する場合を例示したが、画像認識モデル６１０を記憶している不図示のコンピュータに学習データ６４０を送信するようにしても良い。このような態様により、ユーザ端末１００が、画像認識モデル６１０を記憶している不図示のコンピュータに学習データ６４０を送信する手間を省くことができ、画像認識モデル６１０の学習を行う際の作業効率を向上させることが可能となる。 For example, in the above embodiment, the learning data generation device 200 generates the learning data 640 and then transmits the learning data 640 to the user terminal 100. The learning data 640 may be transmitted. With this aspect, the user terminal 100 can save the trouble of transmitting the learning data 640 to a computer (not shown) storing the image recognition model 610, and the work efficiency when learning the image recognition model 610 can be reduced. can be improved.

あるいは、学習データ生成装置２００が画像認識モデル６１０を記憶するようにしておき、学習データ生成装置２００が自ら画像認識モデル６１０の学習を行うようにしても良い。このような態様により、学習データ生成装置２００が学習データ６４０をユーザ端末１００や他のコンピュータに送信することが不要になるので、画像認識モデル６１０の学
習を行う際の作業効率をさらに向上させることが可能となる。 Alternatively, the learning data generating device 200 may store the image recognition model 610 and the learning data generating device 200 may learn the image recognition model 610 by itself. This aspect eliminates the need for the learning data generation device 200 to transmit the learning data 640 to the user terminal 100 or another computer, thereby further improving work efficiency when learning the image recognition model 610. becomes possible.

また上記実施形態では、画像認識モデル６１０とは別に用意したニューラルネットワークモデル６００に、動画データ６２０内の各静止画像６２１を入力してベクトルデータ６５１を取得する場合を説明したが、ニューラルネットワークモデル６００を用いずに、画像認識モデル６１０に動画データ６２０の各静止画像６２１を入力し、画像認識モデル６１０の中間層からベクトルデータ６５１を取得するようにしても良い。 Further, in the above embodiment, the case where each still image 621 in the moving image data 620 is input to the neural network model 600 prepared separately from the image recognition model 610 to obtain the vector data 651 has been described. Each still image 621 of the moving image data 620 may be input to the image recognition model 610 without using , and the vector data 651 may be acquired from the intermediate layer of the image recognition model 610 .

この場合、学習が未完了の状態の画像認識モデル６１０を用いてベクトルデータ６５１を取得し、このベクトルデータ６５１を用いて動画データ６２０から冗長な静止画像６３０を取り除くことになるが、機械学習に用いられる画像認識モデル６１０は、多くの場合、ある程度の精度で一般的な物体についての画像認識が可能な程度に学習済みの状態で配布されているため、このような画像認識モデル６１０を用いるようにすれば、ニューラルネットワークモデル６００を用いずにベクトルデータ６５１を取得することができる。このような態様により、ニューラルネットワークモデル６００を別途利用する場合に必要となる様々な設定作業等の手間が省けるので、学習データ６４０を作成する際の作業者の負担軽減を図ることが可能となる。 In this case, the vector data 651 is obtained using the image recognition model 610 whose learning has not been completed, and the redundant still image 630 is removed from the video data 620 using this vector data 651. In many cases, the image recognition model 610 used is distributed in a trained state to the extent that image recognition of general objects is possible with a certain degree of accuracy. , the vector data 651 can be obtained without using the neural network model 600 . With such an aspect, it is possible to reduce the burden on the operator when creating the learning data 640 because it saves the trouble of various setting work and the like that are necessary when using the neural network model 600 separately. .

なお、本実施形態における学習データの生成方法において、コンピュータが、前記学習データを生成する処理において、前記動画データに含まれる各静止画像の特徴量を求め、前記各組み合わせ毎に、前記組み合わせを成す２つの静止画像の前記特徴量の差分を前記指標値として求める、としてもよい。 In the learning data generation method according to the present embodiment, in the process of generating the learning data, the computer obtains the feature amount of each still image included in the moving image data, and determines the combination for each of the combinations. A difference between the feature amounts of the two still images may be obtained as the index value.

これによれば、冗長な静止画像をより的確に特定することが可能となる。 According to this, redundant still images can be specified more accurately.

また、本実施形態における学習データの生成方法において、前記画像認識用のモデルは、第１のニューラルネットワークモデルであり、前記コンピュータが、前記学習データを生成する処理において、前記第１のニューラルネットワークモデルと同じ種類の第２のニューラルネットワークモデルに前記動画データに含まれる各静止画像を入力し、前記第２のニューラルネットワークモデル内の中間層からの出力データを用いて前記特徴量を求める、としてもよい。 Further, in the method of generating learning data according to the present embodiment, the model for image recognition is a first neural network model, and the computer generates the learning data in the process of generating the first neural network model. Each still image included in the moving image data is input to a second neural network model of the same type as and the feature amount is obtained using the output data from the intermediate layer in the second neural network model. good.

これによれば、モデルの特性に合った学習データを得ることが可能となる。 According to this, it is possible to obtain learning data that matches the characteristics of the model.

また、本実施形態における学習データの生成方法において、前記コンピュータが、前記学習データを生成する処理において、一の静止画像と他の静止画像との前記組み合わせの中に、前記指標値が所定の判定値以下となる組み合わせがある場合に、前記一の静止画像を前記冗長な静止画像として取り除く処理を、前記動画データに含まれる各静止画像を順に前記一の静止画像として繰り返し行うことにより、前記学習データを生成する、としてもよい。 Further, in the method of generating learning data according to the present embodiment, in the process of generating the learning data, the computer determines that the index value is a predetermined value in the combination of one still image and another still image. When there is a combination that is equal to or less than the value, the processing of removing the one still image as the redundant still image is repeatedly performed as the one still image, and the learning is performed by repeatedly performing each still image included in the moving image data in order. may generate data.

これによれば、冗長な静止画像を含まない学習データを生成することが可能となる。 According to this, it is possible to generate learning data that does not include redundant still images.

また、本実施形態における学習データの生成方法において、前記コンピュータが、前記学習データを生成する処理において、前記各組み合わせの中から、前記指標値が最小の第１の組み合わせを特定した上で、さらに、前記第１の組み合わせを成す２つの静止画像のうちの一つを含む他の組み合わせの中で前記指標値が最小の第２の組み合わせを特定し、前記第１の組み合わせと前記第２の組み合わせに共通する静止画像を、前記冗長な静止画像として取り除く処理を、前記指標値が所定の判定値以下となる組み合わせがなくなるまで繰り返し行うことにより、前記学習データを生成する、としてもよい。 Further, in the method of generating learning data according to the present embodiment, in the process of generating the learning data, the computer identifies a first combination with the smallest index value from among the combinations, and further , identifying a second combination having the smallest index value among other combinations including one of the two still images forming the first combination, and determining the first combination and the second combination The learning data may be generated by repeatedly performing the processing of removing the still images common to the above as the redundant still images until there is no combination in which the index value is equal to or less than a predetermined judgment value.

これによれば、より適切に、冗長な静止画像を含まない学習データを生成することが可能となる。 According to this, it is possible to more appropriately generate learning data that does not include redundant still images.

また、本実施形態における学習データの生成方法において、前記コンピュータが、前記学習データを生成する処理において、前記動画データに含まれる静止画像を時系列順に複数のグループに分け、前記グループ単位に、前記冗長な静止画像を取り除く処理を行うことにより中間データを生成した後に、前記中間データの全体に対してさらに前記冗長な静止画像を取り除く処理を行うことにより、前記学習データを生成する、としてもよい。 Further, in the method of generating learning data according to the present embodiment, in the process of generating the learning data, the computer divides the still images included in the moving image data into a plurality of groups in chronological order, and for each group, the After intermediate data is generated by performing a process of removing redundant still images, the learning data may be generated by further performing a process of removing the redundant still images on the entire intermediate data. .

これによれば、学習データをより短時間に生成することが可能となる。 According to this, it becomes possible to generate learning data in a shorter time.

１００ユーザ端末
１１０ＣＰＵ
１２０メモリ
１３０通信装置
１４０記憶装置
１５０入力装置
１６０出力装置
１７０記録媒体読取装置
２００学習データ生成装置
２０１動画データ取得部
２０２学習データ生成部
２１０ＣＰＵ
２２０メモリ
２３０通信装置
２４０記憶装置
２５０入力装置
２６０出力装置
２７０記録媒体読取装置
５００ネットワーク
６００ニューラルネットワークモデル
６１０画像認識モデル
６２０動画データ
６２１静止画像
６３０冗長な静止画像
６４０学習データ
６５０特徴量
６５１ベクトルデータ
７１０ユーザ端末制御プログラム
７２０学習データ生成装置制御プログラム
８００記録媒体
１０００情報システム 100 User terminal 110 CPU
120 memory 130 communication device 140 storage device 150 input device 160 output device 170 recording medium reading device 200 learning data generation device 201 video data acquisition unit 202 learning data generation unit 210 CPU
220 memory 230 communication device 240 storage device 250 input device 260 output device 270 recording medium reader 500 network 600 neural network model 610 image recognition model 620 video data 621 still image 630 redundant still image 640 learning data 650 feature quantity 651 vector data 710 User terminal control program 720 Learning data generation device control program 800 Recording medium 1000 Information system

Claims

A method of generating learning data for learning a model for image recognition, comprising:
the computer
a process of acquiring moving image data in which the target of image recognition is captured;
A feature amount of a still image of each frame included in the moving image data is obtained, and for each combination in which two still images are selected from the still images, the difference in the feature amount of the two still images forming the combination is calculated as the 2. A process of generating the learning data by obtaining an index value representing the degree of difference between the two still images, and removing redundant still images from the still images included in the moving image data based on the index value;
A method of generating training data that runs

The learning data generation method according to claim 1 ,
The model for image recognition is a first neural network model,
In the process of generating the learning data, the computer inputs each still image included in the moving image data to a second neural network model of the same type as the first neural network model, and the second neural network A learning data generation method for determining the feature amount using output data from an intermediate layer in a model.

The learning data generation method according to claim 1,
In the process of generating the learning data, if the combination of one still image and another still image includes a combination in which the index value is equal to or less than a predetermined judgment value, the one A method of generating learning data, wherein the learning data is generated by repeatedly performing a process of removing still images as the redundant still images from each of the still images included in the moving image data in turn as the one still image.

The learning data generation method according to claim 1,
In the process of generating the learning data, the computer identifies, from among the combinations, a first combination with the smallest index value, and then further selects two still images that form the first combination. A second combination having the smallest index value among other combinations including one of the above is specified, and a still image common to the first combination and the second combination is defined as the redundant still image. A method of generating learning data, wherein the learning data is generated by repeatedly performing the removing process until there is no combination in which the index value is equal to or less than a predetermined judgment value.

The learning data generation method according to claim 1,
In the process of generating the learning data, the computer divides the still images included in the moving image data into a plurality of groups in chronological order, and removes the redundant still images from each group to obtain intermediate data. is generated, the learning data is generated by further performing a process of removing the redundant still image from the entire intermediate data.

A learning data generation device for generating learning data for learning a model for image recognition,
a moving image data acquisition unit that acquires moving image data in which the image recognition target is captured;
A feature amount of a still image of each frame included in the moving image data is obtained, and for each combination in which two still images are selected from the still images, the difference in the feature amount of the two still images forming the combination is calculated as the 2. a learning data generation unit that generates the learning data by obtaining an index value representing the degree of difference between the two still images, and removing redundant still images from the still images included in the moving image data based on the index value; ,
A learning data generation device comprising:

A computer that generates training data for training a model for image recognition,
a procedure for acquiring moving image data in which the target for image recognition is captured;
A feature amount of a still image of each frame included in the moving image data is obtained, and for each combination in which two still images are selected from the still images, the difference in the feature amount of the two still images forming the combination is calculated as the 2. a step of obtaining an index value representing the degree of difference between two still images, and removing redundant still images from the still images included in the moving image data based on the index value to generate the learning data;
program to run the