JP7074723B2

JP7074723B2 - Learning equipment and programs

Info

Publication number: JP7074723B2
Application number: JP2019124957A
Authority: JP
Inventors: 和之田坂; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2022-05-24
Anticipated expiration: 2039-07-04
Also published as: JP2021012446A

Description

本発明は、移動手段に設けられたカメラで移動しながら撮影された映像を活用して、物体検出において遠方の対象の検出精度を向上させる学習を行うことが可能な学習装置及びプログラムに関する。 The present invention relates to a learning device and a program capable of learning to improve the detection accuracy of a distant object in object detection by utilizing an image taken while moving by a camera provided in a moving means.

近年、ICT（情報通信技術）端末としての機能を有する自動車として定義され、ネットワークに常時接続し、自動運転などの新たなサービスを可能とするコネクテッドカーの技術が発展している。コネクテッドカーにおいては自動車の周辺の様々な情報を収集し、多数の自動車で収集された情報を、ネットワーク上のシステムにおいて共有して利用することが可能である。収集可能な情報の一種として、車載カメラで撮影した映像がある。この車載カメラ映像を対象とした物体認識に関する従来技術の例として、特許文献１～３が挙げられる。 In recent years, connected car technology, which is defined as an automobile having a function as an ICT (Information and Communication Technology) terminal, is constantly connected to a network and enables new services such as autonomous driving, has been developed. In a connected car, various information around the car can be collected, and the information collected by a large number of cars can be shared and used in a system on a network. As a kind of information that can be collected, there is an image taken by an in-vehicle camera. Patent Documents 1 to 3 are mentioned as an example of the prior art relating to the object recognition for the vehicle-mounted camera image.

特許文献１では、自車両のヘッドライトの配光状態に基づき、物体認識に用いる物体認識辞書としての認識辞書の認識条件を変更し、変更した認識辞書に基づき、自車両の前方を撮影した映像から物体認識を行う。特許文献２では、走行環境が急激に変化した場合、カメラでは認識困難な対象物を他の検出手段で検出可能な物体認識装置が開示される。具体的には、前のフレームで検出していた対象物の消失を検出し、その対象物の方向に基づき、レーザーを照射して対象物の距離情報を出力する。特許文献３では、撮像された画像と画像モデルを用いて、撮像された物体を認証し、認証できなかった物体がある場合に、ネットワーク経由で認識できなかった物体の画像を検索して取得し、取得した画像から画像データを生成して、生成した画像データに基づいて物体の物体名を認識する。 In Patent Document 1, the recognition conditions of the recognition dictionary as the object recognition dictionary used for object recognition are changed based on the light distribution state of the headlight of the own vehicle, and the image taken in front of the own vehicle is taken based on the changed recognition dictionary. Object recognition is performed from. Patent Document 2 discloses an object recognition device that can detect an object that is difficult for a camera to recognize by another detection means when the traveling environment changes suddenly. Specifically, the disappearance of the object detected in the previous frame is detected, and the distance information of the object is output by irradiating the laser based on the direction of the object. In Patent Document 3, the captured image and the image model are used to authenticate the captured object, and when there is an object that could not be authenticated, the image of the unrecognizable object is searched and acquired via the network. , Image data is generated from the acquired image, and the object name of the object is recognized based on the generated image data.

特開2017-165345号公報JP-A-2017-165345 特開2015-172934号公報JP-A-2015-172934 特開2018-169746号公報Japanese Unexamined Patent Publication No. 2018-169746

しかしながら、従来技術では、車載カメラ映像の特性を活かして、物体認識の精度を向上させる以下のようなアプロ―チが検討されていなかった。 However, in the prior art, the following approaches for improving the accuracy of object recognition by utilizing the characteristics of the in-vehicle camera image have not been studied.

すなわち、一般に画像からの物体認識では、予め学習されたモデルのパラメータを用いて認識を行うが、この際、物体に近接するほど画像上の物体の映りが大きく明確に映るため認識精度が高く、遠方では逆に認識精度が低い。ここで、車載カメラ映像の特性を考えると、車道を走行しながら車両周辺の物体が撮影されることで、同一物体がカメラに近接している状況から遠方にある状況までを、連続的に撮影した映像が得られることが通常である。ある物体に関して、近接状況で認識に成功し、これと映像上で連続した遠方状況で認識に失敗した場合であっても、遠方に認識すべき物体が存在して画像上に映っている可能性が高いという情報は得ることができる。また、コネクテッドカーのシステムにおいては、このような映像を大量に収集することも可能である。 That is, in general, in object recognition from an image, recognition is performed using parameters of a model learned in advance, but at this time, the closer to the object, the larger and clearer the image of the object on the image is, so the recognition accuracy is high. On the contrary, the recognition accuracy is low at a distance. Here, considering the characteristics of the in-vehicle camera image, by shooting an object around the vehicle while driving on the roadway, the same object is continuously shot from a situation where it is close to the camera to a situation where it is far away. It is normal to obtain a good image. Even if recognition succeeds in a close-up situation and recognition fails in a continuous distant situation on the image, there is a possibility that an object to be recognized in the distance exists and appears on the image. Information that is high can be obtained. It is also possible to collect a large amount of such images in a connected car system.

従って、このような特性を有する車載カメラ映像を利用して物体認識の学習を行うことにより、遠方の物体の認識精度（検出精度）を向上させるアプローチが考えられるが、従来技術においてはこのようなアプローチはなされていないという課題があった。これは、車載カメラ映像に限らず、同様の特性を有する映像、すなわち、ロボットやドローン等の移動手段に設置されたカメラで移動しながら撮影される映像に関しても、同様である。 Therefore, an approach to improve the recognition accuracy (detection accuracy) of a distant object by learning the object recognition using the in-vehicle camera image having such characteristics can be considered. There was a problem that the approach was not made. This is not limited to the in-vehicle camera image, but the same applies to an image having the same characteristics, that is, an image taken while moving by a camera installed in a moving means such as a robot or a drone.

上記の従来技術の課題に鑑み、本発明は、移動手段に設けられたカメラで移動しながら撮影された映像を活用して、物体認識における遠方の対象の検出精度を向上させる学習を行うことが可能な学習装置及びプログラムを提供することを目的とする。 In view of the above-mentioned problems of the prior art, the present invention can learn to improve the detection accuracy of a distant object in object recognition by utilizing an image taken while moving by a camera provided in a moving means. It is intended to provide possible learning devices and programs.

上記目的を達成するため、本発明は学習装置であって、移動手段に設けられたカメラで移動しながら撮影された映像より、所定対象が連続して検出されると判定される検出成功区間と、当該区間に隣接し、所定対象が撮影されているが連続して検出されないと判定される検出失敗区間とを、映像の各フレームに所定の物体認識手法を既存モデルのもとで適用することによって検出し、検出成功区間の各フレームに所定対象の検出領域を紐づけた成功画像セットと、検出失敗区間の各フレームに所定対象が検出されるべき領域を推定して推定領域として紐づけた失敗画像セットと、を準備する準備部と、前記成功画像セット及び前記失敗画像セットから選択される画像セットを正解として用いて、前記所定の物体認識手法のモデルの学習及び評価を行うことにより、前記既存モデルを改良したモデルを得る学習部と、を備えることを特徴とするまた、コンピュータを前記学習装置として機能させるプログラムであることを特徴とする。 In order to achieve the above object, the present invention is a learning device, which is a detection success section in which it is determined that a predetermined object is continuously detected from an image taken while moving by a camera provided in a moving means. , Apply the predetermined object recognition method to each frame of the video under the existing model with the detection failure section that is adjacent to the relevant section and it is determined that the predetermined object is being photographed but not continuously detected. The success image set in which the detection area of the predetermined target is linked to each frame of the detection success section and the region in which the predetermined target should be detected are estimated and linked as the estimation area in each frame of the detection failure section. By learning and evaluating the model of the predetermined object recognition method by using the failure image set, the preparation unit for preparing the failure image set, and the success image set and the image set selected from the failure image set as correct answers. It is characterized by including a learning unit for obtaining a model obtained by improving the existing model, and is also characterized in that it is a program that causes a computer to function as the learning device.

本発明によれば、検出失敗区間の各フレームに所定対象が検出されるべき領域を紐づけることで、既存モデルでは検出できなかった対象が遠方にある状態の画像も正解として用いて学習を行うことにより、既存モデルを改良したモデルを得ることができる。 According to the present invention, by associating a region in which a predetermined target should be detected with each frame of the detection failure section, learning is performed using an image in a state where the target is distant, which could not be detected by the existing model, as a correct answer. This makes it possible to obtain an improved model of the existing model.

一実施形態に係る学習装置及び検出装置の機能ブロック図である。It is a functional block diagram of the learning device and the detection device which concerns on one Embodiment. コネクテッドカーのシステム構成の模式図である。It is a schematic diagram of the system configuration of a connected car. 一実施形態に係る準備部の機能ブロック図である。It is a functional block diagram of the preparation part which concerns on one Embodiment. 準備部で準備される学習データの模式例を示す図である。It is a figure which shows the schematic example of the learning data prepared in the preparation part. 一実施形態に係る学習部による学習のフローチャートである。It is a flowchart of learning by the learning department which concerns on one Embodiment. 図５のフローの繰り返しによってモデル及び学習データが更新される模式例を、図４の例に対応するものとして示す図である。It is a figure which shows the typical example in which the model and the learning data are updated by repeating the flow of FIG. 5 as corresponding to the example of FIG. 一般的なコンピュータ装置におけるハードウェア構成の例を示す図である。It is a figure which shows the example of the hardware composition in a general computer device.

図１は、一実施形態に係る学習装置及び検出装置の機能ブロック図である。学習装置1は、準備部2及び学習部3を備え、検出装置10は検出処理部11を備える。これらの全体的な処理内容は次の通りである。学習装置1は、準備部2において車載カメラで取得された映像を読み込み、管理者等のマニュアル作業等によってその種別が指定される対象（例えば、道路標識）に関する学習データをこの映像を用いて準備し、学習部3においてこの学習データを用いて学習を行うことで、遠方にある対象の認識精度が低い既存モデルを改良した改良モデルを得るものである。検出装置10では検出処理部11において、この改良モデルを用いて画像認識を行うことで、既存モデルを用いた場合よりも遠方の検出精度が向上された形で、画像における対象の検出結果を得ることができる。検出処理部11の処理内容は、後述する図３の検出部21と同様である。（検出部21及び検出処理部11では、用いるモデルのみが異なり、入力される映像又は画像に対して同様の検出処理を行う。） FIG. 1 is a functional block diagram of a learning device and a detection device according to an embodiment. The learning device 1 includes a preparation unit 2 and a learning unit 3, and the detection device 10 includes a detection processing unit 11. The overall processing contents of these are as follows. The learning device 1 reads the video acquired by the in-vehicle camera in the preparation unit 2, and prepares the learning data regarding the target (for example, a road sign) whose type is specified by the manual work of the administrator or the like using this video. Then, by learning using this learning data in the learning unit 3, an improved model obtained by improving an existing model having a low recognition accuracy of a distant object is obtained. In the detection device 10, the detection processing unit 11 performs image recognition using this improved model to obtain the detection result of the target in the image in a form in which the detection accuracy at a distance is improved as compared with the case where the existing model is used. be able to. The processing content of the detection processing unit 11 is the same as that of the detection unit 21 of FIG. 3 which will be described later. (The detection unit 21 and the detection processing unit 11 differ only in the model used, and perform the same detection processing on the input video or image.)

図２は、コネクテッドカーのシステムの模式図である。システムSY内には、複数（ここでは例として３両だが、一般には任意の多数）のコネクテッドカーCC1,CC2,CC3が存在し、それぞれが車載カメラC1,C2,C3を搭載して走行中の映像を取得し、これら映像はネットワークNW上を伝送されてサーバSVに収集される。学習装置1ではこのようにサーバSVに収集された映像を利用して学習データを得ることができ、このサーバSVが学習装置1として機能するようにしてもよい。一方、学習装置1において車載カメラでの映像を取得する態様はこれに限らない。例えば、ネットワークNW上の伝送を介することなく、車載カメラで撮影した映像をその場で物理的な保存媒体（例えば、SSD等の半導体メモリやDVD等の光学ディスク）に記録しておき、人手による作業を介して、この保存媒体から映像を読み込むようにしてもよい。 FIG. 2 is a schematic diagram of a connected car system. In the system SY, there are multiple connected cars CC1, CC2, CC3 (three cars as an example here, but generally any number), each equipped with in-vehicle cameras C1, C2, C3 and running. The images are acquired, and these images are transmitted on the network NW and collected by the server SV. In the learning device 1, learning data can be obtained by using the video collected in the server SV in this way, and the server SV may function as the learning device 1. On the other hand, the mode of acquiring the image by the in-vehicle camera in the learning device 1 is not limited to this. For example, the video taken by the in-vehicle camera is recorded on the spot on a physical storage medium (for example, a semiconductor memory such as SSD or an optical disk such as DVD) without transmission on the network NW, and is manually recorded. The video may be read from this storage medium through the work.

以下、学習装置1による処理の詳細として、準備部2及び学習部3を説明する。 Hereinafter, the preparation unit 2 and the learning unit 3 will be described as details of the processing by the learning device 1.

図３は、一実施形態に係る準備部2の機能ブロック図である。準備部2は、検出部21、識別部22及び推定部23を備える。検出部21は、図２を参照して説明した各種態様で取得可能な車載カメラ映像を読み込み、この映像の各フレームより、管理者等によって指定される所定種類の対象を検出し、映像の各フレームに検出結果としての対象の占める範囲を紐づけたものを、識別部22へと出力する。ここで、対象が未検出であったフレームに関しても、検出範囲が存在しない旨（検出に失敗した旨）を紐づけることで検出結果とし、識別部22へと出力する。 FIG. 3 is a functional block diagram of the preparation unit 2 according to the embodiment. The preparation unit 2 includes a detection unit 21, an identification unit 22, and an estimation unit 23. The detection unit 21 reads the in-vehicle camera image that can be acquired in various modes described with reference to FIG. 2, detects a predetermined type of target specified by the administrator or the like from each frame of this image, and each of the images. The frame is associated with the range occupied by the target as the detection result, and the image is output to the identification unit 22. Here, even for a frame for which the target has not been detected, the detection result is obtained by associating the fact that the detection range does not exist (the fact that the detection has failed), and the frame is output to the identification unit 22.

検出部21では、映像の各フレームから指定される対象を検出する際に、既に学習して構築されている物体認識手法の既存モデル（学習済パラメータ）を利用する。物体認識手法としては、SVM（サポートベクトルマシン）等の機械学習による物体認識や、各種のCNN（畳込ニューラルネットワーク）等の深層学習による物体認識など（例えばSSD、YOLOv3、Mask R-CNNなど）、任意の既存手法を用いてよく、学習済みのパラメータ（深層学習による物体認識であればネットワークの各層の重みなど）を用いて検出部21では各フレームより対象検出を行うことができる。なお、管理者等が検出部21に対して指定する検出対象の種類は、この種類の対象を当該物体認識手法により検出するように予め学習されて学習結果としてのパラメータがあるものの中から、選択して指定するものとする。 The detection unit 21 uses an existing model (learned parameter) of the object recognition method that has already been learned and constructed when detecting an object specified from each frame of the video. Object recognition methods include object recognition by machine learning such as SVM (support vector machine) and object recognition by deep learning such as various CNNs (convolutional neural networks) (for example, SSD, YOLOv3, Mask R-CNN, etc.). , Any existing method may be used, and the detection unit 21 can perform target detection from each frame using learned parameters (weights of each layer of the network in the case of object recognition by deep learning). The type of detection target specified to the detection unit 21 by the administrator or the like is selected from those that have been learned in advance so that this type of target is detected by the object recognition method and have parameters as a learning result. And specify.

物体認識手法においては、認識すべき所定の物体の種別を指定することで、画像より物体の占める範囲（例えば矩形状の囲み枠（bounding box））と、当該物体に該当する度合い（当該物体らしさのスコア）とを得ることができる。検出部21においては、各フレームに物体認識手法を適用することで、物体らしさのスコアが判定用の閾値以上である場合には検出成功したものとしてその範囲をフレームに紐づけ、スコアがこの閾値未満である場合には検出失敗したものとして範囲をフレームに紐づけることなく、検出結果を得ることができる。 In the object recognition method, by designating the type of a predetermined object to be recognized, the range occupied by the object from the image (for example, a rectangular bounding box) and the degree corresponding to the object (likeness of the object). Score) and can be obtained. In the detection unit 21, by applying the object recognition method to each frame, if the score of the object-likeness is equal to or higher than the threshold value for determination, the range is linked to the frame as if the detection was successful, and the score is this threshold value. If it is less than, the detection result can be obtained without associating the range with the frame as if the detection failed.

識別部22では、検出部21で得た各フレームに検出結果付与された映像より、フレーム時刻上において連続して対象検出に成功していると判定される検出成功区間と、これに隣接する区間であって、フレーム時刻上において連続して対象検出に失敗していると判定される検出失敗区間と、を識別し、識別結果の検出成功区間及び検出失敗区間を推定部23へと出力する。なお、当該出力される検出成功区間及び検出失敗区間は、データとしては映像であり、且つ、その各フレームには検出部21での検出結果としての対象の領域の情報が紐づいたものである。 In the identification unit 22, a detection success section in which it is determined that the target detection is continuously successful on the frame time from the image added to each frame obtained by the detection unit 21 and a section adjacent to the detection success section. Therefore, the detection failure section that is continuously determined to have failed in the target detection on the frame time is identified, and the detection success section and the detection failure section of the discrimination result are output to the estimation unit 23. The output detection success section and detection failure section are images as data, and the information of the target area as the detection result by the detection unit 21 is associated with each frame. ..

識別部22では、次のようにして検出成功区間及び検出失敗区間を識別することができる。まず、検出部21から得られる映像において、一定時間（例えば5秒など）以上連続して対象の検出に成功していると判定される区間を、検出成功区間として識別する。ここで、突発的なノイズ等の影響で短時間だけ対象が未検出となる場合もありうるので、所定の閾値の短時間以内の未検出の部分は、同一の検出成功区間の途中での欠損データとして削除して扱い、識別部22の後段側にある推定部23及び学習部3では利用しないようにしてもよい。 The identification unit 22 can identify the detection success section and the detection failure section as follows. First, in the video obtained from the detection unit 21, a section determined to have succeeded in detecting the target continuously for a certain period of time (for example, 5 seconds or the like) is identified as a detection success section. Here, since the target may not be detected for a short time due to the influence of sudden noise or the like, the undetected part within a short time of a predetermined threshold value is missing in the middle of the same detection success section. It may be deleted and handled as data, and may not be used by the estimation unit 23 and the learning unit 3 on the rear side of the identification unit 22.

例えば、閾値として5秒以上連続して検出成功している区間を検出成功区間とし、0.2秒以下の未検出（検出失敗）は欠損データとして扱うとする場合に、2.9秒だけ連続して検出成功した第１区間と、これに連続して0.1秒だけ検出失敗した第２区間と、これに連続して3秒だけ連続して検出成功した第３区間と、があるとすると、第１区間及び第３区間を１つの（判定閾値5秒以上である合計の長さ2.9秒+3秒=5.9秒の）検出成功区間とし、その途中で短時間（閾値0.2秒以下の0.1秒）のみ検出失敗している第２区間は欠損データとして削除するように扱ってもよい。 For example, if a section that has been successfully detected for 5 seconds or more as a threshold is a detection success section, and an undetected section (detection failure) of 0.2 seconds or less is treated as missing data, detection is successful for 2.9 seconds continuously. If there is a first section, a second section in which detection fails for 0.1 seconds in succession, and a third section in which detection succeeds in succession for 3 seconds in succession, the first section and The third section is one detection success section (total length 2.9 seconds + 3 seconds = 5.9 seconds with a judgment threshold of 5 seconds or more), and detection fails only for a short time (0.1 seconds with a threshold of 0.2 seconds or less) in the middle. The second section may be treated as being deleted as missing data.

また、検出成功区間を識別するための、区間内部でのフレーム同士の連続性の判定においては、時間が隣接するフレーム間における検出部21での検出結果の領域の移動が閾値以下であることを条件として課してもよい。この領域移動が閾値以下であることの判定は、領域の代表位置（重心など）の移動量で判定してもよいし、隣接フレーム間で検出結果の領域の位置の重複が一定値以上であることによって判定してもよい。 Further, in the determination of the continuity between frames within the section for identifying the detection successful section, it is determined that the movement of the detection result region by the detection unit 21 between the frames adjacent in time is equal to or less than the threshold value. It may be imposed as a condition. The determination that this region movement is below the threshold value may be determined by the amount of movement of the representative position of the region (center of gravity, etc.), or the overlap of the positions of the regions of the detection result between adjacent frames is a certain value or more. It may be judged by.

以上のようにして検出成功区間を識別すると次に、識別部22では、この検出成功区間に隣接する所定長さの区間を検出失敗区間として識別する。説明のため、映像のフレーム時刻をt(t=1,2,3,…)、時刻tのフレームをF(t)とし、時刻t1≦t≦t2の範囲にある区間が検出成功区間R1={F(t)|t1≦t≦t2}として識別されたものとする。（なお、以降の説明例においても、フレーム時刻は整数として、すなわちフレーム番号として説明する。）識別部22では、この検出成功区間R1に対して、所定長さLを有し未来側で隣接している区間R2_[未来]又は所定長さLを有し過去側で隣接している区間R2_[過去]を、対応する検出失敗区間R2として識別する。なお、上記の定義により区間R2_[未来]及び区間R2_[過去]は映像のフレーム集合として以下のように書ける。
R2_[未来]={F(t)|t2+1≦t≦t2+1+L} …(1)
R2_[過去]={F(t)|t1-1-L≦t≦t1-1} …(2) After identifying the detection successful section as described above, the identification unit 22 next identifies the section having a predetermined length adjacent to the detection successful section as the detection failure section. For the sake of explanation, the frame time of the video is t (t = 1,2,3, ...), the frame at time t is F (t), and the section within the range of time t1 ≤ t ≤ t2 is the detection success section R1 = It is assumed that it is identified as {F (t) | t1 ≤ t ≤ t2}. (Also, in the following description examples, the frame time will be described as an integer, that is, as a frame number.) The identification unit 22 has a predetermined length L with respect to this detection success section R1 and is adjacent to the future side. The section R2 _[future] or the section R2 _[past] having a predetermined length L and adjacent on the past side is identified as the corresponding detection failure section R2. According to the above definition, the section R2 _[future] and the section R2 _[past] can be written as a frame set of video as follows.
R2 _[future] = {F (t) | t2 + 1 ≤ t ≤ t2 + 1 + L}… (1)
R2 _[Past] = {F (t) | t1-1-L ≤ t ≤ t1-1}… (2)

識別部22では、検出成功区間R1={F(t)|t1≦t≦t2}上において、各時刻tに対象が検出されている領域のサイズS(t)（面積や、縦又は横の長さなどをサイズS(t)とする）の時間変化の挙動を調べ、時間tが進むにつれてサイズS(t)が増加する挙動であった場合には、過去側の区間R2_[過去]を検出失敗区間R2であるものとして識別し、逆に、時間tが進むにつれてサイズS(t)が減少する挙動であった場合には、未来側の区間R2_[未来]を検出失敗区間R2であるものとして識別することができる。 In the identification unit 22, the size S (t) (area, vertical or horizontal) of the area where the target is detected at each time t on the detection successful interval R1 = {F (t) | t1 ≤ t ≤ t2}. Investigate the behavior of time change of size S (t) such as length), and if the behavior is such that size S (t) increases as time t progresses, the interval R2 _[past] on the past side is used. If the behavior is such that the size S (t) decreases as the time t advances, the interval R2 _[future] on the future side is the detection failure interval R2. Can be identified as a thing.

この識別は、次の１つの規則にまとめることができる。すなわち、検出成功区間R1の時間軸上での両端t1,t2のいずれかのうち、区間R1上でこのいずれかの端（t1又はt2）に向かうにつれて、フレームF(t)での検出領域のサイズS(t)が減少する挙動を示していると判定される側の端に、検出失敗区間R2が隣接して存在するものとして識別する。 This identification can be summarized in one rule: That is, of any of both ends t1 and t2 on the time axis of the successful detection interval R1, as the direction toward either end (t1 or t2) on the interval R1, the detection region in the frame F (t) The detection failure interval R2 is identified as being adjacent to the end on the side where it is determined that the size S (t) is decreasing.

すなわち、車載カメラの映像の撮影状況の典型例として、車載カメラは車両前方又は後方のいずれかを向いて設置され、車両は車道を前進又は後進しているという２×２＝４通りが想定されるが、いずれの場合であっても上記の手法により、対象を近接して撮影している状況にある検出成功区間R1に対応するものとして、対象を遠方で撮影している状況にある検出失敗区間R2を推定することができる。 That is, as a typical example of the shooting situation of the image of the in-vehicle camera, it is assumed that the in-vehicle camera is installed facing either the front or the rear of the vehicle, and the vehicle is moving forward or backward on the roadway in 2 × 2 = 4. However, in any case, by the above method, the detection failure in the situation where the target is photographed in the distance corresponds to the detection success section R1 in the situation where the object is photographed in close proximity. The interval R2 can be estimated.

例えば、車載カメラが車両前方を撮影するよう設置され、車両は車道を前進している場合、カメラ前方にあってカメラに撮影され、且つ、車道付近に静止している対象（道路標識等）は常に、時間が進むにつれカメラに近づく形で撮影され、一定距離以上近づくと撮影範囲（カメラの画角範囲）から消え、車両後方に移ってカメラから見えないようになる。この際、対象の大きさは、時間が進むにつれカメラに近づき、大きくなる挙動を示す。この場合、t1≦t≦t2の検出成功区間R1に対して、時刻t<t1の過去側は、対象が車両前方に存在するが遠すぎるため、撮影されているが検出失敗している状況であり、時刻t=t1は、ある程度近づいてある程度の大きさで撮影されるようになったことにより初めて検出成功した状況であり、その後の時刻t=t2は、対象が充分に近づいた後にカメラの撮影範囲から消える直前の状況である。すなわち、サイズS(t)の増加挙動より、過去側の区間R2_[過去]を検出失敗区間R2として定めることができる。 For example, if an in-vehicle camera is installed to photograph the front of the vehicle and the vehicle is moving forward on the roadway, an object (road sign, etc.) that is in front of the camera and is photographed by the camera and is stationary near the roadway is It is always taken closer to the camera as time goes by, and when it gets closer than a certain distance, it disappears from the shooting range (angle of view of the camera) and moves to the rear of the vehicle so that it cannot be seen from the camera. At this time, the size of the object approaches the camera as time goes by, and shows a behavior of increasing. In this case, with respect to the detection success section R1 of t1 ≤ t ≤ t2, on the past side of time t <t1, the target exists in front of the vehicle but is too far away, so the image is taken but the detection fails. Yes, the time t = t1 is the situation where the detection was successful for the first time when the image was taken at a certain size when it came close to a certain extent, and the time t = t2 after that was the situation when the subject was sufficiently close to the camera. This is the situation just before disappearing from the shooting range. That is, the past section R2 _[past] can be defined as the detection failure section R2 from the increasing behavior of the size S (t).

その他の状況においても同様に、サイズS(t)の増減から、検出成功区間の過去側又は未来側として検出失敗区間を定めることができる。また、対象が道路標識等の静止している対象ではなく、走行中の他車両などのように、自車両に対して概ね一定の相対速度で、（すなわち、互いに頻繁に追い抜いたり追い越されたりする状況ではなく、）相対的に移動している対象の場合も同様に、サイズS(t)の増減から、検出成功区間の過去側又は未来側として検出失敗区間を定めることができる。 Similarly, in other situations, the detection failure section can be determined as the past side or the future side of the detection success section from the increase / decrease of the size S (t). In addition, the target is not a stationary target such as a road sign, but is frequently overtaken or overtaken by the own vehicle at a substantially constant relative speed (that is, overtaken or overtaken by each other) like other vehicles running. Similarly, in the case of a relatively moving object (not a situation), the detection failure section can be determined as the past side or the future side of the detection success section from the increase / decrease of the size S (t).

なお、車載カメラでの撮影状況（例えば、前方カメラで車両は前進しながら撮影している状況）が既知である場合、サイズS(t)の増減判定を行うことなく、増減判定を行った場合と同様の手法で検出失敗区間を検出成功区間の過去側又は未来側に定めるようにしてよい。 When the shooting situation with the in-vehicle camera (for example, the situation where the vehicle is shooting while moving forward with the front camera) is known, and the increase / decrease judgment is performed without performing the increase / decrease judgment of the size S (t). The detection failure section may be defined on the past side or the future side of the detection success section by the same method as above.

なお、識別部22では、検出成功区間R1の過去側又は未来側のいずれかに検出失敗区間R2が存在する旨を識別したが、この箇所において、式(1)又は(2)で定められるような少なくとも一定長さLで連続して対象の検出に失敗しているフレームF(t)群が存在しない場合は、ペアとしての検出成功区間R1及び検出失敗区間R2が検出できなかったものとして扱うようにしてよい。（この場合、映像上の別の箇所において、ペアとして検出できた検出成功区間R1及び検出失敗区間R2を、識別部22の後段側の推定部23での処理対象とすればよい。）ここで、一定長さLで連続して対象の検出に失敗しているフレームF(t)群の存在の判定においては、前述した検出成功区間で短時間だけ検出失敗した箇所を欠損データとして扱うのと同様の手法を適用してよい。すなわち、概ね連続して検出失敗しているが、その間に短時間だけ突発的に検出成功している場合、連続した検出失敗区間の途中の欠損データとして削除して扱うようにしてよい。 The identification unit 22 has identified that the detection failure section R2 exists on either the past side or the future side of the detection success section R1, but at this point, it is determined by the equation (1) or (2). If there is no frame F (t) group that has failed to detect the target continuously with at least a certain length L, it is treated as if the detection success section R1 and the detection failure section R2 as a pair could not be detected. You can do it. (In this case, the detection success section R1 and the detection failure section R2 that could be detected as a pair at another location on the video may be processed by the estimation unit 23 on the rear side of the identification unit 22.) Here. In the determination of the existence of the frame F (t) group that has failed to detect the target continuously with a fixed length L, the part where the detection failed for a short time in the above-mentioned detection success section is treated as missing data. A similar technique may be applied. That is, if the detection fails almost continuously, but the detection succeeds suddenly for a short time during that period, the data may be deleted and handled as missing data in the middle of the continuous detection failure section.

推定部23は、識別部22で得た、時間軸上で隣接することで互いに対応する検出成功区間及び検出失敗区間を用いて、検出成功区間の各フレームに対して検出部21で検出されている対象の領域の情報を利用して、検出失敗区間の各フレームについて、対象が検出されるべき領域を推定して紐づけることで学習データを得て、学習部3へと出力する。すなわち、出力される学習データは、互いに対応する検出成功区間及び検出失敗区間で構成され、検出成功区間の各フレームには検出部21で検出された対象の領域が紐づいており、検出失敗区間の各フレームには推定部23で推定された、対象が検出されるべき領域（しかしながら遠方であるため、既存モデルを用いた検出部21では未検出であった領域に相当）が紐づいている。 The estimation unit 23 is detected by the detection unit 21 for each frame of the detection success section using the detection success section and the detection failure section corresponding to each other on the time axis obtained by the identification unit 22. Learning data is obtained by estimating and associating the area where the target should be detected for each frame of the detection failure section by using the information of the target area, and outputting it to the learning unit 3. That is, the output learning data is composed of the detection success section and the detection failure section corresponding to each other, and each frame of the detection success section is associated with the target area detected by the detection unit 21, and the detection failure section. Each frame of is associated with the area where the target should be detected estimated by the estimation unit 23 (however, because it is far away, it corresponds to the area that was not detected by the detection unit 21 using the existing model). ..

推定部23では、次のようにして、検出失敗区間の各フレームについての対象が検出されるべき領域を推定する。説明例として、検出成功区間を前述した区間R1={F(t)|t1≦t≦t2}とし、各フレームF(t)内で検出されている対象の領域を、対象を内包する矩形B1(t)とする。また、検出失敗区間は過去側の長さLの区間R2_[過去]={F(t)|t1-1-L≦t≦t1-1}であったものとし、各フレーム内での検出されるべき領域を矩形B2(t)とする。（なお、過去側で説明するが、検出失敗区間が未来側の区間R2_[未来]であっても同様に矩形B2(t)を推定することができる。） The estimation unit 23 estimates the region in which the target should be detected for each frame of the detection failure section as follows. As an explanatory example, the detection success section is set to the above-mentioned section R1 = {F (t) | t1 ≤ t ≤ t2}, and the target area detected in each frame F (t) is included in the rectangle B1. Let it be (t). In addition, it is assumed that the detection failure interval is the interval R2 _[past] = {F (t) | t1-1-L ≤ t ≤ t1-1} of the length L on the past side, and it is detected in each frame. Let the area to be the rectangle B2 (t). (Although it will be explained on the past side, the rectangle B2 (t) can be estimated in the same way even if the detection failure section is the section R2 _[future] on the future side.)

一実施形態では、未来側の矩形B1(t)（t1≦t≦t2）の画像フレームF(t)内での位置及びサイズの時系列データに対して、時刻の所定関数によるフィッティングを行い、（例えば、１次関数として直線フィッティングを行い、）このフィッティング関数で過去側のt1-1-L≦t≦t1-1の範囲での矩形B2(t)の位置及びサイズを予測することにより、矩形B2(t)を推定してもよい。 In one embodiment, the time-series data of the position and size of the rectangle B1 (t) (t1 ≤ t ≤ t2) on the future side in the image frame F (t) is fitted by a predetermined function of time. (For example, performing linear fitting as a linear function) By predicting the position and size of the rectangle B2 (t) in the range of t1-1-L ≤ t ≤ t1-1 on the past side with this fitting function, The rectangle B2 (t) may be estimated.

一実施形態では、検出部21で読み込んで用いる車載カメラ映像において追加情報として、映像を撮影した各時刻tのカメラの位置C(t)の情報（実世界のワールド座標としての３次元座標C(t)の情報であり、カメラ光軸の方向の情報も含む）が、予め紐づけられているものとして、このカメラの位置情報C(t)を利用して矩形B2(t)を推定してもよい。このため、以下のように「ワールド座標(X,Y,Z)⇔カメラ座標(x,y,z)⇔画像座標(u,v)」を変換する計算を行えばよい。これらの座標変換のためのカメラパラメータは既知であるものとする。なお、カメラ位置C(t)の情報は、GPS（全地球測位システム）等の任意の既存の測位手法と方位センサ（カメラ光軸の方位を取得する方位センサ）とを用いて、車両側において映像のフレーム時刻と測位時刻を一致させて取得しておけばよい。 In one embodiment, as additional information in the in-vehicle camera image read and used by the detection unit 21, information on the camera position C (t) at each time t when the image is taken (three-dimensional coordinate C (three-dimensional coordinate C as real-world world coordinates)). Assuming that the information in t), including the information in the direction of the camera optical axis) is associated in advance, the rectangular B2 (t) is estimated using the position information C (t) of this camera. May be good. Therefore, the calculation for converting "world coordinates (X, Y, Z) ⇔ camera coordinates (x, y, z) ⇔ image coordinates (u, v)" may be performed as follows. It is assumed that the camera parameters for these coordinate transformations are known. The camera position C (t) information can be obtained on the vehicle side using any existing positioning method such as GPS (Global Positioning System) and a directional sensor (direction sensor that acquires the direction of the camera optical axis). The frame time of the video and the positioning time may be matched and acquired.

まず、未来側の矩形B1(t)（t1≦t≦t2）に関して、隣接フレーム画像F(t),F(t+1)間での矩形内での同一点対応を求めたうえでのステレオ視差などの既存手法を適用することで、矩形B1(t)が３次元カメラ座標系で占める矩形範囲b1(t)を求める。（ここで、対象は道路標識等のように平面形状で構成される前提で、この平面形状を囲むものとして矩形範囲b1(t)を求めればよい。） First, regarding the rectangle B1 (t) (t1 ≤ t ≤ t2) on the future side, the stereo is obtained after seeking the same point correspondence within the rectangle between the adjacent frame images F (t) and F (t + 1). By applying an existing method such as parallax, the rectangular range b1 (t) occupied by the rectangle B1 (t) in the 3D camera coordinate system is obtained. (Here, on the premise that the target is composed of a plane shape such as a road sign, the rectangular range b1 (t) may be obtained to surround this plane shape.)

次に、対象は道路標識等であって実世界で静止している前提のもと、そのワールド座標で占める矩形範囲b（時刻tに依存しない一定範囲b）を求める。具体的には、カメラ位置C(t)（ワールド座標）を用いて、このカメラ位置を基準として求まっているカメラ座標の矩形範囲b1(t)をワールド座標に変換した矩形b1'(t)を求め、この矩形b1'(t)のt1≦t≦t2での平均値としてワールド座標の矩形範囲bを求めればよい。（理想的には矩形b1'(t)は静止位置及び固定サイズとして時間変化しないはずだが、ノイズ等があるので、時間平均として求める。） Next, on the premise that the target is a road sign or the like and is stationary in the real world, the rectangular range b (a certain range b that does not depend on the time t) occupied by the world coordinates is obtained. Specifically, using the camera position C (t) (world coordinates), a rectangle b1'(t) obtained by converting the rectangular range b1 (t) of the camera coordinates obtained with reference to this camera position into world coordinates is obtained. Then, the rectangle range b of the world coordinates may be obtained as the average value of this rectangle b1'(t) at t1 ≤ t ≤ t2. (Ideally, the rectangle b1'(t) should not change with time as a stationary position and a fixed size, but since there is noise etc., it is calculated as a time average.)

最後に、ワールド座標の矩形範囲b（対象の静止位置及び範囲）を、過去側の検出失敗区間R2_[過去]={F(t)|t1-1-L≦t≦t1-1}の各時刻tにおいて、カメラ位置C(t)（ワールド座標）を用いて、カメラ座標での矩形範囲b2(t)に変換し、これをさらにフレーム画像内での範囲に変換し、この変換範囲をフレーム画像内で囲む矩形として、推定位置の矩形B2(t)を求めることができる。 Finally, the rectangular range b (target stationary position and range) of the world coordinates is set in each of the detection failure sections R2 _[past] = {F (t) | t1-1-L ≤ t ≤ t1-1} on the past side. At time t, the camera position C (t) (world coordinates) is used to convert to a rectangular range b2 (t) in camera coordinates, which is further converted to a range within the frame image, and this conversion range is framed. As a rectangle surrounded in the image, the rectangle B2 (t) at the estimated position can be obtained.

図４は、以上の準備部2で準備される学習データの模式例を示す図である。図４では、時刻t=17～32の16フレームF(17)～F(32)で構成される検出成功区間R1と、これに対応する対応する過去側の時刻t=1～16の16フレームF(1)～F(16)で構成される検出失敗区間R2_[過去]と、で学習データSD(0)が構成される例が示されている。 FIG. 4 is a diagram showing a schematic example of the learning data prepared by the above preparation unit 2. In FIG. 4, the detection success interval R1 composed of 16 frames F (17) to F (32) at times t = 17 to 32 and the corresponding 16 frames at times t = 1 to 16 on the past side corresponding to the detection success interval R1. An example is shown in which the training data SD (0) is composed of the detection failure interval R2 _[past] composed of F (1) to F (16).

学習データSD(0)においては、検出成功区間R1の各フレームF(t)（t=17～32）には検出部21で検出成功した領域B1(t)が紐づいており、検出失敗区間R2_[過去]の各フレームF(t)（t=1～16）には推定部23で推定された領域B2(t)が紐づいている。図４ではこれらフレームF(t)上の領域B1(t)又はB2(t)の、全時刻t=1～32うちの一部の例としてt=11,19,29の場合のフレームF(11)上の推定領域B2(11)、フレームF(19)上の検出領域B1(19)、フレームF(29)上の検出領域B1(29)が模式的なイラストとして、路上映像における道路標識の領域として示されている。これらイラストにおいては時刻tが進むにつれ、道路標識がカメラ側に近づいて映るようになっている様子が示されている。この図４の学習データSD(0)の例は以降においても適宜、共通の説明例として参照する。 In the training data SD (0), each frame F (t) (t = 17 to 32) of the detection success section R1 is associated with the area B1 (t) that was successfully detected by the detection unit 21, and the detection failure section. The region B2 (t) estimated by the estimation unit 23 is associated with each frame F (t) (t = 1 to 16) of R2 _[past] . In FIG. 4, the frame F (in the case of t = 11,19,29 as an example of a part of the total time t = 1 to 32 in the region B1 (t) or B2 (t) on the frame F (t). 11) The estimated area B2 (11) on the frame F (19), the detection area B1 (19) on the frame F (19), and the detection area B1 (29) on the frame F (29) are schematic illustrations of road signs in the road image. It is shown as an area of. In these illustrations, it is shown that the road sign is getting closer to the camera side as the time t advances. The example of the learning data SD (0) in FIG. 4 will be referred to as a common explanatory example in the following as appropriate.

学習部3は、以上のように準備部2が準備した学習データを用いて学習を行うことにより、検出部21で用いた既存モデルを改良したモデルを得る。この改良モデルは既存モデルと比べて、より遠方にある対象でも検出可能となっている点で、改良されたものである。 The learning unit 3 obtains an improved model of the existing model used in the detection unit 21 by performing learning using the learning data prepared by the preparation unit 2 as described above. This improved model is an improvement in that it can detect objects farther away than the existing model.

図５は、一実施形態に係る学習部3による学習のフローチャートである。ステップS0では、学習パラメータ（モデル）の初期値M(0)及び学習データの初期値SD(0)としてそれぞれ、検出部21で用いた既存モデル及び準備部2から得た学習データを設定したうえでステップS1へと進む。 FIG. 5 is a flowchart of learning by the learning unit 3 according to the embodiment. In step S0, the existing model used in the detection unit 21 and the training data obtained from the preparation unit 2 are set as the initial value M (0) of the training parameter (model) and the initial value SD (0) of the training data, respectively. Proceed to step S1.

以下、図５に示されるように、ステップS1～S6でループする処理（大ループ処理）と、この内部にさらにステップS1～S3でループする処理（小ループ処理）とが存在するが、各ステップを説明する際に、当該時点での前者の大ループ処理の回数をi（i=1,2,3…）として参照する。ステップS0からステップS1に至った時点において、この回数iは初回に該当し、i=1であるものとする。この回数iは、ステップS6からステップS1に戻った時点で、1だけ加算されてその次の値i+1となり、値が更新されるものとする。 Hereinafter, as shown in FIG. 5, there are a process of looping in steps S1 to S6 (large loop process) and a process of further looping in steps S1 to S3 (small loop process). In explaining, the number of times of the former large loop processing at that time is referred to as i (i = 1,2,3 ...). At the time from step S0 to step S1, this number of times i corresponds to the first time, and i = 1. It is assumed that this number i is added by 1 to become the next value i + 1 when returning from step S6 to step S1, and the value is updated.

ステップS1では、現時点（繰り返し回数i回目）の学習データSD(i-1)より、訓練画像セット、検証画像セット及び評価画像セットを選出してから、ステップS2へと進む。この際、以下の条件（１）及び（２）を満たすように、ランダムに選出すればよい。 In step S1, the training image set, the verification image set, and the evaluation image set are selected from the learning data SD (i-1) at the present time (the number of repetitions i), and then the process proceeds to step S2. At this time, it may be randomly selected so as to satisfy the following conditions (1) and (2).

条件（１）…訓練画像セット、検証画像セット及び評価画像セットはそれぞれ、所定数n1個、n2個、n3個の画像セットとして、相互に重複しないように、すなわち、同一の画像が訓練画像セット、検証画像セット及評価画像セットのうちの２つ以上に重複して属することがないように、学習データSD(i-1)の画像の中から選出する。 Condition (1) ... The training image set, the verification image set, and the evaluation image set are set as a predetermined number of n1, n2, and n3 image sets, respectively, so as not to overlap each other, that is, the same image is the training image set. , Select from the images of the training data SD (i-1) so that they do not overlap with two or more of the verification image set and the evaluation image set.

条件（２）…訓練画像セット、検証画像セット及び評価画像セットはそれぞれ、一定割合r（0<r<1）だけ失敗画像を含み、残りの割合(1-r)は成功画像を含むようにして、選出する。ここで、学習データSD(i-1)は成功画像セットD1(i-1)及び失敗画像セットD2(i-1)から構成されているので、前者よりランダムに成功画像をそれぞれ(1-r)*n1個、(1-r)*n2個及び(1-r)*n3個だけ選出し、後者よりランダムに失敗画像をそれぞれr*n1個、r*n2個及びr*n3個だけ選出することで、訓練画像セット、検証画像セット及び評価画像セットを得るようにすればよい。（ここで、選出はランダムだが、条件（１）に従い、既に選出された画像は以降の選出から除外すればよい。）なお、訓練画像セット、検証画像セット及び評価画像セットは一定割合rで失敗画像を含むのではなく、相互に必ずしも等しくない割合r1,r2,r3でそれぞれ失敗画像を含むようにして選出するようにしてもよい。 Condition (2) ... The training image set, the verification image set, and the evaluation image set each include a failure image by a certain ratio r (0 <r <1), and the remaining ratio (1-r) includes a success image. elect. Here, since the training data SD (i-1) is composed of the success image set D1 (i-1) and the failure image set D2 (i-1), the success images are randomly selected from the former (1-r). ) * n1, (1-r) * n2 and (1-r) * n3 are selected, and only r * n1, r * n2 and r * n3 are randomly selected from the latter, respectively. By doing so, a training image set, a verification image set, and an evaluation image set may be obtained. (Here, the selection is random, but according to the condition (1), the images already selected may be excluded from the subsequent selection.) The training image set, the verification image set, and the evaluation image set fail at a fixed rate r. Instead of including images, the images may be selected so as to include failed images at proportions r1, r2, and r3 that are not necessarily equal to each other.

なお、図５の繰り返し回数i=1（初回）に対応する学習データSD(0)における成功画像セットD1(0)及び失敗画像セットD2(0)はそれぞれ、検出成功区間の画像セット及び検出失敗区間の画像セットとし、これらもステップS0で初期値として設定しておくものとする。 The success image set D1 (0) and the failure image set D2 (0) in the training data SD (0) corresponding to the number of repetitions i = 1 (first time) in FIG. 5 are the image set of the detection success section and the detection failure, respectively. It is assumed that the image set of the section is set, and these are also set as the initial values in step S0.

ステップS2では、ステップS1で選出した訓練画像セット及び検証画像セットを用いて学習を行い、学習により得られたモデルをステップS1で選出した評価画像セットを用いて評価してから、ステップS3へと進む。この学習及び評価は、モデルが例えばCNN等の深層学習のものである場合、以下のように行えばよい。 In step S2, training is performed using the training image set and the verification image set selected in step S1, the model obtained by the training is evaluated using the evaluation image set selected in step S1, and then the process proceeds to step S3. move on. When the model is deep learning such as CNN, this learning and evaluation may be performed as follows.

（学習）…既存手法により学習を行う。すなわち、訓練画像セットを用いて勾配法などにより、ネットワークパラメータとしての各層の重みを調整（訓練）することを所定回数繰り返し、得られたパラメータをモデルとして用いて検証画像セットにより、検出部21と同様の手法で検出して検証を行うことを１エポックとする。この１エポックの処理を所定回数だけ繰り返す、あるいは検証画像セットによる検証結果（検出精度）が収束したと判定されるまで繰り返すことで、最終的に得られたパラメータを学習により得られたモデルとする。 (Learning)… Learning is performed by the existing method. That is, the weight of each layer as a network parameter is adjusted (trained) a predetermined number of times by a gradient method or the like using a training image set, and the obtained parameter is used as a model to be used with the detection unit 21 by the verification image set. One epoch is to detect and verify by the same method. By repeating this one epoch process a predetermined number of times or until it is determined that the verification result (detection accuracy) by the verification image set has converged, the finally obtained parameter is used as a model obtained by learning. ..

ここで、上記の学習の手法は既存手法であるが、本実施形態においては特に、次のような扱いで既存手法を適用する。すなわち、訓練画像セット及び検証画像セットは共に、成功画像及び失敗画像を含むものである。成功画像においては通常通り、紐づけられているフレーム内での対象が検出された領域を、この対象に関して検出結果であるものとして、学習の際の正解として扱う。一方、失敗画像においては、推定部23で推定された領域を、この対象に関しての検出結果であるものとして、学習の際の正解として扱うようにする。 Here, the above learning method is an existing method, but in the present embodiment, the existing method is particularly applied with the following treatment. That is, both the training image set and the verification image set include success images and failure images. In the success image, as usual, the area where the target is detected in the associated frame is treated as the detection result for this target and treated as the correct answer at the time of learning. On the other hand, in the failed image, the region estimated by the estimation unit 23 is treated as a correct answer at the time of learning as a detection result for this target.

なお、訓練画像セット及び検証画像セットはまとめて、学習（訓練及び検証）のために用いる学習画像セットを構成するものである。 The training image set and the verification image set collectively constitute a learning image set used for learning (training and verification).

（評価）…上記の訓練画像セット及び検証画像セットにおける成功画像及び失敗画像の扱いと同様にして評価画像セットを用いて、学習されたパラメータ（モデル）を評価する。すなわち、評価画像セットのうち成功画像は紐づけられたフレーム内での対象検出領域を正解とし、失敗画像は推定部23で推定された領域を正解として、学習されたパラメータにより評価画像セットより検出部21と同様の手法で検出を行い、検出精度を評価する。 (Evaluation) ... The learned parameters (models) are evaluated using the evaluation image set in the same manner as the handling of the success image and the failure image in the training image set and the verification image set described above. That is, in the evaluation image set, the success image has the target detection area in the associated frame as the correct answer, and the failure image has the area estimated by the estimation unit 23 as the correct answer, and is detected from the evaluation image set by the learned parameters. Detection is performed by the same method as in Part 21 and the detection accuracy is evaluated.

ステップS3では、ステップS2で評価したモデルの検出精度と、現時点（繰り返し回数i回目）のモデルM(i-1)の検出精度と、を比較し、向上していればステップS4へと進み、向上していなければステップS1へと戻る。なお、検出精度は、F値などで評価すればよく、向上しているか否かはこの評価値の増分に対する閾値判定で判定してもよいし、評価値がわずかにでも増分（正の増分）を有すれば向上していると判定してもよい。また、比較対象となる現時点のモデルM(i-1)の検出精度は、ステップS2における評価と同様にして、共通の評価画像セット（ステップS1で選出されたもの）より算出すればよい。 In step S3, the detection accuracy of the model evaluated in step S2 is compared with the detection accuracy of the model M (i-1) at the present time (the number of repetitions i), and if it is improved, the process proceeds to step S4. If it does not improve, return to step S1. The detection accuracy may be evaluated by an F value or the like, and whether or not it is improved may be determined by a threshold value judgment for the increment of the evaluation value, or even if the evaluation value is slightly incremented (positive increment). If there is, it may be determined that the improvement is achieved. Further, the detection accuracy of the current model M (i-1) to be compared may be calculated from a common evaluation image set (selected in step S1) in the same manner as the evaluation in step S2.

ステップS4では、ステップS3で向上判定が得られたモデルを、現時点（i回目）の次の繰り返し処理（i+1回目）での比較対象としての現状モデルM(i)（=M(i+1-1)）に設定することでモデルを更新してから、ステップS5へと進む。 In step S4, the model for which the improvement judgment was obtained in step S3 is the current model M (i) (= M (i +)) as a comparison target in the next iterative process (i + 1th time) at the present time (ith time). After updating the model by setting 1-1)), proceed to step S5.

ステップS5では、モデル更新に関して収束判定が得られたか否かを判定し、得られていればステップS7へと進み、得られていなければステップS6へと進む。この収束判定は、現時点での繰り返し回数iが一定回数に到達したことによって収束したものと判定してもよいし、ステップS4で更新して得た最新の現状モデルM(i)の、直前のモデルM(i-1)に対する向上の度合いが閾値判定で小さくなったことによって収束したものと判定してもよい。 In step S5, it is determined whether or not a convergence test has been obtained for model update, and if it is obtained, the process proceeds to step S7, and if not, the process proceeds to step S6. This convergence test may be determined to have converged when the number of repetitions i at the present time reaches a certain number of times, or immediately before the latest current model M (i) obtained by updating in step S4. It may be determined that the degree of improvement with respect to the model M (i-1) has converged because it has become smaller in the threshold value determination.

ステップS7では、得られた一連のモデルM(0),M(1),…,M(i-1),M(i)の中から最良のモデルとしてM(i)を、学習部3による最終的結果（図１の改良モデル）として出力し、図５のフローは終了する。 In step S7, M (i) is selected as the best model from the obtained series of models M (0), M (1),…, M (i-1), M (i) by the learning unit 3. It is output as the final result (improved model of FIG. 1), and the flow of FIG. 5 ends.

ステップS6では、現時点（繰り返し処理i回目）の学習データSD(i-1)を更新して次の繰り返し処理（i+1回目）で用いる学習データSD(i)（=SD(i+1-1)）を得てから、ステップS1へと戻る。この更新は、学習データSD(i-1)での成功画像セットD1(i-1)と失敗画像セットD2(i-1)の区別を更新したものとして、成功画像セットD1(i)及び失敗画像セットD2(i)からなる学習データSD(i)を得るものである。具体的には、失敗画像セットD2(i-1)の一部分である変更画像セットD3(i-1)⊂D2(i-1)を、成功画像に変更する（成功画像とみなして扱うようにする）ことで、更新を行う。すなわち、更新された成功画像セットD1(i)は更新前の成功画像セットD1(i-1)に変更画像セットD3(i-1)を追加したもの（これらの和集合）であり、更新された失敗画像セットD2(i)は更新前の失敗画像セットD2(i-1)から変更画像セットD3(i-1)を除外したもの（これらの差集合）である。和集合、差集合を取る演算を+,-で表現すると以下のように集合の式で書ける。
D1(i)=D1(i-1)+D3(i-1)
D2(i)=D2(i-1)-D3(i-1) In step S6, the training data SD (i-1) at the present time (ith iteration process) is updated and the learning data SD (i) (= SD (i + 1-)) used in the next iteration process (i + 1th time). After obtaining 1)), return to step S1. This update is an update of the distinction between the successful image set D1 (i-1) and the failed image set D2 (i-1) in the training data SD (i-1), and the successful image set D1 (i) and the failure. The training data SD (i) consisting of the image set D2 (i) is obtained. Specifically, the modified image set D3 (i-1) ⊂ D2 (i-1), which is a part of the failed image set D2 (i-1), is changed to a successful image (to be treated as a successful image). By doing), the update is performed. That is, the updated successful image set D1 (i) is the unupdated successful image set D1 (i-1) with the modified image set D3 (i-1) added (the union of these), and is updated. The failed image set D2 (i) is the failed image set D2 (i-1) before the update, excluding the modified image set D3 (i-1) (the difference between them). If the operation that takes the union and the difference set is expressed by + and-, it can be written by the set formula as follows.
D1 (i) = D1 (i-1) + D3 (i-1)
D2 (i) = D2 (i-1)-D3 (i-1)

ここで、失敗画像から成功画像へと変更する対象となる変更画像セットD3(i-1)は、ステップS3で向上判定が得られステップS4で更新されたモデルM(i)をステップS2で学習及び評価する際に用いた、訓練画像セット、検証画像セット又は評価画像セットに含まれる失敗画像の全部又は一部、とすればよい。（訓練、検証、評価画像セットに含まれる失敗画像の全部を変更画像セットD3(i-1)としてよい。）一部とする場合、ランダムに一部分を選んだものによって変更画像セットD3(i-1)を得るようにしてもよいし、検出成功区間R1（当初の成功画像セットD1(0)）に時間的に近い側から順に所定数を選んだものとして変更画像セットD3(i-1)を得るようにしてもよい。 Here, the modified image set D3 (i-1), which is the target for changing from the failed image to the successful image, learns the model M (i) whose improvement judgment was obtained in step S3 and updated in step S4 in step S2. And all or part of the failed images included in the training image set, the verification image set, or the evaluation image set used in the evaluation. (All of the failed images included in the training, verification, and evaluation image set may be used as the modified image set D3 (i-1).) If it is a part, the modified image set D3 (i-) is selected by randomly selecting a part. 1) may be obtained, or the modified image set D3 (i-1) may be obtained by selecting a predetermined number in order from the side closest to the detection successful interval R1 (initial successful image set D1 (0)). May be obtained.

なお、失敗画像から成功画像へと変更された場合は、元の失敗画像に対して推定部23で推定された対象の領域を、変更後の成功画像において検出部21で検出された領域であるものとみなして、図５のフローの繰り返し処理での以降のステップ（当該変更され更新されたステップS6以降のステップ）において、成功画像として扱うようにすればよい。 When the failed image is changed to the successful image, the target area estimated by the estimation unit 23 with respect to the original failed image is the area detected by the detection unit 21 in the changed successful image. It may be regarded as a success image in the subsequent steps in the iterative process of the flow of FIG. 5 (steps after the changed and updated step S6).

図６は、図５のフローの繰り返しによってモデルM(i-1)及び学習データSD(i-1)が更新される模式例を、図４の例に対応するものとして示す図である。図６では図４と同様に、成功画像は白色のフレームとして、失敗画像は灰色のフレームとして区別して示しており、図５のフローをi=1,2,3と3回繰り返すことにより、モデルが「M(0)→M(1)→M(2)→M(3)」と更新され、これに対応して学習データも「SD(0)→SD(1)→SD(2)→SD(3)」と更新される例が示されている。 FIG. 6 is a diagram showing a schematic example in which the model M (i-1) and the learning data SD (i-1) are updated by repeating the flow of FIG. 5 as corresponding to the example of FIG. In FIG. 6, as in FIG. 4, the success image is shown as a white frame and the failure image is shown as a gray frame. By repeating the flow of FIG. 5 three times with i = 1,2,3, the model is shown. Is updated as "M (0)-> M (1)-> M (2)-> M (3)", and the training data is also "SD (0)-> SD (1)-> SD (2)-> correspondingly. An example is shown that is updated with "SD (3)".

図６にて、当初の学習データSD(0)では図４で説明した通り、フレームF(1)～F(16)が失敗画像セットD2(0)であり、フレームF(17)～F(32)が成功画像セットD1(0)である。i=1（繰り返し処理1回目）でモデルM(1)及び学習データSD(1)へと更新された際に、フレームF(14)及びF(16)が成功画像に変更して扱われるようになる。同様に、i=2（繰り返し処理2回目）ではフレームF(12)及びF(13)が成功画像として扱われるようになり、i=3（繰り返し処理3回目）ではフレームF(8)及びF(10)が成功画像として扱われるようになる。これらの成功画像セット及び失敗画像セットの更新は集合の式で書けば以下の通りである。 In FIG. 6, in the initial training data SD (0), as described in FIG. 4, the frames F (1) to F (16) are the failed image set D2 (0), and the frames F (17) to F ( 32) is the successful image set D1 (0). Frames F (14) and F (16) are changed to successful images and handled when the model M (1) and training data SD (1) are updated with i = 1 (first iteration). become. Similarly, at i = 2 (second iteration), frames F (12) and F (13) are treated as successful images, and at i = 3 (third iteration), frames F (8) and F. (10) will be treated as a successful image. The update of these success image sets and failure image sets is as follows when written in the set formula.

D1(1)=D1(0)+{F(14),F(16)}, D2(1)=D2(0)-{F(14),F(16)} …(i=1での更新)
D1(2)=D1(1)+{F(12),F(13)}, D2(2)=D2(1)-{F(12),F(13)} …(i=2での更新)
D1(3)=D1(2)+{F(8),F(10)}, D2(3)=D2(2)-{F(8),F(10)} …(i=3での更新) D1 (1) = D1 (0) + {F (14), F (16)}, D2 (1) = D2 (0)-{F (14), F (16)}… (at i = 1) update)
D1 (2) = D1 (1) + {F (12), F (13)}, D2 (2) = D2 (1)-{F (12), F (13)}… (at i = 2) update)
D1 (3) = D1 (2) + {F (8), F (10)}, D2 (3) = D2 (2)-{F (8), F (10)}… (at i = 3) update)

学習部3では図５のフローによって、既存モデルでは遠方のため検出不能であった失敗画像も正解の領域を推定したうえで正解画像として扱うことにより、学習（訓練及び検証）とこの評価とをランダム選出される正解画像に対して繰り返し行い、評価によりモデルの検出精度が向上した場合に、対応する失敗画像を成功画像に扱いを変更して、さらに学習及び評価を同様にして継続する。これにより、図６に模式的に示されるように、遠方で検出不能であった失敗画像も検出可能なモデルをM(1),M(2),M(3)として、処理回数iを増やすごとにより遠方側でも検出可能となり改良されたモデルとして、更新されて取得することができる。 In the learning unit 3, according to the flow of FIG. 5, the learning (training and verification) and this evaluation are performed by estimating the correct area and treating the failed image that could not be detected by the existing model because it is far away. It is repeated for the correctly selected correct image, and when the detection accuracy of the model is improved by the evaluation, the treatment of the corresponding failed image is changed to the successful image, and the learning and the evaluation are continued in the same manner. As a result, as schematically shown in FIG. 6, the number of processes i is increased by setting the models that can detect the failed image that could not be detected at a distance as M (1), M (2), M (3). It can be detected even on the distant side and can be updated and acquired as an improved model.

なお、失敗画像の中には遠方の認識精度向上に寄与し得ない状態にあるものも存在しうるが、図５のS1～S3での小ループ処理により、このような状態にある失敗画像が一定数以上選択された場合は、結果としてステップS3で否定判定が得られ、再度のステップS1において遠方認識の精度向上に寄与し得る失敗画像がランダムに選択されるのを待つこととなる。なお、以下の補足説明（３）の手法により、このような失敗画像を除外することも可能である。 It should be noted that some failed images may be in a state that cannot contribute to the improvement of distant recognition accuracy, but the failed image in such a state is obtained by the small loop processing in S1 to S3 in FIG. When a certain number or more are selected, a negative judgment is obtained in step S3 as a result, and it is waited for a failed image that can contribute to the improvement of the accuracy of distant recognition to be randomly selected in step S1 again. It is also possible to exclude such a failed image by the method of the following supplementary explanation (3).

以下、追加的な実施形態等に関する補足説明を行う。 Hereinafter, supplementary explanations will be given regarding additional embodiments and the like.

（１）図５のステップS1（繰り返し処理i回目）において学習部3は、訓練、検出、評価画像を選出する対象としての失敗画像セットD2(i-1)の全体から選出するのではなく、フレーム番号が成功画像セットD1(i-1)に近い側の上位の所定数以内から選出するようにしてもよい。 (1) In step S1 (repetition processing i-th time) of FIG. 5, the learning unit 3 does not select from the entire failed image set D2 (i-1) as a target for selecting training, detection, and evaluation images. The frame number may be selected from the upper predetermined number on the side closer to the successful image set D1 (i-1).

例えば、図６の例で、上位の４個から選出する場合、i=1回目の学習データSD(0)に関して、失敗画像セットD2(0)=F(1)～F(16)の全部からではなく、成功画像セットD1(0)=F(17)～F(32)に近い側上位の4個であるF(16),F(15),F(14),F(13)の中から選出してよい。同様に、i=2,3,4回目の学習データSD(1),SD(2)及びSD(3)での失敗画像セットD2(1),D2(2),D2(3)の全部（前述の通り図６中において灰色で表示されるフレーム画像）からではなく、成功画像セットにフレーム番号が近い側（図６の例の場合、より未来の時刻にある側）のそれぞれ以下の上位４個から選出してよい。
失敗画像セットD2(1)に関して、F(15),F(13),F(12),F(11)
失敗画像セットD2(2)に関して、F(15),F(11),F(10),F(9)
失敗画像セットD2(3)に関して、F(15),F(11),F(9),F(7) For example, in the example of FIG. 6, when selecting from the top four, i = 1 from all of the failed image sets D2 (0) = F (1) to F (16) for the first training data SD (0). Not among the top four F (16), F (15), F (14), F (13) near the success image set D1 (0) = F (17) to F (32). You may be elected from. Similarly, all of the failed image sets D2 (1), D2 (2), D2 (3) in the i = 2,3,4th training data SD (1), SD (2) and SD (3) ( As mentioned above, not from the frame image displayed in gray in FIG. 6), but from the side whose frame number is closer to the successful image set (in the case of the example of FIG. 6, the side at a later time), the top 4 below each. You may choose from the individual.
F (15), F (13), F (12), F (11) for the failed image set D2 (1)
F (15), F (11), F (10), F (9) for the failed image set D2 (2)
F (15), F (11), F (9), F (7) for the failed image set D2 (3)

このように、繰り返し回数iの際に、フレーム番号が成功画像セットD1(i-1)に近い側の上位の所定数以内から失敗画像を選択するようにすることで、遠方に対象が存在する失敗画像セット全体の中でも、可能な限りカメラに近い位置にあることで、撮影されている対象のサイズが最も大きい側のものであることが想定される失敗画像を優先して学習及び評価用に選出して利用することにより、現状のモデルM(i-1)の検出精度で検出可能なサイズから極端に小さいサイズで対象が撮影されている画像を、学習及び評価に用いることを避けることが期待される。これにより、ステップS3でのモデル精度の向上判定で肯定的な判定結果を速やかに得る可能性を高め、図５のフローによるモデル精度の向上を高速化することが期待される。 In this way, when the number of repetitions i is set, the failed image is selected from within the upper predetermined number on the side where the frame number is close to the successful image set D1 (i-1), so that the target exists in the distance. For learning and evaluation, priority is given to the failed image that is assumed to be the one with the largest size of the object being photographed by being as close to the camera as possible in the entire failed image set. By selecting and using it, it is possible to avoid using an image in which the target is photographed in an extremely small size from the size that can be detected by the detection accuracy of the current model M (i-1) for learning and evaluation. Be expected. This is expected to increase the possibility of promptly obtaining a positive judgment result in the model accuracy improvement judgment in step S3, and to speed up the improvement of the model accuracy by the flow of FIG.

（２）以上の説明では、図４の模式例のように、学習データSD(0)は１つの検出成功区間による成功画像セットと、これに対応する１つの検出失敗区間による失敗画像セットと、で構成されている場合を例として説明した。学習データSD(0)の画像数を増やすため、準備部2では１つ以上の映像を読み込んで、指定される同一種類の対象に関して、２つ以上の互いに対応する検出成功区間及び検出失敗区間の画像で構成されるものとして、学習データSD(0)（成功画像セットD1(0)及び失敗画像セットD2(0)）を得るようにしてもよい。 (2) In the above description, as in the schematic example of FIG. 4, the training data SD (0) includes a success image set with one detection success section, a failure image set with one detection failure section corresponding to the detection success section, and a failure image set. The case of being composed of is described as an example. In order to increase the number of images of the training data SD (0), the preparation unit 2 reads one or more images, and for the specified target of the same type, two or more corresponding detection success sections and detection failure sections. The training data SD (0) (success image set D1 (0) and failure image set D2 (0)) may be obtained as being composed of images.

この場合に上記の補足説明（１）の手法を適用する際は、検出失敗区間の各フレーム（失敗画像）に、対応する検出成功区間の端（検出失敗区間に隣接する側の端）のフレームとの時刻差を紐づけておき、この時刻差が小さい側の上位の所定数から失敗画像を選択するようにすればよい。 In this case, when the method of the above supplementary explanation (1) is applied, each frame (failed image) of the detection failure section is framed at the end of the corresponding detection success section (the end on the side adjacent to the detection failure section). The failure image may be selected from the upper predetermined number on the side where the time difference is small by associating the time difference with.

（３）学習部3によるステップS6での更新処理においては、次の追加処理をさらに行うようにしてもよい。前提として、学習部3では、失敗画像セットの各画像に対して、ステップS1にて訓練、検出、評価画像のいずれかに選択されたが、対応するステップS3で否定判定となった回数（学習又は評価するための画像として選択されたが、得られたモデルは検出精度が向上しなかった回数）をカウントして記録しておくものとする。そして、ステップS6において学習部3では、失敗画像セットの一部の画像を成功画像に変更する処理を行った後の追加処理として、失敗画像セットの各画像のうち、上記カウントして記録されている選択された回数が閾値を超えたものは、以降継続して実施されるステップS1における選択対象から除外されるよう、失敗画像セットから削除するようにしてもよい。この閾値は、図５のフローの大ループ処理（ステップS1～S6のループ処理）の回数i（あるいは、ステップS1～S3の小ループ処理の回数）に応じて増加する閾値としてもよい。 (3) In the update process in step S6 by the learning unit 3, the following additional process may be further performed. As a premise, in the learning unit 3, each image in the failed image set was selected as one of training, detection, and evaluation image in step S1, but the number of times a negative judgment was made in the corresponding step S3 (learning). Alternatively, the model selected as an image for evaluation, but the obtained model does not improve the detection accuracy) shall be counted and recorded. Then, in step S6, the learning unit 3 counts and records the above-mentioned counts among the images of the failed image set as additional processing after performing the process of changing a part of the images of the failed image set to the successful images. If the number of selected times exceeds the threshold value, it may be deleted from the failed image set so as to be excluded from the selection target in the subsequent step S1. This threshold value may be a threshold value that increases according to the number of times i (or the number of small loop processes in steps S1 to S3) of the large loop processing (loop processing in steps S1 to S6) of the flow of FIG.

この追加処理の模式例として、次を挙げることができる。例えば図６の失敗画像F(15)は、成功画像セット（当初の検出成功区間R1）に近い側にあるが、学習データがSD(1)～SD(3)と更新されても失敗画像のままであり成功画像に変更されていないことから、モデルの検出精度の向上に寄与しない可能性が高いもの（例えば、対象の撮影状態が悪いもの）として、ループ処理回数i=3,4等の時点でカウント回数が閾値を超え、失敗画像セットから削除することが可能である。 The following can be given as a schematic example of this additional processing. For example, the failed image F (15) in FIG. 6 is on the side close to the successful image set (initial detection successful interval R1), but even if the training data is updated as SD (1) to SD (3), the failed image Since it has not been changed to a successful image, there is a high possibility that it will not contribute to the improvement of the detection accuracy of the model (for example, the shooting condition of the target is bad). At that point, the number of counts exceeds the threshold and it is possible to remove it from the failed image set.

（４）準備部2で読み込んで学習装置1で学習のために用いる映像は、図２を参照して車載カメラ映像として説明したが、同様の特性を有する任意の映像を、学習装置1において扱うことが可能である。すなわち、ロボットやドローン等の任意の移動手段に設けられたカメラにより、路上などを移動しながら撮影された映像を、学習装置1において扱うことが可能である。 (4) The image read by the preparation unit 2 and used for learning by the learning device 1 has been described as an in-vehicle camera image with reference to FIG. 2, but any image having the same characteristics is handled by the learning device 1. It is possible. That is, it is possible for the learning device 1 to handle an image taken while moving on a road or the like by a camera provided in an arbitrary moving means such as a robot or a drone.

（５）学習部3による図５のフローチャートの変形例として次も可能である。すなわち、ステップS5及びS6は省略して、ステップS4からステップS7へと以降した後、フローを終了するようにしてもよい。この場合、既に説明した大ループ処理としての繰り返し回数i=1のみであり、小ループ処理（ステップS1～S3）の繰り返しのみが行われうることとなり、学習装置1では既存モデルM(0)を改良したモデルM(1)を出力することとなる。 (5) The following is also possible as a modification of the flowchart of FIG. 5 by the learning unit 3. That is, steps S5 and S6 may be omitted, and the flow may be terminated after proceeding from step S4 to step S7. In this case, the number of repetitions i = 1 as the large loop processing already described is only, and only the small loop processing (steps S1 to S3) can be repeated, and the learning device 1 uses the existing model M (0). The improved model M (1) will be output.

（６）以上の説明では、準備部2（のうち、識別部22）が検出成功区間及びこれに隣接する検出失敗区間を識別する際に、検出成功区間と検出失敗区間とが連続しているものとして識別したが、互いに不連続なものとして、当該隣接する両区間を識別し、推定部23に出力するようにしてもよい。すなわち、検出成功区間と検出失敗区間との間に映像上の１つ以上のフレームが存在して互いに不連続であるが、時間軸上で先後関係にあることから互いに隣接しているものとして識別するようにしてもよい。この際、以上説明したのと同様にして、識別部22では検出成功区間及び検出失敗区間を互いに連続しており且つ隣接するものとして検出したうえで、この連続箇所から、検出成功区間側及び／又は検出失敗区間側の所定数（管理者等が予め設定しておく）のフレームを削除したうえで、互いに不連続且つ互いに隣接するものとして検出成功区間及び検出失敗区間を得ることができる。 (6) In the above description, when the preparation unit 2 (of which the identification unit 22) identifies the detection success section and the detection failure section adjacent thereto, the detection success section and the detection failure section are continuous. Although it is identified as a thing, it may be possible to identify both adjacent sections and output them to the estimation unit 23 as discontinuous ones. That is, one or more frames on the image exist between the detection success section and the detection failure section and are discontinuous with each other, but they are identified as being adjacent to each other because they have a front-to-back relationship on the time axis. You may try to do it. At this time, in the same manner as described above, the identification unit 22 detects the detection success section and the detection failure section as being continuous and adjacent to each other, and then from this continuous point, the detection success section side and / Alternatively, after deleting a predetermined number of frames (set in advance by the administrator or the like) on the detection failure section side, the detection success section and the detection failure section can be obtained as discontinuous and adjacent to each other.

（７）用いた映像が適切ではなかった等の事情で、準備部2で適切な学習データが得られなかった場合や、学習データで学習しうる上限精度のモデルが現状モデルM(i-1)として既に得られている場合は、学習部3でステップS1～S3の小ループ処理を繰り返してもステップS3においてモデルの向上判定が得られないこともありうる。従って、ステップS1～S6の大ループ処理の各i回において、この小ループ処理の回数の上限閾値を設けておき、上限に達した場合は学習の処理を終了するようにしてもよい。i≧2で上限に達した場合、得られている最良モデルM(i-1)を学習部3の出力とすればよい。i=1で上限に達した場合、学習データが適切でない可能性がある等のエラーの旨を出力してよい。 (7) The current model M (i-1) is a model with the upper limit accuracy that can be learned by the training data or when the preparation unit 2 cannot obtain appropriate training data due to reasons such as the video used is not appropriate. ), Even if the small loop processing of steps S1 to S3 is repeated in the learning unit 3, it is possible that the improvement judgment of the model cannot be obtained in step S3. Therefore, an upper limit threshold value for the number of times of this small loop processing may be set in each i times of the large loop processing in steps S1 to S6, and the learning process may be terminated when the upper limit is reached. When the upper limit is reached with i ≧ 2, the obtained best model M (i-1) may be used as the output of the learning unit 3. When the upper limit is reached at i = 1, an error message such as the possibility that the training data may not be appropriate may be output.

（８）図７は、一般的なコンピュータ装置70におけるハードウェア構成の例を示す図である。学習装置1及び検出装置10はそれぞれ、このような構成を有する１台以上のコンピュータ装置70として実現可能である。コンピュータ装置70は、所定命令を実行するCPU（中央演算装置）71、CPU71の実行命令の一部又は全部をCPU71に代わって又はCPU71と連携して実行する専用プロセッサ72（GPU（グラフィック演算装置）や深層学習専用プロセッサ等）、CPU71や専用プロセッサ72にワークエリアを提供する主記憶装置としてのRAM73、補助記憶装置としてのROM74、通信インタフェース75、ディスプレイ76、マウス、キーボード、タッチパネル等によりユーザ入力を受け付ける入力インタフェース77と、これらの間でデータを授受するためのバスBSと、を備える。 (8) FIG. 7 is a diagram showing an example of a hardware configuration in a general computer device 70. The learning device 1 and the detection device 10 can each be realized as one or more computer devices 70 having such a configuration. The computer device 70 is a CPU (central processing unit) 71 that executes a predetermined instruction, and a dedicated processor 72 (GPU (graphic calculation device)) that executes a part or all of the execution instructions of the CPU 71 on behalf of the CPU 71 or in cooperation with the CPU 71. And deep learning dedicated processor, etc.), RAM73 as the main storage device that provides a work area for the CPU71 and the dedicated processor 72, ROM74 as the auxiliary storage device, communication interface 75, display 76, mouse, keyboard, touch panel, etc. It includes an input interface 77 that accepts data, and a bus BS for exchanging data between them.

学習装置1及び検出装置10の各部は、各部の機能に対応する所定のプログラムをROM74から読み込んで実行するCPU71及び／又は専用プロセッサ72によって実現することができる。ここで、表示関連の処理が行われる場合にはさらに、ディスプレイ76が連動して動作し、データ送受信に関する通信関連の処理が行われる場合にはさらに通信インタフェース75が連動して動作する。 Each part of the learning device 1 and the detection device 10 can be realized by a CPU 71 and / or a dedicated processor 72 that reads and executes a predetermined program corresponding to the function of each part from the ROM 74. Here, when the display-related processing is performed, the display 76 further operates in conjunction with the display 76, and when the communication-related processing related to data transmission / reception is performed, the communication interface 75 further operates in conjunction with the display 76.

1…学習装置、2…準備部、3…学習部、21…検出部、22…識別部、23…推定部 1 ... learning device, 2 ... preparation unit, 3 ... learning unit, 21 ... detection unit, 22 ... identification unit, 23 ... estimation unit

Claims

A detection success section in which it is determined that a predetermined object is continuously detected from an image taken while moving by a camera provided in the moving means, and a detection success section adjacent to the section, in which a predetermined object is photographed continuously. The detection failure section, which is determined not to be detected, is detected by applying a predetermined object recognition method to each frame of the video under the existing model.
A successful image set in which a detection area of a predetermined target is linked to each frame of a detection success section, and a failure image set in which a region in which a predetermined target should be detected is estimated and linked as an estimation area in each frame of a detection failure section. , The preparation department to prepare, and
A learning unit that obtains an improved model of the existing model by learning and evaluating a model of the predetermined object recognition method using the success image set and the image set selected from the failure image set as correct answers. , A learning device characterized by being provided with.

In the preparation unit, the predetermined target is photographed in the detection failure section, but the predetermined object is farther from the camera and the size photographed on the frame is smaller than that in the detection success section. The learning device according to claim 1, wherein the learning device is detected as a section that is not detected in.

In the preparation unit, the time change of the size of the detection area of the predetermined target in the detection success section is obtained, and the side of both ends on the time axis of the detection success section, which is determined to have a smaller size of the obtained time change, is set. The learning device according to claim 1 or 2, wherein an adjacent detection failure interval is defined.

In the preparation unit, the position and size of the region to be detected in each frame of the detection failure section are obtained by predicting from the time series data of the position and size of the detection area associated with each frame of the detection success section. The learning device according to any one of claims 1 to 3, wherein the estimation region to be linked is obtained.

Each frame of the video is associated with the position information of the camera in the world coordinate system at the time of shooting.
In the preparation section, under the assumption that the predetermined object is stationary in the world coordinate system, the detection area associated with each frame of the successful image set and the camera at the time of shooting each frame are used using known camera parameters. Based on the position information, the rest position information in the world coordinate system of the predetermined target is estimated, and further,
A claim characterized by obtaining an estimated region in which the region where the predetermined object should be detected is estimated based on the estimated static position information and the camera position information at the time of shooting each frame of the failed image set. Item 6. The learning device according to any one of Items 1 to 3.

The learning unit selects an image from the success image set and the failure image set so as to include at least one image of the success image set and at least one image of the failure image set, so that the learning image set and the evaluation image Each set is determined, the model is obtained by learning with the training image set, and the model is evaluated with the evaluation image set.
The invention according to any one of claims 1 to 5, wherein the selection, learning, and evaluation are repeated until the evaluation of the obtained model is determined to be improved from the evaluation of the existing model. Learning device.

The learning device according to claim 6, wherein the learning unit selects an image from a predetermined number whose frame time is closer to the frame time of the successful image set when selecting an image from the failed image set.

If it is determined that the evaluation of the obtained model is improved from the evaluation of the existing model, the learning unit further
Changed the treatment of all or part of the images belonging to the failure image set included in the training image set or evaluation image set used to train the obtained model as belonging to the success image set instead of the failure image set. The learning device according to claim 6 or 7, wherein the existing model is replaced with the obtained model, and then the selection, learning, and evaluation are repeated.

In the learning unit, the estimated area associated with the image when it was treated as belonging to the failed image set is used as the detection area associated with the image when the treatment is changed as belonging to the successful image set. The learning device according to claim 8, wherein the learning device is adopted.

The learning unit counts the number of times that each image in the failed image set is selected for the learning image set or the evaluation image set, but the evaluation of the obtained model is not improved, and the number of times sets a threshold value. The learning device according to claim 8 or 9, wherein the exceeded image is deleted from the failed image set.

The preparation unit according to claim 1 to 10, wherein the detection success section and the detection failure section adjacent to each other are detected as continuous sections on the video or discontinuous sections on the video. The learning device described in any.

In the learning unit, a model improved stepwise is obtained by repeating learning and evaluation, and each time the improved model is obtained, a part of the image of the failed image set is updated as belonging to the successful image set. The learning device according to any one of claims 1 to 11, wherein learning and evaluation are continued.

A program characterized in that the computer functions as the learning device according to any one of claims 1 to 12.