JP2022025878A

JP2022025878A - Object detection device and program

Info

Publication number: JP2022025878A
Application number: JP2020129039A
Authority: JP
Inventors: 仁宣牧野; Kiminobu Makino; 伶遠藤; Rei Endo; 貴裕望月; Takahiro Mochizuki
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2022-02-10

Abstract

To provide an object detection device with which it is possible to perform good and sufficient machine training even in a situation where there is difficulty in collecting correct-answer images.SOLUTION: A detection unit detects a specific object included in an inputted frame image by using a machine trainable model. A training data supply unit supplies training data which is an aggregate in pairs of frame images for input and correct-answer data representing whether or not a specific object is included in the frame images, the training data being the one used for training models of the detection unit. An amplification unit pastes the image of the specific object to a prescribed position in the frame image for input so as to create a frame image for input of training data, as well as generates correct-answer data representing that the specific object is included in the created frame image and adds a pair of the created frame image and correct-answer data corresponding to said frame image to the training data supplied by the training data supply unit.SELECTED DRAWING: Figure 1

Description

特許法第３０条第２項適用申請有り（１）発行日：令和２年（西暦２０２０年）５月，刊行物「第７３回放送技術報告会かんさい放送と技術フォーラム２０２０」（予稿集），論文タイトル：「ＡＩによる選挙ポスターチェッカーの開発・検証」，ページ等：５－６ページ，発行者：ＮＨＫ大阪拠点放送局技術部「かんさい放送と技術フォーラム２０２０」実行委員会（２）開催日：令和２年（西暦２０２０年）５月１９日，報告会の名称：第７３回放送技術報告会かんさい放送と技術フォーラム２０２０」，主催者：ＮＨＫ大阪拠点放送局技術部「かんさい放送と技術フォーラム２０２０」実行委員会，会場：ＮＨＫ大阪拠点放送局１７階会議室Application for application of Article 30, Paragraph 2 of the Patent Act (1) Date of issue: May, 2nd year of Reiwa (2020), Publication "73rd Broadcasting Technology Report Meeting Kansai Broadcasting and Technology Forum 2020" (Proceedings) , Paper title: "Development and verification of election poster checker by AI", Pages: 5-6 pages, Publisher: NHK Osaka-based Broadcasting Station Engineering Department "Kansai Broadcasting and Technology Forum 2020" Executive Committee (2) Date : May 19, 2020, the name of the debriefing session: 73rd Broadcasting Technology Report Meeting Kansai Broadcasting and Technology Forum 2020, Organizer: NHK Osaka Base Broadcasting Station Engineering Department "Kansai Broadcasting and Technology" Forum 2020 ”Executive Committee, Venue: NHK Osaka Base Broadcasting Station 17th Floor Meeting Room

本発明は、物体検出装置およびプログラムに関する。 The present invention relates to an object detection device and a program.

映像（あるいは静止画像内）内に写っている特定の物体を自動的に検出できることは有用である。映像内の特定の物体の検出を自動化できれば、映像へのタグ付けを低コストで行うことが可能となり、例えば、映像を検索する精度を上げるために利用可能である。また、例えば、映像コンテンツを制作する事業においては、制作した映像コンテンツ、あるいは素材として利用する映像コンテンツ内に、特定の物体が写っていること、あるいは写っていないことを、低コストで確認できるようになる。なお、従来技術において、映像内の特定物体の検出のためには、機械学習を用いることも行われている。 It is useful to be able to automatically detect a specific object in an image (or in a still image). If the detection of a specific object in a video can be automated, tagging of the video can be performed at low cost, and it can be used, for example, to improve the accuracy of searching the video. In addition, for example, in the business of producing video content, it is possible to confirm at low cost whether or not a specific object is shown in the produced video content or the video content used as a material. become. In the prior art, machine learning is also used to detect a specific object in an image.

非特許文献１、２、３には、映像内の物体を検出するための技術が記載されている。 Non-Patent Documents 1, 2 and 3 describe techniques for detecting an object in an image.

特許文献１には、素材画像中から対象物を含む領域を自動的に切り出す技術が記載されている。自動的に切出された画像は、当該対象物の検出を行うための機械学習に用いられることが記載されている。 Patent Document 1 describes a technique for automatically cutting out a region including an object from a material image. It is described that the automatically cut out image is used for machine learning to detect the object.

特許文献２には、所定経路を巡回するロボットが物体を検出する技術が記載されている。また、ロボットが物体を検出する際にノイズ（誤検出）を除外する技術が記載されている。 Patent Document 2 describes a technique for detecting an object by a robot that patrols a predetermined path. Further, a technique for excluding noise (false detection) when a robot detects an object is described.

特開２０２０－０４６８５８号公報Japanese Unexamined Patent Publication No. 2020-046858 特開２０２０－０４２５６２号公報Japanese Unexamined Patent Publication No. 2020-042562

Dong Lao， Ganesh Sundaramoorthi，“Minimum Delay Object Detection From Video” ，ICCV 2019，２０１９年，［online］，［２０２０年５月１９日ダウンロード］，インターネット＜URL：https://arxiv.org/abs/1908.11092＞Dong Lao, Ganesh Sundaramoorthi, “Minimum Delay Object Detection From Video”, ICCV 2019, 2019, [online], [Downloaded May 19, 2020], Internet <URL: https://arxiv.org/abs/1908.11092 ＞ Huizi Mao，Xiaodong Yang，William J. Dally， “A Delay Metric for Video Object Detection: What Average Precision Fails to Tell” ，ICCV 2019，２０１９年，［online］，［２０２０年５月１９日ダウンロード］，インターネット＜URL：https://arxiv.org/abs/1908.06368＞Huizi Mao, Xiaodong Yang, William J. Dally, “A Delay Metric for Video Object Detection: What Average Precision Fails to Tell”, ICCV 2019, 2019, [online], [Download May 19, 2020], Internet < URL: https://arxiv.org/abs/1908.06368 ＞ Kai Kang，Hongsheng Li，Tong Xiao，Wanli Ouyang，Junjie Yan，Xihui Liu，Xiaogang Wang， “Object Detection in Videos with Tubelet Proposal Networks”, CVPR 2017，２０１７年，［online］，［２０２０年５月１９日ダウンロード］，インターネット＜URL：https://arxiv.org/abs/1702.06355＞Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, Xiaogang Wang, “Object Detection in Videos with Tubelet Proposal Networks”, CVPR 2017, 2017, [online], [Downloaded May 19, 2020] ], Internet <URL: https://arxiv.org/abs/1702.06355>

機械学習の手法を用いて画像内の特定の物体等を検出できるようにするためには、良好且つ大量（所定量以上）の学習用データを必要とする。ところが、検出したい物体が映っている正解画像を収集することが困難な場合もある。一例として、映像コンテンツを制作する際には、撮影するカメラマンは、特定目的のポスターや看板などの画面内への映り込みを避けるようにフレーミングを行うことが多い。よって、そのような映像コンテンツの素材の中には、当該特定物体を検出するための正解データとして利用することのできるフレーム画像は、存在しないか、存在するとしても極端に少ない。正解画像を十分に収集できない場合には、正解画像に基づいた機械学習を十分に行えないという問題が生じる。すると、物体を検出する精度を向上することができないという問題がある。 In order to be able to detect a specific object or the like in an image by using a machine learning method, a good and large amount (predetermined amount or more) of learning data is required. However, it may be difficult to collect the correct image showing the object to be detected. As an example, when producing video content, the photographer who shoots often performs framing so as to avoid reflection on the screen such as a poster or a signboard for a specific purpose. Therefore, among the materials of such video contents, there is no frame image that can be used as correct answer data for detecting the specific object, or even if it exists, there are extremely few. If the correct image cannot be collected sufficiently, there arises a problem that machine learning based on the correct image cannot be sufficiently performed. Then, there is a problem that the accuracy of detecting the object cannot be improved.

本発明は、上記の課題認識に基づいて行なわれたものであり、正解画像を収集するのが困難な状況においても、良好且つ十分な機械学習を行うことのできる物体検出装置およびプログラムを提供しようとするものである。 The present invention has been made based on the above-mentioned problem recognition, and will provide an object detection device and a program capable of performing good and sufficient machine learning even in a situation where it is difficult to collect correct images. Is to be.

［１］上記の課題を解決するため、本発明の一態様による物体検出装置は、機械学習可能なモデルを用いて、入力されるフレーム画像内に含まれる特定物体を検出する検出部と、前記検出部の前記モデルの学習を行うための学習用データであって、入力用のフレーム画像と、前記フレーム画像に前記特定物体が含まれるか否かを表す正解データと、の対の集合である学習用データを供給する学習用データ供給部と、入力用のフレーム画像内の所定の位置に前記特定物体の画像を貼り付けることによって学習用データの前記入力用のフレーム画像を作成するとともに、作成した当該フレーム画像に前記特定物体が含まれることを表す正解データを生成し、作成した前記フレーム画像と当該フレーム画像に対応する正解データとの対を、前記学習用データ供給部が供給する前記学習用データに追加する増幅部とを具備するものである。 [1] In order to solve the above problems, the object detection device according to one aspect of the present invention includes a detection unit that detects a specific object included in an input frame image using a machine-learnable model, and the above-mentioned object detection device. It is a training data for training the model of the detection unit, and is a set of a pair of a frame image for input and correct answer data indicating whether or not the specific object is included in the frame image. A frame image for input of training data is created and created by pasting an image of the specific object at a predetermined position in a frame image for input and a data supply unit for learning that supplies training data. The learning data supply unit supplies a pair of the created correct answer data indicating that the specific object is included in the frame image and the correct answer data corresponding to the frame image. It is provided with an amplification unit to be added to the data for use.

［２］また、本発明の一態様は、上記の物体検出装置において、前記正解データは、前記フレーム画像に前記特定物体が含まれる場合には、前記特定物体が写っている前記フレーム画像内の位置を表す位置データを含むものである。 [2] Further, in one aspect of the present invention, in the object detection device, the correct answer data is in the frame image in which the specific object is captured when the specific object is included in the frame image. It contains position data representing a position.

［３］また、本発明の一態様は、上記の物体検出装置において、前記学習用データ供給部が供給する前記学習用データと、前記増幅部が追加する前記学習用データと、の少なくともいずれかの前記フレーム画像の、少なくとも一部の領域をぼかすためのフィルター処理を行うフィルター処理部、をさらに具備するものである。 [3] Further, one aspect of the present invention is at least one of the learning data supplied by the learning data supply unit and the learning data added by the amplification unit in the object detection device. Further, it is provided with a filtering unit for performing a filtering process for blurring at least a part of the frame image of the above.

［４］また、本発明の一態様は、上記の物体検出装置において、前記検出部は、フレーム画像の系列に含まれる各々の前記フレーム画像内の特定物体を検出するものであり、前記フレーム画像の系列において、各々の前記フレーム画像で検出された特定物体を同定して複数の前記フレーム画像にわたって当該特定物体をトラッキングするトラッキング部、をさらに具備するものである。 [4] Further, in one aspect of the present invention, in the above-mentioned object detection device, the detection unit detects a specific object in each of the frame images included in the series of frame images, and the frame image. The series further includes a tracking unit that identifies a specific object detected in each of the frame images and tracks the specific object over a plurality of the frame images.

［５］また、本発明の一態様は、上記の物体検出装置において、前記検出部は、前記フレーム画像で検出された前記特定物体の、特定物体らしさを表すスコアを出力するものであり、前記トラッキング部が同定した前記特定物体の、前記フレーム画像の系列に含まれる各々の前記フレーム画像におけるスコアを、前記フレーム画像の系列に対応するスコアの系列として、前記スコアの系列に基づいて、当該特定物体が真に特定物体であるか否かを判定する追跡系列判定部、をさらに具備するものである。 [5] Further, in one aspect of the present invention, in the object detection device, the detection unit outputs a score indicating the specific object-likeness of the specific object detected in the frame image, and is described above. The score of the specific object identified by the tracking unit in each of the frame images included in the frame image series is specified as the score series corresponding to the frame image series based on the score series. Further, it is provided with a tracking sequence determination unit for determining whether or not an object is truly a specific object.

［６］また、本発明の一態様は、上記の物体検出装置において、前記検出部は、前記フレーム画像で検出された前記特定物体の、特定物体らしさを表すスコアを出力するものである。 [6] Further, in one aspect of the present invention, in the above-mentioned object detection device, the detection unit outputs a score indicating the specific object-likeness of the specific object detected in the frame image.

［７］また、本発明の一態様は、機械学習可能なモデルを用いて、入力されるフレーム画像内に含まれる特定物体を検出する検出部と、前記検出部の前記モデルの学習を行うための学習用データであって、入力用のフレーム画像と、前記フレーム画像に前記特定物体が含まれるか否かを表す正解データと、の対の集合である学習用データを供給する学習用データ供給部と、入力用のフレーム画像内の所定の位置に前記特定物体の画像を貼り付けることによって学習用データの前記入力用のフレーム画像を作成するとともに、作成した当該フレーム画像に前記特定物体が含まれることを表す正解データを生成し、作成した前記フレーム画像と当該フレーム画像に対応する正解データとの対を、前記学習用データ供給部が供給する前記学習用データに追加する増幅部と、を具備する物体検出装置、としてコンピューターを機能させるためのプログラムである。 [7] Further, one aspect of the present invention is to learn a detection unit that detects a specific object included in an input frame image and the model of the detection unit using a machine-learnable model. Data for training that supplies training data that is a set of a pair of a frame image for input and correct answer data indicating whether or not the specific object is included in the frame image. The frame image for input of the training data is created by pasting the image of the specific object at a predetermined position in the frame image for input, and the created frame image includes the specific object. An amplification unit that generates correct answer data indicating that the data is generated and adds a pair of the created frame image and the correct answer data corresponding to the frame image to the learning data supplied by the learning data supply unit. It is a program for operating a computer as an object detection device to be equipped.

本発明によれば、豊富な学習用データを用いて、検出部のモデルの機械学習を行うことができる。これにより、検出部が動作する際の検出精度の向上が期待できる。 According to the present invention, machine learning of the model of the detection unit can be performed using abundant learning data. This can be expected to improve the detection accuracy when the detection unit operates.

本発明の第１実施形態による物体検出装置の概略機能構成を示すブロック図である。It is a block diagram which shows the schematic functional structure of the object detection apparatus by 1st Embodiment of this invention. 同実施形態における学習用データの構成例を示す概略図である。It is a schematic diagram which shows the structural example of the learning data in the same embodiment. 同実施形態による増幅部が特定物体の画像の貼り付けを行う処理の内容を示す概略図である。It is a schematic diagram which shows the content of the process which the amplification part by the same embodiment pastes the image of a specific object. 同実施形態による増幅部が特定物体の画像の貼り付けを行った後の画像の例を示す概略図である。It is a schematic diagram which shows the example of the image after the amplification part by the same embodiment pasted the image of a specific object. 同実施形態の増幅部による学習用データの増幅の例を示す概略図である。It is a schematic diagram which shows the example of the amplification of the learning data by the amplification part of the same embodiment. 第２実施形態による物体検出装置の概略機能構成を示すブロック図である。It is a block diagram which shows the schematic functional structure of the object detection apparatus by 2nd Embodiment. 第３実施形態による物体検出装置の概略機能構成を示すブロック図である。It is a block diagram which shows the schematic functional structure of the object detection apparatus according to 3rd Embodiment. 第１、第２、第３実施形態による物体検出装置を実現するコンピューターの機能構成の一例を示すブロック図である。It is a block diagram which shows an example of the functional structure of the computer which realizes the object detection apparatus by 1st, 2nd, 3rd Embodiment. 物体のトラッキングを行わない場合における物体の同定方法を説明するための概略図である。It is a schematic diagram for demonstrating the identification method of the object in the case where the tracking of an object is not performed.

次に、図面を参照しながら、本発明の複数の実施形態について説明する。以下に説明する実施形態では、いずれも、画像に基づく学習用データの増幅（水増し）を行う。物体検出装置は、この手法によって豊富な学習用データを獲得し、機械学習処理を行う。 Next, a plurality of embodiments of the present invention will be described with reference to the drawings. In each of the embodiments described below, the learning data based on the image is amplified (inflated). The object detection device acquires abundant learning data by this method and performs machine learning processing.

［第１実施形態］
図１は、第１実施形態による物体検出装置の概略機能構成を示すブロック図である。図示するように、物体検出装置１は、フレーム画像取得部１０と、学習用データ供給部２０と、増幅部３０と、フィルター処理部４０と、検出部５０と、結果出力部８０とを含んで構成される。これらの各機能部は、例えば、コンピューターと、プログラムとで実現することが可能である。また、各機能部は、必要に応じて、記憶手段を有する。記憶手段は、例えば、プログラム上の変数や、プログラムの実行によりアロケーションされるメモリーである。また、必要に応じて、磁気ハードディスク装置やソリッドステートドライブ（ＳＳＤ）といった不揮発性の記憶手段を用いるようにしてもよい。また、各機能部の少なくとも一部の機能を、プログラムではなく専用の電子回路として実現してもよい。 [First Embodiment]
FIG. 1 is a block diagram showing a schematic functional configuration of an object detection device according to the first embodiment. As shown in the figure, the object detection device 1 includes a frame image acquisition unit 10, a learning data supply unit 20, an amplification unit 30, a filter processing unit 40, a detection unit 50, and a result output unit 80. It is composed. Each of these functional units can be realized by, for example, a computer and a program. In addition, each functional unit has a storage means, if necessary. The storage means is, for example, a variable on the program or a memory allocated by executing the program. Further, if necessary, a non-volatile storage means such as a magnetic hard disk device or a solid state drive (SSD) may be used. Further, at least a part of the functions of each functional unit may be realized as a dedicated electronic circuit instead of a program.

フレーム画像取得部１０は、フレーム画像を外部から取得する。フレーム画像取得部１０は、１枚だけのフレーム画像を取得してもよいし、フレーム画像の系列を取得してもよい。フレーム画像の系列の一種は、映像である。映像は、所定のフレームレート（単位時間当たりのフレーム数）のフレーム画像で構成される。フレーム画像取得部１０は、例えば、映像カメラ等で撮影されたフレーム画像の系列を、ケーブル等を介して取得してもよい。また、フレーム画像取得部１０は、例えば、磁気ハードディスク装置（ＨＤＤ）等の情報記録媒体に記録されているフレーム画像の系列を取得してもよい。フレーム画像取得部１０は、取得したフレーム映像を、検出部５０に渡す。 The frame image acquisition unit 10 acquires a frame image from the outside. The frame image acquisition unit 10 may acquire only one frame image, or may acquire a series of frame images. A type of frame image series is video. The video is composed of frame images having a predetermined frame rate (number of frames per unit time). The frame image acquisition unit 10 may acquire, for example, a series of frame images taken by a video camera or the like via a cable or the like. Further, the frame image acquisition unit 10 may acquire a series of frame images recorded on an information recording medium such as a magnetic hard disk device (HDD). The frame image acquisition unit 10 passes the acquired frame image to the detection unit 50.

学習用データ供給部２０は、検出部５０が持つモデルの機械学習用のデータを供給する。学習用データは、正例と負例とを含んでよい。正例は、特定物体が写った画像（例えば、風景画像）を用いた学習用データである。負例は、特定物体が写っていない画像（例えば、風景画像）を用いた学習用データである。学習用データ供給部２０は、正例および負例のそれぞれの画像に対応付けて、正解データを添付する。正例の学習用データは、特定画像の写っている領域の位置情報（当該領域の四隅の点の座標）と、当該領域のスコア値１．０を持つ。負例の学習用データは、特定画像の写っている領域の位置情報を持たない。負例の学習用データが、スコア値０．０を持つようにしてもよい。つまり、学習用データ供給部２０は、検出部５０が持つモデルの学習を行うための学習用データであって、入力用のフレーム画像と、そのフレーム画像に特定物体が含まれるか否かを表す正解データと、の対の集合である学習用データを供給する。また、正解データは、フレーム画像に特定物体が含まれる場合には、特定物体が写っているフレーム画像内の位置を表す位置データを含む。 The learning data supply unit 20 supplies data for machine learning of the model of the detection unit 50. The training data may include positive and negative examples. A positive example is learning data using an image (for example, a landscape image) showing a specific object. A negative example is learning data using an image (for example, a landscape image) in which a specific object is not shown. The learning data supply unit 20 attaches correct answer data in association with each of the positive and negative images. The learning data of the positive example has the position information of the area in which the specific image is captured (coordinates of the points at the four corners of the area) and the score value 1.0 of the area. The learning data of the negative example does not have the position information of the area where the specific image appears. The training data of the negative example may have a score value of 0.0. That is, the learning data supply unit 20 is learning data for learning the model of the detection unit 50, and represents a frame image for input and whether or not the frame image includes a specific object. The correct answer data and the learning data which is a set of pairs are supplied. Further, the correct answer data includes position data representing a position in the frame image in which the specific object is captured when the frame image includes the specific object.

増幅部３０は、学習用データを増幅する。ここで「増幅」とは、同一の性質を有するデータ（例えば、同一の目的で使用されるデータ）の量を増やすことである。言い換えれば、増幅部３０は、画像の貼り付けの処理を行うことによって、学習用データの量を水増しする。つまり、増幅部３０は、学習用データ供給部２０が供給する学習用データに加えるための追加データを作成する。そのために、増幅部３０は、検出の対象とする特定物体の画像と、その特定物体を含まない画像とを、予め準備しておく場合がある。なお、増幅部３０は、特定物体の大量の画像と、その特定物体を含まない大量の画像とを、予め準備しておくことが望ましい。 The amplification unit 30 amplifies the learning data. Here, "amplification" is to increase the amount of data having the same properties (for example, data used for the same purpose). In other words, the amplification unit 30 inflates the amount of learning data by performing the image pasting process. That is, the amplification unit 30 creates additional data to be added to the learning data supplied by the learning data supply unit 20. Therefore, the amplification unit 30 may prepare in advance an image of a specific object to be detected and an image that does not include the specific object. It is desirable that the amplification unit 30 prepare in advance a large number of images of a specific object and a large number of images that do not include the specific object.

つまり、増幅部３０は、入力用のフレーム画像内の所定の位置に特定物体の画像を貼り付けることによって学習用データの入力用のフレーム画像を作成する。また、増幅部３０は、作成した当該フレーム画像に特定物体が含まれることを表す正解データを生成し、作成したフレーム画像と当該フレーム画像に対応する正解データとの対を、学習用データ供給部２０が供給する学習用データに追加するものである。また、増幅部３０が追加する学習用データにおいても、正解データは、フレーム画像に特定物体が含まれる場合には、特定物体が写っているフレーム画像内の位置を表す位置データを含む。 That is, the amplification unit 30 creates a frame image for input of learning data by pasting an image of a specific object at a predetermined position in the frame image for input. Further, the amplification unit 30 generates correct answer data indicating that the created frame image includes a specific object, and the pair of the created frame image and the correct answer data corresponding to the frame image is a learning data supply unit. It is added to the learning data supplied by 20. Further, in the learning data added by the amplification unit 30, when the frame image includes a specific object, the correct answer data includes position data representing a position in the frame image in which the specific object appears.

ここで、特定物体の例について説明する。一例として、特定物体は、特定の目的のためのポスターや看板である。また、上記の「特定物体を含まない画像」とは、この特定物体が写っていない風景画像等である。ただし、特定物体は、ここに例示したようなポスターや看板に限らず、何であってもよい。また、「特定物体を含まない画像」は、必ずしも風景画像ではない他の画像であってもよい。さらに、上記の「特定の目的のためのポスターや看板」とは、例えば、公職選挙の立候補者や政党等の告知あるいは宣伝のためのポスターや看板である。ただし、特定物体は、特に選挙に関するものでなくてもよい。また、検出対象とする特定物体は、他の何であってもよい。特定物体の例は、人（特定の人であってもよい）、動物、植物、建物等の構築物、乗物、飲食物、書籍、機械、電気製品など様々である。特定物体は、ここに例示列挙したものに限定されない。 Here, an example of a specific object will be described. As an example, a particular object is a poster or signboard for a particular purpose. Further, the above-mentioned "image not including a specific object" is a landscape image or the like in which the specific object is not shown. However, the specific object is not limited to the poster or signboard as illustrated here, and may be anything. Further, the "image that does not include a specific object" may be another image that is not necessarily a landscape image. Further, the above-mentioned "poster or signboard for a specific purpose" is, for example, a poster or signboard for announcement or promotion of a candidate for a public office election, a political party, or the like. However, the specific object does not have to be related to the election. Further, the specific object to be detected may be anything else. Examples of specific objects include people (which may be specific people), animals, plants, structures such as buildings, vehicles, food and drink, books, machines, electric appliances, and the like. The specific object is not limited to those listed here.

増幅部３０は、前記の「特定物体を含まない画像」内の所定の位置に、前記の特定物体の画像をはめ込む形で貼り付ける処理を行う。例えば特定物体がポスターや看板である場合、増幅部３０は、「特定物体を含まない画像」内に映っている平面形状のものを、特定物体の画像の貼り付け先の位置とする。 The amplification unit 30 performs a process of pasting the image of the specific object in a predetermined position in the "image not including the specific object". For example, when the specific object is a poster or a signboard, the amplification unit 30 sets the planar shape image in the "image not including the specific object" as the position where the image of the specific object is pasted.

なお、画像内に写っている平面を検出すること自体は、既存技術を用いて行える。例えば、SynthText（https://github.com/ankush-me/SynthText）は、画像内の平面を検出する機能を持つ。増幅部３０は、平面を検出するために、例えば、画像内に写っているものの深度を推定する処理を行い、推定された深度の分布に基づく画像の領域分割を行う。増幅部３０は、画像内のある領域において、画像内の全方向に対して深度の変化の度合いがリニア（あるいはほぼリニア）であれば、その領域が平面であると認定する。 It should be noted that the detection of the plane reflected in the image itself can be performed by using the existing technology. For example, SynthText (https://github.com/ankush-me/SynthText) has the ability to detect planes in an image. In order to detect a plane, the amplification unit 30 performs, for example, a process of estimating the depth of what is reflected in the image, and divides an image region based on the estimated depth distribution. If the degree of change in depth is linear (or almost linear) in all directions in the image in a certain region in the image, the amplification unit 30 recognizes that the region is a plane.

増幅部３０は、検出された平面内において特定物体の画像を貼り付ける領域を適宜決定する。増幅部３０は、決定された領域に特定物体の画像を貼り付ける処理を行う。なお、特定物体が例えばポスターや看板である場合、増幅部３０は、貼り付ける前に、特定物体の画像のアフィン変換を行ってもよい。アフィン変換を行うことにより、増幅部３０は、例えばポスターの画像を、拡大／縮小させたり、回転させたり、平行移動させたりすることができる。一例として、アフィン変換を行う際の行列の各要素の値を、０．９以上且つ１．１以下の範囲内の値としてよい。これにより、増幅部３０は、特定物体の画像と、特定物体が写っていない風景等の画像とから、特定物体が写っている風景等の画像を作ることができる。 The amplification unit 30 appropriately determines a region in which the image of the specific object is pasted in the detected plane. The amplification unit 30 performs a process of pasting an image of a specific object in a determined area. When the specific object is, for example, a poster or a signboard, the amplification unit 30 may perform affine transformation of the image of the specific object before pasting. By performing the affine transformation, the amplification unit 30 can enlarge / reduce, rotate, or translate, for example, the image of the poster. As an example, the value of each element of the matrix when performing the affine transformation may be set to a value within the range of 0.9 or more and 1.1 or less. As a result, the amplification unit 30 can create an image of a landscape or the like in which the specific object is captured from an image of the specific object and an image of a landscape or the like in which the specific object is not captured.

増幅部３０は、画像内に特定物体の画像を貼り付けた際に、その画像（特定物体の画像を貼り付けた画像）に対応する正解データを、併せて作成する。正解データは、画像内における、貼り付けられた特定物体の位置情報を含む。位置情報は、画像内の座標値を用いて表される。例えば、貼り付けられた特定物体を含む領域が四角形である場合、その領域の四隅の点の座標は、上記の位置情報の一種である。なお、上記の領域は、四角形には限定されない。上記の領域は、四角形以外の多角形の領域であってもよいし、円や楕円等の曲線によってあらわされる領域であってもよい。 When the image of the specific object is pasted in the image, the amplification unit 30 also creates correct answer data corresponding to the image (the image to which the image of the specific object is pasted). The correct answer data includes the position information of the pasted specific object in the image. The position information is represented using the coordinate values in the image. For example, when the area including the pasted specific object is a quadrangle, the coordinates of the points at the four corners of the area are a kind of the above-mentioned position information. The above area is not limited to the quadrangle. The above-mentioned region may be a polygonal region other than a quadrangle, or may be a region represented by a curve such as a circle or an ellipse.

増幅部３０は、自らが準備した風景画像に特定物体の画像を貼り付けて学習用データを生成してもよい。また、増幅部３０は、学習用データ供給部２０が供給する学習用データの画像（特定物体を含む画像、または特定物体を含まない画像）に、特定物体を貼り付けることによって、学習用データの量を増やしてもよい。 The amplification unit 30 may generate learning data by pasting an image of a specific object on a landscape image prepared by itself. Further, the amplification unit 30 attaches the specific object to the image of the learning data supplied by the learning data supply unit 20 (the image including the specific object or the image not including the specific object), so that the learning data can be obtained. You may increase the amount.

フィルター処理部４０は、増幅部３０が出力する増幅後の学習用データが持つ画像にフィルター処理を行う。具体的には、フィルター処理部４０は、学習用データ供給部２０が供給する学習用データと、増幅部３０が追加する学習用データと、の少なくともいずれかのフレーム画像の、少なくとも一部の領域をぼかすためのフィルター処理（ガウスフィルター等）を行う。フィルター処理部４０は、増幅部３０が出力する個々の画像に、フィルター処理を行う場合と、フィルター処理を行わない場合との両方があってよい。また、フィルター処理部４０は、増幅部３０が出力する１枚の画像について、フィルター処理を行った場合の画像と、フィルター処理を行わなかった場合の画像との、両方を出力するようにしてもよい。フィルター処理部４０が実施するフィルター処理は、画像の中の少なくとも一部にガウスフィルター（Gaussian filter）をかける処理である。フィルター処理部４０は、例えば、画像内の一部（例えば、特定物体の箇所）だけにフィルター処理を行うようにしてもよい。これにより、フィルター処理部４０は、画像内の少なくとも一部をぼかす効果を生じさせることができる。画像内のガウスフィルターをかける処理を施された箇所はガウシアンぼかしを持ち、撮影時の焦点が合っていない状態、あるいはそれに類似の状態の画像となる。つまり、フィルター処理部４０の処理により、学習用データ内に、画像の中の少なくとも一部の焦点が合っていない状態（または類似の状態）を生じさせることができる。 The filter processing unit 40 filters the image of the learning data after amplification output by the amplification unit 30. Specifically, the filter processing unit 40 is a region of at least a part of at least one frame image of the learning data supplied by the learning data supply unit 20 and the learning data added by the amplification unit 30. Performs a filter process (Gaussian filter, etc.) to blur the image. The filter processing unit 40 may have both a case where the individual image output by the amplification unit 30 is filtered and a case where the filter processing is not performed. Further, the filter processing unit 40 may output both an image when the filter processing is performed and an image when the filter processing is not performed for one image output by the amplification unit 30. good. The filter processing performed by the filter processing unit 40 is a process of applying a Gaussian filter to at least a part of the image. For example, the filtering unit 40 may perform filtering processing only on a part of the image (for example, a part of a specific object). As a result, the filtering unit 40 can produce the effect of blurring at least a part of the image. The Gaussian-filtered part of the image has Gaussian blur, resulting in an image that is out of focus at the time of shooting or similar. That is, the processing of the filter processing unit 40 can cause at least a part of the image to be out of focus (or a similar state) in the training data.

検出部５０は、フレーム画像内の特定物体を検出する機能を持つものである。検出部５０は、機械学習可能なモデルを含むように構成される。このモデルのパラメーターは、機械学習を行うことによって更新可能である。このモデルは、例えば、ニューラルネットワークを用いて実現される。つまり、検出部５０は、機械学習可能なモデルを用いて、入力されるフレーム画像内に含まれる特定物体を検出する。なお、フレーム画像取得部１０からフレーム画像の系列が渡される場合には、検出部５０は、フレーム画像の系列に含まれる各々のフレーム画像内の特定物体を検出する。また、検出部５０は、フレーム画像内で検出された特定物体の、特定物体らしさを表すスコアの値を出力するものである。スコアの値は、所定範囲内の数値である。例えば、スコアの値は、０以上1以下の実数である。スコアの値は、特定物体の候補が（あるいは画像内の領域が）、特定物体である尤度を表す。このスコアは、モデルによって自動的に算出される。モデルは、適切なスコアを出力するよう、機械学習しておくものとする。 The detection unit 50 has a function of detecting a specific object in the frame image. The detection unit 50 is configured to include a machine learning model. The parameters of this model can be updated by performing machine learning. This model is realized using, for example, a neural network. That is, the detection unit 50 detects a specific object included in the input frame image by using a machine learning model. When a series of frame images is passed from the frame image acquisition unit 10, the detection unit 50 detects a specific object in each frame image included in the series of frame images. Further, the detection unit 50 outputs a score value representing the specific object-likeness of the specific object detected in the frame image. The score value is a numerical value within a predetermined range. For example, the score value is a real number of 0 or more and 1 or less. The score value represents the likelihood that a candidate for a particular object (or an area in the image) is a particular object. This score is calculated automatically by the model. The model shall be machine-learned to output the appropriate score.

検出部５０は、入力されるフレーム画像内の物体を検出する。検出部５０は、検出された物体が写っている領域の位置情報と、検出時のスコアとを出力する。スコアは、その領域に写っているものが特定物体である度合い（尤度）を表す値である。スコアは、一例として、０．０以上且つ１．０以下の値を取ることとする。ただし、スコアの値の範囲が上記以外の範囲となるように定めてもよい。位置情報は、例えば、物体が写っている領域を四角形として、その四角形の４頂点の座標で表わされる。この座標値は、画像の、縦方向（垂直方向）および横方向（水平方向）のそれぞれにおける画素の位置を表す数値である。検出部５０は、機械学習モードと、検出実行モードとのいずれかのモードで稼働する。各モードでの検出部５０の動作は次の通りである。 The detection unit 50 detects an object in the input frame image. The detection unit 50 outputs the position information of the area in which the detected object appears and the score at the time of detection. The score is a value indicating the degree (likelihood) that what is reflected in the area is a specific object. As an example, the score takes a value of 0.0 or more and 1.0 or less. However, the range of the score value may be set to be a range other than the above. The position information is represented by, for example, the coordinates of the four vertices of the quadrangle, with the area in which the object appears as a quadrangle. This coordinate value is a numerical value representing the position of the pixel in each of the vertical direction (vertical direction) and the horizontal direction (horizontal direction) of the image. The detection unit 50 operates in either a machine learning mode or a detection execution mode. The operation of the detection unit 50 in each mode is as follows.

（１）機械学習モード
機械学習モードで稼働するとき、検出部５０は、学習用データ供給部２０側から供給される学習用データを用いて、機械学習を行う。学習用データは、検出部５０への入力となる画像と、当該画像に対応する正解データとの対の集合として与えられる。学習用データは、増幅部３０によって増幅されたデータを含んでもよい。また、学習用データは、フィルター処理部４０によってフィルター処理されたデータ（画像のデータ）を含んでいてもよい。検出部５０は、これらの学習用データに含まれる入力用の画像を基に、検出処理を行う。検出部５０は、検出処理を行った結果と、その画像に対応する正解データと、の間のロス（誤差）を算出する。ここで算出されるロスは、画像内の領域ごとのスコアの差を反映した値である。そして、検出部５０は、算出されたロスに基づいて、モデルのパラメーターの更新を行う。モデルがニューラルネットワークを用いて構成されている場合、検出部５０は、誤差逆伝播法（backpropagation）を用いてモデルのパラメーターを更新する。誤差逆伝播法自体は既存の技術である。検出部５０は、大量の学習用データを用いて、上記の機械学習の処理を行う。 (1) Machine learning mode When operating in the machine learning mode, the detection unit 50 performs machine learning using the learning data supplied from the learning data supply unit 20 side. The learning data is given as a set of pairs of an image to be input to the detection unit 50 and correct answer data corresponding to the image. The learning data may include data amplified by the amplification unit 30. Further, the learning data may include data (image data) filtered by the filtering unit 40. The detection unit 50 performs detection processing based on the input image included in these learning data. The detection unit 50 calculates a loss (error) between the result of the detection process and the correct answer data corresponding to the image. The loss calculated here is a value that reflects the difference in the score for each region in the image. Then, the detection unit 50 updates the parameters of the model based on the calculated loss. When the model is configured using a neural network, the detection unit 50 updates the parameters of the model using the backpropagation method. The backpropagation method itself is an existing technology. The detection unit 50 performs the above machine learning process using a large amount of learning data.

（２）検出実行モード
機械学習モードで稼働するとき、検出部５０は、フレーム画像取得部１０から渡されるフレーム画像に写っている特定物体を検出するための処理を行う。検出部５０は、フレーム画像取得部１０から渡されるフレーム画像の系列について、順次処理を行うようにしても良い。このフレーム画像の系列は、映像である。映像のフレームレートは、例えば、３０ｆｐｓ（フレーム毎秒）等であってよい。各々のフレーム画像は、一意に識別可能な識別情報と関連付けられる。フレームの識別情報は、例えば、映像（コンテンツ）を識別するための映像識別情報と、フレーム番号との組合せである。フレーム番号は、例えば、フレームの相対時刻（時・分・秒、および秒未満のフレーム通し番号）で表わされる。フレーム番号は、例えば、ｈｈ：ｍｍ：ｓｓ．ｎｎｎ（ｈｈは時、ｍｍは分、ｓｓは秒、ｎｎｎは秒単位以下の通し番号（例えば、０００～０２９））の形式で表わされる。検出部５０は、検出処理を行った結果を、結果出力部８０に渡す。検出部５０が出力する結果は、フレーム識別情報と、検出結果データとの対の集合である。検出結果データは、フレーム内の領域の位置を表す位置情報と、その領域のスコアとの対の集合として表わされる。位置情報は、領域の四隅（４点）の座標値で表わされる。 (2) Detection Execution Mode When operating in the machine learning mode, the detection unit 50 performs a process for detecting a specific object in the frame image passed from the frame image acquisition unit 10. The detection unit 50 may sequentially process the sequence of frame images passed from the frame image acquisition unit 10. This series of frame images is a video. The frame rate of the video may be, for example, 30 fps (frames per second) or the like. Each frame image is associated with uniquely identifiable identification information. The frame identification information is, for example, a combination of the image identification information for identifying the image (content) and the frame number. The frame number is represented by, for example, the relative time of the frame (hours, minutes, seconds, and frame serial numbers less than seconds). The frame number is, for example, hh: mm: ss. It is expressed in the form of nnn (hh is hour, mm is minute, ss is second, and nnn is a serial number in seconds or less (for example, 000 to 029)). The detection unit 50 passes the result of the detection process to the result output unit 80. The result output by the detection unit 50 is a set of pairs of frame identification information and detection result data. The detection result data is represented as a set of pairs of position information representing the position of a region within a frame and a score of that region. The position information is represented by the coordinate values of the four corners (four points) of the area.

結果出力部８０は、判定結果を出力する。本実施形態では、結果出力部８０は、検出部５０による検出の結果をそのまま出力する。具体的には、結果出力部８０は、特定物体が検出されたフレーム画像を識別する情報と、そのフレーム画像内の領域を特定する情報（４点の座標値）と、その領域のスコアとを出力する。 The result output unit 80 outputs the determination result. In the present embodiment, the result output unit 80 outputs the result of detection by the detection unit 50 as it is. Specifically, the result output unit 80 outputs information for identifying a frame image in which a specific object is detected, information for specifying an area in the frame image (coordinate values of four points), and a score for that area. Output.

図２は、学習用データの構成例を示す概略図である。学習用データは、正例（検出対象である特定物体が写っている画像の例）と、負例（特定物体が写っていない画像の例）とを含むものであって良い。同図では、（Ａ）が正例であり、（Ｂ）が負例である。正例のデータも負例のデータも、画像と正解との対である。 FIG. 2 is a schematic diagram showing a configuration example of learning data. The learning data may include a positive example (an example of an image in which a specific object to be detected is shown) and a negative example (an example of an image in which a specific object is not shown). In the figure, (A) is a positive example and (B) is a negative example. Both positive and negative data are pairs of images and correct answers.

（Ａ）の正例において、画像内に示すハッチングの領域には、特定物体が写っていることを表す。この画像に対応して、正解データは当該領域についての情報を持つ。正解データが含む位置の情報は、領域の四隅の点の座標である。各点の座標は、横方向および縦方向の画素の位置で表わされる。図示する例では、領域の四隅の点の座標は、（３２０，８０）、（３９９，８０）、（３２０，１１９）、（３９９，１１９）である。また、この領域のスコアは、１．０００である。学習用データの正例において、特定物体が写っている領域のスコアは、１．０００である。逆に特定物体が写っていない領域のスコアは、０．０００である。位置情報のデータが存在しない領域については、スコアが０．０００である。 In the positive example of (A), it means that a specific object is reflected in the hatched area shown in the image. Corresponding to this image, the correct answer data has information about the area. The position information included in the correct answer data is the coordinates of the points at the four corners of the area. The coordinates of each point are represented by the positions of the pixels in the horizontal and vertical directions. In the illustrated example, the coordinates of the points at the four corners of the region are (320,80), (399,80), (320,119), (399,119). The score in this area is 1.000. In the positive example of the training data, the score of the area where the specific object is shown is 1.000. On the contrary, the score of the area where the specific object is not shown is 0.000. The score is 0.000 for the area where the position information data does not exist.

（Ｂ）の負例において、画像内には特定物体が写っていない。この画像に対応する正解データは、スコアが１．０００となる領域を持たない。 In the negative example of (B), the specific object is not shown in the image. The correct answer data corresponding to this image does not have a region where the score is 1.000.

図３は、増幅部３０が特定物体の画像の貼り付けを行う処理の内容を示す概略図である。同図（Ａ）は、増幅前の、特定物体が写っていない画像の例を示す。同図（Ｂ）は、上記の（Ａ）の画像について、増幅部３０が平面検出の処理をした結果を示す。（Ｂ）の画像において、濃い黒の領域は、増幅部３０が行った平面検出処理の結果、平面として検出された箇所である。（Ａ）の画像内の、壁面や、床面や、テーブルの上面や、家具の側板面などが、平面として検出されている。（Ｂ）の画像におけるその他の領域（薄いグレーの領域）は、平面として検出されなかった箇所である。本実施形態の例では、増幅部３０が、平面の領域に特定物体の画像を貼り付けるようにしている。つまり、（Ｂ）の画像において濃い黒で示されている領域が、特定物体の画像を貼り付ける場所の候補となる。 FIG. 3 is a schematic diagram showing the contents of a process in which the amplification unit 30 pastes an image of a specific object. FIG. (A) shows an example of an image in which a specific object is not shown before amplification. FIG. (B) shows the result of the plane detection process by the amplification unit 30 for the image of (A) above. In the image of (B), the dark black region is a portion detected as a plane as a result of the plane detection process performed by the amplification unit 30. In the image of (A), the wall surface, the floor surface, the upper surface of the table, the side plate surface of the furniture, and the like are detected as flat surfaces. The other area (light gray area) in the image of (B) is a part that was not detected as a plane. In the example of this embodiment, the amplification unit 30 attaches an image of a specific object to a plane region. That is, the area shown in dark black in the image (B) is a candidate for a place to paste the image of the specific object.

図４は、増幅部３０が特定物体の画像の貼り付けを行った後の画像の例を示す概略図である。図４に示す画像は、図３（Ａ）の画像に対応する。図４に示す画像は、増幅部３０が特定物体の画像を図３（Ａ）の画像上に貼り付ける処理を行った結果である。図４の画像の左下部分に、特定物体の画像が貼り付けられている。特定物体の画像が貼り付けられている位置は、より具体的に特定すると、（横座標，縦座標）が、（３０，３５０）、（８０，３５０）、（３０，４００）、（８０，４００）である４点によって囲まれる領域内である。本例では、増幅部３０によって貼り付けられた画像は、特定物体であるポスターの画像である。 FIG. 4 is a schematic view showing an example of an image after the amplification unit 30 has pasted an image of a specific object. The image shown in FIG. 4 corresponds to the image of FIG. 3 (A). The image shown in FIG. 4 is the result of the amplification unit 30 performing a process of pasting an image of a specific object on the image of FIG. 3 (A). An image of a specific object is attached to the lower left portion of the image of FIG. More specifically, the position where the image of the specific object is pasted is (horizontal coordinate, ordinate coordinate) of (30,350), (80,350), (30,400), (80, It is in the area surrounded by the four points of 400). In this example, the image pasted by the amplification unit 30 is an image of a poster which is a specific object.

図５は、増幅部３０による学習用データの増幅の例を示す概略図である。同図において、（Ａ）は、増幅前の学習用データの一例である。（Ａ）の画像には、特定物体が写っていない。（Ｂ）は、（Ａ）の画像の中の一部に増幅部３０が特定物体を貼り付けたことによる、増幅後の学習用データ（正例）である。（Ｂ）の画像の右下部分に存在するハッチングで示した領域は、増幅部３０が貼り付けた特定物体の画像を含む領域である。（Ｃ）は、（Ａ）の画像に特定物体の画像を貼り付けることなくそのまま用いた、増幅後の学習用データ（負例）である。（Ａ）における正解データは、特定物体が含まれる領域に関するデータを持たない。つまり、（Ａ）の画像の全体について、特定物体が存在することのスコアを０．０００とすることと等価である。（Ｃ）における正解データも、（Ａ）における正解データと同様に、特定物体が含まれる領域に関するデータを持たない。一方、（Ｂ）における正解データは、特定物体が含まれる領域に関するデータを持つ。図示するように、その領域の位置は、（横座標，縦座標）が、（４８０，２８０）、（５５９，２８０）、（４８０，３５９）、（５５９，３５９）という４点（四隅の点）の座標値によって表わされる。 FIG. 5 is a schematic diagram showing an example of amplification of learning data by the amplification unit 30. In the figure, (A) is an example of learning data before amplification. The image of (A) does not show a specific object. (B) is learning data (normal example) after amplification due to the amplification unit 30 attaching a specific object to a part of the image of (A). The region shown by hatching existing in the lower right portion of the image of (B) is a region including an image of a specific object attached by the amplification unit 30. (C) is learning data (negative example) after amplification, which is used as it is without pasting an image of a specific object on the image of (A). The correct answer data in (A) does not have data regarding the area including the specific object. That is, it is equivalent to setting the score of the existence of a specific object to 0.000 for the entire image of (A). Like the correct answer data in (A), the correct answer data in (C) does not have data regarding the region including the specific object. On the other hand, the correct answer data in (B) has data relating to a region including a specific object. As shown in the figure, the positions of the area are four points (points at the four corners) whose (horizontal coordinates, ordinate coordinates) are (480,280), (559,280), (480,359), and (559,359). ) Is represented by the coordinate value.

増幅部３０は、例えば、（Ａ）の学習用データを入力し、（Ｂ）の学習用データを出力することができる。また、増幅部３０は、例えば、（Ａ）の学習用データを入力し、（Ｂ）と（Ｃ）の両方の学習用データを出力することができる。また、増幅部３０は、例えば、（Ａ）の学習用データを入力し、（Ｂ）または（Ｃ）のいずれかの学習用データを確率的に出力するようにしてもよい。 For example, the amplification unit 30 can input the learning data of (A) and output the learning data of (B). Further, the amplification unit 30 can input the learning data of (A) and output the learning data of both (B) and (C), for example. Further, the amplification unit 30 may input the learning data of (A) and output the learning data of either (B) or (C) stochastically, for example.

以上説明したように、本実施形態では、増幅部３０は、学習用データの量を増やすものである。これにより、検出部５０は、より豊富な学習用データを用いて、機械学習を行うことが可能となる。これにより、検出部５０の精度を向上させることができるようになる。 As described above, in the present embodiment, the amplification unit 30 increases the amount of learning data. As a result, the detection unit 50 can perform machine learning using abundant learning data. This makes it possible to improve the accuracy of the detection unit 50.

なお、増幅部３０は、検出対象とする特定物体が写っていそうな場所に、特定物体の画像を貼り付ける処理を行う。そのために、増幅部３０は、画像内の、検出対象とする特定物体が写っていそうな場所を探す処理を行う。一例としては、検出対象とする特定物体が写っていそうな場所とは、平面上である。この場合、増幅部３０は、貼り付ける対象の画像の中の平面を検出する処理を行う。 The amplification unit 30 performs a process of pasting an image of the specific object at a place where the specific object to be detected is likely to appear. Therefore, the amplification unit 30 performs a process of searching for a place in the image where a specific object to be detected is likely to appear. As an example, the place where the specific object to be detected is likely to appear is on a plane. In this case, the amplification unit 30 performs a process of detecting a plane in the image to be pasted.

また、本実施形態によれば、フィルター処理部４０は、フィルター処理によるボケの生じた画像を、学習用データ内に含ませる。これにより、検出部５０は、ボケを含んだ学習用データをも用いて、モデルの機械学習を行う。これにより、検出部５０は、入力されるフレーム画像の中にボケ（焦点が完全に合っていないことによるボケ）が含まれる場合の学習も行えるようになる。つまり、検出部５０が、ボケを含む入力画像の中からも特定物体を検出することができるようになることが期待される。 Further, according to the present embodiment, the filter processing unit 40 includes the blurred image due to the filter processing in the learning data. As a result, the detection unit 50 performs machine learning of the model using the learning data including the blur. As a result, the detection unit 50 can also learn when the input frame image contains blur (blurring due to out of focus). That is, it is expected that the detection unit 50 will be able to detect a specific object even from the input image including the blur.

［第２実施形態］
次に、本発明の第２実施形態について説明する。なお、前実施形態において既に説明した事項については以下において説明を省略する場合がある。ここでは、本実施形態に特有の事項を中心に説明する。 [Second Embodiment]
Next, a second embodiment of the present invention will be described. The matters already described in the previous embodiment may be omitted below. Here, the matters peculiar to the present embodiment will be mainly described.

図６は、本実施形態による物体検出装置の概略機能構成を示すブロック図である。図示するように、物体検出装置２は、第１実施形態の物体検出装置１と同様に、フレーム画像取得部１０と、学習用データ供給部２０と、増幅部３０と、フィルター処理部４０と、検出部５０と、結果出力部８０とを持つ。本実施形態の特徴は、物体検出装置２が、トラッキング部６０を備える点である。トラッキング部６０も、例えばコンピューターを用いて実現可能である。 FIG. 6 is a block diagram showing a schematic functional configuration of the object detection device according to the present embodiment. As shown in the figure, the object detection device 2 includes a frame image acquisition unit 10, a learning data supply unit 20, an amplification unit 30, a filter processing unit 40, and the same as the object detection device 1 of the first embodiment. It has a detection unit 50 and a result output unit 80. A feature of this embodiment is that the object detection device 2 includes a tracking unit 60. The tracking unit 60 can also be realized by using, for example, a computer.

トラッキング部６０は、検出部５０が処理したフレーム画像の系列において、検出部５０が検出した特定物体のトラッキングを行う。つまり、トラッキング部６０は、各々の前記フレーム画像で検出された特定物体を同定して、フレーム画像の系列に含まれる複数のフレーム画像にわたって当該特定物体をトラッキングする。トラッキング部６０は、例えば、検出部５０が検出した特定物体（候補）のうち、所定の閾値以上のスコアを持つ特定物体の、フレーム画像間でのトラッキングを行う。トラッキング部６０は、特定物体のトラッキングを行うために、例えば、ＫＣＦ（Kernelized Correlation Filter）アルゴリズムを使用する。なお、ＫＣＦについては、下記の参考文献にも記載されている。
［参考文献］Joao F. Henriques，Rui Caseiro，Pedro Martins， Jorge Batista，“High-Speed Tracking with Kernelized Correlation Filters”， IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE，https://arxiv.org/abs/1404.7584，２０１４年． The tracking unit 60 tracks a specific object detected by the detection unit 50 in a series of frame images processed by the detection unit 50. That is, the tracking unit 60 identifies the specific object detected in each of the frame images, and tracks the specific object over a plurality of frame images included in the series of frame images. The tracking unit 60 tracks, for example, between frame images of a specific object (candidate) detected by the detection unit 50 and having a score equal to or higher than a predetermined threshold value. The tracking unit 60 uses, for example, a KCF (Kernelized Correlation Filter) algorithm to track a specific object. KCF is also described in the following references.
[References] Joao F. Henriques, Rui Caseiro, Pedro Martins, Jorge Batista, “High-Speed Tracking with Kernelized Correlation Filters”, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, https://arxiv.org/abs/1404.7584, 2014.

なお、トラッキング部６０は、ＫＣＦアルゴリズムの代わりに、ＴＬＤ(Tracking Learning Detection)など、他の既存手法を用いてトラッキングを行ってもよい。 The tracking unit 60 may perform tracking by using another existing method such as TLD (Tracking Learning Detection) instead of the KCF algorithm.

トラッキング部６０は、検出部５０から受け取った検出結果のデータとともに、トラッキングの結果を、結果出力部８０に渡す。本実施形態の結果出力部８０は、検出部５０が求めた検出結果（画像内の位置のデータと、その位置に関するスコア）を出力するとともに、トラッキングの結果を出力する。トラッキングの結果のデータは、例えば、あるフレームにおいて検出された特定物体の位置が、他のフレーム（例えば、隣接フレーム）において検出された特定物体の位置のどれに対応するものであるか、を表すデータである。また、トラッキング部６０が、同定された特定物体のそれぞれについて、フレームごとのスコア（検出部５０によって求められたスコア）の系列（時系列）を結果出力部８０に渡すことも容易である。 The tracking unit 60 passes the tracking result to the result output unit 80 together with the detection result data received from the detection unit 50. The result output unit 80 of the present embodiment outputs the detection result (data of the position in the image and the score related to the position) obtained by the detection unit 50, and also outputs the tracking result. The data of the tracking result represents, for example, which of the positions of the specific object detected in one frame corresponds to the position of the specific object detected in another frame (for example, an adjacent frame). It is data. Further, it is easy for the tracking unit 60 to pass a sequence (time series) of scores (scores obtained by the detection unit 50) for each frame to the result output unit 80 for each of the identified specific objects.

本実施形態によれば、結果出力部８０は、フレーム画像ごとに検出された特定物体のデータだけではなく、フレーム間でのトラッキングによって同定された特定物体の情報や、同定された特定物体についての上記の時系列のスコアの値をも、出力することができる。本実施形態による物体検出装置２のユーザーは、結果出力部８０が出力するそれらの情報を用いて、利用することが可能となる。 According to the present embodiment, the result output unit 80 not only covers the data of the specific object detected for each frame image, but also the information of the specific object identified by tracking between frames and the identified specific object. The value of the score in the above time series can also be output. The user of the object detection device 2 according to the present embodiment can use the information output by the result output unit 80.

［第３実施形態］
次に、本発明の第３実施形態について説明する。なお、前実施形態までにおいて既に説明した事項については以下において説明を省略する場合がある。ここでは、本実施形態に特有の事項を中心に説明する。 [Third Embodiment]
Next, a third embodiment of the present invention will be described. The matters already explained up to the previous embodiment may be omitted below. Here, the matters peculiar to the present embodiment will be mainly described.

図７は、本実施形態による物体検出装置の概略機能構成を示すブロック図である。図示するように、物体検出装置３は、第２実施形態の物体検出装置１と同様に、フレーム画像取得部１０と、学習用データ供給部２０と、増幅部３０と、フィルター処理部４０と、検出部５０と、トラッキング部６０と、結果出力部８０とを持つ。本実施形態の特徴は、物体検出装置３が、追跡系列判定部７０を備える点である。追跡系列判定部７０も、例えばコンピューターを用いて実現可能である。 FIG. 7 is a block diagram showing a schematic functional configuration of the object detection device according to the present embodiment. As shown in the figure, the object detection device 3 includes a frame image acquisition unit 10, a learning data supply unit 20, an amplification unit 30, a filter processing unit 40, and the same as the object detection device 1 of the second embodiment. It has a detection unit 50, a tracking unit 60, and a result output unit 80. A feature of this embodiment is that the object detection device 3 includes a tracking sequence determination unit 70. The tracking sequence determination unit 70 can also be realized by using, for example, a computer.

追跡系列判定部７０は、フレーム画像内の特定物体（候補）の追跡結果に基づいて、フレーム画像の系列（映像）に、特定物体が含まれるか否かを判定する。具体的には、追跡系列判定部７０は、検出部５０によって検出された特定物体（候補）のデータ（検出時のスコアを含む）と、トラッキング部６０によって同定された特定物体（候補）のトラッキングの情報とを受け取る。追跡系列判定部７０は、同定された特定物体（候補）について、トラッキング部６０によってトラッキングされた複数のフレーム画像のそれぞれにおけるスコアを取得する。言い換えれば、追跡系列判定部７０は、スコアの系列を取得する。つまり、追跡系列判定部は、トラッキング部６０が同定した特定物体の、フレーム画像の系列に含まれる各々のフレーム画像におけるスコアを、フレーム画像の系列に対応するスコアの系列として、そのスコアの系列に基づいて当該特定物体が真に特定物体であるか否かを判定する。つまり、追跡系列判定部７０は、１枚のフレーム画像におけるスコアだけによって判定を行うのではなく、上記のスコアの系列に基づいて、特定物体であるか否かの判定を行う。 The tracking sequence determination unit 70 determines whether or not the specific object is included in the sequence (video) of the frame image based on the tracking result of the specific object (candidate) in the frame image. Specifically, the tracking sequence determination unit 70 tracks the data (including the score at the time of detection) of the specific object (candidate) detected by the detection unit 50 and the specific object (candidate) identified by the tracking unit 60. Receive information and. The tracking sequence determination unit 70 acquires a score in each of a plurality of frame images tracked by the tracking unit 60 for the identified specific object (candidate). In other words, the tracking sequence determination unit 70 acquires a sequence of scores. That is, the tracking sequence determination unit sets the score in each frame image included in the frame image series of the specific object identified by the tracking unit 60 into the score series as the score series corresponding to the frame image series. Based on this, it is determined whether or not the specific object is truly a specific object. That is, the tracking sequence determination unit 70 does not make a determination based only on the score in one frame image, but determines whether or not it is a specific object based on the sequence of the above scores.

追跡系列判定部７０は、一例として、上記のスコアの系列の平均値を算出する。そして、追跡系列判定部７０は、その平均値が所定の閾値であるか否かによって、判定を行う。上記のスコアの系列の平均値が閾値以上である場合（その場合のみ）、追跡系列判定部７０は、その特定物体（候補）が、真に特定物体であると判定する。 As an example, the tracking sequence determination unit 70 calculates the average value of the sequence of the above scores. Then, the tracking sequence determination unit 70 makes a determination depending on whether or not the average value is a predetermined threshold value. When the average value of the series of the above scores is equal to or greater than the threshold value (only in that case), the tracking sequence determination unit 70 determines that the specific object (candidate) is truly a specific object.

追跡系列判定部７０は、他の例として、スコアの系列の平均値以外の統計値にも基づいて、判定を行うようにしてもよい。追跡系列判定部７０は、例えば、スコアの系列における最大値あるいは最小値に基づいて、判定を行うようにしてもよい。この場合、追跡系列判定部７０は、スコアの系列における最大値あるいは最小値のいずれか一方（予め定めた側）が、所定の閾値以上であるか否かを判定する。上記の最大値あるいは最小値が所定の閾値以上である場合（その場合のみ）、追跡系列判定部７０は、その特定物体（候補）が、真に特定物体であると判定する。また、追跡系列判定部７０は、スコアの系列の値の平均値、最大値、最小値のいずれかと、その系列におけるスコアの分散値とに基づいて、判定を行うようにしてもよい。例えば、スコアの系列の値の平均値、最大値、最小値のいずれか（予め定めたもの）が所定の閾値以上であって、且つ分散値が分散値に関する所定の閾値以下である場合（その場合のみ）に、追跡系列判定部７０は、その特定物体（候補）が、真に特定物体であると判定する。さらに、追跡系列判定部７０が、スコアの系列の値に関する他の統計値が、所定の範囲内にあるか否かに基づいて、その特定物体（候補）が真に特定物体であるか否かを判定してもよい。 As another example, the tracking sequence determination unit 70 may make a determination based on a statistical value other than the average value of the score series. The tracking sequence determination unit 70 may make a determination based on, for example, the maximum value or the minimum value in the score series. In this case, the tracking sequence determination unit 70 determines whether or not either the maximum value or the minimum value (predetermined side) in the score sequence is equal to or higher than a predetermined threshold value. When the maximum value or the minimum value is equal to or higher than a predetermined threshold value (only in that case), the tracking sequence determination unit 70 determines that the specific object (candidate) is truly a specific object. Further, the tracking sequence determination unit 70 may make a determination based on any one of the average value, the maximum value, and the minimum value of the values of the score series and the variance value of the score in the series. For example, when any one of the average value, the maximum value, and the minimum value (predetermined) of the values in the series of scores is equal to or more than a predetermined threshold value and the variance value is equal to or less than a predetermined threshold value regarding the variance value (its). Only in the case), the tracking sequence determination unit 70 determines that the specific object (candidate) is truly a specific object. Further, the tracking sequence determination unit 70 determines whether or not the specific object (candidate) is truly a specific object based on whether or not other statistical values regarding the values of the score series are within a predetermined range. May be determined.

追跡系列判定部７０は、さらに他の例として、スコアの系列の値に基づくデータを、ＳＶＭ（サポートベクトルマシン，Support Vector Machine）やＦＦＮＮ（フィードフォワードニューラルネットワーク，Feed Forward Neural Networks）に入力することによって判定を行ってもよい。この場合、追跡系列判定部７０は、ＳＶＭやＦＦＮＮにスコア系列の値の平均値、分散値、最大値、最小値などといった統計値を入力する。そのＳＶＭやＦＦＮＮは、それらの入力パラメーターに基づいて、判定結果（特定物体であるか否かを表す情報）を出力する。この場合、上記のＳＶＭやＦＦＮＮについては、正例データおよび負例データを用いて、予め学習（内部パラメーターの調整）を行っておく。なお、正例データおよび負例データを用いたＳＶＭやＦＦＮＮの内部パラメーターの調整自体は、既存技術を用いて行うことができる。内部パラメーターを調整するための、正例データおよび負例データのそれぞれは、ＳＶＭやＦＦＮＮに入力するためのスコア系列の値に関する統計値と、その正解（特定物体であるか否か）と、の対のデータである。 As yet another example, the tracking sequence determination unit 70 inputs data based on the value of the score series to SVM (Support Vector Machine) or FFNN (Feed Forward Neural Networks). The determination may be made by. In this case, the tracking sequence determination unit 70 inputs statistical values such as an average value, a variance value, a maximum value, and a minimum value of the score series values into the SVM or FFNN. The SVM or FFNN outputs a determination result (information indicating whether or not it is a specific object) based on those input parameters. In this case, for the above SVM and FFNN, learning (adjustment of internal parameters) is performed in advance using the positive example data and the negative example data. The adjustment itself of the internal parameters of SVM and FFNN using the positive example data and the negative example data can be performed by using the existing technology. The positive and negative example data for adjusting the internal parameters are the statistical values related to the values of the score series to be input to SVM and FFNN, and the correct answer (whether or not it is a specific object). It is a pair of data.

追跡系列判定部７０は、さらに他の例として、スコアの系列の値に基づく統計値の代わりに、スコアの系列の値そのものを、ＳＶＭやＦＦＮＮに入力することによって判定を行ってもよい。この場合、ＳＶＭやＦＦＮＮは、入力されるスコアの系列の値に基づいて計算を行い、判定結果（特定物体であるか否かを表す情報）を出力する。この場合も、ＳＶＭやＦＦＮＮについては、正例データおよび負例データを用いて、予め学習（内部パラメーターの調整）を行っておく。内部パラメーターを調整するための、正例データおよび負例データのそれぞれは、ＳＶＭやＦＦＮＮに入力するためのスコア系列の値そのものと、その正解（特定物体であるか否か）と、の対のデータである。 As yet another example, the tracking sequence determination unit 70 may make a determination by inputting the score sequence value itself into the SVM or FFNN instead of the statistical value based on the score sequence value. In this case, SVM and FFNN perform a calculation based on the value of a series of input scores, and output a determination result (information indicating whether or not the object is a specific object). Also in this case, for SVM and FFNN, learning (adjustment of internal parameters) is performed in advance using positive example data and negative example data. Each of the positive example data and the negative example data for adjusting the internal parameters is a pair of the value of the score series itself for inputting to SVM or FFNN and the correct answer (whether or not it is a specific object). It is data.

追跡系列判定部７０は、検出部５０およびトラッキング部６０から受け取った結果のデータとともに、スコアの系列に対する判定の結果を、結果出力部８０に渡す。本実施形態の結果出力部８０は、検出部５０が求めた検出結果（画像内の位置のデータと、その位置に関するスコア）を出力するとともに、トラッキングの結果（スコアの系列のデータを含む）と、スコアの系列に基づく判定結果とを出力する。 The tracking sequence determination unit 70 passes the result of determination for the score sequence to the result output unit 80 together with the result data received from the detection unit 50 and the tracking unit 60. The result output unit 80 of the present embodiment outputs the detection result (data of the position in the image and the score related to the position) obtained by the detection unit 50, and also outputs the tracking result (including the data of the score series). , Outputs the judgment result based on the score series.

本実施形態によれば、結果出力部８０は、フレーム画像ごとに検出された特定物体のデータだけではなく、フレーム間でのトラッキングによって同定された特定物体（候補）の、スコアの系列に基づいて追跡系列判定部７０が判定した判定結果を出力することができる。つまり、本実施形態による物体検出装置３のユーザーは、１枚のフレーム画像だけに基づく判定結果ではなく、所定の長さを有するフレーム画像の系列に基づく判定結果を得ることができる。 According to the present embodiment, the result output unit 80 is based not only on the data of the specific object detected for each frame image but also on the sequence of scores of the specific object (candidate) identified by tracking between frames. The determination result determined by the tracking sequence determination unit 70 can be output. That is, the user of the object detection device 3 according to the present embodiment can obtain a determination result based on a series of frame images having a predetermined length, not a determination result based on only one frame image.

第２実施形態および第３実施形態では、トラッキング手法を用いることによって、フレーム画像の系列において、フレーム画像間にまたがって物体を同定した。また、第３実施形態では、フレーム画像の系列に対応して、スコア系列を求め、そのスコア系列に基づいて検出とするか非検出とするかを判定した。これらにより、映像中の一瞬（単一のフレーム画像、あるいは少数のフレーム画像）のみにおいて検出されてしまう誤検出（過検出）を除去することが可能となる。 In the second and third embodiments, the tracking technique was used to identify objects in the frame image sequence, straddling the frame images. Further, in the third embodiment, the score series is obtained corresponding to the series of frame images, and it is determined whether to detect or not to detect based on the score series. These make it possible to remove erroneous detection (overdetection) that is detected only in a moment (single frame image or a small number of frame images) in the video.

図８は、前記の各実施形態における物体検出装置１、２、または３の内部構成の例を示すブロック図である。物体検出装置１、２、または３は、コンピューターを用いて実現され得る。図示するように、そのコンピューターは、中央処理装置９０１と、ＲＡＭ９０２と、入出力ポート９０３と、入出力デバイス９０４や９０５等と、バス９０６と、を含んで構成される。コンピューター自体は、既存技術を用いて実現可能である。中央処理装置９０１は、ＲＡＭ９０２等から読み込んだプログラムに含まれる命令を実行する。中央処理装置９０１は、各命令にしたがって、ＲＡＭ９０２にデータを書き込んだり、ＲＡＭ９０２からデータを読み出したり、算術演算や論理演算を行ったりする。ＲＡＭ９０２は、データやプログラムを記憶する。ＲＡＭ９０２に含まれる各要素は、アドレスを持ち、アドレスを用いてアクセスされ得るものである。なお、ＲＡＭは、「ランダムアクセスメモリー」の略である。入出力ポート９０３は、中央処理装置９０１が外部の入出力デバイス等とデータのやり取りを行うためのポートである。入出力デバイス９０４や９０５は、入出力デバイスである。入出力デバイス９０４や９０５は、入出力ポート９０３を介して中央処理装置９０１との間でデータをやりとりする。バス９０６は、コンピューター内部で使用される共通の通信路である。例えば、中央処理装置９０１は、バス９０６を介してＲＡＭ９０２のデータを読んだり書いたりする。また、例えば、中央処理装置９０１は、バス９０６を介して入出力ポートにアクセスする。 FIG. 8 is a block diagram showing an example of the internal configuration of the object detection devices 1, 2, or 3 in each of the above embodiments. The object detection device 1, 2, or 3 can be realized using a computer. As shown in the figure, the computer includes a central processing unit 901, a RAM 902, an input / output port 903, input / output devices 904 and 905, and a bus 906. The computer itself can be realized using existing technology. The central processing unit 901 executes an instruction included in a program read from RAM 902 or the like. The central processing unit 901 writes data to RAM 902, reads data from RAM 902, and performs arithmetic operations and logical operations according to each instruction. The RAM 902 stores data and programs. Each element contained in the RAM 902 has an address and can be accessed using the address. RAM is an abbreviation for "random access memory". The input / output port 903 is a port for the central processing unit 901 to exchange data with an external input / output device or the like. The input / output devices 904 and 905 are input / output devices. The input / output devices 904 and 905 exchange data with the central processing unit 901 via the input / output ports 903. Bus 906 is a common communication path used inside a computer. For example, the central processing unit 901 reads and writes data in the RAM 902 via the bus 906. Further, for example, the central processing unit 901 accesses the input / output port via the bus 906.

なお、物体検出装置１、２、または３の機能の少なくとも一部は、コンピューター上で稼働するプログラムで実現され得る。その場合、この機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ＵＳＢメモリー等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、一時的に、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリーのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 At least a part of the functions of the object detection devices 1, 2, or 3 can be realized by a program running on a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by a computer system and executed. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. The "computer-readable recording medium" is a portable medium such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, a DVD-ROM, or a USB memory, or a storage device such as a hard disk built in a computer system. That means. Furthermore, a "computer-readable recording medium" is a device that temporarily and dynamically holds a program, such as a communication line when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In that case, it may include a program that holds a program for a certain period of time, such as a volatile memory inside a computer system that is a server or a client. Further, the above-mentioned program may be for realizing a part of the above-mentioned functions, and may be further realized for realizing the above-mentioned functions in combination with a program already recorded in the computer system.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs and the like within a range that does not deviate from the gist of the present invention.

［評価実験］
以下に、複数の実施形態を評価するための実験の結果を説明する。 [Evaluation experiment]
The results of experiments for evaluating a plurality of embodiments will be described below.

表１は、学習済みのモデルを用いて、静止画像を対象として物体の検出を行った結果を表す。ここでは、収集した学習用データは２００件であり、増幅後の学習用データは約２０万件である。この実験において、学習用データと、評価用データとは相互に似通ったデータである。 Table 1 shows the results of detecting an object on a still image using a trained model. Here, the collected learning data is 200, and the amplified learning data is about 200,000. In this experiment, the training data and the evaluation data are similar to each other.

表１の１行目は、比較対象の実験であり、学習用データの増幅を行わず、フィルター処理も行っていない。この場合のＦ１値は、０．８０であった。２行目は、学習用データの増幅を行い、フィルター処理を行わなかった場合である。この場合のＦ１値は、０．８１であり、比較対象の場合と比べて、良い結果が得られている。３行目は、学習用データの増幅を行い、フィルター処理も行った場合である。この場合のＦ１値は、０．７７であり、比較対象の場合と比べて、悪い結果となった。具体的には、３行目の場合の実験では、ボケ画像を用いた学習を行った結果として、正検出が増加したが、同時に誤検出も増加してしまった。フィルター処理によって性能が劣化したことは、元の学習用データおよび評価用データの両方において、ボケ画像が極めて少ないことに起因する。実際には、ボケた状態で写った特定物体（ここでは、ポスター）の検出も行えるようになっている。即ち、検出の性能劣化を抑えつつ，ボケ画像の検出性能の向上を獲得している。 The first row of Table 1 is an experiment to be compared, and the training data is not amplified and not filtered. The F1 value in this case was 0.80. The second line is the case where the learning data is amplified and the filtering process is not performed. The F1 value in this case is 0.81, and better results are obtained as compared with the case of the comparison target. The third line is a case where the learning data is amplified and filtered. The F1 value in this case was 0.77, which was a worse result than in the case of the comparison target. Specifically, in the experiment in the case of the third line, as a result of learning using the blurred image, the positive detection increased, but at the same time, the false detection also increased. The deterioration in performance due to the filtering process is due to the extremely small amount of blurred images in both the original training data and the evaluation data. In reality, it is also possible to detect a specific object (here, a poster) that appears in a blurred state. That is, the improvement of the detection performance of the blurred image is obtained while suppressing the deterioration of the detection performance.

この実験では、データの増幅を行うことによって、物体の検出性能が向上したことを確認できた。 In this experiment, it was confirmed that the object detection performance was improved by amplifying the data.

表２は、学習済みのモデルを用いて、動画を対象として物体の検出を行った結果を表す。この実験では、学習用データとは全く異なる画像を含んだ評価用データ（映像）を用いて評価を行った。 Table 2 shows the results of detecting an object for a moving image using a trained model. In this experiment, evaluation was performed using evaluation data (video) containing images completely different from the learning data.

表２の１行目は、比較対象の実験であり、学習用データの増幅も、フィルター処理も、トラッキングも行っていない。この場合のＦ１値は、０．１６である。２行目から４行目は、学習用データの増幅も、フィルター処理も行っている場合である。２行目から４行目までのいずれの場合も、１業務の比較対象の場合よりは、性能が向上している。２行目は、トラッキングを行わない場合（第１実施形態）であり、そのＦ１値は０．１９である。３行目は、シンプルなトラッキングを行った場合（第２実施形態）であり、そのＦ１値は０．２４である。４行目は、トラッキングを行って、その結果得られるスコアの系列の平均値に基づいて判定を行った場合（第３実施形態）であり、そのＦ１値は０．２６である。つまり、第２実施形態および第３実施形態で説明したトラッキングを行った場合に、検出性能が大幅に向上している。また、２行目と３行目とを比較しても、３行目（第３実施形態）のほうの検出性能が、より向上している。 The first row of Table 2 is an experiment to be compared, and the training data is not amplified, filtered, or tracked. The F1 value in this case is 0.16. The second to fourth lines are cases where the learning data is amplified and filtered. In any of the second to fourth lines, the performance is improved as compared with the case of the comparison target of one business. The second line is the case where tracking is not performed (first embodiment), and the F1 value is 0.19. The third line is the case where simple tracking is performed (second embodiment), and the F1 value is 0.24. The fourth row is a case where tracking is performed and a determination is made based on the average value of a series of scores obtained as a result (third embodiment), and the F1 value is 0.26. That is, when the tracking described in the second embodiment and the third embodiment is performed, the detection performance is significantly improved. Further, even when the second row and the third row are compared, the detection performance of the third row (third embodiment) is further improved.

図９は、物体検出装置が物体のトラッキングを行わない場合の、フレーム間での物体同定方法を説明するための概略図である。同図（Ａ）および（Ｂ）のそれぞれにおいて、２つの四角形は、それぞれ、異なるフレーム画像において検出された物体、または物体の候補である。物体を同定するために、フレーム間（例えば、隣接フレーム間）で、ＩｏＵ（Intersection over Union）値を算出する。同図（Ａ）においてハッチングされている領域は、２つの四角形の共通部分（積集合、intersection）である。同図（Ｂ）においてハッチングされている領域は、２つの四角形の和集合（union）である。フレーム画像内における、図示するような積集合の部分の面積と和集合の部分の面積とを求め、前者を後者で除すことによって、ＩｏＵ値が算出される。算出したＩｏＵ値が所定の閾値（例えば、０．５）以上である物体を、同一物体と同定してトラッキングを行う。 FIG. 9 is a schematic diagram for explaining a method of identifying an object between frames when the object detection device does not track the object. In each of the figures (A) and (B), the two rectangles are objects or candidates for objects detected in different frame images, respectively. In order to identify an object, an IOU (Intersection over Union) value is calculated between frames (for example, between adjacent frames). The area hatched in the figure (A) is the intersection of the two quadrilaterals. The region hatched in FIG. 3B is a union of two quadrilaterals. The IoU value is calculated by obtaining the area of the intersection and the area of the union as shown in the frame image and dividing the former by the latter. Objects whose calculated IoU value is equal to or higher than a predetermined threshold value (for example, 0.5) are identified as the same object and tracking is performed.

本発明は、例えば、画像内の特定の物体を検出すること、あるいは画像内の特定の物体が存在するか否かを確認すること、を必要とするすべての産業で利用可能である。一例としては、映像コンテンツを制作する事業において、映像コンテンツを構成する各フレーム画像に、特定の物体が写っていることあるいは写っていないことを確認するために利用可能である。特に、学習用データの収集が困難な動画の物体検出タスクを実現するシステムに利用可能である。但し、本発明の利用範囲はここに例示したものには限られない。 The present invention is available in all industries that require, for example, to detect a particular object in an image or to determine if a particular object in an image is present. As an example, in a business of producing video content, it can be used to confirm that a specific object is shown or not shown in each frame image constituting the video content. In particular, it can be used for a system that realizes a moving object detection task in which it is difficult to collect learning data. However, the scope of use of the present invention is not limited to those exemplified here.

１，２，３物体検出装置
１０フレーム画像取得部
２０学習用データ供給部
３０増幅部
４０フィルター処理部
５０検出部
６０トラッキング部
７０追跡系列判定部
８０結果出力部
９０１中央処理装置
９０２ＲＡＭ
９０３入出力ポート
９０４，９０５入出力デバイス
９０６バス 1, 2, 3 Object detection device 10 Frame image acquisition unit 20 Learning data supply unit 30 Amplification unit 40 Filter processing unit 50 Detection unit 60 Tracking unit 70 Tracking sequence determination unit 80 Result output unit 901 Central processing unit 902 RAM
903 I / O ports 904,905 I / O devices 906 buses

Claims

A detector that detects a specific object contained in the input frame image using a machine-learnable model,
It is a learning data for learning the model of the detection unit, and is a set of a pair of a frame image for input and correct answer data indicating whether or not the specific object is included in the frame image. A learning data supply unit that supplies certain learning data,
By pasting the image of the specific object at a predetermined position in the frame image for input, the frame image for input of the training data is created, and the created frame image includes the specific object. An amplification unit that generates correct answer data to be represented and adds a pair of the created frame image and the correct answer data corresponding to the frame image to the learning data supplied by the learning data supply unit.
An object detection device comprising.

When the specific object is included in the frame image, the correct answer data includes position data representing a position in the frame image in which the specific object is captured.
The object detection device according to claim 1.

Filtering is performed to blur at least a part of the frame image of at least one of the learning data supplied by the learning data supply unit and the learning data added by the amplification unit. Filter processing unit,
The object detection device according to claim 1 or 2, further comprising.

The detection unit detects a specific object in each of the frame images included in the series of frame images.
A tracking unit that identifies a specific object detected in each of the frame images in the series of the frame images and tracks the specific object over a plurality of the frame images.
The object detection device according to any one of claims 1 to 3, further comprising.

The detection unit outputs a score representing the specific object-likeness of the specific object detected in the frame image.
The score in each of the frame images of the specific object identified by the tracking unit in the frame image series is used as a score series corresponding to the frame image series, based on the score series. Tracking sequence determination unit that determines whether a specific object is truly a specific object,
The object detection device according to claim 4, further comprising.

The detection unit outputs a score representing the specific object-likeness of the specific object detected in the frame image.
The object detection device according to any one of claims 1 to 4.

A detector that detects a specific object contained in the input frame image using a machine-learnable model,
It is a learning data for learning the model of the detection unit, and is a set of a pair of a frame image for input and correct answer data indicating whether or not the specific object is included in the frame image. A learning data supply unit that supplies certain learning data,
By pasting the image of the specific object at a predetermined position in the frame image for input, the frame image for input of the training data is created, and the created frame image includes the specific object. An amplification unit that generates correct answer data to be represented and adds a pair of the created frame image and the correct answer data corresponding to the frame image to the learning data supplied by the learning data supply unit.
A program for operating a computer as an object detection device, which is equipped with.