JP7012673B2

JP7012673B2 - Video identification programs, devices and methods that select the context identification engine according to the image

Info

Publication number: JP7012673B2
Application number: JP2019002325A
Authority: JP
Inventors: 和之田坂; 広昌柳原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2022-01-28
Anticipated expiration: 2039-01-10
Also published as: JP2020112961A

Description

本発明は、学習モデルを予め構築したコンテキスト識別エンジンを用いて、映像からコンテキストを識別する技術に関する。 The present invention relates to a technique for identifying a context from a video by using a context identification engine in which a learning model is constructed in advance.

従来、複数の機械学習モデルを実装し、各機械学習モデルの評価を自動で比較する技術がある（例えば特許文献１参照）。
同様に、複数の機械学習モデルを実装し、ルールベースを用いて、機械学習モデルを選択する技術もある（例えば特許文献２参照）。
また、複数のタスクを同一のモデルで学習・推論する技術もある（例えば特許文献３参照）。
更に、複数の機械学習モデルを用いて同時に識別する際に、複数の対象物の種類に関わらず、学習に不要な情報を除去することによって識別精度を向上させる技術もある（例えば特許文献４参照）。 Conventionally, there is a technique of implementing a plurality of machine learning models and automatically comparing the evaluations of each machine learning model (see, for example, Patent Document 1).
Similarly, there is also a technique of implementing a plurality of machine learning models and selecting a machine learning model using a rule base (see, for example, Patent Document 2).
There is also a technique for learning and inferring a plurality of tasks using the same model (see, for example, Patent Document 3).
Further, there is also a technique for improving the identification accuracy by removing information unnecessary for learning regardless of the types of a plurality of objects when simultaneously identifying using a plurality of machine learning models (see, for example, Patent Document 4). ).

図１は、従来技術における映像識別装置の機能構成図である。 FIG. 1 is a functional configuration diagram of a video identification device in the prior art.

図１によれば、映像識別装置１は、学習モデルを予め構築した複数のコンテキスト識別エンジンを有する。１つの映像を、異なるコンテキスト識別エンジンに入力し、各コンテキストの識別結果を同時に得ることができる。具体的には、物体種別の認識のみならず、具体的なオブジェグトやエッジ、挙動のような異なるコンテキストまでも同時に識別することができる。この場合、複数の対象物やその挙動が映り込む映像全体に対して、それぞれの機械学習モデルが、高精度に識別するように学習させておく必要がある。 According to FIG. 1, the video identification device 1 has a plurality of context identification engines in which a learning model is pre-constructed. One video can be input to different context identification engines and the identification results for each context can be obtained at the same time. Specifically, not only the recognition of the object type but also different contexts such as specific objects, edges, and behaviors can be identified at the same time. In this case, it is necessary to train each machine learning model to discriminate with high accuracy for the entire image in which a plurality of objects and their behaviors are reflected.

特開２０１７－００４５０９号公報Japanese Unexamined Patent Publication No. 2017-004509 特許第６２２４８１１号公報Japanese Patent No. 6224811 特開２０１８－０５５３７７号公報Japanese Unexamined Patent Publication No. 2018-0553777 特開２０１４－１０６６８５号公報Japanese Unexamined Patent Publication No. 2014-106685

前述した特許文献１によれば、対象となる映像を、全ての機械学習モデルに入力する必要があるために、機械学習モデルが多くなるほど、サーバの計算リソースも必要とする。
また、特許文献１及び２の両方とも、対象となる映像をそのまま、各機械学習モデルに入力するために、映像に含まれる不要な情報によって、識別精度が低下する場合もある。
更に、特許文献３によれば、新たなタスクを追加する場合には、その都度、学習モデルを構築し直す必要がある。
同様に、特許文献４も、入力される同一の映像に対して、多くの機械学習モデルを用意する必要がある。 According to the above-mentioned Patent Document 1, since it is necessary to input the target video into all the machine learning models, the more machine learning models there are, the more computational resources of the server are required.
Further, in both Patent Documents 1 and 2, since the target video is input to each machine learning model as it is, the identification accuracy may be lowered by unnecessary information included in the video.
Further, according to Patent Document 3, it is necessary to reconstruct the learning model each time a new task is added.
Similarly, in Patent Document 4, it is necessary to prepare many machine learning models for the same input video.

このように、同一の映像に対する各コンテキストの識別精度を高めるために、各コンテキストに専用の機械学習エンジンを用意すると共に、計算リソースを増加させる必要がある。また、映像に対して、コンテキストをリアルタイムに識別しようとするほど、計算リソースを更に増加させる必要がある。
In this way, in order to improve the identification accuracy of each context for the same video, it is necessary to prepare a dedicated machine learning engine for each context and increase the calculation resources. In addition, it is necessary to further increase the computational resources as the context is identified in real time for the video.

尚、他の課題として、例えば動物の写真が存在する室内を撮影した映像を機械学習エンジンで識別した場合、その動物を椅子と識別しただけでなく、その他の識別も誤って識別してしまうことがある。既存の機械学習エンジンによれば、所定のコンテキストに合わない物体が画像内に存在すると、物体検出自体を誤る傾向がある。即ち、その環境でそのコンテキストの識別自体がおかしい、として判断することが難しい。 As another problem, for example, when an image taken in a room where a photograph of an animal is present is identified by a machine learning engine, not only the animal is identified as a chair but also other identifications are erroneously identified. There is. According to existing machine learning engines, if an object that does not fit a predetermined context exists in an image, the object detection itself tends to be erroneous. That is, it is difficult to judge that the identification of the context itself is strange in that environment.

そこで、本発明は、映像から、複数のコンテキストを少ない計算リソースで高精度に識別することができる映像識別プログラム、装置及び方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a video identification program, an apparatus, and a method capable of discriminating a plurality of contexts from a video with high accuracy with a small amount of computational resources.

本発明によれば、映像に応じたコンテキスト識別エンジンを選択するようにコンピュータを機能させる映像識別プログラムであって、
映像から、それぞれ異なる所定コンテキストを識別するべく、少なくとも挙動識別エンジンを含む複数のコンテキスト識別エンジンを予め起動しており、
入力された映像から、物体を枠で囲む物体画像と、当該物体画像における物体種別とを検出する物体検出エンジンと、
物体画像毎に、当該物体画像の物体種別が移動する物体である場合、複数のコンテキスト識別エンジンの中から、物体種別に応じた挙動識別エンジンを選択する選択手段と、
物体画像よりも広い画像となるべく映像からトリミングし直し、物体画像を挙動識別エンジンに適したフレームレートとなる編集画像に編集し、当該挙動識別エンジンへ入力する画像編集手段と
して機能させるようにコンピュータを機能させることを特徴とする。
また、本発明の映像識別プログラムにおける他の実施形態によれば、
複数のコンテキスト識別エンジンは、エッジ識別エンジンを含んでおり、
選択手段は、物体画像毎に、複数のコンテキスト識別エンジンの中から、物体種別に応じたエッジ識別エンジンを選択し、
画像編集手段は、映像全体をそのまま、当該エッジ識別エンジンへ入力する
ように機能させるようにコンピュータを機能させることも好ましい。
更に、本発明の映像識別プログラムにおける他の実施形態によれば、
複数のコンテキスト識別エンジンは、オブジェクト識別エンジンを含んでおり、
選択手段は、物体画像毎に、複数のコンテキスト識別エンジンの中から、物体種別に応じたオブジェクト識別エンジンを選択し、
画像編集手段は、物体画像をオブジェクト識別エンジンに適したフレームレート及び／又は解像度となる編集画像に編集し、当該オブジェクト識別エンジンへ入力する
ように機能させるようにコンピュータを機能させることも好ましい。 According to the present invention, it is a video identification program that causes a computer to function to select a context identification engine according to a video.
In order to identify different predetermined contexts from the video, multiple context identification engines including at least the behavior identification engine have been started in advance.
An object detection engine that detects an object image that surrounds an object with a frame and an object type in the object image from the input video.
When the object type of the object image is a moving object for each object image, a selection means for selecting a behavior identification engine according to the object type from a plurality of context identification engines, and
The computer is made to function as an image editing means for retrimming the image to a wider image than the object image, editing the object image into an edited image having a frame rate suitable for the behavior identification engine , and inputting the image to the behavior identification engine. It is characterized by making it work.
Further, according to another embodiment in the video identification program of the present invention.
Multiple context identification engines include edge identification engines,
The selection means selects an edge identification engine according to the object type from a plurality of context identification engines for each object image.
The image editing means inputs the entire image as it is to the edge identification engine.
It is also preferable to make the computer function so as to function.
Further, according to another embodiment in the video identification program of the present invention.
Multiple context identification engines include object identification engines,
The selection means selects an object identification engine according to the object type from a plurality of context identification engines for each object image.
The image editing means edits the object image into an edited image having a frame rate and / or resolution suitable for the object identification engine, and inputs the object image to the object identification engine.
It is also preferable to make the computer function so as to function.

本発明の映像識別プログラムにおける他の実施形態によれば、
サービス項目毎に、１つ以上のコンテキスト識別エンジンを割り当てており、識別されたコンテキストを利用するアプリケーションのサービス項目に応じて、１つ以上のコンテキスト識別エンジンを起動させる起動手段を
更に有するようにコンピュータを機能させることも好ましい。 According to another embodiment of the video identification program of the present invention.
A computer assigns one or more context identification engines to each service item and further has a booting means to launch one or more context identification engines depending on the service item of the application that uses the identified context. It is also preferable to make it work.

本発明の映像識別プログラムにおける他の実施形態によれば、
物体種別毎に、１つ以上のコンテキスト識別エンジンを割り当てており、物体検出エンジンによって検出された１つ以上の物体種別に応じて、１つ以上のコンテキスト識別エンジンを予め起動させる起動手段を
更に有するようにコンピュータを機能させることも好ましい。 According to another embodiment of the video identification program of the present invention.
One or more context identification engines are assigned to each object type, and further have a starting means for pre-launching one or more context identification engines according to one or more object types detected by the object detection engine. It is also preferable to make the computer function as such.

本発明の映像識別プログラムにおける他の実施形態によれば、
起動手段は、所定時間毎に、物体検出エンジンによって検出された１つ以上の物体種別に応じて、起動している１つ以上のコンテキスト識別エンジンを更新する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the video identification program of the present invention.
It is also preferred that the activation means cause the computer to function to update one or more running context identification engines at predetermined time intervals according to one or more object types detected by the object detection engine.

本発明の映像識別プログラムにおける他の実施形態によれば、
所定時間帯毎に、１つ以上のコンテキスト識別エンジンを割り当てており、当該所定時間帯に応じて、１つ以上のコンテキスト識別エンジンを予め起動させる起動手段を
更に有するようにコンピュータを機能させることも好ましい。 According to another embodiment of the video identification program of the present invention.
One or more context identification engines are assigned for each predetermined time zone, and the computer may function to further have a starting means for pre-launching one or more context identification engines according to the predetermined time zone. preferable.

本発明の映像識別プログラムにおける他の実施形態によれば、
物体検出エンジンは、バウンディングボックスを検出し、当該バウンディングボックス内の画像を物体画像とする
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the video identification program of the present invention.
It is also preferable that the object detection engine detects the bounding box and causes the computer to function so that the image in the bounding box is an object image.

本発明によれば、映像に応じたコンテキスト識別エンジンを選択する映像識別装置であって、
映像から、それぞれ異なる所定コンテキストを識別するべく、少なくとも挙動識別エンジンを含む複数のコンテキスト識別エンジンを予め起動しており、
入力された映像から、物体を枠で囲む物体画像と、当該物体画像における物体種別とを検出する物体検出エンジンと、
物体画像毎に、当該物体画像の物体種別が移動する物体である場合、複数のコンテキスト識別エンジンの中から、物体種別に応じた挙動識別エンジンを選択する選択手段と、
物体画像よりも広い画像となるべく映像からトリミングし直し、物体画像を挙動識別エンジンに適したフレームレートとなる編集画像に編集し、当該挙動識別エンジンへ入力する画像編集手段と
を有することを特徴とする。 According to the present invention, it is a video identification device that selects a context identification engine according to a video.
In order to identify different predetermined contexts from the video, multiple context identification engines including at least the behavior identification engine have been started in advance.
An object detection engine that detects an object image that surrounds an object with a frame and an object type in the object image from the input video.
When the object type of the object image is a moving object for each object image, a selection means for selecting a behavior identification engine according to the object type from a plurality of context identification engines, and
It is characterized by having an image editing means that re-trimms the image to a wider image than the object image, edits the object image into an edited image having a frame rate suitable for the behavior identification engine , and inputs the image to the behavior identification engine. do.

本発明によれば、映像に応じたコンテキスト識別エンジンを選択する装置の映像識別方法であって、
装置は、
映像から、それぞれ異なる所定コンテキストを識別するべく、少なくとも挙動識別エンジンを含む複数のコンテキスト識別エンジンを予め起動しており、
物体検出エンジンを用いて、入力された映像から、物体を枠で囲む物体画像と、当該物体画像における物体種別とを検出する第１のステップと、
物体画像毎に、当該物体画像の物体種別が移動する物体である場合、複数のコンテキスト識別エンジンの中から、物体種別に応じた挙動識別エンジンを選択する第２のステップと、
物体画像よりも広い画像となるべく映像からトリミングし直し、物体画像を挙動識別エンジンに適したフレームレートとなる編集画像に編集し、当該挙動識別エンジンへ入力する第３のステップと
を実行することを特徴とする。
According to the present invention, there is a video identification method for a device that selects a context identification engine according to an image.
The device is
In order to identify different predetermined contexts from the video, multiple context identification engines including at least the behavior identification engine have been started in advance.
Using the object detection engine, the first step of detecting the object image that surrounds the object with a frame and the object type in the object image from the input video,
When the object type of the object image is a moving object for each object image, the second step of selecting a behavior identification engine according to the object type from a plurality of context identification engines, and
Re-trimming the image so that it is wider than the object image, editing the object image into an edited image with a frame rate suitable for the behavior identification engine, and performing the third step of inputting to the behavior identification engine . It is a feature.

本発明の映像識別プログラム、装置及び方法によれば、映像から、複数のコンテキストを少ない計算リソースで高精度に識別することができる。 According to the video identification program, apparatus and method of the present invention, a plurality of contexts can be discriminated from a video with high accuracy with a small amount of computational resources.

従来技術における映像識別装置の機能構成図である。It is a functional block diagram of the image identification apparatus in the prior art. 本発明における映像識別装置の機能構成図である。It is a functional block diagram of the image identification apparatus in this invention. 映像識別装置に入力される映像を表す説明図である。It is explanatory drawing which shows the image which is input to the image identification apparatus. 図３の映像から、物体検出エンジンによって検出した物体を表す説明図である。It is explanatory drawing which shows the object detected by the object detection engine from the image of FIG. 選択部、画像編集部及びコンテキスト識別部の処理の流れを表す説明図である。It is explanatory drawing which shows the process flow of a selection part, an image editing part, and a context identification part.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２は、本発明における映像識別装置の機能構成図である。 FIG. 2 is a functional configuration diagram of the image identification device according to the present invention.

映像識別装置１は、映像に応じたコンテキスト識別エンジンを選択することができる。図２によれば、映像識別装置１は、物体検出エンジン１１と、選択部１２と、画像編集部１３と、複数のコンテキスト識別エンジン１４と、起動部１５と、アプリケーション１６とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現できる。また、これら機能構成部の処理の流れは、映像識別方法としても理解できる。 The image identification device 1 can select a context identification engine according to the image. According to FIG. 2, the video identification device 1 includes an object detection engine 11, a selection unit 12, an image editing unit 13, a plurality of context identification engines 14, an activation unit 15, and an application 16. These functional components can be realized by executing a program that makes the computer mounted on the device function. Further, the processing flow of these functional components can be understood as a video identification method.

映像識別装置１は、例えばインターネットに接続されたサーバとして機能するものであってもよい。その場合、映像識別装置１には、カメラを搭載した様々な端末２から、その撮影映像が入力される。例えば、以下のような端末２を想定することができる。
自動車に搭載されるドライブレコーダ
各ユーザによって所持されるスマートフォンや携帯端末
宅内に設置されたＷｅｂカメラ
勿論、映像識別装置１の機能自体が、端末２に組み込まれたものであってもよい。 The video identification device 1 may function as a server connected to the Internet, for example. In that case, the captured image is input to the image identification device 1 from various terminals 2 equipped with a camera. For example, the following terminal 2 can be assumed.
Drive recorder mounted on an automobile Smartphone or mobile terminal possessed by each user Web camera installed in the home Of course, the function itself of the video identification device 1 may be incorporated in the terminal 2.

サーバとしての映像識別装置１は、撮影映像を、携帯電話網又は無線ＬＡＮのようなアクセスネットワークを介して受信するものであってもよい。また、Ｗｅｂカメラによって撮影された映像を記録したＳＤカードから、その映像を入力するものであってもよい。映像識別装置１は、入力された映像から複数のコンテキストを識別し、そのコンテキストを様々なアプリケーションで利用することができる。 The video identification device 1 as a server may receive captured video via an access network such as a mobile phone network or a wireless LAN. Further, the video may be input from the SD card on which the video captured by the Web camera is recorded. The video identification device 1 can identify a plurality of contexts from the input video and use the contexts in various applications.

図３は、映像識別装置に入力される映像を表す説明図である。 FIG. 3 is an explanatory diagram showing an image input to the image identification device.

図３によれば、映像識別装置１に入力される映像は、自動車に搭載されたドライブレコーダのカメラから、車外を撮影したものである。この映像には、以下のような物体が映り込んでいるとする。
道路側面に設置された「標識」
道路側面を歩行している「人物」
前方を走行する「車両」
駐車場に駐車している「車両」 According to FIG. 3, the image input to the image identification device 1 is taken from the outside of the vehicle from the camera of the drive recorder mounted on the vehicle. It is assumed that the following objects are reflected in this image.
"Sign" installed on the side of the road
"People" walking on the side of the road
"Vehicle" traveling ahead
"Vehicle" parked in the parking lot

［物体検出エンジン１１］
物体検出エンジン１１は、入力された映像から、物体を枠で囲む「物体画像」と、当該物体画像における「物体種別」とを検出する。
「物体画像」としては、バウンディングボックスを検出し、当該バウンディングボックス内の画像を検出する。
「物体種別」としては、物体検出のカテゴリであってもよい。例えばドライブレコーダによって撮影された映像の場合、その映像から、例えば標識、人物、車両などの物体を検出する。 [Object detection engine 11]
The object detection engine 11 detects an "object image" that surrounds an object with a frame and an "object type" in the object image from the input video.
As the "object image", a bounding box is detected, and an image in the bounding box is detected.
The "object type" may be a category of object detection. For example, in the case of an image taken by a drive recorder, an object such as a sign, a person, or a vehicle is detected from the image.

物体検出エンジン１１としては、例えばＳＳＤ(Single Shot Multibox Detector)がある。ＳＳＤは、画像をグリッドで分割し、各グリッドに対して固定された複数のバウンディングボックスの当てはまり具合から、その位置のバウンディングボックスを検知する。そのバウンディングボックスには、１つの物体が収まる。 As the object detection engine 11, for example, there is an SSD (Single Shot Multibox Detector). The SSD divides the image by a grid and detects the bounding box at that position from the fit of a plurality of bounding boxes fixed to each grid. One object fits in the bounding box.

図４は、図３の映像から、物体検出エンジンによって検出した物体を表す説明図である。 FIG. 4 is an explanatory diagram showing an object detected by the object detection engine from the image of FIG.

図４によれば、図３の映像から、４つのバウンディングボックスで囲まれた物体画像と、それぞれの物体種別（標識ＩＤ、人物ＩＤ、車両ＩＤ）とが検出されている。 According to FIG. 4, an object image surrounded by four bounding boxes and each object type (sign ID, person ID, vehicle ID) are detected from the image of FIG.

［コンテキスト識別エンジン１４］
前述した図２によれば、例えば以下のようなコンテキストを識別する複数のコンテキスト識別エンジン１４が、予め起動されているとする。
物体種別毎に専用のオブジェクト識別エンジン１４１
複数の物体種別に応じた各オブジェクトのエッジ識別エンジン１４２
物体種別毎に専用の挙動識別エンジン１４３ [Context identification engine 14]
According to FIG. 2 described above, for example, it is assumed that a plurality of context identification engines 14 for identifying the following contexts are started in advance.
Dedicated object identification engine 141 for each object type
Edge identification engine 142 for each object according to multiple object types
Dedicated behavior identification engine 143 for each object type

＜オブジェクト識別エンジン１４１＞
オブジェクト識別エンジン１４１は、撮影映像に映り込むオブジェクト（対象物）を識別することができる。
オブジェクト識別エンジン１４１としては、例えばＲＧＢ認識に基づくＣＮＮ(Convolutional Neural Network)のようなニューラルネットワークであって、ＹＯＬＯ(You Only Look Once)（登録商標）がある。これは、前述した物体検出エンジンとしてのＳＳＤと同様に、物体を識別する。
但し、ここでのオブジェクト識別エンジン１４１は、物体検出エンジン１１と異なって、各物体種別に専用の学習モデルを構築したものであって、物体種別を詳細に識別する。物体検出エンジン１１が、例えば物体種別として「標識ＩＤ」と識別した場合、オブジェクト識別エンジン１４１は、例えば「３０ｋｍ速度標識」「一旦停止標識」「工事中標識」のように、標識に特化して専用に学習モデルを構築したものである。 <Object identification engine 141>
The object identification engine 141 can identify an object (object) reflected in the captured image.
The object identification engine 141 is, for example, a neural network such as a CNN (Convolutional Neural Network) based on RGB recognition, and includes YOLO (You Only Look Once) (registered trademark). It identifies an object, similar to the SSD as an object detection engine described above.
However, unlike the object detection engine 11, the object identification engine 141 here constructs a learning model dedicated to each object type, and identifies the object type in detail. When the object detection engine 11 identifies as a "sign ID" as an object type, for example, the object identification engine 141 specializes in a sign such as "30 km speed sign", "temporary stop sign", and "under construction sign". A learning model was constructed exclusively for this purpose.

＜エッジ識別エンジン１４２＞
エッジ識別エンジン１４２は、映像に対してピクセル毎に物体の各領域（エッジ）を識別することができる。
例えば人物が物体として検出された場合、その人物が、道路上の横断歩道に存在するのか、又は、横断歩道でないところに存在するのか、など、歩行者の位置を識別することができる。 <Edge identification engine 142>
The edge identification engine 142 can identify each region (edge) of an object pixel by pixel with respect to an image.
For example, when a person is detected as an object, the position of a pedestrian can be identified, such as whether the person is on a pedestrian crossing on the road or is not on the pedestrian crossing.

エッジ識別エンジンとしては、例えばセマンティック・セグメンテーションに基づくＤｅｅｐＬａｂＶ３（登録商標）がある。これは、画像系ディープラーニングの一種で、画素レベルで物体を分類することができる。一般的には、画素情報をクラスの次元に落とし込んで分類するのに対し、セマンティック・セグメンテーションでは、それを画素（ピクセル）単位で分類することができる。即ち、ピクセル毎に、それが何かをラベル付け（アノテーション）することができる。 Edge identification engines include, for example, DeepLab V3® based on semantic segmentation. This is a type of image-based deep learning that can classify objects at the pixel level. In general, pixel information is dropped into the dimension of the class and classified, whereas in semantic segmentation, it can be classified in pixel units. That is, it is possible to label (annotate) what it is for each pixel.

＜挙動識別エンジン１４３＞
挙動識別エンジン１４３は、物体の挙動から、どのような行動をとっているかを識別することができる。映像におけるＲＧＢ画像に加えて、移動特徴量を用いて、物体検出エンジン１１によって検出された物体種別の移動を識別する。 <Behavior identification engine 143>
The behavior identification engine 143 can discriminate what kind of behavior is being taken from the behavior of the object. In addition to the RGB image in the video, the movement feature amount is used to identify the movement of the object type detected by the object detection engine 11.

挙動識別エンジン１４３としては、例えばＴｗｏ－ＳｔｒｅａｍＣＮＮ（登録商標）がある。これは、空間方向のＣＮＮ(Spatial stream ConvNet)と時系列方向のＣＮＮ(Temporal stream ConvNet)とを用いて、画像中の物体や背景のアピアランスの特徴と、水平方向成分と垂直成分の系列における動きの特徴との両方を抽出する。例えば車両の場合、「右左している、フラフラしている」のような挙動を識別することができる。 As the behavior identification engine 143, for example, there is Two-StreamCNN (registered trademark). It uses a spatial CNN (Spatial stream ConvNet) and a time-series CNN (Temporal stream ConvNet) to characterize the appearance of objects and backgrounds in an image and to move them in a series of horizontal and vertical components. Extract both with the characteristics of. For example, in the case of a vehicle, it is possible to identify behavior such as "left and right, fluttering".

図５は、選択部、画像編集部及びコンテキスト識別部の処理の流れを表す説明図である。 FIG. 5 is an explanatory diagram showing a processing flow of a selection unit, an image editing unit, and a context identification unit.

［選択部１２］
選択部１２は、物体種別毎に、１つ以上のコンテキスト識別エンジン１４を割り当てたテーブルを保持する。その上で、選択部１２は、物体画像（バウンディングボックス）毎に、当該物体画像の物体種別に応じた１つ以上のコンテキスト識別エンジン１４を選択する。例えば物体種別とコンテキスト識別エンジンとは、以下のように紐付けられているとする。
＜物体種別＞ -> ＜コンテキスト識別エンジン＞
標識ＩＤ -> オブジェクト識別エンジン
人物ＩＤ -> エッジ識別エンジン
車両ＩＤ -> 挙動識別エンジン
尚、このテーブルは、後述する起動部１５によって起動中のコンテキスト識別エンジン１４によって更新される。 [Selection unit 12]
The selection unit 12 holds a table to which one or more context identification engines 14 are assigned for each object type. Then, the selection unit 12 selects one or more context identification engines 14 according to the object type of the object image for each object image (bounding box). For example, it is assumed that the object type and the context identification engine are linked as follows.
<Object type>-><Context identification engine>
Marker ID-> Object identification engine Person ID-> Edge identification engine Vehicle ID-> Behavior identification engine This table is updated by the context identification engine 14 being activated by the activation unit 15 described later.

［画像編集部１３］
画像編集部１３は、物体画像（バウンディングボックス）毎に、選択された各コンテキスト識別エンジン１４に適した編集画像に編集し、選択部１２によって選択された当該コンテキスト識別エンジン１４へ入力する。これによって、各コンテキスト識別エンジン１４における識別精度を維持することができる。 [Image editing unit 13]
The image editing unit 13 edits each object image (bounding box) into an edited image suitable for each selected context identification engine 14, and inputs the edited image to the context identification engine 14 selected by the selection unit 12. Thereby, the identification accuracy in each context identification engine 14 can be maintained.

オブジェクト識別エンジン１４１に対しては、物体画像（バウンディングボックス）の解像度を拡大する。具体的には、物体画像から、所定比率で拡大した拡大ボックスを「囲み領域」として導出する。また、オブジェクト識別エンジン１４１は、ＲＧＢ画像に基づいて識別するために、物体画像のフレームレートを、例えば１fpsのように間引くように画像を編集するものであってもよい。 For the object identification engine 141, the resolution of the object image (bounding box) is expanded. Specifically, an enlarged box enlarged at a predetermined ratio is derived as an "enclosed area" from the object image. Further, the object identification engine 141 may edit the image so that the frame rate of the object image is thinned out, for example, 1 fps, in order to identify based on the RGB image.

エッジ識別エンジン１４２に対しては、物体画像でなく、映像識別装置１に入力された元の映像全体をそのまま、入力する。例えば物体検出エンジン１１によって人物が検出された場合、その歩行者と道路との位置関係も必要となるために、バウンディングボックスのみの物体画像では不十分となるためである。 For the edge identification engine 142, not the object image but the entire original image input to the image identification device 1 is input as it is. For example, when a person is detected by the object detection engine 11, the positional relationship between the pedestrian and the road is also required, so that the object image of the bounding box alone is insufficient.

挙動識別エンジン１４３に対しては、物体画像よりも広い範囲で、元の映像からトリミングし直すと共に、識別すべき挙動に応じたフレームレートに編集する。例えば物体画像が車両である場合、その挙動を高精度に認識するべく、入力されたフレームレートそのままで、且つ、左右への動作も考慮して物体画像よりも広くトリミングし直す。
尚、挙動識別エンジン１４３は、フレーム間で同一の特徴点が動いている箇所を抽出し、物体の動きを「ベクトル」として識別するものであってもよい。 For the behavior identification engine 143, the original image is trimmed again in a wider range than the object image, and the frame rate is edited according to the behavior to be identified. For example, when the object image is a vehicle, in order to recognize the behavior with high accuracy, the input frame rate is used as it is, and the trimming is performed wider than the object image in consideration of the movement to the left and right.
The behavior identification engine 143 may extract a portion where the same feature point moves between frames and identify the movement of the object as a "vector".

［起動部１５］
起動部１５は、例えば以下の３つのパターンによって、１つ以上のコンテキスト識別エンジン１４を起動させる。
（起動パターン１）起動部１５は、サービス項目毎に、１つ以上のコンテキスト識別エンジン１４を割り当てている。この場合、識別されたコンテキストを利用するアプリケーションのサービス項目に応じて、１つ以上のコンテキスト識別エンジンを起動させる。
例えばアプリケーションとしては「ドライブレコーダ」であり、サービス項目が「道路上の交通流把握」である場合、道路上に映り込むコンテキスト（例えば車両と人物等）のみを詳細に識別できればよい。 [Activator 15]
The activation unit 15 activates one or more context identification engines 14 by, for example, the following three patterns.
(Activation pattern 1) The activation unit 15 assigns one or more context identification engines 14 to each service item. In this case, one or more context identification engines are started according to the service item of the application that uses the identified context.
For example, if the application is a "drive recorder" and the service item is "traffic flow grasping on the road", it is sufficient to be able to identify in detail only the context (for example, a vehicle and a person) reflected on the road.

（起動パターン２）起動部１５は、物体種別毎に、１つ以上のコンテキスト識別エンジン１４を割り当てている。この場合、物体検出エンジン１１によって検出された１つ以上の物体種別に応じて、１つ以上のコンテキスト識別エンジン１４を起動させる。
このとき、起動部１５は、所定時間（例えば１０分）毎に、物体検出エンジン１１によって検出された１つ以上の物体種別に応じて、起動させる１つ以上のコンテキスト識別エンジン１４を更新する。 (Activation pattern 2) The activation unit 15 assigns one or more context identification engines 14 to each object type. In this case, one or more context identification engines 14 are started according to one or more object types detected by the object detection engine 11.
At this time, the activation unit 15 updates one or more context identification engines 14 to be activated according to the one or more object types detected by the object detection engine 11 at predetermined time (for example, 10 minutes).

このように、映像に映り込んでいない物体に基づくコンテキストについては、識別する必要もないために、そのコンテキスト識別エンジン１４自体も実行する必要がない。 As described above, since it is not necessary to identify the context based on the object that is not reflected in the image, it is not necessary to execute the context identification engine 14 itself.

（起動パターン３）起動部１５は、所定時間帯毎に、１つ以上のコンテキスト識別エンジンを割り当てている。この場合、当該所定時間帯に応じて、１つ以上のコンテキスト識別エンジン１４を起動させる。例えば道路上を撮影した映像の場合、カメラの特性上、日中のみに起動すべきコンテキスト識別エンジン１４と、夜中のみに起動すべきコンテキスト識別エンジン１４とを切り替えることもできる。 (Activation pattern 3) The activation unit 15 assigns one or more context identification engines to each predetermined time zone. In this case, one or more context identification engines 14 are started according to the predetermined time zone. For example, in the case of an image shot on a road, due to the characteristics of the camera, it is possible to switch between the context identification engine 14 that should be started only in the daytime and the context identification engine 14 that should be started only in the middle of the night.

起動部１５は、起動中のコンテキスト識別エンジン１４を、選択部１２へ通知する。
一方で、起動部１５は、起動させる必要がないと判定したコンテキスト識別エンジン１４を停止させる。
これによって、起動させるコンテキスト識別エンジン１４を制限し、全体的な計算リソースを抑えることができる。 The activation unit 15 notifies the selection unit 12 of the context identification engine 14 being activated.
On the other hand, the activation unit 15 stops the context identification engine 14 determined that it does not need to be activated.
As a result, the context identification engine 14 to be started can be limited, and the overall calculation resource can be suppressed.

以上、詳細に説明したように、本発明の映像識別プログラム、装置及び方法によれば、入力される映像に応じて、コンテキスト識別エンジンを選択することができ、複数のコンテキストを少ない計算リソースで高精度に識別することができる。 As described in detail above, according to the video identification program, apparatus and method of the present invention, the context identification engine can be selected according to the input video, and a plurality of contexts can be created with a small amount of computational resources. Can be identified with precision.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Various modifications, modifications and omissions of the above-mentioned various embodiments of the present invention within the scope of the technical idea and viewpoint of the present invention can be easily carried out by those skilled in the art. The above explanation is just an example and does not attempt to limit anything. The present invention is limited only to the scope of claims and their equivalents.

１映像識別装置
１１物体検出エンジン
１２選択部
１３画像編集部
１４コンテキスト識別エンジン
１４１オブジェクト識別エンジン
１４２エッジ識別エンジン
１４３挙動識別エンジン
１５起動部
１６アプリケーション
２端末

1 Image identification device 11 Object detection engine 12 Selection unit 13 Image editing unit 14 Context identification engine 141 Object identification engine 142 Edge identification engine 143 Behavior identification engine 15 Startup unit 16 Application 2 Terminal

Claims

A video identification program that makes a computer function to select a context identification engine according to the video.
In order to identify different predetermined contexts from the video, multiple context identification engines including at least the behavior identification engine have been started in advance.
An object detection engine that detects an object image that surrounds an object with a frame and an object type in the object image from the input video.
When the object type of the object image is a moving object for each object image, a selection means for selecting a behavior identification engine according to the object type from a plurality of context identification engines, and
The image is re-trimmed from the image so as to be wider than the object image, the object image is edited into an edited image having a frame rate suitable for the behavior identification engine, and the image is made to function as an image editing means to be input to the behavior identification engine . An image identification program characterized by operating a computer.

Multiple context identification engines include edge identification engines,
The selection means selects an edge identification engine according to the object type from a plurality of context identification engines for each object image.
The image identification program according to claim 1 , wherein the image editing means causes a computer to function so as to input the entire image as it is to the edge identification engine .

Multiple context identification engines include object identification engines,
The selection means selects an object identification engine according to the object type from a plurality of context identification engines for each object image.
The image editing means edits the object image into an edited image having a frame rate and / or resolution suitable for the object identification engine, and causes the computer to function as input to the object identification engine. The video identification program according to claim 1 or 2, characterized in that.

A computer is assigned one or more context identification engines for each service item and further has a booting means for invoking one or more context identification engines depending on the service item of the application that uses the identified context. The video identification program according to any one of claims 1 to 3, wherein the video identification program is characterized in that.

One or more context identification engines are assigned to each object type, and a starting means for preliminarily starting one or more context identification engines according to one or more object types detected by the object detection engine is provided. The video identification program according to any one of claims 1 to 3, further comprising operating a computer to have it.

The activation means causes a computer to update one or more context identification engines that are activated according to one or more object types detected by the object detection engine at predetermined time intervals. The video identification program according to claim 5 .

One or more context identification engines are assigned to each predetermined time zone, and the computer is made to function so as to have a starting means for pre-launching one or more context identification engines according to the predetermined time zone. The video identification program according to any one of claims 1 to 3, which is characterized.

The image identification according to any one of claims 1 to 7 , wherein the object detection engine detects a bounding box and causes a computer to function so that the image in the bounding box is the object image. program.

It is a video identification device that selects a context identification engine according to the video.
In order to identify different predetermined contexts from the video, multiple context identification engines including at least the behavior identification engine have been started in advance.
An object detection engine that detects an object image that surrounds an object with a frame and an object type in the object image from the input video.
When the object type of the object image is a moving object for each object image, a selection means for selecting a behavior identification engine according to the object type from a plurality of context identification engines, and
It has an image editing means for trimming an image wider than the object image as much as possible, editing the object image into an edited image having a frame rate suitable for the behavior identification engine , and inputting the image to the behavior identification engine. A featured image identification device.

It is a video identification method of a device that selects a context identification engine according to the video.
The device is
In order to identify different predetermined contexts from the video, multiple context identification engines including at least the behavior identification engine have been started in advance.
Using the object detection engine, the first step of detecting the object image that surrounds the object with a frame and the object type in the object image from the input video,
When the object type of the object image is a moving object for each object image, the second step of selecting a behavior identification engine according to the object type from a plurality of context identification engines ,
The third step of trimming the image so that it is wider than the object image, editing the object image into an edited image having a frame rate suitable for the behavior identification engine , and inputting the image to the behavior identification engine is executed. An image identification method characterized by that.