JP6340675B1

JP6340675B1 - Object extraction device, object recognition system, and metadata creation system

Info

Publication number: JP6340675B1
Application number: JP2017038203A
Authority: JP
Inventors: 明雄小金; 恒利田中; 俊雄石松
Original assignee: J-STREAM INC.
Current assignee: J-STREAM INC.
Priority date: 2017-03-01
Filing date: 2017-03-01
Publication date: 2018-06-13
Anticipated expiration: 2037-03-01
Also published as: JP2018147019A

Abstract

【課題】動画データからオブジェクトを高精度で抽出、認識すると共に、オブジェクト情報の利便性を高めて、動画の利用及び普及の向上を図る。【解決手段】オブジェクト抽出装置は、動画データ２１の２次元画像の処理フレーム２２に対して、特徴点抽出処理を行って複数の特徴点２３を抽出して各特徴点２３の２次元画像上の第１特徴量を検出し、深度検出処理を行って各特徴点２３の周囲の特徴点２３からの相対的な深度を検出し、３次元空間推定処理を行って各特徴点２３の第１特徴量及び深度に基づいて処理フレーム２２の現実３次元空間２６を推定して各特徴点２３の現実３次元空間２６上の第２特徴量を検出し、各特徴点２３の第２特徴量及び色分布に基づいてオブジェクト抽出処理を行って現実３次元空間２６上の特徴量を有する２つ以上の特徴点２３からなる特徴点群２４を検出して候補オブジェクト２５として抽出する。【選択図】図２An object of the present invention is to extract and recognize an object from moving image data with high accuracy and improve the convenience of object information to improve the use and spread of moving images. An object extraction device performs a feature point extraction process on a processing frame 22 of a two-dimensional image of moving image data 21 to extract a plurality of feature points 23, and each feature point 23 on the two-dimensional image. A first feature amount is detected, a depth detection process is performed to detect a relative depth from the surrounding feature points 23, and a three-dimensional space estimation process is performed to perform a first feature of each feature point 23. Based on the amount and the depth, the actual three-dimensional space 26 of the processing frame 22 is estimated to detect the second feature amount of each feature point 23 on the actual three-dimensional space 26, and the second feature amount and color of each feature point 23 are detected. An object extraction process is performed based on the distribution to detect a feature point group 24 composed of two or more feature points 23 having a feature amount in the real three-dimensional space 26 and extract them as candidate objects 25. [Selection] Figure 2

Description

本発明は、動画等の画像に表示される人物や物等のオブジェクトを抽出するオブジェクト抽出装置、このオブジェクト抽出装置を用いたオブジェクト認識システム、及びこのオブジェクト認識システムを用いたメタデータ作成システムに関する。 The present invention relates to an object extraction device for extracting an object such as a person or an object displayed on an image such as a moving image, an object recognition system using the object extraction device, and a metadata creation system using the object recognition system.

従来から、インターネット等のネットワークでは、動画サーバや動画データベース等のコンピュータが動画データを格納すると共に、視聴者端末に対して動画データを公開する動画配信が行われている。このような動画配信を促進するために、動画データに関連するメタデータを作成して視聴者に配信する装置やシステムが提案されている。 Conventionally, in a network such as the Internet, a computer such as a moving image server or a moving image database stores moving image data and distributes moving image data to the viewer terminal. In order to promote such moving image distribution, devices and systems for creating metadata related to moving image data and distributing them to viewers have been proposed.

例えば、特許文献１に記載のメタデータ配信装置では、抽出変換テーブルと局固有データを利用してキー局のコンテンツのメタデータから自局でネット放送するネット番組のコンテンツのメタデータを抽出変換し、抽出変換されたメタデータを配信するので、キー局のコンテンツのメタデータを自局のコンテンツのメタデータとして、受信機に配信し、これにより、キー局以外のネット局で、ネット放送するネット番組において、キー局のコンテンツのメタデータを利用し、サーバ型放送を行う。 For example, the metadata distribution apparatus described in Patent Document 1 extracts and converts the metadata of the content of a net program broadcast on the local station from the metadata of the content of the key station using the extraction conversion table and the station-specific data. Since the extracted and converted metadata is distributed, the metadata of the content of the key station is distributed to the receiver as the metadata of the content of the local station. In the program, server type broadcasting is performed by using metadata of contents of the key station.

特開２００６−３２５１３４号公報JP 2006-325134 A

しかしながら、上記したメタデータ配信装置のような装置やシステムでは、放送局が予め動画情報のメタデータを用意しなければ動画情報を提供することができない。そのため、このようなメタデータが用意されていない動画データについては動画情報を提供することができない。 However, in an apparatus or system such as the metadata distribution apparatus described above, video information cannot be provided unless the broadcast station prepares video information metadata in advance. Therefore, moving image information cannot be provided for moving image data for which such metadata is not prepared.

また、動画データには、様々な人物や物等のオブジェクトが登場するため、メタデータには、これらのオブジェクトを特定する情報やこれらのオブジェクトの登場時間帯の情報等を記述することが望まれる。メタデータを作成する作業者は、動画データを視聴して、登場するオブジェクトを確認することで、このようなオブジェクトの特定や登場時間帯の把握をすることができるが、このような作業は作業者に掛かる負担が大きい。そこで、動画データから自動的にオブジェクトを認識する装置やシステムが望まれる。 In addition, since various objects such as people and objects appear in the moving image data, it is desirable to describe information for identifying these objects, information on the appearance times of these objects, and the like in the metadata. . The worker who creates metadata can identify such objects and grasp the appearance time zone by viewing the video data and confirming the objects that appear. The burden on the person is great. Therefore, an apparatus or system that automatically recognizes an object from moving image data is desired.

オブジェクトを認識する装置やシステムでは、例えば、動画データから静止画データを切り出し、静止画データからオブジェクトを抽出して、抽出したオブジェクトを予め用意した学習データと比較することで、オブジェクトを認識する。しかしながら、このような静止画データは、通常、２次元平面画像であるのに対して、実際のオブジェクトは奥行きのある３次元空間で特徴を有していて、様々な角度から撮影される。そのため、２次元平面画像の静止画データからオブジェクトの正確な特徴を抽出することが困難であった。 An apparatus or system that recognizes an object recognizes the object by, for example, extracting still image data from moving image data, extracting an object from still image data, and comparing the extracted object with learning data prepared in advance. However, such still image data is usually a two-dimensional planar image, whereas an actual object has a feature in a three-dimensional space having a depth, and is photographed from various angles. Therefore, it has been difficult to extract an accurate feature of an object from still image data of a two-dimensional planar image.

また、様々な角度から撮影されるオブジェクトのそれぞれに対して学習データを用意する場合には、大量の学習データを格納するために膨大な容量の記憶装置が必要となり、また、膨大な回数の比較処理を行う必要があるため、設備コストや処理工数が増大してしまう。そして、上記のような理由から、オブジェクト認識の精度が低下し、更には、所望のメタデータを生成できないという問題が生じてしまう。また、視聴者の望むメタデータを配信することができないために、動画データの利用及び普及が停滞することがある。 In addition, when preparing learning data for each object photographed from various angles, a large amount of storage device is required to store a large amount of learning data, and a large number of comparisons are made. Since it is necessary to perform processing, the equipment cost and the number of processing steps increase. For the reasons described above, the accuracy of object recognition decreases, and further, there arises a problem that desired metadata cannot be generated. Further, since the metadata desired by the viewer cannot be distributed, the use and dissemination of moving image data may stagnate.

なお、オブジェクトの３次元情報を予め有する動画データからオブジェクト認識をする場合には、オブジェクトを３次元に対応した撮像装置で予め撮影して動画データを生成する必要があり、設備コストが増大してしまう。 When object recognition is performed from moving image data having 3D information of an object in advance, it is necessary to generate moving image data by shooting the object in advance with an imaging device that supports 3D, which increases equipment costs. End up.

そこで、本発明は上記事情を考慮し、動画データに表示されるオブジェクトを高精度で抽出し、高精度で認識すると共に、認識したオブジェクトの情報の利便性を高めて、動画の利用及び普及の向上を図ることを目的とする。 Therefore, in consideration of the above circumstances, the present invention extracts objects displayed in moving image data with high accuracy, recognizes them with high accuracy, and enhances the convenience of information on the recognized objects, thereby making it possible to use and disseminate moving images. The purpose is to improve.

上記課題を解決するために、本発明の第１のオブジェクト抽出装置は、動画データを構成する２次元画像の複数のフレームの内、オブジェクト抽出対象の処理フレームに特徴点抽出処理を行って、前記処理フレームの複数の特徴点を抽出すると共に、前記各特徴点の２次元画像上の第１特徴量を検出し、前記処理フレームに深度検出処理を行って、前記処理フレームの各特徴点について周囲の特徴点からの相対的な深度を検出し、前記処理フレームに３次元空間推定処理を行って、前記処理フレームの複数の特徴点それぞれの少なくとも前記第１特徴量及び前記深度に基づいて前記処理フレーム内の現実３次元空間を推定し、前記処理フレームの複数の特徴点の前記現実３次元空間上の第２特徴量を検出し、前記処理フレームの複数の特徴点それぞれの少なくとも前記第２特徴量及び色分布に基づいてオブジェクト抽出処理を行って、前記処理フレームの２つ以上の特徴点の集合からなる特徴点群を検出し、前記現実３次元空間上の特徴量を有する前記特徴点群を、前記処理フレームの候補オブジェクトとして抽出することを特徴とする。 In order to solve the above-described problem, the first object extraction device of the present invention performs a feature point extraction process on a processing frame that is an object extraction target among a plurality of frames of a two-dimensional image constituting moving image data, and A plurality of feature points of the processing frame are extracted, a first feature amount on the two-dimensional image of each feature point is detected, a depth detection process is performed on the processing frame, and each feature point of the processing frame is surrounded Detecting a relative depth from the feature point, performing a three-dimensional space estimation process on the processing frame, and performing the processing based on at least the first feature amount and the depth of each of the plurality of feature points of the processing frame. Estimating a real three-dimensional space in the frame, detecting a second feature quantity in the real three-dimensional space of a plurality of feature points of the processing frame, and a plurality of feature points of the processing frame Object extraction processing is performed based on at least each of the second feature amount and color distribution, and a feature point group including a set of two or more feature points of the processing frame is detected, and the real three-dimensional space is detected. The feature point group having the feature amount is extracted as a candidate object of the processing frame.

また、本発明の第２のオブジェクト抽出装置は、上述した本発明の第１のオブジェクト抽出装置において、前記動画データを構成する複数の前記処理フレームの内、時間軸において前後に連続していて同一シーンを構成する２つ以上の共通の処理フレームがある場合に、前記２つ以上の共通の処理フレームのそれぞれについて前記深度検出処理、前記３次元空間推定処理及び前記オブジェクト抽出処理を行うとき、前記現実３次元空間上の特徴量を有する特徴点群であって、前記２つ以上の共通の処理フレームに共通して検出された特徴点群を、前記同一シーンの候補オブジェクトとして抽出することを特徴とする。 Further, the second object extraction device of the present invention is the same as the first object extraction device of the present invention described above, which is continuous in the time axis among the plurality of processing frames constituting the moving image data. When there are two or more common processing frames constituting a scene, when performing the depth detection processing, the three-dimensional space estimation processing, and the object extraction processing for each of the two or more common processing frames, A feature point group having a feature amount in a real three-dimensional space, wherein the feature point group detected in common in the two or more common processing frames is extracted as a candidate object of the same scene. And

また、本発明の第３のオブジェクト抽出装置は、上述した本発明の第２のオブジェクト抽出装置において、前記２つ以上の共通の処理フレームのそれぞれに前記特徴点抽出処理を行うとき、一の前記共通の処理フレームから抽出された複数の特徴点と、他の前記共通の処理フレームから抽出された複数の特徴点との差異を利用して、前記一の共通の処理フレームの特徴点を増やすことを特徴とする。 The third object extraction device of the present invention is the above-described second object extraction device of the present invention, wherein when the feature point extraction processing is performed on each of the two or more common processing frames, Using the difference between a plurality of feature points extracted from a common processing frame and a plurality of feature points extracted from the other common processing frame, the feature points of the one common processing frame are increased. It is characterized by.

また、本発明の第４のオブジェクト抽出装置は、上述した本発明の第２又は第３のオブジェクト抽出装置において、前記動画データを高画質化処理することにより、前記処理フレームで前記候補オブジェクトとして抽出される特徴点群の特徴点を増やすことを特徴とする。 Further, the fourth object extraction device of the present invention is the above-described second or third object extraction device of the present invention, wherein the moving image data is extracted as the candidate object in the processing frame by performing high quality processing. The feature point of the feature point group to be added is increased.

また、本発明の第５のオブジェクト抽出装置は、上述した本発明の第１〜第４の何れかのオブジェクト抽出装置において、前記同一シーンの候補オブジェクトは、前記現実３次元空間上の特徴量に加えて、該現実３次元空間上の特徴量の前記同一シーン上の時間変位量も有することを特徴とする。 The fifth object extraction device of the present invention is the above-described first to fourth object extraction device of the present invention, wherein the candidate object of the same scene is a feature quantity in the real three-dimensional space. In addition, it is characterized by having a time displacement amount on the same scene of the feature amount in the real three-dimensional space.

また、本発明の第６のオブジェクト抽出装置は、上述した本発明の第５のオブジェクト抽出装置において、前記高画質化処理は、微小領域毎に様々な色分布を有する複数のテンプレート画像のそれぞれについて高画質データ及び低画質データの相違をサンプル化した様々な色分布の学習データを予め記憶しておき、前記処理フレームの微小領域毎に最も適合した前記学習データを用いて前記処理フレームを高画質化することを特徴とする。 According to a sixth object extraction device of the present invention, in the fifth object extraction device of the present invention described above, the image quality enhancement processing is performed for each of a plurality of template images having various color distributions for each minute region. Learning data of various color distributions obtained by sampling the difference between the high-quality data and the low-quality data is stored in advance, and the processing frame is imaged using the learning data most suitable for each minute region of the processing frame. It is characterized by becoming.

更に、上記課題を解決するために、本発明の第１のオブジェクト認識システムは、上述した本発明の第１〜第６の何れかのオブジェクト抽出装置と、前記候補オブジェクトを認識するための複数の参照オブジェクトを、前記各参照オブジェクトの元画像及び前記各参照オブジェクトに関連する付属情報と共に格納するデータベースであって、前記各参照オブジェクトの元画像に対する前記特徴点抽出処理、前記深度検出処理、前記３次元空間推定処理及び前記オブジェクト抽出処理によって、その元画像の現実３次元空間上の特徴量を有する特徴点群として抽出された前記各参照オブジェクトを格納しているオブジェクトデータベースと、前記オブジェクト抽出装置によって抽出された前記候補オブジェクトが、前記オブジェクトデータベースに格納されている前記複数の参照オブジェクトの何れに相当するかのオブジェクト認識処理を行うオブジェクト認識装置と、を備え、前記オブジェクト認識装置は、前記候補オブジェクトが前記複数の参照オブジェクトの内の一の参照オブジェクトに相当すると判定した場合に、前記一の参照オブジェクトの前記付属情報に基づいて生成したオブジェクト情報を前記候補オブジェクトに付加することを特徴とする。 Furthermore, in order to solve the above problem, a first object recognition system of the present invention includes any one of the first to sixth object extraction devices of the present invention described above and a plurality of objects for recognizing the candidate object. A database for storing a reference object together with an original image of each reference object and attached information related to each reference object, wherein the feature point extraction process, the depth detection process, and the 3 for the original image of each reference object An object database storing each reference object extracted as a feature point group having a feature quantity in the actual three-dimensional space of the original image by the dimension space estimation process and the object extraction process, and the object extraction device The extracted candidate object is the object database. An object recognition device that performs an object recognition process corresponding to which of the plurality of reference objects stored in the object recognition device, wherein the object recognition device is configured such that the candidate object is one of the plurality of reference objects. When it is determined that it corresponds to a reference object, object information generated based on the attached information of the one reference object is added to the candidate object.

また、本発明の第２のオブジェクト認識システムは、上述した本発明の第１のオブジェクト認識システムにおいて、前記オブジェクト認識処理は、前記候補オブジェクトの特徴点群及び前記処理フレームにおける色分布と、前記参照オブジェクトの特徴点群及び元画像における色分布とを比較することによって行われることを特徴とする。 Further, the second object recognition system of the present invention is the above-described first object recognition system of the present invention, wherein the object recognition processing includes the feature point group of the candidate object, the color distribution in the processing frame, and the reference. This is performed by comparing the feature point group of the object and the color distribution in the original image.

また、本発明の第３のオブジェクト認識システムは、上述した本発明の第１又は第２のオブジェクト認識システムにおいて、前記オブジェクトデータベースは、前記複数の参照オブジェクトをそれぞれの付属情報に基づいて分類していて、共通する付属情報を有する２つ以上の参照オブジェクトについては、その共通の付属情報を分類情報とした共通のカテゴリーに分類して格納していることを特徴とする。 Further, according to a third object recognition system of the present invention, in the first or second object recognition system of the present invention described above, the object database classifies the plurality of reference objects based on respective attached information. Thus, two or more reference objects having common attached information are classified and stored in a common category using the common attached information as classification information.

また、本発明の第４のオブジェクト認識システムは、上述した本発明の第１〜第３の何れかのオブジェクト認識システムにおいて、前記複数の参照オブジェクトの内の一の参照オブジェクトに相当すると判定された前記候補オブジェクトを、前記一の参照オブジェクトが分類されるカテゴリーの新たな参照オブジェクトとして前記オブジェクトデータベースに格納することを特徴とする。 Further, the fourth object recognition system of the present invention is determined to correspond to one reference object among the plurality of reference objects in the above-described first to third object recognition systems of the present invention. The candidate object is stored in the object database as a new reference object of a category into which the one reference object is classified.

また、本発明の第５のオブジェクト認識システムは、上述した本発明の第４のオブジェクト認識システムにおいて、前記複数の参照オブジェクトの何れにも相当しないと判定された前記候補オブジェクトを、該候補オブジェクトが分類される新たなカテゴリーの参照オブジェクトとして前記オブジェクトデータベースに格納することを特徴とする。 The fifth object recognition system of the present invention is the above-described candidate object that is determined not to correspond to any of the plurality of reference objects in the above-described fourth object recognition system of the present invention. It is stored in the object database as a reference object of a new category to be classified.

更に、上記課題を解決するために、本発明の第１のメタデータ作成システムは、上述した本発明の第１〜第５の何れかのオブジェクト認識システムを備え、所定の前記動画データの動画情報と、前記所定の動画データを構成する複数の前記処理フレームのフレーム情報と、前記複数の処理フレームのそれぞれから抽出及び認識された前記候補オブジェクトの前記オブジェクト情報とを集計して、その集計結果に基づいて、前記動画データに関するメタデータを作成することを特徴とする。 Furthermore, in order to solve the above-described problem, a first metadata creation system of the present invention includes any one of the first to fifth object recognition systems of the present invention described above, and includes moving image information of predetermined moving image data. And the frame information of the plurality of processing frames constituting the predetermined moving image data and the object information of the candidate objects extracted and recognized from each of the plurality of processing frames, On the basis of this, metadata relating to the moving image data is created.

本発明によれば、動画データに表示されるオブジェクトを高精度で抽出し、高精度で認識すると共に、認識したオブジェクトの情報の利便性を高めて、動画の利用及び普及の向上を図ることが可能となる。 According to the present invention, an object displayed in moving image data is extracted with high accuracy and recognized with high accuracy, and the convenience of information on the recognized object is enhanced, thereby improving the use and spread of moving images. It becomes possible.

本発明の一実施形態に係るオブジェクト認識システム及びメタデータ作成システムの概略を示すブロック図である。It is a block diagram which shows the outline of the object recognition system and metadata production system which concern on one Embodiment of this invention. 本発明の一実施形態に係るオブジェクト認識システムによって処理される動画データの例を示す概要図である。It is a schematic diagram which shows the example of the moving image data processed by the object recognition system which concerns on one Embodiment of this invention. 本発明の一実施形態に係るオブジェクト認識システム及びメタデータ作成システムにおけるオブジェクト認識動作及びメタデータ作成動作を示すフローチャートである。It is a flowchart which shows the object recognition operation | movement and metadata production | generation operation | movement in the object recognition system and metadata production system which concern on one Embodiment of this invention.

先ず、図１を参照しながら、本発明の実施形態に係るオブジェクト認識システム１の全体の構成について説明する。図１に示すように、オブジェクト認識システム１は、動画や静止画等の画像に基づいてオブジェクト抽出処理を行うオブジェクト抽出装置２と、オブジェクト認識処理に用いられるオブジェクトを格納するオブジェクトデータベース（ＤＢ）３と、オブジェクト認識処理を行うオブジェクト認識装置４とを備える。 First, the overall configuration of an object recognition system 1 according to an embodiment of the present invention will be described with reference to FIG. As shown in FIG. 1, an object recognition system 1 includes an object extraction device 2 that performs an object extraction process based on an image such as a moving image or a still image, and an object database (DB) 3 that stores objects used for the object recognition process. And an object recognition device 4 that performs object recognition processing.

オブジェクトは、人物や動物等の生物、建物や置物等の静止物、文字や記号、ロゴマーク等の表示物のように、２次元平面上で識別可能な形状、色彩、濃淡の特徴を有するものだけでなく、３次元空間上で識別可能な形状、色彩、濃淡の特徴を有するものも含む。以下では、オブジェクト抽出処理及びオブジェクト認識処理の対象となるオブジェクト、即ち、オブジェクト抽出装置２によって抽出され、オブジェクト認識装置４によって認識されるオブジェクトを候補オブジェクト２５（図２参照）と称する。また、候補オブジェクト２５の認識処理のために比較されるオブジェクトであって、オブジェクトＤＢ３に格納されるオブジェクトを参照オブジェクト３ａと称する。 Objects have features of shapes, colors, and shades that can be identified on a two-dimensional plane, such as living things such as people and animals, stationary objects such as buildings and figurines, and display objects such as characters, symbols, and logo marks. As well as those having features of shape, color, and shading that can be identified in a three-dimensional space. Hereinafter, an object to be subjected to object extraction processing and object recognition processing, that is, an object extracted by the object extraction device 2 and recognized by the object recognition device 4 is referred to as a candidate object 25 (see FIG. 2). An object to be compared for recognition processing of the candidate object 25 and stored in the object DB 3 is referred to as a reference object 3a.

本実施形態では、オブジェクト抽出装置２、オブジェクトＤＢ３及びオブジェクト認識装置４が、インターネットやＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等の所定のネットワーク５を介して、相互に通信可能に接続される例を説明するが、オブジェクト抽出装置２、オブジェクトＤＢ３及びオブジェクト認識装置４は、相互にデータ送受信可能であれば、直接的に接続されてもよく、あるいは、何れか２つ以上が一体的に構成されてもよい。 In the present embodiment, an example will be described in which the object extracting device 2, the object DB 3, and the object recognizing device 4 are connected to be able to communicate with each other via a predetermined network 5 such as the Internet or a LAN (Local Area Network). The object extracting device 2, the object DB 3, and the object recognizing device 4 may be directly connected as long as they can transmit and receive data to each other, or any two or more of them may be integrally configured.

また、本実施形態では、オブジェクト抽出装置２、オブジェクトＤＢ３及びオブジェクト認識装置４が、１つずつ備えられる例を説明するが、複数のオブジェクト抽出装置２、複数のオブジェクトＤＢ３及び複数のオブジェクト認識装置４が備えられてよい。なお、複数のオブジェクトＤＢ３は、各オブジェクトＤＢ３に格納される参照オブジェクト３ａを統括管理していて、キーワードやカテゴリーを指定すると、そのキーワードやカテゴリーに対応する参照オブジェクト３ａが複数のオブジェクトＤＢ３に亘って検索される。複数のオブジェクトＤＢ３は、一の画像に基づく一の参照オブジェクト３ａを、２つ以上のオブジェクトＤＢ３に重複して格納せずに、何れか１つのオブジェクトＤＢ３に格納する。 In this embodiment, an example in which the object extracting device 2, the object DB 3, and the object recognizing device 4 are provided one by one will be described. However, a plurality of object extracting devices 2, a plurality of object DBs 3, and a plurality of object recognizing devices 4 are provided. May be provided. The plurality of objects DB3 collectively manage the reference objects 3a stored in each object DB3. When a keyword or category is specified, the reference object 3a corresponding to the keyword or category extends over the plurality of object DB3. Searched. The plurality of object DBs 3 store one reference object 3a based on one image in any one object DB 3 without being redundantly stored in two or more object DBs 3.

先ず、オブジェクト抽出装置２について説明する。オブジェクト抽出装置２は、上記したように、動画や静止画等の画像に表示される候補オブジェクト２５を抽出するように構成される。例えば、オブジェクト抽出装置２は、制御部１０と、記憶部１１と、通信部１２とを備える。また、オブジェクト抽出装置２は、画像入力部１３と、フレーム取得部１４と、高画質化部１５と、フレーム調整部１６と、特徴点抽出部１７と、深度検出部１８と、３次元空間推定部１９と、オブジェクト抽出部２０とを備える。なお、画像入力部１３、フレーム取得部１４、高画質化部１５、フレーム調整部１６、特徴点抽出部１７、深度検出部１８、３次元空間推定部１９及びオブジェクト抽出部２０は、記憶部１１に記憶され、制御部１０によって制御されることで動作するプログラムで構成されてよい。 First, the object extraction device 2 will be described. As described above, the object extraction device 2 is configured to extract the candidate object 25 displayed in an image such as a moving image or a still image. For example, the object extraction device 2 includes a control unit 10, a storage unit 11, and a communication unit 12. The object extraction device 2 includes an image input unit 13, a frame acquisition unit 14, a high image quality improvement unit 15, a frame adjustment unit 16, a feature point extraction unit 17, a depth detection unit 18, and a three-dimensional space estimation. A unit 19 and an object extraction unit 20 are provided. The image input unit 13, the frame acquisition unit 14, the image quality improvement unit 15, the frame adjustment unit 16, the feature point extraction unit 17, the depth detection unit 18, the three-dimensional space estimation unit 19, and the object extraction unit 20 are stored in the storage unit 11. And may be configured by a program that operates by being controlled by the control unit 10.

また、オブジェクト抽出装置２は、オブジェクト抽出処理に用いる抽出処理用データ８ａ（例えば、後述の高画質化処理用のテンプレート画像、特徴点分布判定用の特徴点分布データ、色分布判定用の色変位−深度データ、オブジェクト抽出用のオブジェクト抽出データ等）を格納する抽出処理データベース（ＤＢ）８にネットワーク５を介して接続される。オブジェクト抽出装置２は、抽出処理ＤＢ８と相互にデータ送受信可能であれば、直接的に接続されてもよく、あるいは、一体的に構成されてもよい。 The object extraction device 2 also uses extraction processing data 8a used for the object extraction processing (for example, a template image for high image quality processing, which will be described later, feature point distribution data for feature point distribution determination, and color displacement for color distribution determination. -It is connected via a network 5 to an extraction processing database (DB) 8 that stores depth data, object extraction data for object extraction, and the like. The object extraction device 2 may be directly connected or may be configured integrally as long as data can be transmitted / received to / from the extraction processing DB 8.

制御部１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等を有して、オブジェクト抽出装置２の全体の動作を統括して制御するように構成される。記憶部１１は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等のメモリや、ハードディスク等の記録媒体を有して、制御部１０で制御される情報やデータ、プログラム等を記憶するように構成される。 The control unit 10 includes a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), and the like, and is configured to control the overall operation of the object extraction device 2. The storage unit 11 has a memory such as a ROM (Read Only Memory) and a RAM (Random Access Memory), and a recording medium such as a hard disk, and stores information, data, programs, and the like controlled by the control unit 10. Configured.

通信部１２は、オブジェクト抽出装置２がネットワーク５に接続するためのインタフェースであり、即ち、オブジェクト抽出装置２をオブジェクトＤＢ３及びオブジェクト認識装置４とネットワーク５を介して接続する。 The communication unit 12 is an interface for the object extraction device 2 to connect to the network 5, that is, connects the object extraction device 2 to the object DB 3 and the object recognition device 4 through the network 5.

画像入力部１３は、例えば、オブジェクト抽出処理の対象となる動画データ２１（図２参照）や静止画データ等の画像データを入力する。例えば、画像入力部１３は、複数の動画データ２１を格納している外部の動画データベース（ＤＢ）６や外部の他のコンピュータ等と通信部１２を介して通信することで、オブジェクト抽出処理の対象の動画データ２１の動画ＤＢ６からの選択操作及び入力を可能にする。又は、画像入力部１３は、記憶部１１から動画データ２１を読み出し、あるいはＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）やＢｌｕ−ｒａｙＤｉｓｃ（登録商標）等の記憶媒体に記憶された動画データ２１を、読出装置（図示せず）によって読み出して、オブジェクト抽出処理の対象の動画データ２１として入力してもよい。なお、動画データ２１には、映像データや音声データに加えて、予め設定された動画タイトルや内容等の動画情報が記録されている。 The image input unit 13 inputs, for example, image data such as moving image data 21 (see FIG. 2) and still image data to be subjected to object extraction processing. For example, the image input unit 13 communicates with an external moving image database (DB) 6 storing a plurality of moving image data 21 or other external computers via the communication unit 12 to perform object extraction processing. The moving image data 21 can be selected and input from the moving image DB 6. Alternatively, the image input unit 13 reads the moving image data 21 from the storage unit 11 or reads the moving image data 21 stored in a storage medium such as a DVD (Digital Versatile Disc) or Blu-ray Disc (registered trademark) as a reading device ( (Not shown), and may be input as moving image data 21 to be subjected to object extraction processing. The moving image data 21 includes moving image information such as a moving image title and contents set in advance in addition to video data and audio data.

また、画像入力部１３は、入力した画像データの画像データ情報を抽出する。画像データ情報は、例えば、動画データ２１の場合には、動画データ２１の動画ＩＤ、フレーム数、フレームサイズ及びフォーマット形式や、動画データ２１のタイトル、作者情報、作成日時、動画のカテゴリー、出演者情報、サムネイル（ＵＲＬ）等の動画情報がある。また、静止画データの場合には、静止画のタイトル、データサイズ、フォーマット形式等の静止画情報がある。また、画像データがウェブサイトから取得された場合には、そのウェブサイトの記述内容に含まれる画像データの情報も、画像データ情報としてよい。 The image input unit 13 extracts image data information of the input image data. For example, in the case of moving image data 21, the image data information includes the moving image ID, the number of frames, the frame size and the format of the moving image data 21, the title of the moving image data 21, author information, the creation date, the category of the moving image, and the performer. Information, and moving image information such as thumbnails (URLs). Still image data includes still image information such as a still image title, data size, and format. When image data is acquired from a website, the image data information included in the description content of the website may be image data information.

フレーム取得部１４は、図２に示すように、動画データ２１をオブジェクト抽出処理の対象とする場合に、その動画データ２１を構成する複数の静止画フレームを、そのフレームレートに基づいて取得し、これらの複数の静止画フレームのそれぞれがオブジェクト抽出処理の対象の処理フレーム２２となる。なお、画像入力部１３が静止画データを入力した場合には、その静止画データがそのままオブジェクト抽出処理の対象の処理フレーム２２となる。なお、この処理フレーム２２は、１台の撮像装置で被写体を１方向から撮影したような２次元平面画像に相当する。 As shown in FIG. 2, the frame acquisition unit 14 acquires a plurality of still image frames constituting the moving image data 21 based on the frame rate when the moving image data 21 is a target of object extraction processing. Each of the plurality of still image frames becomes a processing frame 22 to be subjected to object extraction processing. When the image input unit 13 inputs still image data, the still image data directly becomes a processing frame 22 that is a target of object extraction processing. Note that the processing frame 22 corresponds to a two-dimensional planar image obtained by photographing a subject from one direction with a single imaging device.

また、フレーム取得部１４は、取得した各処理フレーム２２のフレーム情報を抽出する。フレーム情報は、例えば、その処理フレーム２２の動画データ２１におけるリレーションＩＤや再生時間（タイムスタンプ）、及びこの処理フレーム２２のフレーム番号（ユニークＩＤ）等がある。 Further, the frame acquisition unit 14 extracts frame information of each acquired processing frame 22. The frame information includes, for example, a relation ID and a reproduction time (time stamp) in the moving image data 21 of the processing frame 22, a frame number (unique ID) of the processing frame 22, and the like.

高画質化部１５は、処理フレーム２２の高画質化処理を行う。本実施形態では特に、高画質化部１５は、処理フレーム２２の特徴点２３の抽出量が増加するように処理フレーム２２を高画質化する。 The image quality improving unit 15 performs an image quality improving process on the processing frame 22. Particularly in the present embodiment, the image quality improving unit 15 increases the image quality of the processing frame 22 so that the extraction amount of the feature points 23 of the processing frame 22 increases.

例えば、高画質化部１５は、微小領域（例えば、ａ×ａの画素範囲、ａは３以上の奇数）毎に様々な色分布（色変位）を有する高画質化処理用の複数のテンプレート画像を抽出処理ＤＢ８に予め記憶している。各テンプレート画像には、高画質データ及び低画質データが用意されていて、低画質データはテンプレート画像毎の解像度で示される。また、高画質化部１５は、各テンプレート画像の高画質データ及び低画質データの微小領域毎の相違（色変位）をサンプル化した高画質化用の色変位データを各テンプレート画像に対応付けて、抽出処理ＤＢ８に予め記憶している。そして、高画質化部１５は、処理フレーム２２の解像度に合う様々な色変位データを用いて処理フレーム２２の微小領域毎に畳み込み演算をすることで、処理フレーム２２の各微小領域の色変位に対応する色変位データから、最も確率の高い（最も適合する）色変位データを判定して合わせ込む（合成する）ことによって処理フレーム２２を高画質化する。なお、この畳み込みは、全ての色変位データを常に用いる必要はなく、処理フレーム２２の各微小領域の色データに近似する色変位データを用いてよい。 For example, the high image quality improvement unit 15 includes a plurality of template images for high image quality processing having various color distributions (color displacements) for each minute region (for example, a × a pixel range, a is an odd number of 3 or more). Are stored in advance in the extraction process DB 8. Each template image is provided with high image quality data and low image quality data, and the low image quality data is indicated by the resolution of each template image. Further, the image quality enhancement unit 15 associates color displacement data for image quality enhancement obtained by sampling the difference (color displacement) for each minute area of the image quality data and low image quality data of each template image with each template image. , Stored in advance in the extraction process DB 8. Then, the image quality improving unit 15 performs a convolution operation for each minute region of the processing frame 22 using various color displacement data that matches the resolution of the processing frame 22, thereby changing the color displacement of each minute region of the processing frame 22. The processing frame 22 is improved in image quality by determining and combining (combining) the color displacement data having the highest probability (the most suitable) from the corresponding color displacement data. In this convolution, it is not always necessary to use all the color displacement data, and color displacement data that approximates the color data of each minute region of the processing frame 22 may be used.

また、高画質化部１５は、元の処理フレーム２２（又はその局所領域）を低画質データとし、高画質化後の処理フレーム２２（又はその局所領域）を高画質データとするテンプレート画像を、高画質化処理の機械学習の学習データとして抽出処理ＤＢ８に記憶する。従って、高画質化部１５は、機械学習によって抽出処理ＤＢ８に蓄積された高画質化処理用のテンプレート画像を使用するため、処理を行う度に、より精度の高い高画質化処理を行うことができる。 In addition, the image quality improvement unit 15 generates a template image having the original processing frame 22 (or its local region) as low image quality data and the processing frame 22 (or its local region) after image quality improvement as high image quality data, The extracted data is stored in the extraction process DB 8 as machine learning learning data for high image quality processing. Therefore, since the image quality improvement unit 15 uses the template image for image quality improvement processing accumulated in the extraction processing DB 8 by machine learning, the image quality improvement processing can be performed with higher accuracy each time processing is performed. it can.

更に、高画質化部１５は、動画データ２１をオブジェクト抽出処理の対象とする場合には、動画データ２１を構成する複数の処理フレーム２２の内、時間軸において前後に連続していて同一シーンを構成する２つ以上の共通の処理フレーム２２について、一の共通の処理フレーム２２を他の共通の処理フレーム２２に基づいて高画質化する。例えば、高画質化部１５は、一の共通の処理フレーム２２から抽出される複数の特徴点２３と、他の共通の処理フレーム２２から抽出される複数の特徴点２３との差異を利用して、一の共通の処理フレーム２２の特徴点２３が増加するように、一の共通の処理フレーム２２を高画質化する。他の共通の処理フレーム２２に含まれる特徴点２３の内、一の共通の処理フレーム２２に含まれない特徴点２３を、一の共通の処理フレーム２２に加えることにより、一の共通の処理フレーム２２の特徴点２３が増加する。 Furthermore, when the moving image data 21 is the object of the object extraction process, the image quality improving unit 15 continues the same scene in the time axis among the plurality of processing frames 22 constituting the moving image data 21. With respect to two or more common processing frames 22 that constitute, one common processing frame 22 is improved in image quality based on the other common processing frames 22. For example, the image quality improving unit 15 uses a difference between a plurality of feature points 23 extracted from one common processing frame 22 and a plurality of feature points 23 extracted from another common processing frame 22. The image quality of one common processing frame 22 is improved so that the feature points 23 of the one common processing frame 22 increase. Among the feature points 23 included in the other common processing frame 22, a feature point 23 not included in the one common processing frame 22 is added to the one common processing frame 22. 22 feature points 23 increase.

フレーム調整部１６は、処理フレーム２２の性質や動画データ２１の性質に応じて、処理フレーム２２に対して様々な画像処理を行う。 The frame adjustment unit 16 performs various image processing on the processing frame 22 according to the property of the processing frame 22 and the property of the moving image data 21.

例えば、フレーム調整部１６は、処理フレーム２２のモスキートノイズやブロックノイズの低減処理を行う。フレーム調整部１６は、処理フレーム２２からモスキートノイズを検出すると、その周辺情報を用いて平滑化することでモスキートノイズを低減する。フレーム調整部１６は、処理フレーム２２からブロックノイズを検出すると、そのブロックノイズ部分を上記した複数のテンプレート画像と照合し、最も適合したテンプレート画像の学習データを用いることで高画質化することでブロックノイズを低減する。フレーム調整部１６は、ブロックノイズ部分に適合するテンプレート画像が無い場合には、ブロックノイズ部分にアンシャープマスク処理やぼかし処理等を施すことでブロックノイズを低減する。 For example, the frame adjustment unit 16 performs mosquito noise and block noise reduction processing of the processing frame 22. When the frame adjustment unit 16 detects mosquito noise from the processing frame 22, the frame adjustment unit 16 reduces the mosquito noise by performing smoothing using the peripheral information. When detecting the block noise from the processing frame 22, the frame adjustment unit 16 compares the block noise portion with the plurality of template images described above, and uses the learning data of the most suitable template image to improve the image quality. Reduce noise. If there is no template image that matches the block noise portion, the frame adjustment unit 16 reduces the block noise by performing unsharp mask processing, blurring processing, or the like on the block noise portion.

また、フレーム調整部１６は、処理フレーム２２が高コントラストな領域を含む場合、その領域が多くの画像詳細を失う恐れがあるため、その領域について局所的にＨＤＲ処理を行う。ＨＤＲ処理では、局所的な複数のコントラストデータを作成しておき、高コントラストな領域に対して、最も適合するコントラストデータを合成することで、高画質なトーンバランスを有する画像を生成する。 In addition, when the processing frame 22 includes a high-contrast region, the frame adjustment unit 16 may perform HDR processing locally on the region because the region may lose many image details. In HDR processing, a plurality of local contrast data is created, and the most suitable contrast data is synthesized with a high-contrast region to generate an image having a high-quality tone balance.

また、フレーム調整部１６は、動画データ２１をオブジェクト抽出処理の対象とするとき、動画データ２１のフレームレートが低い場合には、フレーム補間処理を行う。フレーム補間処理では、先ず、低フレームレートのために処理フレーム２２自体にボケが生じている場合には、シャープ化等によりボケを解消する。そして、連続する２つの処理フレーム２２間の所定の時間の中間画像として補間フレームを生成し、これらの２つの処理フレーム２２間に挿入する。例えば、連続する２つの処理フレーム２２が、同一シーンの共通の処理フレーム２２であって、共通する候補オブジェクト２５のみが移動している場合には、２つの処理フレーム２２間の特徴点２３の深度及び移動ベクトルに基づいて、この共通する候補オブジェクト２５について、２つの処理フレーム２２間の所定の時間での特徴点２３及びその深度を推定して算出する。そして、算出した特徴点２３及びその深度を有する候補オブジェクト２５を、２つの処理フレーム２２と同様の処理フレーム２２に合成することで、所定の時間の補間フレームを生成する。このようなフレーム補間処理は、特徴点抽出部１７による特徴点抽出処理や深度検出部１８による深度検出処理の後に行われてよい。 In addition, when the moving image data 21 is a target of object extraction processing, the frame adjustment unit 16 performs frame interpolation processing when the frame rate of the moving image data 21 is low. In the frame interpolation process, first, when the processing frame 22 itself is blurred due to a low frame rate, the blur is eliminated by sharpening or the like. Then, an interpolation frame is generated as an intermediate image of a predetermined time between two consecutive processing frames 22 and inserted between these two processing frames 22. For example, when two consecutive processing frames 22 are common processing frames 22 in the same scene and only the common candidate object 25 is moving, the depth of the feature point 23 between the two processing frames 22 Based on the movement vector, the common candidate object 25 is calculated by estimating the feature point 23 and its depth at a predetermined time between the two processing frames 22. Then, the candidate object 25 having the calculated feature point 23 and its depth is synthesized into a processing frame 22 similar to the two processing frames 22 to generate an interpolation frame for a predetermined time. Such frame interpolation processing may be performed after the feature point extraction processing by the feature point extraction unit 17 or the depth detection processing by the depth detection unit 18.

また、フレーム調整部１６は、動画データ２１のフレームレートが低い場合には、残像低減処理を行い、低フレームレートに起因して処理フレーム２２に生じた残像を低減させる。 In addition, when the frame rate of the moving image data 21 is low, the frame adjustment unit 16 performs afterimage reduction processing to reduce afterimages generated in the processing frame 22 due to the low frame rate.

また、フレーム調整部１６は、動画データ２１のフレームレートが高い場合には、間引き処理を行い、所定期間における処理フレーム２２の数を少なくして、その後の画像処理に掛かる負荷や時間を軽減する。なお、間引き処理では、連続する２つ以上の処理フレーム２２において、各特徴点２３の動きベクトル（候補オブジェクト２５の動き）の少ない処理フレーム２２のように、影響の少ない処理フレーム２２を削除することが好ましく、シーンの切り替わる前後の処理フレーム２２のように、影響の大きい処理フレーム２２を残すことが好ましい。 In addition, when the frame rate of the moving image data 21 is high, the frame adjustment unit 16 performs a thinning process, reduces the number of processing frames 22 in a predetermined period, and reduces the load and time required for subsequent image processing. . In the thinning-out process, the processing frame 22 having less influence is deleted, such as the processing frame 22 having a small motion vector (motion of the candidate object 25) of each feature point 23 in two or more continuous processing frames 22. It is preferable to leave the processing frame 22 having a large influence like the processing frame 22 before and after the scene change.

また、フレーム調整部１６は、所定のフォーマットに圧縮された動画データ２１が画像入力部１３に入力された場合には、そのフォーマットの圧縮アルゴリズムのロバスト性を評価し、動画データ２１がそのフォーマットに符号化された際に処理フレーム２２について欠落した情報を、動画データ２１を復号化するときに担保して元の処理フレーム２２を再現する。 In addition, when the moving image data 21 compressed into a predetermined format is input to the image input unit 13, the frame adjustment unit 16 evaluates the robustness of the compression algorithm of the format, and the moving image data 21 is converted into the format. The information lost for the processing frame 22 at the time of encoding is secured when the moving image data 21 is decoded, and the original processing frame 22 is reproduced.

特徴点抽出部１７は、処理フレーム２２に特徴点抽出処理を行って、処理フレーム２２の複数の特徴点２３を抽出すると共に、各特徴点２３の２次元画像上の第１特徴量を検出する。特徴点抽出部１７は、動画データ２１をオブジェクト抽出処理の対象とするときには、動画データ２１を構成する複数の処理フレーム２２の２次元画像のそれぞれに特徴点抽出処理を行う。例えば、各特徴点２３の第１特徴量としては、２次元座標、輝度や色変数（ＲＧＢ）、並びに輝度勾配ベクトル（周囲画像又は全体画像に対する輝度勾配）等がある。 The feature point extraction unit 17 performs a feature point extraction process on the processing frame 22 to extract a plurality of feature points 23 of the processing frame 22 and detects a first feature amount of each feature point 23 on the two-dimensional image. . The feature point extraction unit 17 performs the feature point extraction processing on each of the two-dimensional images of the plurality of processing frames 22 constituting the moving image data 21 when the moving image data 21 is a target of the object extraction processing. For example, the first feature amount of each feature point 23 includes two-dimensional coordinates, luminance and color variables (RGB), a luminance gradient vector (luminance gradient with respect to the surrounding image or the entire image), and the like.

例えば、特徴点抽出部１７は、特徴点抽出処理の前処理として、処理フレーム２２にシャープ化処理を施すことにより、画素間の輝度の変位量を算出し、この変位量から換算される加速度が大きいほどエッジをより強調したエッジ強調フレームを生成する。そして、特徴点抽出部１７は、特徴点抽出処理として、エッジ強調フレームで強調されたエッジに基づいて複数の特徴点２３を抽出すると共に、各特徴点２３の第１特徴量を算出する。 For example, the feature point extraction unit 17 performs a sharpening process on the processing frame 22 as a pre-process of the feature point extraction process, thereby calculating a luminance displacement amount between pixels, and an acceleration converted from the displacement amount. An edge emphasis frame in which the edge is emphasized as the size increases is generated. Then, as the feature point extraction process, the feature point extraction unit 17 extracts a plurality of feature points 23 based on the edge emphasized in the edge enhancement frame, and calculates the first feature amount of each feature point 23.

深度検出部１８は、特徴点抽出部１７によって特徴点２３を抽出された処理フレーム２２に深度検出処理を行って、処理フレーム２２の各特徴点２３について周囲の特徴点２３からの相対的な深度を検出する。 The depth detection unit 18 performs depth detection processing on the processing frame 22 from which the feature point 23 has been extracted by the feature point extraction unit 17, and the relative depth from the surrounding feature points 23 for each feature point 23 of the processing frame 22. Is detected.

例えば、深度検出部１８は、先ず、様々な特徴点分布データを用いて処理フレーム２２の局所領域毎に畳み込み演算を行うことで、処理フレーム２２における局所領域毎の特徴点２３の数（存在確率）の分布を判定する。例えば、特徴点分布データは、オブジェクトを特定する必要はないが、オブジェクトの特徴点２３の分布を示すように作成される。そして、深度検出部１８は、処理フレーム２２内の特徴点２３の分布から、何れかの特徴点分布検出データに対応する分布として、より高い確率で判定されるものを検出する。なお、この畳み込み演算を二次元方向に行うことによって、処理フレーム２２の画像内の実際の３次元空間（現実３次元空間２６、図２参照）における特徴点２３の分布を判定することもできる。 For example, the depth detection unit 18 first performs a convolution operation for each local region of the processing frame 22 using various feature point distribution data, so that the number of feature points 23 for each local region in the processing frame 22 (existence probability). ) Distribution. For example, the feature point distribution data does not need to specify an object, but is created to indicate the distribution of the feature points 23 of the object. Then, the depth detection unit 18 detects from the distribution of the feature points 23 in the processing frame 22 what is determined with a higher probability as a distribution corresponding to any feature point distribution detection data. Note that the distribution of the feature points 23 in the actual three-dimensional space (the actual three-dimensional space 26, see FIG. 2) in the image of the processing frame 22 can also be determined by performing this convolution operation in the two-dimensional direction.

例えば、特徴点分布データは、オブジェクトＤＢ３に格納される参照オブジェクト３ａの特徴点２３の分布を有するように作成され、様々なカテゴリー及びサイズのオブジェクトの特徴点分布データが抽出処理ＤＢ８に予め記憶される。特徴点分布データは、オブジェクト認識装置４によって、高い精度で認識された参照オブジェクト３ａがオブジェクトＤＢ３に格納される際に、特徴点分布判定の機械学習の学習データとして作成されてよい。また、特徴点分布データは、特徴点分布判定によって処理フレーム２２から判定された特徴点分布の内、高い精度で判定されたものによって作成されてもよい。従って、深度検出部１８は、機械学習によって抽出処理ＤＢ８に蓄積された特徴点分布データを使用するため、処理を行う度に、より精度の高い特徴点分布判定を行うことができる。 For example, the feature point distribution data is created so as to have the distribution of the feature points 23 of the reference object 3a stored in the object DB 3, and the feature point distribution data of objects of various categories and sizes is stored in the extraction processing DB 8 in advance. The The feature point distribution data may be created as machine learning learning data for feature point distribution determination when the object recognition device 4 stores the reference object 3a recognized with high accuracy in the object DB 3. In addition, the feature point distribution data may be created based on the feature point distribution determined from the processing frame 22 by the feature point distribution determination and determined with high accuracy. Therefore, since the depth detection unit 18 uses the feature point distribution data accumulated in the extraction processing DB 8 by machine learning, the depth detection unit 18 can perform feature point distribution determination with higher accuracy each time processing is performed.

また、深度検出部１８は、様々なサイズの微小領域（例えば、ａ×ａの画素範囲、ａは３以上の整数）の色変位とその色変位に対応する深度との対応関係を示す色変位−深度データを用いて、処理フレーム２２の微小領域毎に畳み込み演算を行い、処理フレーム２２内の画素の色分布を判定する。例えば、色変位−深度データの色変位は、微小領域において中心画素から見た周囲画素の色データ（例えば、ＲＧＢ）の変位であり、深度は、微小領域において中心画素から見た周囲画素の相対的深度である。そして、深度検出部１８は、上記した特徴点分布判定の結果である処理フレーム２２の特徴点２３の分布に対して、同様のカテゴリー及びサイズを有する様々なオブジェクトの色変位−深度データを用いて、各微小領域の色変位に適合する色変位−深度データとして、より高い確率で判定されるものを検出する。これにより、深度検出部１８は、各特徴点２３について、周囲の特徴点２３からの相対的な深度を検出する。 In addition, the depth detection unit 18 is a color displacement indicating a correspondence relationship between a color displacement of a minute region of various sizes (for example, an a × a pixel range, a is an integer of 3 or more) and a depth corresponding to the color displacement. Using the depth data, perform a convolution operation for each minute region of the processing frame 22 to determine the color distribution of the pixels in the processing frame 22. For example, the color displacement of the color displacement-depth data is the displacement of the color data (for example, RGB) of the surrounding pixels viewed from the center pixel in the minute area, and the depth is the relative of the surrounding pixels viewed from the center pixel in the minute area. Depth. Then, the depth detection unit 18 uses the color displacement-depth data of various objects having the same category and size with respect to the distribution of the feature points 23 of the processing frame 22 as a result of the above-described feature point distribution determination. Detect what is determined with higher probability as color displacement-depth data matching the color displacement of each minute region. Thereby, the depth detection unit 18 detects the relative depth from the surrounding feature points 23 for each feature point 23.

例えば、色変位−深度データは、オブジェクトＤＢ３に格納される参照オブジェクト３ａの微小領域毎に、色変位とその色変位に対応する深度との対応関係を有するように作成され、様々なカテゴリー及びサイズのオブジェクトの色変位−深度データが、抽出処理ＤＢ８に予め記憶される。色変位−深度データは、オブジェクト認識装置４によって、高い精度で認識された参照オブジェクト３ａがオブジェクトＤＢ３に格納される際に、色分布判定の機械学習の学習データとして作成されてよい。また、色変位−深度データは、色分布判定によって処理フレーム２２から判定された色分布の内、高い精度で判定されたものによって作成されてもよい。従って、深度検出部１８は、機械学習によって抽出処理ＤＢ８に蓄積された色変位−深度データを使用するため、処理を行う度に、より精度の高い色分布判定を行うことができる。 For example, the color displacement-depth data is created so as to have a correspondence relationship between the color displacement and the depth corresponding to the color displacement for each minute area of the reference object 3a stored in the object DB 3, and has various categories and sizes. The color displacement-depth data of the object is stored in advance in the extraction process DB 8. The color displacement-depth data may be created as learning data for machine learning for color distribution determination when the reference object 3a recognized with high accuracy by the object recognition device 4 is stored in the object DB 3. In addition, the color displacement-depth data may be created by the color distribution determined from the processing frame 22 by the color distribution determination and determined with high accuracy. Therefore, since the depth detection unit 18 uses the color displacement-depth data accumulated in the extraction processing DB 8 by machine learning, it is possible to perform more accurate color distribution determination each time processing is performed.

更に、深度検出部１８は、上記した色分布判定の結果に基づいて、処理フレーム２２の各特徴点２３の方向ベクトルを算出する。例えば、所定の特徴点２３の方向ベクトルは、その特徴点２３の座標と、その特徴点２３からの周囲画素（特徴点２３を中心とする微小領域内の画素）の相対的深度とを有している。換言すれば、所定の特徴点２３の方向ベクトルは、処理フレーム２２の画像内の実際の３次元空間（現実３次元空間２６、図２参照）において周囲画素との間の輝度勾配及び色変位（色勾配）の方向を示す。 Further, the depth detection unit 18 calculates the direction vector of each feature point 23 of the processing frame 22 based on the result of the color distribution determination described above. For example, the direction vector of a predetermined feature point 23 has the coordinates of the feature point 23 and the relative depth of surrounding pixels from the feature point 23 (pixels in a minute region centered on the feature point 23). ing. In other words, the direction vector of the predetermined feature point 23 is a luminance gradient and a color displacement between the surrounding pixels in the actual three-dimensional space (real three-dimensional space 26, see FIG. 2) in the image of the processing frame 22. Direction of color gradient).

また、深度検出部１８は、上記の各特徴点２３の方向ベクトルに基づいて、処理フレーム２２内で各特徴点２３間を通る曲線を、各特徴点２３が存在する領域の特徴点分布に応じた方式で作成する。例えば、深度検出部１８は、各特徴点２３と他の特徴点２３（周囲の特徴点）とを制御点として通るスプライン曲線やベジエ曲線等を生成する。また、深度検出部１８は、特徴点２３の分布がスプライン曲線やベジエ曲線等の生成に都合が悪い場合、例えば、所定領域内の特徴点２３が過多又は過密でルンゲ現象が生じる場合には、回帰曲線等を利用して近似曲線化することによって、特徴点２３間の曲線を生成する。これらのように生成される曲線は、処理フレーム２２の画像内の実際の３次元空間（現実３次元空間２６、図２参照）において、各特徴点２３間の輝度勾配及び色変位（色勾配）の方向に沿った曲線となる。 Further, the depth detection unit 18 determines a curve passing between the feature points 23 in the processing frame 22 based on the direction vector of each feature point 23 according to the feature point distribution of the region where each feature point 23 exists. Create a new method. For example, the depth detection unit 18 generates a spline curve, a Bezier curve, or the like that passes through each feature point 23 and other feature points 23 (surrounding feature points) as control points. In addition, the depth detection unit 18 may be used when the distribution of the feature points 23 is inconvenient for generating a spline curve, a Bezier curve, etc. A curve between the feature points 23 is generated by making an approximate curve using a regression curve or the like. The curves generated as described above are the luminance gradient and color displacement (color gradient) between the feature points 23 in the actual three-dimensional space in the image of the processing frame 22 (the actual three-dimensional space 26, see FIG. 2). It becomes a curve along the direction of.

そして、深度検出部１８は、所定の特徴点２３についての周囲の特徴点２３からの相対的な深度を、周囲の特徴点２３毎に生成した曲線に基づいて算出する。これにより、深度検出部１８は、各特徴点２３の周囲の特徴点２３との現実３次元空間２６における相対的な位置関係を検出する。なお、このようにして処理フレーム２２の各特徴点２３について深度（位置関係）を算出した後、所定の特徴点２３についての深度（位置関係）を、周囲の特徴点２３毎に算出された深度（位置関係）に基づいて、適宜調整してもよい。 Then, the depth detection unit 18 calculates a relative depth of the predetermined feature point 23 from the surrounding feature points 23 based on a curve generated for each surrounding feature point 23. Thereby, the depth detection unit 18 detects the relative positional relationship in the actual three-dimensional space 26 with the feature points 23 around each feature point 23. After calculating the depth (positional relationship) for each feature point 23 of the processing frame 22 in this way, the depth (positional relationship) for the predetermined feature point 23 is calculated for each surrounding feature point 23. You may adjust suitably based on (positional relationship).

３次元空間推定部１９は、処理フレーム２２の複数の特徴点２３それぞれの第１特徴量及び深度（周囲の特徴点２３からの相対的な深度）に基づいて処理フレーム２２の画像内の実際の３次元空間（現実３次元空間２６）を推定する。例えば、３次元空間推定部１９は、処理フレーム２２の各特徴点２３の第１特徴量及び深度を相互に対比していくことで、各特徴点２３の深度が適合するような現実３次元空間２６を推定して算出する。また、３次元空間推定部１９は、処理フレーム２２の複数の特徴点２３について、現実３次元空間２６上の第２特徴量を検出する。例えば、各特徴点２３の第２特徴量には、現実３次元空間２６上の３次元座標がある。 The three-dimensional space estimation unit 19 uses the first feature amount and the depth (relative depth from the surrounding feature points 23) of each of the plurality of feature points 23 of the process frame 22 to determine the actual feature in the image of the process frame 22. A three-dimensional space (real three-dimensional space 26) is estimated. For example, the three-dimensional space estimation unit 19 compares the first feature amount and the depth of each feature point 23 of the processing frame 22 with each other so that the depth of each feature point 23 can be matched. 26 is estimated and calculated. Further, the three-dimensional space estimation unit 19 detects the second feature amount on the real three-dimensional space 26 for the plurality of feature points 23 of the processing frame 22. For example, the second feature amount of each feature point 23 includes a three-dimensional coordinate on the real three-dimensional space 26.

オブジェクト抽出部２０は、処理フレーム２２の複数の特徴点２３それぞれの第２特徴量及び色分布に基づいてオブジェクト抽出処理を行う。そして、オブジェクト抽出部２０は、オブジェクト抽出処理によって、処理フレーム２２の複数の特徴点２３の分布状態に応じて、２つ以上の特徴点２３の集合からなる特徴点群２４を検出する。例えば、オブジェクト抽出部２０は、処理フレーム２２を四分木空間分割したときの特徴点２３の分布に基づいて、１組以上の特徴点群２４を検出する。各特徴点群２４は、現実３次元空間２６上の特徴量（座標等）を有していて、オブジェクト抽出部２０は、このようにして検出した特徴点群２４を、抽出元の処理フレーム２２の候補オブジェクト２５として抽出する。 The object extraction unit 20 performs an object extraction process based on the second feature amount and the color distribution of each of the plurality of feature points 23 of the processing frame 22. Then, the object extraction unit 20 detects a feature point group 24 including a set of two or more feature points 23 according to the distribution state of the plurality of feature points 23 in the processing frame 22 by the object extraction process. For example, the object extraction unit 20 detects one or more sets of feature points 24 based on the distribution of the feature points 23 when the processing frame 22 is divided into the quadtree space. Each feature point group 24 has a feature amount (coordinates and the like) in the actual three-dimensional space 26, and the object extraction unit 20 uses the feature point group 24 detected in this way as a processing frame 22 as an extraction source. Is extracted as a candidate object 25.

また、オブジェクト抽出部２０は、様々なオブジェクト抽出データを用いて処理フレーム２２の局所領域毎に畳み込み演算を行うことで、オブジェクト抽出データに対応する特徴点群２４を検出してもよい。例えば、オブジェクト抽出データは、オブジェクトＤＢ３に格納される参照オブジェクト３ａの特徴点群２４を示すように作成され、様々なカテゴリー及びサイズのオブジェクトのオブジェクト抽出データが抽出処理ＤＢ８に予め記憶される。オブジェクト抽出データとして、オブジェクト認識装置４によって高い精度で認識された参照オブジェクト３ａが、オブジェクト抽出処理の機械学習の学習データにも利用される。また、オブジェクト抽出データは、オブジェクト抽出処理によって処理フレーム２２から抽出された候補オブジェクト２５の内、高い精度で判定されたものによって作成されてもよい。従って、オブジェクト抽出部２０は、機械学習によってオブジェクトＤＢ３に蓄積された参照オブジェクト３ａを使用するため、処理を行う度に、より精度の高いオブジェクト抽出処理を行うことができる。 The object extraction unit 20 may detect a feature point group 24 corresponding to the object extraction data by performing a convolution operation for each local region of the processing frame 22 using various object extraction data. For example, the object extraction data is created so as to indicate the feature point group 24 of the reference object 3a stored in the object DB 3, and the object extraction data of objects of various categories and sizes is stored in the extraction processing DB 8 in advance. As the object extraction data, the reference object 3a recognized with high accuracy by the object recognition device 4 is also used as learning data for machine learning of the object extraction processing. In addition, the object extraction data may be created from the candidate object 25 extracted from the processing frame 22 by the object extraction process and determined with high accuracy. Therefore, since the object extraction unit 20 uses the reference object 3a accumulated in the object DB 3 by machine learning, it is possible to perform object extraction processing with higher accuracy each time processing is performed.

更に、オブジェクト抽出部２０は、抽出した候補オブジェクト２５を抽出元の処理フレーム２２に関連付けて記憶部１１に記憶し、処理フレーム２２を動画データ２１から取得した場合には、抽出した候補オブジェクト２５を動画データ２１にも関連付ける。候補オブジェクト２５は、対応する特徴点群２４に関する情報として、特徴点群２４を構成する各特徴点２３の第１特徴量、深度及び第２特徴量を含んでいる。また、オブジェクト抽出部２０は、抽出元の処理フレーム２２を候補オブジェクト２５に付加する。 Further, the object extraction unit 20 stores the extracted candidate object 25 in the storage unit 11 in association with the extraction source processing frame 22, and when the processing frame 22 is acquired from the moving image data 21, the extracted candidate object 25 is stored in the storage unit 11. The video data 21 is also associated. The candidate object 25 includes the first feature amount, the depth, and the second feature amount of each feature point 23 constituting the feature point group 24 as information regarding the corresponding feature point group 24. In addition, the object extraction unit 20 adds the source processing frame 22 to the candidate object 25.

なお、オブジェクト抽出部２０は、動画データ２１をオブジェクト抽出処理の対象とする場合に、動画データ２１を構成する複数の処理フレーム２２の内、時間軸において前後に連続していて同一シーンを構成する２つ以上の共通の処理フレーム２２のそれぞれについてオブジェクト抽出処理を行うときには、共通の処理フレーム２２に共通して検出された特徴点群２４を、同一シーンに共通する候補オブジェクト２５とする。このとき、同一シーンの共通の処理フレーム２２間で候補オブジェクト２５が移動している場合には、同一シーンに共通する候補オブジェクト２５は、特徴点群２４（現実３次元空間２６上の特徴量）の移動量（同一シーン上の時間変位量）も含む。 In addition, when the moving image data 21 is a target of the object extracting process, the object extracting unit 20 configures the same scene that is continuous in the time axis among the plurality of processing frames 22 constituting the moving image data 21. When object extraction processing is performed for each of two or more common processing frames 22, the feature point group 24 detected in common in the common processing frame 22 is set as a candidate object 25 common in the same scene. At this time, if the candidate object 25 is moving between the common processing frames 22 of the same scene, the candidate object 25 common to the same scene is the feature point group 24 (feature amount in the real three-dimensional space 26). Movement amount (time displacement amount on the same scene).

そして、オブジェクト抽出装置２は、上記のようにして抽出した候補オブジェクト２５を、画像データ情報及びフレーム情報と共に、オブジェクト認識処理のためにオブジェクト認識装置４へと出力する。 Then, the object extraction device 2 outputs the candidate object 25 extracted as described above to the object recognition device 4 for object recognition processing together with the image data information and the frame information.

なお、オブジェクト抽出装置２は、動画データ２１をオブジェクト抽出処理の対象とするとき、フレーム調整部１６による調整後の複数の処理フレーム２２に対して近似判定を行って、近似する処理フレーム２２については特徴点抽出部１７、深度検出部１８、３次元空間推定部１９及びオブジェクト抽出部２０の処理対象から除外してもよい。例えば、前後に連続して近似する２つの処理フレーム２２については、先行の処理フレーム２２を処理対象とすると共に、後続の処理フレーム２２を処理対象から除外する。なお、先の近似判定において後続の処理フレーム２２を処理対象から除外した場合には、今回の近似判定において後続の処理フレーム２２と比較される処理フレーム２２は、先の近似判定で処理対象とした処理フレーム２２となる。 Note that the object extraction device 2 performs approximation determination on the plurality of processing frames 22 after adjustment by the frame adjustment unit 16 when the moving image data 21 is a target of object extraction processing, and the processing frame 22 to be approximated is determined. You may exclude from the process target of the feature point extraction part 17, the depth detection part 18, the three-dimensional space estimation part 19, and the object extraction part 20. FIG. For example, for two processing frames 22 that are successively approximated before and after, the preceding processing frame 22 is a processing target and the subsequent processing frame 22 is excluded from the processing target. When the subsequent processing frame 22 is excluded from the processing target in the previous approximation determination, the processing frame 22 to be compared with the subsequent processing frame 22 in the current approximation determination is set as the processing target in the previous approximation determination. A processing frame 22 is obtained.

次に、オブジェクトＤＢ３について説明する。オブジェクトＤＢ３は、上記したように、オブジェクト認識処理に用いられる複数の参照オブジェクト３ａを格納している。オブジェクトＤＢ３は、オブジェクト認識装置４がオブジェクト認識処理を行う際に、オブジェクト認識装置４から参照オブジェクト出力の指示を受けると、格納している参照オブジェクト３ａを順次、オブジェクト認識装置４へと出力する。 Next, the object DB 3 will be described. As described above, the object DB 3 stores a plurality of reference objects 3a used for object recognition processing. When the object recognition device 4 performs an object recognition process and receives an instruction to output a reference object from the object recognition device 4, the object DB 3 sequentially outputs the stored reference objects 3 a to the object recognition device 4.

各参照オブジェクト３ａは、オブジェクト抽出装置２によって抽出される候補オブジェクト２５と同様に、２次元平面画像（以下、元画像と称する）から現実３次元空間上の特徴量を有する特徴点の特徴点群として抽出されたものであり、特徴点群を構成する各特徴点の第１特徴量、深度及び第２特徴量を含んでよい。参照オブジェクト３ａには、元画像が付加され、更に、参照オブジェクト３ａに関連する付属情報も付加される。付属情報には、例えば、参照オブジェクト３ａを特定する特定情報や、元画像に付属する元画像情報、元画像の取得元の動画に付属する動画情報、元画像や動画の取得元のウェブサイトの記述内容に含まれる情報等がある。 Each reference object 3a is a feature point group of feature points having feature quantities in a real three-dimensional space from a two-dimensional planar image (hereinafter referred to as an original image), as with the candidate object 25 extracted by the object extracting device 2. And may include the first feature amount, the depth, and the second feature amount of each feature point constituting the feature point group. The original image is added to the reference object 3a, and additional information related to the reference object 3a is also added. The attached information includes, for example, specific information for specifying the reference object 3a, original image information attached to the original image, moving image information attached to the moving image from which the original image is acquired, and the website from which the original image and moving image are acquired. There is information included in the description.

更に、オブジェクトＤＢ３は、複数の参照オブジェクト３ａをそれぞれの付属情報に基づいて、複数のカテゴリーに分類して格納している。カテゴリーは、人物や物等の大枠のカテゴリーや、特定の人物や特定の物等の小枠のカテゴリー等の複数段階のカテゴリーに分けられてよい。そして、オブジェクトＤＢ３は、共通する付属情報を有する２つ以上の参照オブジェクト３ａについては、その共通の付属情報を分類情報とした共通のカテゴリーに分類して格納している。なお、オブジェクトＤＢ３は、オブジェクト認識装置４がオブジェクト認識処理を行う際に、オブジェクト認識装置４からカテゴリーを特定して参照オブジェクト出力の指示を受けると、その特定されたカテゴリーに格納している参照オブジェクト３ａをオブジェクト認識装置４へと出力することもできる。 Furthermore, the object DB 3 stores a plurality of reference objects 3a classified into a plurality of categories based on the attached information. The categories may be divided into multiple categories such as large categories such as persons and objects, and small categories such as specific persons and specific objects. The object DB 3 classifies and stores two or more reference objects 3a having common auxiliary information by classifying the common auxiliary information into a common category using classification information. Note that when the object recognition device 4 performs an object recognition process, the object DB 3 specifies a category from the object recognition device 4 and receives a reference object output instruction, so that the reference object stored in the specified category is stored. 3a can also be output to the object recognition device 4.

オブジェクトＤＢ３に格納される参照オブジェクト３ａは、オブジェクト抽出装置２及びオブジェクト認識装置４によって作成することができ、また、上記のような構成を有していれば他の手段によって作成してもよい。例えば、オブジェクトＤＢ３は、オブジェクト認識装置４によるオブジェクト認識処理後の候補オブジェクト２５を、オブジェクト認識処理の機械学習の学習データとして入力し、参照オブジェクト３ａとして格納することができる。従って、オブジェクト認識装置４は、機械学習によってオブジェクトＤＢ３に蓄積された参照オブジェクト３ａを使用するため、処理を行う度に、より精度の高いオブジェクト認識処理を行うことができる。 The reference object 3a stored in the object DB 3 can be created by the object extracting device 2 and the object recognizing device 4, and may be created by other means as long as it has the above configuration. For example, the object DB 3 can input the candidate object 25 after the object recognition process by the object recognition device 4 as learning data for machine learning of the object recognition process and store it as a reference object 3a. Therefore, since the object recognition device 4 uses the reference object 3a accumulated in the object DB 3 by machine learning, it is possible to perform object recognition processing with higher accuracy each time processing is performed.

この場合、オブジェクト認識装置４によって所定の参照オブジェクト３ａに相当すると判定された候補オブジェクト２５は、参照オブジェクト３ａに基づいて、後述のオブジェクト情報が付加され、この所定の参照オブジェクト３ａが分類されるカテゴリーの新たな参照オブジェクト３ａとしてオブジェクトＤＢ３に格納される。一方、オブジェクト認識装置４によって何れの参照オブジェクト３ａにも相当しないと判定された候補オブジェクト２５は、この候補オブジェクト２５が分類される新たなカテゴリーの参照オブジェクト３ａとしてオブジェクトＤＢ３に格納される。 In this case, the candidate object 25 determined to correspond to the predetermined reference object 3a by the object recognition device 4 is added with object information to be described later based on the reference object 3a, and the category in which the predetermined reference object 3a is classified. Is stored in the object DB 3 as a new reference object 3a. On the other hand, the candidate object 25 determined not to correspond to any reference object 3a by the object recognition device 4 is stored in the object DB 3 as a reference object 3a of a new category into which the candidate object 25 is classified.

次に、オブジェクト認識装置４について説明する。オブジェクト認識装置４は、上記のようにオブジェクト認識処理を行うように構成され、オブジェクト抽出装置２によって抽出された候補オブジェクト２５が、オブジェクトＤＢ３に格納されている複数の参照オブジェクト３ａの何れに相当するかを判定する。 Next, the object recognition device 4 will be described. The object recognition device 4 is configured to perform object recognition processing as described above, and the candidate object 25 extracted by the object extraction device 2 corresponds to any of the plurality of reference objects 3a stored in the object DB 3. Determine whether.

例えば、オブジェクト認識装置４は、オブジェクト抽出装置２から候補オブジェクト２５を入力すると、オブジェクトＤＢ３に対して参照オブジェクト出力を指示する。そして、オブジェクト認識装置４は、オブジェクトＤＢ３から参照オブジェクト３ａを入力すると、候補オブジェクト２５が参照オブジェクト３ａに相当するか否かを判定する。例えば、オブジェクト認識装置４は、候補オブジェクト２５の特徴点群２４（現実３次元空間上の特徴量）及びその抽出元の処理フレーム２２における色分布と、参照オブジェクト３ａの特徴点群（現実３次元空間上の特徴量）及びその元画像における色分布とを比較して、参照オブジェクト３ａの候補オブジェクト２５との類似度を算出する。このように、候補オブジェクト２５の抽出時に、候補オブジェクト２５に処理フレーム２２を付加しておくことで、候補オブジェクト２５を利用する際に、その色分布も用いることができる。なお、この比較処理では、候補オブジェクト２５と参照オブジェクト３ａとは、現実３次元空間上の特徴量が比較されるため、一方の向きや大きさを他方に合わせる必要がない。 For example, when the object recognition device 4 inputs the candidate object 25 from the object extraction device 2, the object recognition device 4 instructs the object DB 3 to output a reference object. Then, when the object recognition device 4 inputs the reference object 3a from the object DB 3, the object recognition device 4 determines whether the candidate object 25 corresponds to the reference object 3a. For example, the object recognizing device 4 uses the feature point group 24 (feature amount in the real three-dimensional space) of the candidate object 25 and the color distribution in the processing frame 22 from which the candidate object 25 is extracted, and the feature point group (real three-dimensional) of the reference object 3a. The feature amount in the space) and the color distribution in the original image are compared, and the similarity between the reference object 3a and the candidate object 25 is calculated. As described above, by adding the processing frame 22 to the candidate object 25 when extracting the candidate object 25, the color distribution of the candidate object 25 can also be used. In this comparison process, the candidate object 25 and the reference object 3a are compared in feature quantity in the actual three-dimensional space, so that it is not necessary to match one direction and size to the other.

そして、オブジェクト認識装置４は、複数の参照オブジェクト３ａについて候補オブジェクト２５との類似度を算出し、より類似度の高い参照オブジェクト３ａ、例えば、所定の類似度閾値以上の参照オブジェクト３ａを、候補オブジェクト２５に相当すると判定する。このとき、オブジェクト認識装置４は、類似度の高い参照オブジェクト３ａの付属情報から候補オブジェクト２５に関連する情報を取得して、オブジェクト情報を生成して候補オブジェクト２５に付加する。なお、オブジェクト情報の作成のために、１つの参照オブジェクト３ａのみの付属情報を用いてもよく、あるいは、類似度の高い２つ以上の参照オブジェクト３ａの付属情報を用いてもよい。更に、オブジェクト情報の作成のために、オブジェクトＤＢ３における参照オブジェクト３ａの分類情報を用いてもよい。 Then, the object recognition device 4 calculates the similarity between the plurality of reference objects 3a and the candidate object 25, and selects a reference object 3a having a higher similarity, for example, a reference object 3a having a predetermined similarity threshold or more as a candidate object. It is determined that it corresponds to 25. At this time, the object recognition device 4 acquires information related to the candidate object 25 from the attached information of the reference object 3a having a high similarity, generates object information, and adds the object information to the candidate object 25. In addition, for the creation of the object information, the attached information of only one reference object 3a may be used, or the attached information of two or more reference objects 3a having high similarity may be used. Furthermore, the classification information of the reference object 3a in the object DB 3 may be used for creating the object information.

一方、オブジェクト認識装置４は、各参照オブジェクト３ａの候補オブジェクト２５との類似度が何れも所定の類似度閾値未満であった場合には、その候補オブジェクト２５が何れの参照オブジェクト３ａにも相当しないと判定する。 On the other hand, when the similarity between each reference object 3a and the candidate object 25 is less than a predetermined similarity threshold, the object recognition device 4 does not correspond to any reference object 3a. Is determined.

また、オブジェクト認識装置４は、何れの判定があった場合でも、候補オブジェクト２５をオブジェクト認識処理の機械学習のための学習データとしてオブジェクトＤＢ３へと出力して参照オブジェクト３ａとして格納させる。 Further, the object recognition device 4 outputs the candidate object 25 to the object DB 3 as learning data for machine learning of the object recognition processing and stores it as the reference object 3a regardless of any determination.

また、オブジェクト認識システム１は、オブジェクトＤＢ３の参照オブジェクト３ａを増やすために画像収集装置７を備える。画像収集装置７は、ネットワーク５を介して参照オブジェクト３ａを有する動画や静止画の画像を検索して収集する画像収集クローラを備える。そして、画像収集装置７は、画像収集クローラ機能を実行すると、ネットワーク５を介して画像収集装置７に接続された外部の動画ＤＢ６やその他の端末に格納された動画データ２１（図２参照）や静止画データ等の画像データを順次収集する。なお、画像収集装置７は、ネットワーク５に公開された全ての画像データを収集してもよいが、操作者によって選択されたカテゴリー（業種）やキーワードに基づいて画像データを検索して収集してもよい。 The object recognition system 1 also includes an image collection device 7 in order to increase the number of reference objects 3a in the object DB 3. The image collection device 7 includes an image collection crawler that searches and collects a moving image or a still image having the reference object 3 a via the network 5. When the image collecting device 7 executes the image collecting crawler function, the moving image data 21 (see FIG. 2) stored in the external moving image DB 6 or other terminal connected to the image collecting device 7 via the network 5 Image data such as still image data is collected sequentially. The image collection device 7 may collect all the image data disclosed to the network 5, but searches and collects the image data based on the category (business type) or keyword selected by the operator. Also good.

画像収集装置７は、オブジェクト抽出装置２に接続されていて、収集した画像データをオブジェクト抽出装置２へと出力する。オブジェクト抽出装置２では、上記のようにして、画像データから候補オブジェクト２５が抽出されてオブジェクト認識装置４へと出力される。オブジェクト認識装置４では、上記のようにして、オブジェクトＤＢ３の参照オブジェクト３ａを用いて候補オブジェクト２５のオブジェクト認識処理が行われ、更に、オブジェクト認識処理後の候補オブジェクト２５は、学習データとなり参照オブジェクト３ａとしてオブジェクトＤＢ３に格納される。このように、画像収集装置７を利用することで、オブジェクトＤＢ３に格納される参照オブジェクト３ａの数が増大し、オブジェクト認識装置４によるオブジェクト認識処理の精度を高めることができる。 The image collection device 7 is connected to the object extraction device 2 and outputs the collected image data to the object extraction device 2. In the object extraction device 2, the candidate object 25 is extracted from the image data and output to the object recognition device 4 as described above. In the object recognition device 4, the object recognition processing of the candidate object 25 is performed using the reference object 3a of the object DB 3 as described above. Further, the candidate object 25 after the object recognition processing becomes learning data and becomes the reference object 3a. Is stored in the object DB3. Thus, by using the image collection device 7, the number of reference objects 3a stored in the object DB 3 increases, and the accuracy of object recognition processing by the object recognition device 4 can be increased.

なお、画像収集装置７は、オブジェクト抽出装置２、オブジェクトＤＢ３及びオブジェクト認識装置４とは独立して設けられてもよく、あるいは何れかと一体的に構成されてもよい。 The image collecting device 7 may be provided independently of the object extracting device 2, the object DB 3, and the object recognizing device 4, or may be configured integrally with any one of them.

また、上記したようなオブジェクト認識システム１は、動画データ２１（図２参照）のメタデータ作成処理を行うメタデータ作成システム３０に適用される。メタデータ作成システム３０は、メタデータ３２ａを作成するメタデータ作成装置３１と、作成されたメタデータ３２ａを格納するメタデータデータベース（ＤＢ）３２とを備える。本実施形態では、メタデータ作成装置３１及びメタデータＤＢ３２が、１つずつ備えられる例を説明するが、複数のメタデータ作成装置３１及び複数のメタデータＤＢ３２が備えられてよい。メタデータＤＢ３２は、１つのメタデータ作成装置３１で利用されるものに限定されず、複数のメタデータ作成装置３１で利用可能に設けられてよい。 The object recognition system 1 as described above is applied to a metadata creation system 30 that performs metadata creation processing of the moving image data 21 (see FIG. 2). The metadata creation system 30 includes a metadata creation device 31 that creates the metadata 32a and a metadata database (DB) 32 that stores the created metadata 32a. In this embodiment, an example in which one metadata creation device 31 and one metadata DB 32 are provided will be described. However, a plurality of metadata creation devices 31 and a plurality of metadata DBs 32 may be provided. The metadata DB 32 is not limited to that used by one metadata creation device 31, and may be provided so as to be usable by a plurality of metadata creation devices 31.

メタデータ作成装置３１は、ネットワーク５を介してメタデータＤＢ３２と相互に通信可能に接続され、また、オブジェクト認識システム１のオブジェクト抽出装置２及びオブジェクト認識装置４とも相互に通信可能に接続される。なお、メタデータ作成装置３１は、メタデータＤＢ３２と相互にデータ送受信可能であれば、直接的に接続されてもよく、あるいは、一体的に構成されてもよい。 The metadata creation device 31 is connected to the metadata DB 32 via the network 5 so as to be communicable with each other, and is also connected to the object extraction device 2 and the object recognition device 4 of the object recognition system 1 so as to be able to communicate with each other. Note that the metadata creation device 31 may be directly connected or may be integrally configured as long as data can be transmitted and received with the metadata DB 32.

メタデータＤＢ３２は、動画データ２１のタイトルや動画ＩＤを検索キーワードとすることで、その動画データ２１に対応するメタデータ３２ａを検索できるように複数のメタデータ３２ａを格納している。メタデータＤＢ３２は、作成日時の新しい動画データ２１や検索頻度が高い動画データ２１、推奨している動画データ２１等のメタデータ３２ａが優先的に検索されるようにメタデータ３２ａを格納するとよい。 The metadata DB 32 stores a plurality of metadata 32a so that the metadata 32a corresponding to the moving image data 21 can be searched by using the title or moving image ID of the moving image data 21 as a search keyword. The metadata DB 32 may store the metadata 32a so that the metadata 32a such as the new moving image data 21 with the new creation date and time, the moving image data 21 with a high search frequency, and the recommended moving image data 21 are preferentially searched.

なお、複数のメタデータＤＢ３２は、各メタデータＤＢ３２に格納されるメタデータ３２ａを統括管理していて、動画データ２１のタイトルや動画ＩＤを指定すると複数のメタデータＤＢ３２に亘ってメタデータ３２ａが検索される。複数のメタデータＤＢ３２は、一の動画データ２１に基づく一のメタデータ３２ａを、２つ以上のメタデータＤＢ３２に重複して格納せずに、何れか１つのメタデータＤＢ３２に格納する。また、複数のメタデータＤＢ３２は、動画データ２１のカテゴリー別に備えられていてもよい。 The plurality of metadata DBs 32 collectively manage the metadata 32a stored in each metadata DB 32. When the title or the movie ID of the moving image data 21 is specified, the metadata 32a is spread over the plurality of metadata DBs 32. Searched. The plurality of metadata DBs 32 store one metadata 32 a based on one moving image data 21 in any one metadata DB 32 without storing the metadata 32 a in duplicate in two or more metadata DBs 32. The plurality of metadata DBs 32 may be provided for each category of the moving image data 21.

メタデータ作成装置３１は、所定の動画データ２１の動画情報、所定の動画データ２１を構成する複数の処理フレーム２２の各フレーム情報、及び各処理フレーム２２から抽出及び認識された候補オブジェクト２５のオブジェクト情報を入力すると、これらの情報を集計して所定の動画データ２１のメタデータ３２ａを作成する。また、メタデータ作成装置３１は、所定の動画データ２１について作成したメタデータ３２ａをメタデータＤＢ３２へと格納する。 The metadata creation device 31 includes the moving image information of the predetermined moving image data 21, each frame information of the plurality of processing frames 22 constituting the predetermined moving image data 21, and the object of the candidate object 25 extracted and recognized from each processing frame 22. When the information is input, the information 32 is totaled to create the metadata 32a of the predetermined moving image data 21. Further, the metadata creation device 31 stores the metadata 32 a created for the predetermined moving image data 21 in the metadata DB 32.

例えば、メタデータ３２ａには、動画データ２１のタイトル、出演者名等の動画情報が記述され、更に、動画データ２１を構成する複数の処理フレーム２２の再生順に、各処理フレーム２２の再生時間等のフレーム情報が記述される。また、メタデータ３２ａには、各処理フレーム２２のフレーム情報に付随して、各処理フレーム２２から抽出された候補オブジェクト２５のオブジェクト情報が記述される。即ち、メタデータ３２ａでは、フレーム情報及びオブジェクト情報はタイムライン上に示される。 For example, the metadata 32a describes moving image information such as the title of the moving image data 21, the names of performers, and the like. Furthermore, the reproduction time of each processing frame 22 in the reproduction order of the plurality of processing frames 22 constituting the moving image data 21, etc. Frame information is described. The metadata 32 a describes object information of the candidate object 25 extracted from each processing frame 22 along with the frame information of each processing frame 22. That is, in the metadata 32a, frame information and object information are shown on the timeline.

なお、同一シーンの２つ以上の共通する処理フレーム２２について、メタデータ３２ａには、同一シーンの時間帯等のシーン情報が記述され、また、同一シーンのシーン情報に付随して、同一シーンに共通する候補オブジェクト２５のオブジェクト情報が記述される。このようなシーン情報も、メタデータ３２ａではタイムライン上に示される。また、同一シーンの先頭及び最後尾の処理フレーム２２以外の各処理フレーム２２については、フレーム情報やオブジェクト情報の記述は省略してもよい。 For two or more common processing frames 22 in the same scene, scene information such as the time zone of the same scene is described in the metadata 32a, and the same scene is attached to the scene information of the same scene. Object information of the common candidate object 25 is described. Such scene information is also shown on the timeline in the metadata 32a. For each processing frame 22 other than the first and last processing frames 22 in the same scene, description of frame information and object information may be omitted.

次に、上記のような構成を備えたオブジェクト認識システム１及びメタデータ作成システム３０における所定の動画データ２１のメタデータ作成動作について、図３のフローチャートを参照して説明する。 Next, the metadata creation operation of the predetermined moving image data 21 in the object recognition system 1 and the metadata creation system 30 having the above-described configuration will be described with reference to the flowchart of FIG.

メタデータ作成システム３０では、所定の動画データ２１についてメタデータ作成処理を行うとき、この所定の動画データ２１がオブジェクト認識システム１のオブジェクト抽出装置２へと入力され（ステップＳ１）、画像入力部１３によって、この動画データ２１の画像データ情報、即ち、動画情報が抽出される。 In the metadata creation system 30, when performing metadata creation processing for the predetermined moving image data 21, the predetermined moving image data 21 is input to the object extraction device 2 of the object recognition system 1 (step S 1), and the image input unit 13. Thus, image data information of the moving image data 21, that is, moving image information is extracted.

また、オブジェクト抽出装置２では、フレーム取得部１４によって、この動画データ２１を構成する複数の処理フレーム２２が取得されると共に（ステップＳ２）、各処理フレーム２２のフレーム情報が抽出される。更に、各処理フレーム２２は、最適な特徴点抽出処理や深度検出処理ができるように、高画質化部１５によって高画質化され、フレーム調整部１６によって調整される（ステップＳ３）。 In the object extraction device 2, the frame acquisition unit 14 acquires a plurality of processing frames 22 constituting the moving image data 21 (step S2), and extracts frame information of each processing frame 22. Further, each processing frame 22 is improved in image quality by the image quality improving unit 15 and adjusted by the frame adjusting unit 16 so that optimum feature point extraction processing and depth detection processing can be performed (step S3).

そして、特徴点抽出部１７によって、各処理フレーム２２の複数の特徴点２３が抽出されると共に、各特徴点２３の２次元画像上の第１特徴量が検出され（ステップＳ４）、更に、深度検出部１８によって、各特徴点２３の周囲の特徴点２３からの深度が検出される（ステップＳ５）。また、３次元空間推定部１９によって、各処理フレーム２２の複数の特徴点２３の第１特徴量及び深度に基づいて、各処理フレーム２２の現実３次元空間２６が推定され、各特徴点２３の現実３次元空間２６上の第２特徴量が検出される（ステップＳ６）。 Then, the feature point extraction unit 17 extracts a plurality of feature points 23 of each processing frame 22, and detects the first feature amount on the two-dimensional image of each feature point 23 (step S4), and further the depth. The detection unit 18 detects the depth from the feature points 23 around each feature point 23 (step S5). In addition, the three-dimensional space estimation unit 19 estimates the actual three-dimensional space 26 of each processing frame 22 based on the first feature amount and the depth of the plurality of feature points 23 of each processing frame 22. A second feature amount on the actual three-dimensional space 26 is detected (step S6).

次に、オブジェクト抽出部２０によって、各処理フレーム２２の複数の特徴点２３の第２特徴量及び色分布に基づいて、特徴点群２４、即ち、候補オブジェクト２５が抽出され（ステップＳ７）、候補オブジェクト２５には対応する処理フレーム２２が付加される。 Next, a feature point group 24, that is, a candidate object 25 is extracted by the object extraction unit 20 based on the second feature amount and the color distribution of the plurality of feature points 23 of each processing frame 22 (step S7). A corresponding processing frame 22 is added to the object 25.

そして、オブジェクト抽出装置２は、所定の動画データ２１の動画情報及びこの動画データ２１を構成する複数の処理フレーム２２の各フレーム情報と共に、各処理フレーム２２から抽出した候補オブジェクト２５をオブジェクト認識装置４へと出力する。 Then, the object extraction device 2 extracts the candidate object 25 extracted from each processing frame 22 together with the moving image information of the predetermined moving image data 21 and each frame information of the plurality of processing frames 22 constituting the moving image data 21. To output.

オブジェクト認識装置４では、上記のようにしてオブジェクト抽出装置２から入力した候補オブジェクト２５のオブジェクト認識処理が行われて（ステップＳ８）、この候補オブジェクト２５がオブジェクトＤＢ３に格納された参照オブジェクト３ａに相当するか否かが判定される。 In the object recognition device 4, object recognition processing of the candidate object 25 input from the object extraction device 2 is performed as described above (step S8), and this candidate object 25 corresponds to the reference object 3a stored in the object DB 3. It is determined whether or not to do so.

そして、候補オブジェクト２５が一の参照オブジェクト３ａに相当すると判定されると、この一の参照オブジェクト３ａの付属情報に基づいてオブジェクト情報が生成されて候補オブジェクト２５に付加される（ステップＳ９）。一方、候補オブジェクト２５が何れの参照オブジェクト３ａにも相当しないと判定されると、所定の動画データ２１の動画情報及びこの候補オブジェクトに対応する処理フレーム２２のフレーム情報等に基づいて生成されたオブジェクト情報が候補オブジェクト２５に付加される。 When it is determined that the candidate object 25 corresponds to one reference object 3a, object information is generated based on the attached information of the one reference object 3a and added to the candidate object 25 (step S9). On the other hand, if it is determined that the candidate object 25 does not correspond to any reference object 3a, the object generated based on the moving image information of the predetermined moving image data 21 and the frame information of the processing frame 22 corresponding to the candidate object Information is added to the candidate object 25.

そして、オブジェクト認識処理後の候補オブジェクト２５は、学習データ生成のために、参照オブジェクト３ａとしてオブジェクトＤＢ３に格納される（ステップＳ１０）。 Then, the candidate object 25 after the object recognition process is stored in the object DB 3 as the reference object 3a for generating learning data (step S10).

更に、オブジェクト認識装置４では、所定の動画データ２１の動画情報、動画データ２１を構成する複数の処理フレーム２２の各フレーム情報、及び各処理フレーム２２の候補オブジェクト２５のオブジェクト情報がメタデータ作成装置３１へと出力される。 Further, in the object recognition device 4, metadata information of the moving image information of the predetermined moving image data 21, each frame information of the plurality of processing frames 22 constituting the moving image data 21, and object information of the candidate object 25 of each processing frame 22 31 is output.

メタデータ作成装置３１では、オブジェクト認識装置４から入力した動画情報、各フレーム情報及び各オブジェクト情報が集計され、その集計結果に基づいて、所定の動画データ２１のメタデータ３２ａが作成される（ステップＳ１１）。このメタデータ３２ａは、メタデータＤＢ３２に格納される（ステップＳ１２）。 In the metadata creation device 31, the moving image information, each frame information, and each object information input from the object recognition device 4 are totaled, and metadata 32a of predetermined moving image data 21 is created based on the totaling result (step). S11). This metadata 32a is stored in the metadata DB 32 (step S12).

また、上記したようなメタデータ作成システム３０は、所定の動画データ２１のメタデータ配信処理を行うメタデータ配信システム４０に適用される。メタデータ配信システム４０は、メタデータ３２ａを配信するメタデータ配信装置４１を備える。 The metadata creation system 30 as described above is applied to a metadata delivery system 40 that performs metadata delivery processing of predetermined moving image data 21. The metadata distribution system 40 includes a metadata distribution device 41 that distributes the metadata 32a.

メタデータ配信装置４１は、ネットワーク５を介してメタデータＤＢ３２と相互に通信可能に接続され、また、視聴者端末４２とも相互に通信可能に接続される。なお、メタデータ配信装置４１は、メタデータＤＢ３２と相互にデータ送受信可能であれば、直接的に接続されてもよく、あるいは、一体的に構成されてもよい。また、メタデータ配信装置４１は、メタデータ作成装置３１と一体的に構成されてもよい。 The metadata distribution device 41 is connected to the metadata DB 32 via the network 5 so as to be able to communicate with each other, and is also connected to the viewer terminal 42 so as to be able to communicate with each other. Note that the metadata distribution device 41 may be directly connected or may be integrally configured as long as data can be transmitted / received to / from the metadata DB 32. Further, the metadata distribution device 41 may be configured integrally with the metadata creation device 31.

メタデータ配信装置４１は、視聴者端末４２からのアクセスに応じて、動画データ２１のメタデータ３２ａをメタデータＤＢ３２から取得して提供するように構成される。また、メタデータ配信装置４１は、所定の視聴者にアクセス権限を付与して、当該視聴者の視聴者端末４２からの要求に応じて動画データ２１のメタデータ３２ａを提供するように構成されてもよい。 The metadata distribution device 41 is configured to acquire and provide the metadata 32 a of the moving image data 21 from the metadata DB 32 in response to access from the viewer terminal 42. Further, the metadata distribution device 41 is configured to give access authority to a predetermined viewer and provide the metadata 32a of the moving image data 21 in response to a request from the viewer terminal 42 of the viewer. Also good.

視聴者端末４２は、例えば、ネットワーク５に接続可能であって、ネットワーク５を介して配信された動画データ２１を再生可能なスマートフォン、携帯電話機及びタブレット等の携帯端末や、パーソナルコンピュータ及びテレビ等の据え置き型端末でよい。あるいは、視聴者端末４２は、例えば、ネットワーク５に接続可能であって、ＤＶＤ等の記憶媒体に記憶された動画データ２１を読み出して再生可能な再生装置でもよい。 The viewer terminal 42 is, for example, a mobile terminal such as a smartphone, a mobile phone, or a tablet that can be connected to the network 5 and can reproduce the moving image data 21 distributed via the network 5, a personal computer, a TV, or the like. A stationary terminal may be used. Alternatively, the viewer terminal 42 may be, for example, a playback device that can be connected to the network 5 and can read and play back the moving image data 21 stored in a storage medium such as a DVD.

例えば、視聴者端末４２は、視聴者端末４２からのアクセスに応じて動画データ２１をダウンロード方式やストリーミング方式で配信する動画ＤＢ６にネットワーク５を介して接続され、動画ＤＢ６から配信された動画データ２１を再生する。なお、動画ＤＢ６は、所定の視聴者にアクセス権限を付与して、当該視聴者の視聴者端末４２からの要求に応じて動画データ２１を配信するように構成されてもよい。 For example, the viewer terminal 42 is connected via the network 5 to the moving image DB 6 that distributes the moving image data 21 by the download method or the streaming method in response to the access from the viewer terminal 42, and the moving image data 21 distributed from the moving image DB 6. Play. The moving image DB 6 may be configured to grant access authority to a predetermined viewer and distribute the moving image data 21 in response to a request from the viewer terminal 42 of the viewer.

本実施形態では、上述のように、オブジェクト抽出装置２は、動画データ２１を構成する２次元画像の複数のフレームの内、オブジェクト抽出対象の処理フレーム２２に特徴点抽出処理を行って、処理フレーム２２の複数の特徴点２３を抽出すると共に、各特徴点２３の２次元画像上の第１特徴量を検出し、処理フレーム２２に深度検出処理を行って、処理フレーム２２の各特徴点２３について周囲の特徴点２３からの相対的な深度を検出し、処理フレーム２２に３次元空間推定処理を行って、処理フレーム２２の複数の特徴点２３それぞれの少なくとも第１特徴量及び深度に基づいて処理フレーム２２内の現実３次元空間２６を推定し、処理フレーム２２の複数の特徴点２３の現実３次元空間２６上の第２特徴量を検出し、処理フレーム２２の複数の特徴点２３それぞれの少なくとも第２特徴量及び色分布に基づいてオブジェクト抽出処理を行って、処理フレーム２２の２つ以上の特徴点２３の集合からなる特徴点群２４を検出し、現実３次元空間２６上の特徴量を有する特徴点群２４を、処理フレーム２２の候補オブジェクト２５として抽出する。 In the present embodiment, as described above, the object extraction device 2 performs the feature point extraction process on the processing frame 22 to be extracted from the plurality of frames of the two-dimensional image constituting the moving image data 21 to obtain the processing frame. A plurality of feature points 23 of 22 are extracted, a first feature amount on the two-dimensional image of each feature point 23 is detected, a depth detection process is performed on the processing frame 22, and each feature point 23 of the processing frame 22 is detected. A relative depth from surrounding feature points 23 is detected, a three-dimensional space estimation process is performed on the processing frame 22, and processing is performed based on at least the first feature amount and the depth of each of the plurality of feature points 23 of the processing frame 22. A real three-dimensional space 26 in the frame 22 is estimated, and second feature quantities on the real three-dimensional space 26 of a plurality of feature points 23 in the processing frame 22 are detected. The object extraction processing is performed based on at least the second feature amount and the color distribution of each of the plurality of feature points 23 to detect a feature point group 24 including a set of two or more feature points 23 in the processing frame 22, A feature point group 24 having a feature amount on the three-dimensional space 26 is extracted as a candidate object 25 of the processing frame 22.

このような構成により、処理フレーム２２の各特徴点２３の現実３次元空間２６上の第２特徴量及び色分布に基づいて、２つ以上の特徴点２３の集合からなる特徴点群２４を判断するため、より高精度でオブジェクトを抽出することができる。また、３次元画像を撮影する撮像装置によって生成された動画データを用いることなく、処理フレーム２２から現実３次元空間２６上の特徴量を有する候補オブジェクト２５を抽出することができる。更に、この候補オブジェクト２５は、現実３次元空間２６上の特徴量を有するため、撮影角度に依存することなく、人物や物等の特徴を識別することができ、従って、高精度で認識することができる。これにより、認識したオブジェクトの情報の利便性を高めて、動画の利用及び普及の向上を図ることが可能となる。 With such a configuration, a feature point group 24 composed of a set of two or more feature points 23 is determined based on the second feature amount and color distribution of each feature point 23 in the processing frame 22 on the real three-dimensional space 26. Therefore, the object can be extracted with higher accuracy. In addition, the candidate object 25 having the feature amount in the real three-dimensional space 26 can be extracted from the processing frame 22 without using the moving image data generated by the imaging device that captures the three-dimensional image. Furthermore, since this candidate object 25 has a feature amount in the actual three-dimensional space 26, it can identify the feature of a person or an object without depending on the shooting angle, and therefore can recognize it with high accuracy. Can do. As a result, the convenience of the information of the recognized object can be improved, and the use and spread of the moving image can be improved.

また、本実施形態によれば、オブジェクト抽出装置２は、動画データ２１を構成する複数の処理フレーム２２の内、時間軸において前後に連続していて同一シーンを構成する２つ以上の共通の処理フレーム２２がある場合に、２つ以上の共通の処理フレーム２２のそれぞれについて深度検出処理、３次元空間推定処理及びオブジェクト抽出処理を行うとき、現実３次元空間２６上の特徴量を有する特徴点群２４であって、２つ以上の共通の処理フレーム２２に共通して検出された特徴点群２４を、同一シーンの候補オブジェクト２５として抽出する。 In addition, according to the present embodiment, the object extraction device 2 includes two or more common processes that are consecutive in the time axis and that constitute the same scene among the plurality of processing frames 22 that configure the moving image data 21. When there is a frame 22, when performing depth detection processing, three-dimensional space estimation processing, and object extraction processing for each of two or more common processing frames 22, a feature point group having feature quantities in the real three-dimensional space 26 24, the feature point group 24 detected in common in two or more common processing frames 22 is extracted as a candidate object 25 of the same scene.

このような構成により、現実３次元空間２６上の特徴量を用いることで、撮影した角度に拘らず、同一シーンに登場する同一のオブジェクトを高精度で認識することができる。 With such a configuration, the same object appearing in the same scene can be recognized with high accuracy regardless of the photographed angle by using the feature amount in the actual three-dimensional space 26.

また、本実施形態によれば、オブジェクト抽出装置２は、２つ以上の共通の処理フレーム２２のそれぞれに特徴点抽出処理を行うとき、一の共通の処理フレーム２２から抽出された複数の特徴点２３と、他の共通の処理フレーム２２から抽出された複数の特徴点２３との差異を利用して、一の共通の処理フレーム２２の特徴点２３を増やす。 Further, according to the present embodiment, when the object extraction device 2 performs the feature point extraction process on each of the two or more common processing frames 22, a plurality of feature points extracted from the one common processing frame 22. The feature points 23 of one common processing frame 22 are increased by utilizing the difference between the feature points 23 extracted from the other common processing frame 22 and the plurality of feature points 23.

このような構成により、より多くの特徴点２３を有する候補オブジェクト２５を抽出することができ、オブジェクト認識処理では、より多くの特徴点２３を用いるため、候補オブジェクト２５の認識精度を高めることができる。 With such a configuration, candidate objects 25 having more feature points 23 can be extracted, and more feature points 23 are used in the object recognition process, so that recognition accuracy of candidate objects 25 can be improved. .

また、本実施形態によれば、オブジェクト抽出装置２において、同一シーンの候補オブジェクト２５は、現実３次元空間２６上の特徴量に加えて、この現実３次元空間２６上の特徴量の同一シーン上の時間変位量も有する。 Further, according to the present embodiment, in the object extraction device 2, the candidate object 25 in the same scene is added to the feature scene in the real 3D space 26 in addition to the feature quantity in the real 3D space 26. There is also a time displacement amount.

このような構成により、同一シーンの候補オブジェクト２５の現実３次元空間２６上の動作の特徴量を抽出することができる。そして、オブジェクトの様々な動作の特徴を記録した参照オブジェクト３ａをオブジェクトＤＢ３に格納して、オブジェクト認識装置４が候補オブジェクト２５の動作と参照オブジェクト３ａの動作とを比較することにより、候補オブジェクト２５がどのような動作をしているかを判断することもできる。なお、この場合の動作の特徴量は、候補オブジェクト２５の種類まで特定する必要はないが、処理フレーム２２に対する出現及び退出等、現実３次元空間２６上の移動方向及び移動量、回転動作等を識別可能であればよい。 With such a configuration, it is possible to extract the feature amount of the motion on the real three-dimensional space 26 of the candidate object 25 in the same scene. Then, the reference object 3a in which various motion characteristics of the object are recorded is stored in the object DB 3, and the object recognition device 4 compares the motion of the candidate object 25 with the motion of the reference object 3a. It is also possible to determine what operation is being performed. Note that the feature amount of the motion in this case does not need to be specified up to the type of the candidate object 25, but the direction and amount of movement in the real three-dimensional space 26, such as the appearance and exit of the processing frame 22, the rotational motion, etc. It only needs to be identifiable.

あるいは、本実施形態によれば、オブジェクト抽出装置２は、動画データ２１を高画質化処理することにより、処理フレーム２２で候補オブジェクト２５として抽出される特徴点群２４の特徴点２３を増やす。 Alternatively, according to the present embodiment, the object extraction device 2 increases the feature points 23 of the feature point group 24 extracted as the candidate objects 25 in the processing frame 22 by performing the image quality enhancement processing on the moving image data 21.

例えば、高画質化処理は、微小領域毎に様々な色分布を有する複数のテンプレート画像のそれぞれについて高画質データ及び低画質データの相違をサンプル化した様々な色分布の学習データを予め記憶しておき、処理フレームの微小領域毎に最も適合した学習データを用いて処理フレーム２２を高画質化する。 For example, in the image quality enhancement processing, learning data of various color distributions obtained by sampling the difference between the high image quality data and the low image quality data for each of a plurality of template images having various color distributions for each minute region is stored in advance. Then, the processing frame 22 is improved in image quality using learning data that is most suitable for each minute region of the processing frame.

これらのような構成により、より多くの特徴点２３を有する候補オブジェクト２５を抽出することができ、オブジェクト認識処理では、より多くの特徴点２３を用いるため、候補オブジェクト２５の認識精度を高めることができる。 With such a configuration, candidate objects 25 having more feature points 23 can be extracted, and more feature points 23 are used in the object recognition process, so that the recognition accuracy of candidate objects 25 can be improved. it can.

更に、本実施形態では、上述のように、オブジェクト認識システム１は、上記のオブジェクト抽出装置２と、候補オブジェクト２５を認識するための複数の参照オブジェクト３ａを、各参照オブジェクト３ａの元画像及び各参照オブジェクト３ａに関連する付属情報と共に格納するデータベースであって、各参照オブジェクト３ａの元画像に対する特徴点抽出処理、深度検出処理、３次元空間推定処理及びオブジェクト抽出処理によって、その元画像の現実３次元空間上の特徴量を有する特徴点群として抽出された各参照オブジェクト３ａを格納しているオブジェクトＤＢ３と、オブジェクト抽出装置２によって抽出された候補オブジェクト２５が、オブジェクトＤＢ３に格納されている複数の参照オブジェクト３ａの何れに相当するかのオブジェクト認識処理を行うオブジェクト認識装置４と、を備える。そして、オブジェクト認識装置４は、候補オブジェクト２５が複数の参照オブジェクト３ａの内の一の参照オブジェクト３ａに相当すると判定した場合に、一の参照オブジェクト３ａの付属情報に基づいて生成したオブジェクト情報を候補オブジェクト２５に付加する。 Further, in the present embodiment, as described above, the object recognition system 1 includes the object extraction device 2 and a plurality of reference objects 3a for recognizing the candidate object 25, the original image of each reference object 3a, and each reference object 3a. This is a database that is stored together with attached information related to the reference object 3a, and a feature point extraction process, a depth detection process, a three-dimensional space estimation process, and an object extraction process for the original image of each reference object 3a. An object DB 3 storing each reference object 3a extracted as a feature point group having a feature quantity in the dimensional space, and a plurality of candidate objects 25 extracted by the object extracting device 2 are stored in the object DB 3 Which of the reference objects 3a corresponds to It includes an object recognition apparatus 4 that performs object recognition processing, the. When the object recognition device 4 determines that the candidate object 25 corresponds to one reference object 3a among the plurality of reference objects 3a, the object recognition device 4 uses the object information generated based on the attached information of the one reference object 3a as a candidate. It is added to the object 25.

このような構成により、高精度に抽出された候補オブジェクト２５と、高精度に抽出された参照オブジェクト３ａとを比較するため、候補オブジェクト２５を高精度に認識処理することができる。そして、候補オブジェクト２５には、参照オブジェクト３ａの付属情報に基づいて精錬されたオブジェクト情報を生成するので、候補オブジェクト２５をより適切に特定するオブジェクト情報が付加され、オブジェクト情報の利便性を高めることができる。 With such a configuration, since the candidate object 25 extracted with high accuracy is compared with the reference object 3a extracted with high accuracy, the candidate object 25 can be recognized and processed with high accuracy. And since the refined object information is generated for the candidate object 25 based on the attached information of the reference object 3a, the object information for specifying the candidate object 25 more appropriately is added, and the convenience of the object information is improved. Can do.

また、本実施形態によれば、オブジェクト認識システム１において、オブジェクト認識処理は、候補オブジェクト２５の特徴点群２４及び処理フレーム２２における色分布と、参照オブジェクト３ａの特徴点群及び元画像における色分布とを比較することによって行われる。 Further, according to the present embodiment, in the object recognition system 1, the object recognition processing includes the color distribution in the feature point group 24 and the processing frame 22 of the candidate object 25, the feature point group of the reference object 3 a, and the color distribution in the original image. And is done by comparing

このような構成により、候補オブジェクト２５を特定する高精度な識別量と、参照オブジェクト３ａを特定する高精度な識別量とが比較されるため、候補オブジェクト２５の高精度な認識処理を実現している。 With such a configuration, a high-accuracy identification amount that identifies the candidate object 25 is compared with a high-accuracy identification amount that identifies the reference object 3a, thereby realizing a highly accurate recognition process for the candidate object 25. Yes.

また、本実施形態によれば、オブジェクト認識システム１において、オブジェクトＤＢ３は、複数の参照オブジェクト３ａをそれぞれの付属情報に基づいて分類していて、共通する付属情報を有する２つ以上の参照オブジェクト３ａについては、その共通の付属情報を分類情報とした共通のカテゴリーに分類して格納している。 Further, according to the present embodiment, in the object recognition system 1, the object DB 3 classifies the plurality of reference objects 3a based on the respective attached information, and two or more reference objects 3a having common attached information. Are classified and stored in a common category using the common attached information as classification information.

このような構成により、オブジェクトＤＢ３は、参照オブジェクト３ａを付属情報に基づいて容易に検索することができ、更に、カテゴリーに基づいて容易に検索することもできる。 With such a configuration, the object DB 3 can easily search for the reference object 3a based on the attached information, and can also easily search based on the category.

また、本実施形態によれば、オブジェクト認識システム１は、複数の参照オブジェクト３ａの内の一の参照オブジェクト３ａに相当すると判定された候補オブジェクト２５を、一の参照オブジェクト３ａが分類されるカテゴリーの新たな参照オブジェクト３ａとしてオブジェクトＤＢ３に格納する。 Further, according to the present embodiment, the object recognition system 1 selects a candidate object 25 determined to correspond to one reference object 3a among the plurality of reference objects 3a as a category in which the one reference object 3a is classified. Stored in the object DB 3 as a new reference object 3a.

なお、オブジェクト認識システム１は、複数の参照オブジェクト３ａの何れにも相当しないと判定された候補オブジェクト２５を、この候補オブジェクト２５が分類される新たなカテゴリーの参照オブジェクト３ａとしてオブジェクトＤＢに格納する。 The object recognition system 1 stores the candidate object 25 determined not corresponding to any of the plurality of reference objects 3a in the object DB as a reference object 3a of a new category into which the candidate object 25 is classified.

これらのような構成により、オブジェクト認識処理の結果の候補オブジェクト２５を、参照オブジェクト３ａの学習データとすることができる。また、様々な動画データのオブジェクト認識処理をしていくことにより、高精度な認識結果の候補オブジェクト２５に基づく学習データを増やすことができる。そのため、オブジェクト認識システム１の機械学習がより優秀となり、オブジェクト認識処理の精度及びの効率を向上させることができる。 With such a configuration, the candidate object 25 as a result of the object recognition process can be used as learning data for the reference object 3a. Further, by performing object recognition processing of various moving image data, it is possible to increase learning data based on the candidate objects 25 of highly accurate recognition results. Therefore, the machine learning of the object recognition system 1 becomes more excellent, and the accuracy and efficiency of the object recognition process can be improved.

更に、本実施形態では、上述のように、メタデータ作成システムは、上記のオブジェクト認識システム１を備え、所定の動画データ２１の動画情報と、所定の動画データ２１を構成する複数の処理フレーム２２のフレーム情報と、複数の処理フレーム２２のそれぞれから抽出及び認識された候補オブジェクト２５のオブジェクト情報とを集計して、その集計結果に基づいて、動画データ２１に関するメタデータ３２ａを作成する。 Further, in the present embodiment, as described above, the metadata creation system includes the object recognition system 1 described above, and includes the moving image information of the predetermined moving image data 21 and the plurality of processing frames 22 constituting the predetermined moving image data 21. And the object information of the candidate object 25 extracted and recognized from each of the plurality of processing frames 22 are totaled, and the metadata 32a related to the moving image data 21 is created based on the totaled result.

このような構成により、候補オブジェクト２５が高精度に認識されたフレーム情報や候補オブジェクト２５をより適切に特定するオブジェクト情報を用いてメタデータ３２ａを作成している。そのため、メタデータ３２ａには、候補オブジェクト２５の登場する処理フレーム２２が適切に記述され、また、候補オブジェクト２５についての説明が適切に記述されるので、動画データ２１の内容が適切に反映されることとなる。これにより、メタデータ３２ａの利用価値が向上し、更には、メタデータ３２ａに対応する動画データ２１の利用及び普及の向上を図ることができる。 With this configuration, the metadata 32a is created using frame information in which the candidate object 25 is recognized with high accuracy and object information that more appropriately identifies the candidate object 25. Therefore, the processing frame 22 in which the candidate object 25 appears is appropriately described in the metadata 32a, and the explanation of the candidate object 25 is appropriately described, so that the content of the moving image data 21 is appropriately reflected. It will be. As a result, the utility value of the metadata 32a can be improved, and further, the use and spread of the moving image data 21 corresponding to the metadata 32a can be improved.

本実施形態では、オブジェクト抽出装置２がオブジェクト認識システム１に適用される構成を説明したが、この構成に限定されない。例えば、他の実施形態では、オブジェクト抽出装置２は、被写体を撮影した画像から被写体の候補オブジェクト２５を抽出し、この候補オブジェクト２５の３次元空間上の特徴量に基づいて、立体画像を立体表示スクリーンに表示させる立体表示システム等に適用することもできる。 In the present embodiment, the configuration in which the object extraction device 2 is applied to the object recognition system 1 has been described. However, the configuration is not limited to this configuration. For example, in another embodiment, the object extraction device 2 extracts a candidate object 25 of a subject from an image obtained by photographing the subject, and stereoscopically displays a stereoscopic image based on the feature amount of the candidate object 25 in a three-dimensional space. The present invention can also be applied to a stereoscopic display system that displays on a screen.

１オブジェクト認識システム
２オブジェクト抽出装置
３オブジェクトデータベース（ＤＢ）
３ａ参照オブジェクト
４オブジェクト認識装置
５ネットワーク
６動画データベース（ＤＢ）
７画像収集装置
８抽出処理データベース（ＤＢ）
１０制御部
１１記憶部
１２通信部
１３動画入力部
１４フレーム取得部
１５高画質化部
１６フレーム調整部
１７特徴点抽出部
１８深度検出部
１９３次元空間推定部
２０オブジェクト抽出部
２１動画データ
２２処理フレーム
２３特徴点
２４特徴点群
２５候補オブジェクト
２６現実３次元空間
３０メタデータ作成システム
３１メタデータ作成装置
３２メタデータデータベース（ＤＢ）
３２ａメタデータ
４０メタデータ配信システム
４１メタデータ配信装置
４２視聴者端末 1 Object Recognition System 2 Object Extractor 3 Object Database (DB)
3a Reference object 4 Object recognition device 5 Network 6 Movie database (DB)
7 Image collection device 8 Extraction processing database (DB)
DESCRIPTION OF SYMBOLS 10 Control part 11 Memory | storage part 12 Communication part 13 Movie input part 14 Frame acquisition part 15 Image quality improvement part 16 Frame adjustment part 17 Feature point extraction part 18 Depth detection part 19 3D space estimation part 20 Object extraction part 21 Movie data 22 Process Frame 23 Feature point 24 Feature point group 25 Candidate object 26 Real 3D space 30 Metadata creation system 31 Metadata creation device 32 Metadata database (DB)
32a Metadata 40 Metadata distribution system 41 Metadata distribution device 42 Viewer terminal

Claims

Among the plurality of frames of the two-dimensional image constituting the moving image data, the feature point extraction process is performed on the processing frame to be extracted, and a plurality of feature points of the processing frame are extracted. Detecting a first feature on the image;
Performing a depth detection process on the processing frame to detect a relative depth from surrounding feature points for each feature point of the processing frame;
Performing a three-dimensional space estimation process on the processing frame, estimating a real three-dimensional space in the processing frame based on at least the first feature amount and the depth of each of a plurality of feature points of the processing frame; Detecting a second feature amount in the real three-dimensional space of a plurality of feature points of the frame;
Performing object extraction processing based on at least the second feature amount and color distribution of each of the plurality of feature points of the processing frame to detect a feature point group consisting of a set of two or more feature points of the processing frame; An object extraction apparatus, wherein the feature point group having the feature amount in the real three-dimensional space is extracted as a candidate object of the processing frame.

In the case where there are two or more common processing frames that constitute the same scene in the time axis among the plurality of processing frames constituting the moving image data,
When performing the depth detection process, the three-dimensional space estimation process, and the object extraction process for each of the two or more common processing frames, a feature point group having a feature amount in the real three-dimensional space, 2. The object extraction apparatus according to claim 1, wherein a feature point group detected in common in the two or more common processing frames is extracted as a candidate object of the same scene.

When the feature point extraction processing is performed on each of the two or more common processing frames, a plurality of feature points extracted from one common processing frame and a plurality of feature points extracted from another common processing frame The object extraction apparatus according to claim 2, wherein the feature points of the one common processing frame are increased using a difference from the feature points of the object.

The candidate object of the same scene has a time displacement amount on the same scene of the feature amount on the real three-dimensional space in addition to the feature amount on the real three-dimensional space. 3. The object extraction device according to 3.

The object according to any one of claims 1 to 4, wherein a feature point of a feature point group extracted as the candidate object in the processing frame is increased by performing high-quality processing on the moving image data. Extraction device.

In the image quality enhancement process, learning data of various color distributions obtained by sampling the difference between the high image quality data and the low image quality data for each of a plurality of template images having various color distributions for each minute region is stored in advance. The object extraction device according to claim 5, wherein the processing frame is improved in image quality using the learning data most suitable for each minute region of the processing frame.

The object extraction device according to any one of claims 1 to 6,
A database for storing a plurality of reference objects for recognizing the candidate objects together with an original image of each reference object and attached information related to each reference object, wherein the feature points with respect to the original image of each reference object Each reference object extracted as a feature point group having a feature amount in the actual three-dimensional space of the original image by the extraction process, the depth detection process, the three-dimensional space estimation process, and the object extraction process is stored. An object database
An object recognition device that performs object recognition processing on which the candidate object extracted by the object extraction device corresponds to which of the plurality of reference objects stored in the object database;
When the object recognition device determines that the candidate object corresponds to one reference object of the plurality of reference objects, the object recognition device generates object information generated based on the attached information of the one reference object. An object recognition system characterized by being added to the object.

The object recognition process is performed by comparing a feature point group of the candidate object and a color distribution in the processing frame with a feature point group of the reference object and a color distribution in an original image. 8. The object recognition system according to 7.

The object database classifies the plurality of reference objects on the basis of the respective accessory information. For two or more reference objects having common accessory information, the common reference information using the common accessory information is classified. 9. The object recognition system according to claim 7, wherein the object recognition system is classified and stored.

The candidate object determined to correspond to one reference object among the plurality of reference objects is stored in the object database as a new reference object of a category into which the one reference object is classified. The object recognition system of any one of Claims 7-9.

11. The candidate object that is determined not to correspond to any of the plurality of reference objects is stored in the object database as a reference object of a new category into which the candidate object is classified. Object recognition system.

The object recognition system according to any one of claims 7 to 11, comprising:
Moving image information of the predetermined moving image data, frame information of the plurality of processing frames constituting the predetermined moving image data, and the object information of the candidate object extracted and recognized from each of the plurality of processing frames. A metadata creation system comprising: summing up and creating metadata relating to the moving image data based on the summation result.