JP2015062090A

JP2015062090A - Video processing system, video processing method, video processing device for portable terminal or server, and control method and control program of the same

Info

Publication number: JP2015062090A
Application number: JP2011273940A
Authority: JP
Inventors: 野村　俊之; Toshiyuki Nomura; 俊之野村; 山田　昭雄; Akio Yamada; 昭雄山田; 岩元　浩太; Kota Iwamoto; 浩太岩元; 亮太間瀬; Ryota Mase
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-12-15
Filing date: 2011-12-15
Publication date: 2015-04-02
Also published as: WO2013089041A1

Abstract

PROBLEM TO BE SOLVED: To notify a user in real time of results of recognition of a plurality of recognition objects in a video while maintaining the accuracy of the recognition.SOLUTION: A video processing device: stores a plurality of recognition objects and m1 to mk first local feature amounts each consisting of first to i1 to ik-th dimensional feature vectors in association with each other; extracts n feature amounts from an image in a video to create n second local feature amounts each consisting of first to j-th dimensional feature vectors; selects the smaller number of dimension from the number of dimension i1 and the number of dimension j, and when determined that a predetermined ratio or more of each of the m1 to mk first local feature amounts consisting of feature vectors of first to the selected number of dimension correspond to the n second local feature amounts consisting of feature vectors of first to the selected number of dimension, recognizes the presence of the plurality of recognition object with the predetermined ratio or more of correspondence in the image in the video; and displays information indicating the plurality of recognition objects in the image in the video in which the plurality of recognition objects are present.

Description

本発明は、映像中に存在する複数の物をリアルタイムに同定するための技術に関する。 The present invention relates to a technique for identifying a plurality of objects existing in a video in real time.

上記技術分野において、特許文献１には、売り場の写真画像の色補正、サイズ補正および歪み補正を行なって、個々の商品の商品マスターとマッチングすることにより、商品の棚割モデルを短時間に精度良く作成する技術が記載されている。 In the above technical field, Patent Document 1 discloses that a shelf model of a product is accurately obtained in a short time by performing color correction, size correction, and distortion correction of a photographic image of a sales floor and matching with a product master of each product. A well-crafted technique is described.

特開２００９−１８７４８２号公報JP 2009-187482 A

しかしながら、上記文献に記載の技術では、売り場の写真画像の色補正、サイズ補正および歪み補正が必要なので、複数の商品をリアルタイムに同定して報知することはできない。 However, since the techniques described in the above documents require color correction, size correction, and distortion correction of photographic images on the sales floor, it is impossible to identify and notify a plurality of products in real time.

本発明の目的は、上述の課題を解決する技術を提供することにある。 The objective of this invention is providing the technique which solves the above-mentioned subject.

上記目的を達成するため、本発明に係る装置は、
複数の認識対象物と、前記複数の認識対象物の画像内のそれぞれｍ１，ｍ２，…，ｍｋ個の特徴点のそれぞれを含むｍ１，ｍ２，…，ｍｋ個の局所領域のそれぞれについて生成された、それぞれ１次元からｉ１，ｉ２，…，ｉｋ次元までの特徴ベクトルからなるｍ１，ｍ２，…，ｍｋ個の第１局所特徴量とを、対応付けて記憶する第１局所特徴量記憶手段と、
映像中の画像からｎ個の特徴点を抽出し、前記ｎ個の特徴点のそれぞれを含むｎ個の局所領域について、それぞれ１次元からｊ次元までの特徴ベクトルからなるｎ個の第２局所特徴量を生成する第２局所特徴量生成手段と、
前記第１局所特徴量の次元数ｉ１，ｉ２，…，ｉｋおよび前記第２局所特徴量の次元数ｊのうち、より少ない次元数を選択し、選択した前記次元数までの特徴ベクトルからなる前記ｎ個の第２局所特徴量に、選択した前記次元数までの特徴ベクトルからなる前記ｍ１，ｍ２，…，ｍｋ個の第１局所特徴量のそれぞれについて所定割合以上が対応すると判定した場合に、前記映像中の前記画像に前記所定割合以上が対応した複数の認識対象物が存在すると認識する認識手段と、
前記認識手段が認識した複数の認識対象物を示す情報を前記映像中の前記複数の認識対象物が存在する画像に表示する表示手段と、
を備えることを特徴とする。 In order to achieve the above object, an apparatus according to the present invention provides:
Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , First local feature storage means for storing m1, m2,..., Mk first local feature amounts each consisting of feature vectors from one dimension to i1, i2,.
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. Second local feature generating means for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, Recognition means for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
Display means for displaying information indicating a plurality of recognition objects recognized by the recognition means on an image in which the plurality of recognition objects exist in the video;
It is characterized by providing.

上記目的を達成するため、本発明に係る方法は、
複数の認識対象物と、前記複数の認識対象物の画像内のそれぞれｍ１，ｍ２，…，ｍｋ個の特徴点のそれぞれを含むｍ１，ｍ２，…，ｍｋ個の局所領域のそれぞれについて生成された、それぞれ１次元からｉ１，ｉ２，…，ｉｋ次元までの特徴ベクトルからなるｍ１，ｍ２，…，ｍｋ個の第１局所特徴量とを、対応付けて記憶する第１局所特徴量記憶手段を備えた映像処理装置の制御方法であって、
映像中の画像からｎ個の特徴点を抽出し、前記ｎ個の特徴点のそれぞれを含むｎ個の局所領域について、それぞれ１次元からｊ次元までの特徴ベクトルからなるｎ個の第２局所特徴量を生成する第２局所特徴量生成ステップと、
前記第１局所特徴量の次元数ｉ１，ｉ２，…，ｉｋおよび前記第２局所特徴量の次元数ｊのうち、より少ない次元数を選択し、選択した前記次元数までの特徴ベクトルからなる前記ｎ個の第２局所特徴量に、選択した前記次元数までの特徴ベクトルからなる前記ｍ１，ｍ２，…，ｍｋ個の第１局所特徴量のそれぞれについて所定割合以上が対応すると判定した場合に、前記映像中の前記画像に前記所定割合以上が対応した複数の認識対象物が存在すると認識する認識ステップと、
前記認識ステップにおいて認識した複数の認識対象物を示す情報を前記映像中の前記複数の認識対象物が存在する画像に表示する表示ステップと、
を含むことを特徴とする。 In order to achieve the above object, the method according to the present invention comprises:
Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , M1, m2,..., Mk first local feature amounts each comprising feature vectors from one dimension to i1, i2,. A control method for a video processing apparatus,
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. A second local feature generation step for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, A recognition step for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
A display step of displaying information indicating a plurality of recognition objects recognized in the recognition step on an image in which the plurality of recognition objects exist in the video;
It is characterized by including.

上記目的を達成するため、本発明に係るプログラムは、
複数の認識対象物と、前記複数の認識対象物の画像内のそれぞれｍ１，ｍ２，…，ｍｋ個の特徴点のそれぞれを含むｍ１，ｍ２，…，ｍｋ個の局所領域のそれぞれについて生成された、それぞれ１次元からｉ１，ｉ２，…，ｉｋ次元までの特徴ベクトルからなるｍ１，ｍ２，…，ｍｋ個の第１局所特徴量とを、対応付けて記憶する第１局所特徴量記憶手段を備えた映像処理装置の制御プログラムであって、
映像中の画像からｎ個の特徴点を抽出し、前記ｎ個の特徴点のそれぞれを含むｎ個の局所領域について、それぞれ１次元からｊ次元までの特徴ベクトルからなるｎ個の第２局所特徴量を生成する第２局所特徴量生成ステップと、
前記第１局所特徴量の次元数ｉ１，ｉ２，…，ｉｋおよび前記第２局所特徴量の次元数ｊのうち、より少ない次元数を選択し、選択した前記次元数までの特徴ベクトルからなる前記ｎ個の第２局所特徴量に、選択した前記次元数までの特徴ベクトルからなる前記ｍ１，ｍ２，…，ｍｋ個の第１局所特徴量のそれぞれについて所定割合以上が対応すると判定した場合に、前記映像中の前記画像に前記所定割合以上が対応した複数の認識対象物が存在すると認識する認識ステップと、
前記認識ステップにおいて認識した複数の認識対象物を示す情報を前記映像中の前記複数の認識対象物が存在する画像に表示する表示ステップと、
をコンピュータに実行させることを特徴とする。 In order to achieve the above object, a program according to the present invention provides:
Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , M1, m2,..., Mk first local feature amounts each comprising feature vectors from one dimension to i1, i2,. A video processing device control program,
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. A second local feature generation step for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, A recognition step for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
A display step of displaying information indicating a plurality of recognition objects recognized in the recognition step on an image in which the plurality of recognition objects exist in the video;
Is executed by a computer.

上記目的を達成するため、本発明に係るシステムは、
ネットワークを介して接続される携帯端末用の映像処理装置とサーバ用の映像処理装置とを有する映像処理システムであって、
複数の認識対象物と、前記複数の認識対象物の画像内のそれぞれｍ１，ｍ２，…，ｍｋ個の特徴点のそれぞれを含むｍ１，ｍ２，…，ｍｋ個の局所領域のそれぞれについて生成された、それぞれ１次元からｉ１，ｉ２，…，ｉｋ次元までの特徴ベクトルからなるｍ１，ｍ２，…，ｍｋ個の第１局所特徴量とを、対応付けて記憶する第１局所特徴量記憶手段と、
映像中の画像からｎ個の特徴点を抽出し、前記ｎ個の特徴点のそれぞれを含むｎ個の局所領域について、それぞれ１次元からｊ次元までの特徴ベクトルからなるｎ個の第２局所特徴量を生成する第２局所特徴量生成手段と、
前記第１局所特徴量の次元数ｉ１，ｉ２，…，ｉｋおよび前記第２局所特徴量の次元数ｊのうち、より少ない次元数を選択し、選択した前記次元数までの特徴ベクトルからなる前記ｎ個の第２局所特徴量に、選択した前記次元数までの特徴ベクトルからなる前記ｍ１，ｍ２，…，ｍｋ個の第１局所特徴量のそれぞれについて所定割合以上が対応すると判定した場合に、前記映像中の前記画像に前記所定割合以上が対応した複数の認識対象物が存在すると認識する認識手段と、
前記認識手段が認識した複数の認識対象物を示す情報を前記映像中の前記複数の認識対象物が存在する画像に表示する表示手段と、
を備えることを特徴とする。 In order to achieve the above object, a system according to the present invention provides:
A video processing system having a video processing device for a mobile terminal and a video processing device for a server connected via a network,
Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , First local feature storage means for storing m1, m2,..., Mk first local feature amounts each consisting of feature vectors from one dimension to i1, i2,.
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. Second local feature generating means for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, Recognition means for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
Display means for displaying information indicating a plurality of recognition objects recognized by the recognition means on an image in which the plurality of recognition objects exist in the video;
It is characterized by providing.

上記目的を達成するため、本発明に係る方法は、
ネットワークを介して接続される携帯端末用の映像処理装置とサーバ用の映像処理装置とを有し、複数の認識対象物と、前記複数の認識対象物の画像内のそれぞれｍ１，ｍ２，…，ｍｋ個の特徴点のそれぞれを含むｍ１，ｍ２，…，ｍｋ個の局所領域のそれぞれについて生成された、それぞれ１次元からｉ１，ｉ２，…，ｉｋ次元までの特徴ベクトルからなるｍ１，ｍ２，…，ｍｋ個の第１局所特徴量とを、対応付けて記憶する第１局所特徴量記憶手段を備えた映像処理システムにおける映像処理方法であって、
映像中の画像からｎ個の特徴点を抽出し、前記ｎ個の特徴点のそれぞれを含むｎ個の局所領域について、それぞれ１次元からｊ次元までの特徴ベクトルからなるｎ個の第２局所特徴量を生成する第２局所特徴量生成ステップと、
前記第１局所特徴量の次元数ｉ１，ｉ２，…，ｉｋおよび前記第２局所特徴量の次元数ｊのうち、より少ない次元数を選択し、選択した前記次元数までの特徴ベクトルからなる前記ｎ個の第２局所特徴量に、選択した前記次元数までの特徴ベクトルからなる前記ｍ１，ｍ２，…，ｍｋ個の第１局所特徴量のそれぞれについて所定割合以上が対応すると判定した場合に、前記映像中の前記画像に前記所定割合以上が対応した複数の認識対象物が存在すると認識する認識ステップと、
前記認識ステップにおいて認識した複数の認識対象物を示す情報を前記映像中の前記複数の認識対象物が存在する画像に表示する表示ステップと、
を含むことを特徴とする。 In order to achieve the above object, the method according to the present invention comprises:
A mobile terminal video processing apparatus and a server video processing apparatus connected via a network, each having a plurality of recognition objects and m1, m2,. ..., mk local regions including mk feature points, m1, m2,..., each of feature vectors from one dimension to i1, i2,. , Mk first local feature values in association with each other, and a first local feature value storing means for storing the image data in association with each other.
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. A second local feature generation step for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, A recognition step for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
A display step of displaying information indicating a plurality of recognition objects recognized in the recognition step on an image in which the plurality of recognition objects exist in the video;
It is characterized by including.

本発明によれば、認識精度を維持しながら、映像中の複数の認識対象物に対してリアルタイムで認識結果を報知できる。 ADVANTAGE OF THE INVENTION According to this invention, a recognition result can be alert | reported with respect to the several recognition target object in an image | video in real time, maintaining recognition accuracy.

本発明の第１実施形態に係る映像処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the video processing apparatus which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る映像処理装置による映像処理を説明する図である。It is a figure explaining the video processing by the video processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る映像処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量生成部の構成を示すブロック図である。It is a block diagram which shows the structure of the local feature-value production | generation part which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量生成部の処理を示す図である。It is a figure which shows the process of the local feature-value production | generation part which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量生成部の処理を示す図である。It is a figure which shows the process of the local feature-value production | generation part which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量生成部の処理を示す図である。It is a figure which shows the process of the local feature-value production | generation part which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量生成部の処理を示す図である。It is a figure which shows the process of the local feature-value production | generation part which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量生成部の処理を示す図である。It is a figure which shows the process of the local feature-value production | generation part which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る照合部の処理を示す図である。It is a figure which shows the process of the collation part which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量生成データの構成を示す図である。It is a figure which shows the structure of the local feature-value production | generation data concerning 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量ＤＢの構成を示す図である。It is a figure which shows the structure of local feature-value DB which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る映像処理装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the video processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る映像処理装置の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the video processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る局所特徴量生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the local feature-value production | generation process which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る照合処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the collation process which concerns on 2nd Embodiment of this invention. 本発明の第３実施形態に係る映像処理装置による映像処理を説明する図である。It is a figure explaining the video processing by the video processing apparatus which concerns on 3rd Embodiment of this invention. 本発明の第３実施形態に係る映像処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video processing apparatus which concerns on 3rd Embodiment of this invention. 本発明の第３実施形態に係る複数照合結果保持部の構成を示す図である。It is a figure which shows the structure of the multiple collation result holding part which concerns on 3rd Embodiment of this invention. 本発明の第３実施形態に係る組み合わせ識別ＤＢの構成を示す図である。It is a figure which shows the structure of combination identification DB which concerns on 3rd Embodiment of this invention. 本発明の第３実施形態に係る映像処理装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the video processing apparatus which concerns on 3rd Embodiment of this invention. 本発明の第３実施形態に係る映像処理装置の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the video processing apparatus which concerns on 3rd Embodiment of this invention. 本発明の第３実施形態に係る映像照合処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the video collation process which concerns on 3rd Embodiment of this invention. 本発明の第４実施形態に係る映像処理システムによる映像処理を説明する図である。It is a figure explaining the video processing by the video processing system which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る映像処理システムの映像処理手順を示すシーケンス図である。It is a sequence diagram which shows the video processing procedure of the video processing system which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る携帯端末用の映像処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video processing apparatus for portable terminals which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る符号化部の構成を示すブロック図である。It is a block diagram which shows the structure of the encoding part which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係るサーバ用の映像処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video processing apparatus for servers which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係るリンク情報ＤＢの構成を示す図である。It is a figure which shows the structure of link information DB which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る携帯端末用の映像処理装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the video processing apparatus for portable terminals which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る携帯端末用の映像処理装置の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the video processing apparatus for portable terminals which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る符号化の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the encoding which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る差分値の符号化の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the encoding of the difference value which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係るサーバ用の映像処理装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the video processing apparatus for servers which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係るサーバ用の映像処理装置の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the video processing apparatus for servers which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る局所特徴量ＤＢ生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of local feature-value DB production | generation processing which concerns on 4th Embodiment of this invention. 本発明の第４実施形態に係る認識対象物／リンク情報取得処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the recognition target object / link information acquisition process which concerns on 4th Embodiment of this invention.

以下に、図面を参照して、本発明の実施の形態について例示的に詳しく説明する。ただし、以下の実施の形態に記載されている構成要素は単なる例示であり、本発明の技術範囲をそれらのみに限定する趣旨のものではない。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings. However, the constituent elements described in the following embodiments are merely examples, and are not intended to limit the technical scope of the present invention only to them.

［第１実施形態］
本発明の第１実施形態としての映像処理装置１００について、図１を用いて説明する。映像処理装置１００は、映像中の画像内の複数の認識対象物を、認識精度を維持してリアルタイムに認識する装置である。 [First Embodiment]
A video processing apparatus 100 as a first embodiment of the present invention will be described with reference to FIG. The video processing device 100 is a device that recognizes a plurality of recognition objects in an image in a video in real time while maintaining recognition accuracy.

図１に示すように、映像処理装置１００は、第１局所特徴量記憶部１１０と、第２局所特徴量生成部１２０と、認識部１３０と、表示部１４０と、を含む。第１局所特徴量記憶部１１０は、複数の認識対象物１１１−１〜１１１−ｋと、複数の認識対象物の画像内のそれぞれｍ１，ｍ２，…，ｍｋ個の特徴点のそれぞれを含むｍ１，ｍ２，…，ｍｋ個の局所領域のそれぞれについて生成された、それぞれ１次元からｉ１，ｉ２，…，ｉｋ次元までの特徴ベクトルからなるｍ１，ｍ２，…，ｍｋ個の第１局所特徴量１１２−１〜１１２−ｋとを、対応付けて記憶する。第２局所特徴量生成部１２０は、映像中の画像１０１からｎ個の特徴点を１２１抽出し、ｎ個の特徴点のそれぞれを含むｎ個の局所領域１２２について、それぞれ１次元からｊ次元までの特徴ベクトルからなるｎ個の第２局所特徴量１２３を生成する。認識部１３０は、第１局所特徴量の次元数ｉ１，ｉ２，…，ｉｋおよび第２局所特徴量の次元数ｊのうち、より少ない次元数を選択する。認識部１３０は、選択した次元数までの特徴ベクトルからなるｎ個の第２局所特徴量に、選択した次元数までの特徴ベクトルからなるｍ１，ｍ２，…，ｍｋ個の第１局所特徴量のそれぞれについて所定割合以上が対応するか否かを判定する。認識部１３０は、対応すると判定した場合に、映像中の画像１０１に所定割合以上が対応した複数の認識対象物が存在すると認識する。表示部１４０は、認識部１３０が認識した複数の認識対象物を示す情報１４１を映像中の複数の認識対象物が存在する画像に表示する。 As shown in FIG. 1, the video processing apparatus 100 includes a first local feature quantity storage unit 110, a second local feature quantity generation unit 120, a recognition unit 130, and a display unit 140. The first local feature quantity storage unit 110 includes a plurality of recognition objects 111-1 to 111-k and m 1 including m 1, m 2,..., M k feature points in the images of the plurality of recognition objects. , M2,..., Mk local regions, each of m1, m2,..., Mk first local features 112 each consisting of a feature vector from one dimension to i1, i2,. −1 to 112-k are stored in association with each other. The second local feature generation unit 120 extracts 121 n feature points from the image 101 in the video, and each of the n local regions 122 including each of the n feature points from 1 to j dimensions. N second local feature values 123 of the feature vectors are generated. The recognizing unit 130 selects a smaller number of dimensions among the number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity. The recognizing unit 130 adds m1, m2,..., Mk first local feature quantities consisting of feature vectors up to the selected dimension number to n second local feature quantities consisting of feature vectors up to the selected dimension number. It is determined whether or not a predetermined ratio or more corresponds to each. If it is determined that the recognition unit 130 corresponds, the recognition unit 130 recognizes that there are a plurality of recognition objects corresponding to the image 101 in the video corresponding to a predetermined ratio or more. The display unit 140 displays information 141 indicating a plurality of recognition objects recognized by the recognition unit 130 on an image including a plurality of recognition objects in the video.

本実施形態によれば、認識精度を維持しながら、映像中の複数の認識対象物に対してリアルタイムで認識結果を報知できる。 According to this embodiment, a recognition result can be notified to a plurality of recognition objects in a video in real time while maintaining recognition accuracy.

［第２実施形態］
次に、本発明の第２実施形態に係る映像処理装置について説明する。本実施形態においては、携帯端末としての映像処理装置が、撮像中の映像内の複数の認識対象物を認識して、映像中の認識対象物に対してリアルタイムに認識結果を表示する映像処理を説明する。本実施形態では、認識対象物の名称をリアルタイムに表示する例を説明する。なお、本実施形態においては、携帯端末により撮像した映像に対する処理について説明するが、映像は映像コンテンツの再生処理や、放送番組の視聴においても同様に適応される。 [Second Embodiment]
Next, a video processing apparatus according to the second embodiment of the present invention will be described. In the present embodiment, the video processing device as a mobile terminal recognizes a plurality of recognition objects in the image being captured, and performs video processing for displaying the recognition result on the recognition object in the video in real time. explain. In this embodiment, an example will be described in which the names of recognition objects are displayed in real time. In the present embodiment, processing for video captured by a mobile terminal will be described. However, video is similarly applied to video content reproduction processing and broadcast program viewing.

本実施形態によれば、ユーザが映像を視聴中、その映像内の複数の認識対象物についての認識結果を、認識精度を維持しながら映像上でリアルタイムにユーザに知らせることができる。 According to this embodiment, while a user is viewing a video, it is possible to notify the user of the recognition results for a plurality of recognition objects in the video on the video in real time while maintaining the recognition accuracy.

《本実施形態に係る映像処理の説明》
図２は、本実施形態に係る映像処理装置２００による映像処理を説明する図である。図２には、１つの処理例を示すがこれに限定されない。 << Description of Video Processing According to this Embodiment >>
FIG. 2 is a diagram for explaining video processing by the video processing apparatus 200 according to the present embodiment. Although FIG. 2 shows one processing example, the present invention is not limited to this.

図２においては、観光客などユーザが、視界内の搭やビルなどの建築物を知りたい場合や、行き先や自分の現在地を知りたい場合に、携帯端末で認識対象物を含む映像を撮像した例を示すものである。携帯端末としての映像処理装置２００の表示画面２１０には、図２の左図に示すように、撮像中の映像表示領域２１１とタッチパネルの指示ボタン表示領域２１２とが表示されている。なお、映像表示領域２１１に表示されている建築物は、撮像中の映像がそのまま表示されたものであり、静止画（写真）ではない。本実施形態においては、その表示映像に対してリアルタイムの認識処理が行なわれ、図２の右の表示画面２２０の映像表示領域２２１には、複数の建築物のそれぞれの名前２２２〜２２４が表示される。ユーザは、かかる映像表示領域２２１から、例えば、観光案内などが無くても、目的地や自分の現在地を知ることができる。 In FIG. 2, when a user such as a tourist wants to know a building such as a tower or a building in the field of view, or wants to know a destination or his / her current location, the mobile terminal captures an image including a recognition target object. An example is given. On the display screen 210 of the video processing apparatus 200 as a portable terminal, as shown in the left diagram of FIG. 2, a video display area 211 being imaged and an instruction button display area 212 of the touch panel are displayed. Note that the building displayed in the video display area 211 is a still image (photograph) that is displayed as it is with the video being captured. In the present embodiment, real-time recognition processing is performed on the display video, and the names 222 to 224 of the plurality of buildings are displayed in the video display area 221 of the right display screen 220 in FIG. The The user can know the destination and his / her current location from the video display area 221 without, for example, sightseeing guidance.

《映像処理装置の機能構成》
図３は、本実施形態に係る映像処理装置２００の機能構成を示すブロック図である。《Functional configuration of video processing device》
FIG. 3 is a block diagram showing a functional configuration of the video processing apparatus 200 according to the present embodiment.

映像処理装置２００は、映像を取得する撮像部３１０を有する。撮像された映像は、表示部３６０に表示されると共に、局所特徴量生成部３２０に入力される。局所特徴量生成部３２０は、撮像された映像から局所特徴量を生成する（詳細は図４Ａ参照）。局所特徴量ＤＢ３３０には、あらかじめ個々の認識対象物、図２のスカイツリー／ビルなど単体、から局所特徴量生成部３２０と同様のアルゴリズムで生成された局所特徴量が、認識対象物と対応付けられて格納されている。かかる局所特徴量ＤＢ３３０の内容は、通信制御部３９０を介して局所特徴量受信部３８０が受信してもよい。 The video processing apparatus 200 includes an imaging unit 310 that acquires video. The captured image is displayed on the display unit 360 and input to the local feature amount generation unit 320. The local feature value generating unit 320 generates a local feature value from the captured video (refer to FIG. 4A for details). In the local feature DB 330, local feature values generated by an algorithm similar to the local feature generating unit 320 in advance from individual recognition objects, such as a single unit such as the sky tree / building of FIG. 2, are associated with the recognition objects. Stored. The contents of the local feature DB 330 may be received by the local feature receiver 380 via the communication controller 390.

照合部３４０は、撮像された映像から局所特徴量生成部３２０で生成された局所特徴量中に、局所特徴量ＤＢ３３０に格納されている局所特徴量に対応するデータがあるか否かを照合する。照合部３４０は、対応するデータがあれば、撮影された映像中に認識対象物があると判定する。なお、局所特徴量が対応するというのは、同じ局所特徴量が在るというだけでなく、その順序や配置が同じ対象物から取得し得るか否かを判断することを含んでもよい（図４Ｇ参照）。 The collation unit 340 collates whether there is data corresponding to the local feature stored in the local feature DB 330 in the local feature generated by the local feature generation unit 320 from the captured video. . If there is corresponding data, collator 340 determines that there is a recognition object in the captured video. Note that the correspondence between local feature amounts not only means that the same local feature amount exists, but may also include determining whether or not the order and arrangement can be acquired from the same object (FIG. 4G). reference).

照合結果生成部３５０は、照合部３４０の照合結果から表示部３６０に表示するためのデータを生成する。かかるデータには、認識対象物の名称や認識エラーなどのデータも含まれる。表示部３６０は、撮像部３１０で撮像された映像に照合結果を重畳して表示する。また、照合結果生成部３５０が生成したデータは、通信制御部３９０を介して外部に送信されてもよい。操作部３７０は、映像処理装置２００のキーやタッチパネル（図２の指示ボタンなど）を含み、撮像部３１０などの映像処理装置２００の動作を操作する。 The verification result generation unit 350 generates data to be displayed on the display unit 360 from the verification result of the verification unit 340. Such data includes data such as names of recognition objects and recognition errors. The display unit 360 displays the collation result superimposed on the video imaged by the imaging unit 310. In addition, the data generated by the verification result generation unit 350 may be transmitted to the outside via the communication control unit 390. The operation unit 370 includes keys and a touch panel (such as the instruction button in FIG. 2) of the video processing device 200, and operates the operation of the video processing device 200 such as the imaging unit 310.

なお、本実施形態の映像処理装置２００は、撮像中の映像に限定されず、再生中の映像や放送中の映像においても適用可能である。その場合には、撮像部３１０を映像再生部や映像受信部に置き換えればよい。 Note that the video processing apparatus 200 according to the present embodiment is not limited to a video that is being captured, and can be applied to a video that is being played back or a video that is being broadcast. In that case, the imaging unit 310 may be replaced with a video reproduction unit or a video reception unit.

《局所特徴量生成部》
図４Ａは、本実施形態に係る局所特徴量生成部３２０の構成を示すブロック図である。 << Local feature generator >>
FIG. 4A is a block diagram illustrating a configuration of the local feature value generation unit 320 according to the present embodiment.

局所特徴量生成部３２０は、特徴点検出部４１１、局所領域取得部４１２、サブ領域分割部４１３、サブ領域特徴ベクトル生成部４１４、および次元選定部４１５を含んで構成される。 The local feature quantity generation unit 320 includes a feature point detection unit 411, a local region acquisition unit 412, a sub region division unit 413, a sub region feature vector generation unit 414, and a dimension selection unit 415.

特徴点検出部４１１は、画像データから特徴的な点（特徴点）を多数検出し、各特徴点の座標位置、スケール（大きさ）、および角度を出力する。 The feature point detection unit 411 detects a large number of characteristic points (feature points) from the image data, and outputs the coordinate position, scale (size), and angle of each feature point.

局所領域取得部４１２は、検出された各特徴点の座標値、スケール、および角度から、特徴量抽出を行う局所領域を取得する。 The local region acquisition unit 412 acquires a local region where feature amount extraction is performed from the coordinate value, scale, and angle of each detected feature point.

サブ領域分割部４１３は、局所領域をサブ領域に分割する。例えば、サブ領域分割部４１３は、局所領域を１６ブロック（４×４ブロック）に分割することも、局所領域を２５ブロック（５×５ブロック）に分割することもできる。なお、分割数は限定されない。本実施形態においては、以下、局所領域を２５ブロック（５×５ブロック）に分割する場合を代表して説明する。 The sub area dividing unit 413 divides the local area into sub areas. For example, the sub-region dividing unit 413 can divide the local region into 16 blocks (4 × 4 blocks) or divide the local region into 25 blocks (5 × 5 blocks). The number of divisions is not limited. In the present embodiment, the case where the local area is divided into 25 blocks (5 × 5 blocks) will be described below as a representative.

サブ領域特徴ベクトル生成部４１４は、局所領域のサブ領域ごとに特徴ベクトルを生成する。サブ領域の特徴ベクトルとしては、例えば、勾配方向ヒストグラムを用いることができる。 The sub-region feature vector generation unit 414 generates a feature vector for each sub-region of the local region. As the feature vector of the sub-region, for example, a gradient direction histogram can be used.

次元選定部４１５は、サブ領域の位置関係に基づいて、近接するサブ領域の特徴ベクトル間の相関が低くなるように、局所特徴量として出力する次元を選定する（例えば、次元の削除や間引きをする）。また、次元選定部４１５は、単に次元を選定するだけではなく、選定の優先順位を決定することができる。すなわち、次元選定部４１５は、例えば、隣接するサブ領域間では同一の勾配方向の次元が選定されないように、優先順位をつけて次元を選定することができる。そして、次元選定部４１５は、選定した次元から構成される特徴ベクトルを、局所特徴量として出力する。なお、次元選定部４１５は、優先順位に基づいて次元を並び替えた状態で、局所特徴量を出力することができる。 The dimension selection unit 415 selects a dimension to be output as a local feature amount based on the positional relationship between the sub-regions so that the correlation between the feature vectors of the adjacent sub-regions is low (for example, dimension deletion or thinning is performed). To do). In addition, the dimension selection unit 415 can not only select a dimension but also determine a selection priority. That is, for example, the dimension selection unit 415 can select dimensions with priorities so that dimensions in the same gradient direction are not selected between adjacent sub-regions. Then, the dimension selection unit 415 outputs a feature vector composed of the selected dimensions as a local feature amount. Note that the dimension selection unit 415 can output the local feature amount in a state where the dimensions are rearranged based on the priority order.

《局所特徴量生成部の処理》
図４Ｂ〜図４Ｆは、本実施形態に係る局所特徴量生成部３２０の処理を示す図である。 << Processing of local feature generator >>
4B to 4F are diagrams illustrating processing of the local feature value generation unit 320 according to the present embodiment.

まず、図４Ｂは、局所特徴量生成部３２０における、特徴点検出／局所領域取得／サブ領域分割／特徴ベクトル生成の一連の処理を示す図である。かかる一連の処理については、米国特許第６７１１２９３号明細書や、David G. Lowe著、「Distinctive image features from scale-invariant key points」、（米国）、International Journal of Computer Vision、60(2)、2004年、p. 91-110を参照されたい。 First, FIG. 4B is a diagram illustrating a series of processing of feature point detection / local region acquisition / sub-region division / feature vector generation in the local feature amount generation unit 320. Such a series of processing is described in US Pat. No. 6,711,293, David G. Lowe, “Distinctive image features from scale-invariant key points” (US), International Journal of Computer Vision, 60 (2), 2004. See Years, p. 91-110.

（特徴点検出部）
図４Ｂの４２１は、図４Ａの特徴点検出部４１１において、映像中の画像から特徴点を検出した状態を示す図である。以下、１つの特徴点４２１ａを代表させて局所特徴量の生成を説明する。特徴点４２１ａの矢印の起点が特徴点の座標位置を示し、矢印の長さがスケール（大きさ）を示し、矢印の方向が角度を示す。ここで、スケール（大きさ）や方向は、対象映像に従って輝度や彩度、色相などを選択できる。また、図４Ｂの例では、６０度間隔で６方向の場合を説明するが、これに限定されない。 (Feature point detector)
421 in FIG. 4B is a diagram illustrating a state in which feature points are detected from an image in the video in the feature point detection unit 411 in FIG. 4A. Hereinafter, generation of a local feature amount will be described by using one feature point 421a as a representative. The starting point of the arrow of the feature point 421a indicates the coordinate position of the feature point, the length of the arrow indicates the scale (size), and the direction of the arrow indicates the angle. Here, as the scale (size) and direction, brightness, saturation, hue, and the like can be selected according to the target image. In the example of FIG. 4B, the case of six directions at intervals of 60 degrees is described, but the present invention is not limited to this.

（局所領域取得部）
図４Ａの局所領域取得部４１２は、例えば、特徴点４２１ａの起点を中心にガウス窓４２２ａを生成し、このガウス窓４２２ａを略含む局所領域４２２を生成する。図４Ｂの例では、局所領域取得部４１２は正方形の局所領域４２２を生成したが、局所領域は円形であっても他の形状であってもよい。この局所領域を各特徴点について取得する。 (Local area acquisition unit)
For example, the local region acquisition unit 412 in FIG. 4A generates a Gaussian window 422a around the starting point of the feature point 421a, and generates a local region 422 that substantially includes the Gaussian window 422a. In the example of FIG. 4B, the local region acquisition unit 412 generates the square local region 422, but the local region may be circular or have another shape. This local region is acquired for each feature point.

（サブ領域分割部）
次に、サブ領域分割部４１３において、上記特徴点４２１ａの局所領域４２２に含まれる各画素のスケールおよび角度をサブ領域４２３に分割した状態が示されている。なお、図４Ｂでは４×４＝１６画素をサブ領域とする５×５＝２５のサブ領域に分割した例を示す。しかし、サブ領域は、４×４＝１６や他の形状、分割数であってもよい。 (Sub-region division part)
Next, a state in which the scale and angle of each pixel included in the local region 422 of the feature point 421a is divided into sub regions 423 in the sub region dividing unit 413 is shown. FIG. 4B shows an example in which 4 × 4 = 16 pixels are divided into 5 × 5 = 25 subregions. However, the sub-region may be 4 × 4 = 16, other shapes, or the number of divisions.

（サブ領域特徴ベクトル生成部）
サブ領域特徴ベクトル生成部４１４は、サブ領域内の各画素のスケールを８方向の角度単位にヒストグラムを生成して量子化し、サブ領域の特徴ベクトル４２４とする。すなわち、特徴点検出部４１１が出力する角度に対して正規化された方向である。そして、サブ領域特徴ベクトル生成部４１４は、サブ領域ごとに量子化された８方向の頻度を集計し、ヒストグラムを生成する。この場合、サブ領域特徴ベクトル生成部４１４は、各特徴点に対して生成される２５サブ領域ブロック×６方向＝１５０次元のヒストグラムにより構成される特徴ベクトルを出力する。また、勾配方向を８方向に量子化するだけに限らず、４方向、８方向、１０方向など任意の量子化数に量子化してよい。勾配方向をＤ方向に量子化する場合、量子化前の勾配方向をＧ（０〜２πラジアン）とすると、勾配方向の量子化値Ｑq（q＝０，…，Ｄ−１）は、例えば式（１）や式（２）などで求めることができるが、これに限られない。 (Sub-region feature vector generator)
The sub-region feature vector generation unit 414 generates and quantizes the histogram of each pixel in the sub-region in units of angular directions in eight directions to obtain the sub-region feature vector 424. That is, the direction is normalized with respect to the angle output by the feature point detection unit 411. Then, the sub-region feature vector generation unit 414 aggregates the frequencies in the eight directions quantized for each sub-region, and generates a histogram. In this case, the sub-region feature vector generation unit 414 outputs a feature vector constituted by a histogram of 25 sub-region blocks × 6 directions = 150 dimensions generated for each feature point. In addition, the gradient direction is not only quantized to 8 directions, but may be quantized to an arbitrary quantization number such as 4 directions, 8 directions, and 10 directions. When the gradient direction is quantized in the D direction, if the gradient direction before quantization is G (0 to 2π radians), the quantized value Qq (q = 0,..., D−1) in the gradient direction is, for example, Although it can obtain | require by (1), Formula (2), etc., it is not restricted to this.

Ｑq＝floor(Ｇ×Ｄ／２π） …（１）
Ｑq＝round(Ｇ×Ｄ／２π）modＤ …（２）
ここで、floor()は小数点以下を切り捨てる関数、round()は四捨五入を行う関数、modは剰余を求める演算である。また、サブ領域特徴ベクトル生成部４１４は勾配ヒストグラムを生成するときに、単純な頻度を集計するのではなく、勾配の大きさを加算して集計してもよい。また、サブ領域特徴ベクトル生成部４１４は勾配ヒストグラムを集計するときに、画素が属するサブ領域だけではなく、サブ領域間の距離に応じて近接するサブ領域（隣接するブロックなど）にも重み値を加算するようにしてもよい。また、サブ領域特徴ベクトル生成部４１４は量子化された勾配方向の前後の勾配方向にも重み値を加算するようにしてもよい。なお、サブ領域の特徴ベクトルは勾配方向ヒストグラムに限られず、色情報など、複数の次元（要素）を有するものであればよい。本実施形態においては、サブ領域の特徴ベクトルとして、勾配方向ヒストグラムを用いることとして説明する。 Qq = floor (G × D / 2π) (1)
Qq = round (G × D / 2π) mod D (2)
Here, floor () is a function for rounding off the decimal point, round () is a function for rounding off, and mod is an operation for obtaining a remainder. Further, when generating the gradient histogram, the sub-region feature vector generation unit 414 may add up the magnitudes of the gradients instead of adding up the simple frequencies. Further, when the sub-region feature vector generation unit 414 aggregates the gradient histogram, the sub-region feature vector generation unit 414 assigns weight values not only to the sub-region to which the pixel belongs, but also to sub-regions adjacent to each other (such as adjacent blocks) according to the distance between the sub-regions. You may make it add. Further, the sub-region feature vector generation unit 414 may add weight values to gradient directions before and after the quantized gradient direction. Note that the feature vector of the sub-region is not limited to the gradient direction histogram, and may be any one having a plurality of dimensions (elements) such as color information. In the present embodiment, it is assumed that a gradient direction histogram is used as the feature vector of the sub-region.

（次元選定部）
次に、図４Ｃ〜図４Ｆに従って、局所特徴量生成部３２０における、次元選定部４１５に処理を説明する。 (Dimension selection part)
Next, processing will be described in the dimension selection unit 415 in the local feature amount generation unit 320 according to FIGS. 4C to 4F.

次元選定部４１５は、サブ領域の位置関係に基づいて、近接するサブ領域の特徴ベクトル間の相関が低くなるように、局所特徴量として出力する次元（要素）を選定する（間引きする）。より具体的には、次元選定部４１５は、例えば、隣接するサブ領域間では少なくとも１つの勾配方向が異なるように次元を選定する。なお、本実施形態では、次元選定部４１５は近接するサブ領域として主に隣接するサブ領域を用いることとするが、近接するサブ領域は隣接するサブ領域に限られず、例えば、対象のサブ領域から所定距離内にあるサブ領域を近接するサブ領域とすることもできる。 The dimension selection unit 415 selects (decimates) a dimension (element) to be output as a local feature amount based on the positional relationship between the sub-regions so that the correlation between feature vectors of adjacent sub-regions becomes low. More specifically, for example, the dimension selection unit 415 selects dimensions so that at least one gradient direction differs between adjacent sub-regions. In the present embodiment, the dimension selection unit 415 mainly uses adjacent sub-regions as adjacent sub-regions. However, the adjacent sub-regions are not limited to adjacent sub-regions, for example, from the target sub-region. A sub-region within a predetermined distance may be a nearby sub-region.

図４Ｃは、局所領域を５×５ブロックのサブ領域に分割し、勾配方向を６方向４３１ａに量子化して生成された１５０次元の勾配ヒストグラムの特徴ベクトル４３１から次元を選定する場合の一例を示す図である。図４Ｃの例では、１５０次元（５×５＝２５サブ領域ブロック×６方向）の特徴ベクトルから次元の選定が行われている。 FIG. 4C shows an example of selecting a dimension from a feature vector 431 of a 150-dimensional gradient histogram generated by dividing a local region into 5 × 5 block sub-regions and quantizing gradient directions into six directions 431a. FIG. In the example of FIG. 4C, dimensions are selected from feature vectors of 150 dimensions (5 × 5 = 25 sub-region blocks × 6 directions).

（局所領域の次元選定）
図４Ｃは、局所特徴量生成部３２０における、特徴ベクトルの次元数の選定処理の様子を示す図である。 (Dimension selection of local area)
FIG. 4C is a diagram illustrating a state of a feature vector dimension number selection process in the local feature value generation unit 320.

図４Ｃに示すように、次元選定部４１５は、１５０次元の勾配ヒストグラムの特徴ベクトル４３１から半分の７５次元の勾配ヒストグラムの特徴ベクトル４３２を選定する場合に、隣接する左右、上下のサブ領域ブロックでは、同一の勾配方向の次元が選定されないように、次元を選定することができる。 As shown in FIG. 4C, when selecting a half 75-dimensional gradient histogram feature vector 432 from the 150-dimensional gradient histogram feature vector 431, the dimension selection unit 415 selects adjacent left, right, upper and lower sub-region blocks. The dimension can be selected so that the same gradient direction dimension is not selected.

この例では、勾配方向ヒストグラムにおける量子化された勾配方向をｑ（ｑ＝０，１，２，３，４，５）とした場合に、ｑ＝０，２，４の要素を選定するブロックと、ｑ＝１，３，５の要素を選定するサブ領域ブロックとが交互に並んでいる。そして、図４Ｃの例では、隣接するサブ領域ブロックで選定された勾配方向を合わせると、全６方向となっている。 In this example, when the quantized gradient direction in the gradient direction histogram is q (q = 0, 1, 2, 3, 4, 5), a block for selecting elements of q = 0, 2, 4 and , Q = 1, 3, and 5 are alternately arranged with sub-region blocks for selecting elements. In the example of FIG. 4C, when the gradient directions selected in the adjacent sub-region blocks are combined, there are six directions.

また、次元選定部４１５は、７５次元の勾配ヒストグラムの特徴ベクトル４３２から５０次元の勾配ヒストグラムの特徴ベクトル４３３を選定する場合は、斜め４５度に位置するサブ領域ブロック間で、１つの方向のみが同一になる（残り１つの方向は異なる）ように次元を選定することができる。 In addition, when the dimension selection unit 415 selects the feature vector 433 of the 50-dimensional gradient histogram from the feature vector 432 of the 75-dimensional gradient histogram, only one direction exists between the sub-region blocks located at an angle of 45 degrees. The dimensions can be selected to be identical (the remaining one direction is different).

また、次元選定部４１５は、５０次元の勾配ヒストグラムの特徴ベクトル４３３から２５次元の勾配ヒストグラムの特徴ベクトル４３４を選定する場合は、斜め４５度に位置するサブ領域ブロック間で、選定される勾配方向が一致しないように次元を選定することができる。図４Ｃに示す例では、次元選定部４１５は、１次元から２５次元までは各サブ領域から１つの勾配方向を選定し、２６次元から５０次元までは２つの勾配方向を選定し、５１次元から７５次元までは３つの勾配方向を選定している。 In addition, when the dimension selection unit 415 selects the feature vector 434 of the 25-dimensional gradient histogram from the feature vector 433 of the 50-dimensional gradient histogram, the gradient direction selected between the sub-region blocks located at an angle of 45 degrees. Dimension can be selected so that does not match. In the example shown in FIG. 4C, the dimension selection unit 415 selects one gradient direction from each sub-region from the first dimension to the 25th dimension, selects two gradient directions from the 26th dimension to the 50th dimension, and starts from the 51st dimension. Three gradient directions are selected up to 75 dimensions.

このように、隣接するサブ領域ブロック間で勾配方向が重ならないように、また全勾配方向が均等に選定されることが望ましい。また同時に、図４Ｃに示す例のように、局所領域の全体から均等に次元が選定されることが望ましい。なお、図４Ｃに示した次元選定方法は一例であり、この選定方法に限らない。 In this way, it is desirable that the gradient directions are selected uniformly so that the gradient directions do not overlap between adjacent sub-region blocks. At the same time, as in the example shown in FIG. 4C, it is desirable that dimensions be selected uniformly from the entire local region. Note that the dimension selection method illustrated in FIG. 4C is an example, and is not limited to this selection method.

（局所領域の優先順位）
図４Ｄは、局所特徴量生成部３２０における、サブ領域からの特徴ベクトルの選定順位の一例を示す図である。 (Local area priority)
FIG. 4D is a diagram illustrating an example of the selection order of feature vectors from sub-regions in the local feature value generation unit 320.

次元選定部４１５は、単に次元を選定するだけではなく、特徴点の特徴に寄与する次元から順に選定するように、選定の優先順位を決定することができる。すなわち、次元選定部４１５は、例えば、隣接するサブ領域ブロック間では同一の勾配方向の次元が選定されないように、優先順位をつけて次元を選定することができる。そして、次元選定部４１５は、選定した次元から構成される特徴ベクトルを、局所特徴量として出力する。なお、次元選定部４１５は、優先順位に基づいて次元を並び替えた状態で、局所特徴量を出力することができる。 The dimension selection unit 415 can determine the priority of selection so as to select not only the dimensions but also the dimensions that contribute to the features of the feature points in order. That is, the dimension selection unit 415 can select dimensions with priorities so that, for example, dimensions in the same gradient direction are not selected between adjacent sub-area blocks. Then, the dimension selection unit 415 outputs a feature vector composed of the selected dimensions as a local feature amount. Note that the dimension selection unit 415 can output the local feature amount in a state where the dimensions are rearranged based on the priority order.

すなわち、次元選定部４１５は、１〜２５次元、２６次元〜５０次元、５１次元〜７５次元の間は、例えば図４Ｄの４４１に示すようなサブ領域ブロックの順番で次元を追加するように選定していってもよい。図４Ｄの４４１に示す優先順位を用いる場合、次元選定部４１５は、中心に近いサブ領域ブロックの優先順位を高くして、勾配方向を選定していくことができる。 That is, the dimension selection unit 415 selects between 1 to 25 dimensions, 26 dimensions to 50 dimensions, and 51 dimensions to 75 dimensions, for example, by adding dimensions in the order of the sub-region blocks as indicated by 441 in FIG. 4D. You may do it. When the priority order indicated by 441 in FIG. 4D is used, the dimension selection unit 415 can select the gradient direction by increasing the priority order of the sub-region blocks close to the center.

図４Ｅの４５１は、図４Ｄの選定順位に従って、１５０次元の特徴ベクトルの要素の番号の一例を示す図である。この例では、５×５＝２５ブロックをラスタスキャン順に番号ｐ（ｐ＝０，１，…，２５）で表し、量子化された勾配方向をｑ（ｑ＝０，１，２，３，４，５）とした場合に、特徴ベクトルの要素の番号を６×ｐ＋ｑとしている。 451 in FIG. 4E is a diagram illustrating an example of element numbers of 150-dimensional feature vectors in accordance with the selection order in FIG. 4D. In this example, 5 × 5 = 25 blocks are represented by numbers p (p = 0, 1,..., 25) in raster scan order, and the quantized gradient direction is represented by q (q = 0, 1, 2, 3, 4). , 5), the element number of the feature vector is 6 × p + q.

図４Ｆの４６１は、図４Ｅの選定順位による１５０次元の順位が、２５次元単位に階層化されていることを示す図である。すなわち、図４Ｆの４６１は、図４Ｄの４４１に示した優先順位に従って図４Ｅに示した要素を選定していくことにより得られる局所特徴量の構成例を示す図である。次元選定部４１５は、図４Ｆに示す順序で次元要素を出力することができる。具体的には、次元選定部４１５は、例えば１５０次元の局所特徴量を出力する場合、図４Ｆに示す順序で全１５０次元の要素を出力することができる。また、次元選定部４１５は、例えば２５次元の局所特徴量を出力する場合、図４Ｆに示す１行目（７６番目、４５番目、８３番目、…、１２０番目）の要素４７１を図４Ｆに示す順（左から右）に出力することができる。また、次元選定部４１５は、例えば５０次元の局所特徴量を出力する場合、図４Ｆに示す１行目に加えて、図４Ｆに示す２行目の要素４７２を図４Ｆに示す順（左から右）に出力することができる。 461 in FIG. 4F is a diagram showing that the 150-dimensional order according to the selection order in FIG. 4E is hierarchized in units of 25 dimensions. That is, 461 in FIG. 4F is a diagram showing a configuration example of local feature amounts obtained by selecting the elements shown in FIG. 4E according to the priority order shown in 441 in FIG. 4D. The dimension selection unit 415 can output dimension elements in the order shown in FIG. 4F. Specifically, for example, when outputting a 150-dimensional local feature amount, the dimension selection unit 415 can output all 150-dimensional elements in the order shown in FIG. 4F. When the dimension selection unit 415 outputs, for example, a 25-dimensional local feature amount, the first line (76th, 45th, 83rd,..., 120th) element 471 shown in FIG. 4F is shown in FIG. 4F. Can be output in order (from left to right). For example, when outputting a 50-dimensional local feature amount, the dimension selecting unit 415 adds the element 472 in the second row shown in FIG. 4F in the order shown in FIG. To the right).

ところで、図４Ｆに示す例では、局所特徴量は階層的な構造となっている。すなわち、例えば、２５次元の局所特徴量と１５０次元の局所特徴量とにおいて、先頭の２５次元分の局所特徴量における要素４７１〜４７６の並びは同一となっている。このように、次元選定部４１５は、階層的（プログレッシブ）に次元を選定することにより、アプリケーションや通信容量、端末スペックなどに応じて、任意の次元数の局所特徴量、すなわち任意のサイズの局所特徴量を抽出して出力することができる。また、次元選定部４１５が、階層的に次元を選定し、優先順位に基づいて次元を並び替えて出力することにより、異なる次元数の局所特徴量を用いて、画像の照合を行うことができる。例えば、７５次元の局所特徴量と５０次元の局所特徴量を用いて画像の照合が行われる場合、先頭の５０次元だけを用いることにより、局所特徴量間の距離計算を行うことができる。 Incidentally, in the example shown in FIG. 4F, the local feature amount has a hierarchical structure. That is, for example, in the 25-dimensional local feature value and the 150-dimensional local feature value, the arrangement of the elements 471 to 476 in the first 25-dimensional local feature value is the same. In this way, the dimension selection unit 415 selects a dimension hierarchically (progressively), and thereby, depending on the application, communication capacity, terminal specification, etc., the local feature quantity of an arbitrary number of dimensions, that is, the local size of an arbitrary size. Feature quantities can be extracted and output. Further, the dimension selection unit 415 can hierarchically select dimensions, rearrange the dimensions based on the priority order, and output them, thereby performing image matching using local feature amounts of different dimensions. . For example, when images are collated using a 75-dimensional local feature value and a 50-dimensional local feature value, the distance between the local feature values can be calculated by using only the first 50 dimensions.

なお、図４Ｄの４４１から図４Ｆに示す優先順位は一例であり、次元を選定する際の順序はこれに限られない。例えば、ブロックの順番に関しては、図４Ｄの４４１の例の他に、図４Ｄの４４２や図４Ｄの４４３に示すような順番でもよい。また、例えば、すべてのサブ領域からまんべんなく次元が選定されるように優先順位が定められることとしてもよい。また、局所領域の中央付近が重要として、中央付近のサブ領域の選定頻度が高くなるように優先順位が定められることとしてもよい。また、次元の選定順序を示す情報は、例えば、プログラムにおいて規定されていてもよいし、プログラムの実行時に参照されるテーブル等（選定順序記憶部）に記憶されていてもよい。 Note that the priorities shown in FIG. 4D from 441 to FIG. 4F are examples, and the order of selecting dimensions is not limited to this. For example, regarding the order of blocks, in addition to the example of 441 in FIG. 4D, the order as shown in 442 in FIG. 4D and 443 in FIG. 4D may be used. Further, for example, the priority order may be set so that dimensions are selected from all the sub-regions. Also, the vicinity of the center of the local region may be important, and the priority order may be determined so that the selection frequency of the sub-region near the center is increased. Further, the information indicating the dimension selection order may be defined in the program, for example, or may be stored in a table or the like (selection order storage unit) referred to when the program is executed.

また、次元選定部４１５は、サブ領域ブロックを１つ飛びに選択して、次元の選定を行ってもよい。すなわち、あるサブ領域では６次元が選定され、当該サブ領域に近接する他のサブ領域では０次元が選定される。このような場合においても、近接するサブ領域間の相関が低くなるようにサブ領域ごとに次元が選定されていると言うことができる。 In addition, the dimension selection unit 415 may select a dimension by selecting one sub-region block. That is, 6 dimensions are selected in a certain sub-region, and 0 dimensions are selected in other sub-regions close to the sub-region. Even in such a case, it can be said that the dimension is selected for each sub-region so that the correlation between adjacent sub-regions becomes low.

また、局所領域やサブ領域の形状は、正方形に限られず、任意の形状とすることができる。例えば、局所領域取得部４１２が、円状の局所領域を取得することとしてもよい。この場合、サブ領域分割部４１３は、円状の局所領域を例えば複数の局所領域を有する同心円に９分割や１７分割のサブ領域に分割することができる。この場合においても、次元選定部４１５は、各サブ領域において、次元を選定することができる。 Further, the shape of the local region and the sub-region is not limited to a square, and can be an arbitrary shape. For example, the local region acquisition unit 412 may acquire a circular local region. In this case, the sub-region dividing unit 413 can divide the circular local region into, for example, nine or seventeen sub-regions into concentric circles having a plurality of local regions. Even in this case, the dimension selection unit 415 can select a dimension in each sub-region.

以上、図４Ｂ〜図４Ｆに示したように、本実施形態の局所特徴量生成部３２０によれば、局所特徴量の情報量を維持しながら生成された特徴ベクトルの次元が階層的に選定される。この処理により、認識精度を維持しながらリアルタイムでの対象物認識と認識結果の表示が可能となる。なお、局所特徴量生成部３２０の構成および処理は本例に限定されない。認識精度を維持しながらリアルタイムでの対象物認識と認識結果の表示が可能となる他の処理が当然に適用できる。 As described above, as illustrated in FIGS. 4B to 4F, according to the local feature value generation unit 320 of this embodiment, the dimensions of the feature vectors generated while maintaining the information amount of the local feature values are hierarchically selected. The This processing enables real-time object recognition and recognition result display while maintaining recognition accuracy. Note that the configuration and processing of the local feature value generation unit 320 are not limited to this example. Naturally, other processes that enable real-time object recognition and recognition result display while maintaining recognition accuracy can be applied.

《照合部》
図４Ｇは、本実施形態に係る照合部３４０の処理を示す図である。 <Verification part>
FIG. 4G is a diagram illustrating processing of the matching unit 340 according to the present embodiment.

図４Ｇは、図２の映像中の建築物を認識する照合例を示す図である。あらかじめ認識対象物（本例では、スカイツリー、○○ビル、××鉄道、△△体育館を含む建物）から本実施形態に従い生成された局所特徴量は、局所特徴量ＤＢ３３０に格納されている。一方、左図の携帯端末としての映像処理装置２００の表示画面２３０中の映像表示領域２３１からは、本実施形態に従い局所特徴量が生成される。そして、局所特徴量ＤＢ３３０に格納された局所特徴量のそれぞれが、映像表示領域２３１から生成された局所特徴量中にあるか否かが照合される。 FIG. 4G is a diagram illustrating a collation example for recognizing a building in the video of FIG. Local feature amounts generated according to the present embodiment from recognition objects (in this example, buildings including a sky tree, a XX building, a XX railway, and a ΔΔ gymnasium) are stored in the local feature amount DB 330. On the other hand, a local feature amount is generated from the video display area 231 in the display screen 230 of the video processing apparatus 200 as the mobile terminal in the left diagram according to the present embodiment. And it is collated whether each of the local feature-values stored in local feature-value DB330 is in the local feature-value produced | generated from the image | video display area 231. FIG.

図４Ｇに示すように、照合部３４０は、局所特徴量ＤＢ３３０に格納されている局所特徴量４８１〜４８４と局所特徴量が合致する各特徴点を細線にように関連付ける。なお、照合部３４０は、局所特徴量の所定割合以上が一致する場合を特徴点の合致とする。そして、照合部３４０は、関連付けられた特徴点の位置関係が線形関係であれば、認識対象物であると認識する。このような対応する特徴点の集合による認識を行なえば、サイズの大小や向きの違い（視点の違い）、あるいは反転などによっても認識が可能である。また、所定数以上の関連付けられた特徴点があれば認識精度が得られるので、一部が視界から隠れていても認識対象物の認識が可能である。ここで、認識のための局所特徴量の合致条件や特徴点数の条件は、同じであってもよいし、認識対象が異なる場合は異なる条件を設定してもよい。 As illustrated in FIG. 4G, the matching unit 340 associates the local feature amounts 481 to 484 stored in the local feature amount DB 330 with the feature points that match the local feature amounts so as to be thin lines. Note that the matching unit 340 determines that the feature points match when a predetermined ratio or more of the local feature amounts match. And the collation part 340 will recognize that it is a recognition target object, if the positional relationship of the related feature point is a linear relationship. If recognition is performed using a set of corresponding feature points, recognition can be performed by size, direction difference (difference in viewpoint), or inversion. In addition, since recognition accuracy can be obtained if there are a predetermined number or more of associated feature points, recognition objects can be recognized even if a part of them is hidden from view. Here, the local feature amount matching conditions and the feature point conditions for recognition may be the same, or different conditions may be set when the recognition targets are different.

なお、本実施形態の照合部３４０の照合処理では、特徴点座標と局所特徴量とに基づいて照合が行なわれるが、合致する認識対象物から生成された局所特徴量と映像中の画像から生成された局所特徴量との配列順序の線形関係のみによっても、認識が可能である。一方、本実施形態では、２次元画像によって説明されているが、３次元の特徴点座標を使用しても、同様の処理が可能である。 In the matching process of the matching unit 340 according to the present embodiment, matching is performed based on the feature point coordinates and the local feature amount. The matching is performed from the local feature amount generated from the matching recognition target object and the image in the video. Recognition is possible only by the linear relationship of the arrangement order with the determined local feature quantity. On the other hand, although this embodiment has been described with a two-dimensional image, the same processing can be performed even if three-dimensional feature point coordinates are used.

（局所特徴量生成データ）
図５は、本実施形態に係る局所特徴量生成データ５００の構成を示す図である。これらのデータは、図７のＲＡＭ７４０に記憶保持される。 (Local feature generation data)
FIG. 5 is a diagram showing a configuration of local feature value generation data 500 according to the present embodiment. These data are stored and held in the RAM 740 of FIG.

局所特徴量生成データ５００には、入力画像ＩＤ５０１に対応付けて、複数の検出された検出特徴点５０２，特徴点座標５０３および特徴点に対応する局所領域情報５０４が記憶される。そして、各検出特徴点５０２，特徴点座標５０３および局所領域情報５０４に対応付けて、複数のサブ領域ＩＤ５０５，サブ領域情報５０６，各サブ領域に対応する特徴ベクトル５０７および優先順位を含む選定次元５０８が記憶される。 In the local feature quantity generation data 500, a plurality of detected feature points 502, feature point coordinates 503, and local region information 504 corresponding to the feature points are stored in association with the input image ID 501. A selection dimension 508 including a plurality of sub-region IDs 505, sub-region information 506, a feature vector 507 corresponding to each sub-region, and a priority order in association with each detected feature point 502, feature point coordinates 503 and local region information 504. Is memorized.

以上のデータから各検出特徴点５０２に対して生成された局所特徴量５０９が記憶される。 A local feature quantity 509 generated for each detected feature point 502 from the above data is stored.

（局所特徴量ＤＢ）
図６は、本実施形態に係る局所特徴量ＤＢ３３０の構成を示す図である。 (Local feature DB)
FIG. 6 is a diagram showing a configuration of the local feature DB 330 according to the present embodiment.

局所特徴量ＤＢ３３０において、それぞれの認識対象物の局所特徴量は、認識対象物の認識に適切な、異なる特徴点数および／または異なる特徴ベクトルの次元数で格納されている。 In the local feature amount DB 330, local feature amounts of the respective recognition objects are stored in different feature points and / or different feature vector dimensions suitable for recognition of the recognition object.

例えば、局所特徴量ＤＢ３３１は、他の認識対象物との相関が低いので特徴点数２５／次元数２５により局所特徴量が格納されている。また、局所特徴量ＤＢ３３２は、他の認識対象物との相関が少しあるので特徴点数５０／次元数２５により局所特徴量が格納されている。また、局所特徴量ＤＢ３３３は、他の認識対象物との相関があるで特徴点数１００／次元数５０により局所特徴量が格納されている。また、局所特徴量ＤＢ３３４は、他の認識対象物との相関が高いので特徴点数１５０／次元数５０により局所特徴量が格納されている。 For example, since the local feature amount DB 331 has a low correlation with other recognition objects, the local feature amount is stored by 25 feature points / 25 dimensions. Further, the local feature DB 332 stores the local feature amount by 50 feature points / 25 dimensions because there is little correlation with other recognition objects. Further, the local feature DB 333 stores local feature values based on the number of feature points 100 / number of dimensions 50 with correlation with other recognition objects. In addition, since the local feature DB 334 has a high correlation with other recognition objects, the local feature is stored with 150 feature points / 50 dimensions.

各局所特徴量ＤＢ３３１〜３３４には、認識対象物ＩＤと認識対象物名に対応付けて、第１番局所特徴量から第ｍ番局所特徴量を記憶する。なお、ｍは正の整数であり、認識対象物に対応して異なる数でよい。また、本実施形態においては、それぞれの局所特徴量と共に照合処理に使用される特徴点座標が記憶される。 Each local feature value DB 331 to 334 stores the m-th local feature value from the first local feature value in association with the recognition object ID and the recognition object name. Note that m is a positive integer and may be a different number corresponding to the recognition object. In the present embodiment, the feature point coordinates used for the matching process are stored together with the respective local feature amounts.

《映像処理装置のハードウェア構成》
図７は、本実施形態に係る映像処理装置２００のハードウェア構成を示すブロック図である。 << Hardware configuration of video processing device >>
FIG. 7 is a block diagram showing a hardware configuration of the video processing apparatus 200 according to the present embodiment.

図７で、ＣＰＵ７１０は演算制御用のプロセッサであり、プログラムを実行することで携帯端末である映像処理装置２００の各機能構成部を実現する。ＲＯＭ７２０は、初期データおよびプログラムなどの固定データおよびプログラムを記憶する。また、通信制御部３９０は通信制御部であり、本実施形態においては、ネットワークを介して他の装置と通信する。なお、ＣＰＵ７１０は１つに限定されず、複数のＣＰＵであっても、あるいは画像処理用のＧＰＵ（Graphics Processing Unit）を含んでもよい。 In FIG. 7, a CPU 710 is a processor for arithmetic control, and implements each functional component of the video processing device 200 that is a portable terminal by executing a program. The ROM 720 stores initial data and fixed data such as programs and programs. The communication control unit 390 is a communication control unit, and in the present embodiment, communicates with other devices via a network. Note that the number of CPUs 710 is not limited to one, and may be a plurality of CPUs or may include a graphics processing unit (GPU) for image processing.

ＲＡＭ７４０は、ＣＰＵ７１０が一時記憶のワークエリアとして使用するランダムアクセスメモリである。ＲＡＭ７４０には、本実施形態の実現に必要なデータを記憶する領域が確保されている。入力映像７４１は、撮像部３１０が撮像して入力された入力映像を記憶する領域である。特徴点データ７４２は、入力映像７４１から検出した特徴点座標、スケール、角度を含む特徴点データを記憶する領域である。局所特徴量生成テーブル５００は、図５で示した局所特徴量生成テーブルを記憶する領域である。照合結果７４３は、入力映像から生成された局所特徴量と局所特徴量ＤＢ３３０に格納された局所特徴量との照合から認識された、複数の認識対象物を含む照合結果を記憶する領域である。照合結果表示データ７４４は、照合結果７４３をユーザに報知するための照合結果表示データを記憶する領域である。なお、音声出力をする場合には、照合結果音声データが含まれてもよい。入力映像／照合結果重畳データ７４５は、入力映像７４１に照合結果７４３を重畳した表示部３６０に表示される入力映像／照合結果重畳データを記憶する領域である。入出力データ７４６は、入出力インタフェース７６０を介して入出力される入出力データを記憶する領域である。送受信データ７４７は、通信制御部３９０を介して送受信される送受信データを記憶する領域である。 The RAM 740 is a random access memory that the CPU 710 uses as a work area for temporary storage. In the RAM 740, an area for storing data necessary for realizing the present embodiment is secured. The input video 741 is an area for storing the input video input by the imaging unit 310. The feature point data 742 is an area for storing feature point data including feature point coordinates, scales, and angles detected from the input video 741. The local feature quantity generation table 500 is an area for storing the local feature quantity generation table shown in FIG. The collation result 743 is an area for storing a collation result including a plurality of recognition objects recognized from the collation between the local feature amount generated from the input video and the local feature amount stored in the local feature amount DB 330. The collation result display data 744 is an area for storing collation result display data for notifying the user of the collation result 743. In addition, when outputting a voice, collation result voice data may be included. The input video / collation result superimposition data 745 is an area for storing input video / collation result superimposition data displayed on the display unit 360 in which the collation result 743 is superimposed on the input video 741. The input / output data 746 is an area for storing input / output data input / output via the input / output interface 760. Transmission / reception data 747 is an area for storing transmission / reception data transmitted / received via the communication control unit 390.

ストレージ７５０には、データベースや各種のパラメータ、あるいは本実施形態の実現に必要な以下のデータまたはプログラムが記憶されている。局所特徴量ＤＢ３３０は、図６に示した局所特徴量ＤＢが格納される領域である。照合結果表示フォーマット７５１は、照合結果を表示するフォーマットを生成するために使用される照合結果表示フォーマットが格納される領域である。 The storage 750 stores a database, various parameters, or the following data or programs necessary for realizing the present embodiment. The local feature DB 330 is an area in which the local feature DB shown in FIG. 6 is stored. The collation result display format 751 is an area in which a collation result display format used for generating a format for displaying the collation result is stored.

ストレージ７５０には、以下のプログラムが格納される。携帯端末制御プログラム７５２は、本映像処理装置２００の全体を制御する携帯端末制御プログラムが格納される領域である。局所特徴量生成モジュール７５３は、携帯端末制御プログラム７５２において、入力映像から図４Ｂ〜図４Ｆに従って局所特徴量を生成する局所特徴量生成モジュールが格納される領域である。照合制御モジュール７５４は、携帯端末制御プログラム７５２において、入力映像から生成された局所特徴量と局所特徴量ＤＢ３３０に格納された局所特徴量とを照合する照合制御モジュールが格納される領域である。照合結果報知モジュール７５５は、照合結果を表示または音声によりユーザに報知するための照合結果報知モジュールが格納される領域である。 The storage 750 stores the following programs. The portable terminal control program 752 is an area in which a portable terminal control program that controls the entire video processing apparatus 200 is stored. The local feature value generation module 753 is an area in which a local feature value generation module that generates a local feature value from an input video according to FIGS. 4B to 4F in the mobile terminal control program 752 is stored. The collation control module 754 is an area in which the collation control module that collates the local feature amount generated from the input video and the local feature amount stored in the local feature amount DB 330 in the portable terminal control program 752 is stored. The verification result notification module 755 is an area in which a verification result notification module for displaying the verification result to the user by display or voice is stored.

入出力インタフェース７６０は、入出力機器との入出力データをインタフェースする。入出力インタフェース７６０には、表示部３６０、操作部３７０であるタッチパネルやキーボード、スピーカ７６４、マイク７６５、撮像部３１０が接続される。入出力機器は上記例に限定されない。また、ＧＰＳ(Global Positioning System)位置生成部７６６は、ＧＰＳ衛星からの信号に基づいて現在位置を取得する。 The input / output interface 760 interfaces input / output data with input / output devices. The input / output interface 760 is connected with a display unit 360, a touch panel and keyboard as the operation unit 370, a speaker 764, a microphone 765, and an imaging unit 310. The input / output device is not limited to the above example. In addition, a GPS (Global Positioning System) position generation unit 766 acquires a current position based on a signal from a GPS satellite.

なお、図７には、本実施形態に必須なデータやプログラムのみが示されており、本実施形態に関連しないデータやプログラムは図示されていない。 FIG. 7 shows only data and programs essential to the present embodiment, and does not illustrate data and programs not related to the present embodiment.

《映像処理装置の処理手順》
図８は、本実施形態に係る映像処理装置２００の処理手順を示すフローチャートである。このフローチャートは、図７のＣＰＵ７１０によってＲＡＭ７４０を用いて実行され、図３の各機能構成部を実現する。《Processing procedure of video processing device》
FIG. 8 is a flowchart showing a processing procedure of the video processing apparatus 200 according to the present embodiment. This flowchart is executed by the CPU 710 in FIG. 7 using the RAM 740, and implements each functional component in FIG.

まず、ステップＳ８１１において、対象物認識を行なうための映像入力があったか否かを判定する。また、携帯端末の機能として、ステップＳ８２１においては受信を判定し、ステップＳ８３１においては送信を判定する。いずれでもなければ、ステップＳ８４１において他の処理を行なう。 First, in step S811, it is determined whether or not there is a video input for performing object recognition. As a function of the mobile terminal, reception is determined in step S821, and transmission is determined in step S831. Otherwise, other processing is performed in step S841.

映像入力があればステップＳ８１３に進んで、入力映像から局所特徴量生成処理を実行する（図９Ａ参照）。次に、ステップＳ８１５において、照合処理を実行する（図９Ｂ参照）。ステップＳ８１７においては、照合処理の結果を入力映像に重畳して映像／照合結果重畳表示処理を実行する。ステップＳ８１９において、対象物認識を行なう処理を終了するかを判定する。終了は、例えば図２の指示ボタン表示領域２１２にあるリセットボタンで行なわれる。終了でなければステップＳ８１３に戻って、映像入力の対象物認識を繰り返す。 If there is video input, the process proceeds to step S813, and local feature generation processing is executed from the input video (see FIG. 9A). Next, in step S815, collation processing is executed (see FIG. 9B). In step S817, the result of the collation process is superimposed on the input video, and the video / collation result superimposed display process is executed. In step S819, it is determined whether or not to finish the object recognition process. The end is performed, for example, by a reset button in the instruction button display area 212 of FIG. If not completed, the process returns to step S813 to repeat the object recognition for video input.

受信であり、局所特徴量ＤＢ用のデータをダウンロードする場合は、ステップＳ８２３において局所特徴量ＤＢ用データを受信して、ステップＳ８２５において局所特徴量ＤＢに記憶する。一方、その他の携帯端末としてのデータ受信であれば、ステップＳ８２７において受信処理を行なう。また、送信であり、局所特徴量ＤＢ用のデータをアップロードする場合は、ステップＳ８３３において入力映像から生成した局所特徴量を局所特徴量ＤＢ用データとして送信する。一方、その他の携帯端末としてのデータ送信であれば、ステップＳ８３５において送信処理を行なう。携帯端末としてのデータ送受信処理については、本実施形態の特徴ではないので詳細な説明は省略する。 When receiving and downloading local feature DB data, the local feature DB data is received in step S823 and stored in the local feature DB in step S825. On the other hand, if it is data reception as another portable terminal, reception processing is performed in step S827. In addition, when uploading local feature DB data, the local feature generated from the input video is transmitted as local feature DB data in step S833. On the other hand, if it is data transmission as another portable terminal, transmission processing is performed in step S835. The data transmission / reception processing as a portable terminal is not a feature of the present embodiment, and thus detailed description thereof is omitted.

（局所特徴量生成処理）
図９Ａは、本実施形態に係る局所特徴量生成処理Ｓ８１３の処理手順を示すフローチャートである。 (Local feature generation processing)
FIG. 9A is a flowchart illustrating a processing procedure of local feature generation processing S813 according to the present embodiment.

まず、ステップＳ９１１において、入力映像から特徴点の位置座標、スケール、角度を検出する。ステップＳ９１３において、ステップＳ９１１で検出された特徴点の１つに対して局所領域を取得する。次に、ステップＳ９１５において、局所領域をサブ領域に分割する。ステップＳ９１７においては、各サブ領域の特徴ベクトルを生成して局所領域の特徴ベクトルを生成する。ステップＳ９１１からＳ９１７の処理は図４Ｂに図示されている。 First, in step S911, the position coordinates, scale, and angle of the feature points are detected from the input video. In step S913, a local region is acquired for one of the feature points detected in step S911. Next, in step S915, the local area is divided into sub-areas. In step S917, a feature vector for each sub-region is generated to generate a feature vector for the local region. The processing from step S911 to S917 is illustrated in FIG. 4B.

次に、ステップＳ９１９において、ステップＳ９１７において生成された局所領域の特徴ベクトルに対して次元選定を実行する。次元選定については、図４Ｄ〜図４Ｆに図示されている。 Next, in step S919, dimension selection is performed on the feature vector of the local region generated in step S917. The dimension selection is illustrated in FIGS. 4D to 4F.

ステップＳ９２１においては、ステップＳ９１１で検出した全特徴点について局所特徴量の生成と次元選定とが終了したかを判定する。終了していない場合はステップＳ９１３に戻って、次の１つの特徴点について処理を繰り返す。 In step S921, it is determined whether the generation of local feature values and dimension selection have been completed for all feature points detected in step S911. If not completed, the process returns to step S913, and the process is repeated for the next one feature point.

（照合処理）
図９Ｂは、本実施形態に係る照合処理Ｓ８１５の処理手順を示すフローチャートである。 (Verification process)
FIG. 9B is a flowchart showing the processing procedure of the collation processing S815 according to the present embodiment.

まず、ステップＳ９３１において、初期化として、パラメータｐ＝１，ｑ＝０を設定する。次に、ステップＳ９３３において、ステップＳ８１３において生成した局所特徴量の次元数ｊを取得する。 First, in step S931, parameters p = 1 and q = 0 are set as initialization. Next, in step S933, the dimension number j of the local feature amount generated in step S813 is acquired.

ステップＳ９３５〜Ｓ９４５のループにおいて、ｐ＞ｍ（ｍ＝認識対象物の特徴点数）となるまで各局所特徴量の照合を繰り返す。まず、ステップＳ９３５において、局所特徴量ＤＢ３３０に格納された認識対象物の第ｐ番局所特徴量の次元数ｊのデータを取得する。すなわち、最初の１次元からｊ次元を取得する。次に、ステップＳ９３７において、ステップＳ９３５において取得した第ｐ番局所特徴量と入力映像から生成した全特徴点の局所特徴量を順に照合して、類似か否かを判定する。ステップＳ９３９においては、局所特徴量間の照合の結果から類似度が閾値αを超えるか否かを判断し、超える場合はステップＳ９４１において、局所特徴量と、入力映像と認識対象物とにおける合致した特徴点の位置関係との組みを記憶する。そして、合致した特徴点数のパラメータであるｑを１つカウントアップする。ステップＳ９４３においては、認識対象物の特徴点を次の特徴点に進め（ｐ←ｐ＋１）、認識対象物の全特徴点の照合が終わってない場合には（ｐ≦ｍ）、ステップＳ９３５に戻って合致する局所特徴量の照合を繰り返す。なお、閾値αは、認識対象物によって求められる認識精度に対応して変更可能である。ここで、他の認識対象物との相関が低い認識対象物であれば認識精度を低くしても、正確な認識が可能である。 In the loop of steps S935 to S945, the collation of each local feature amount is repeated until p> m (m = number of feature points of recognition target object). First, in step S935, data of the dimension number j of the p-th local feature amount of the recognition target stored in the local feature amount DB 330 is acquired. That is, the j dimension is acquired from the first one dimension. Next, in step S937, the p-th local feature amount acquired in step S935 and the local feature amounts of all feature points generated from the input video are sequentially checked to determine whether or not they are similar. In step S939, it is determined whether or not the similarity exceeds the threshold value α from the result of collation between the local feature amounts. If so, in step S941, the local feature amount matches the input video and the recognition target object. A combination with the positional relationship of feature points is stored. Then, q, which is a parameter for the number of matched feature points, is incremented by one. In step S943, the feature point of the recognition target object is advanced to the next feature point (p ← p + 1). If all feature points of the recognition target object have not been matched (p ≦ m), the process returns to step S935. Repeat matching of matching local features. Note that the threshold value α can be changed according to the recognition accuracy required by the recognition object. Here, if the recognition object has a low correlation with other recognition objects, accurate recognition is possible even if the recognition accuracy is lowered.

認識対象物の全特徴点との照合が終了すると、ステップＳ９４５からＳ９４７に進んで、ステップＳ９４７〜Ｓ９５３において、認識対象物が入力映像に存在するか否かが判定される。まず、ステップＳ９４７において、認識対象物の特徴点数ｐの内で入力映像の特徴点の局所特徴量と合致した特徴点数ｑの割合が、閾値βを超えたか否かを判定する。超えていればステップＳ９４９に進んで、認識対象物候補として、さらに、入力映像の特徴点と認識対象物の特徴点との位置関係が、線形変換が可能な関係であるか否かを判定する。すなわち、ステップＳ９４１において局所特徴量が合致したとして記憶した、入力映像の特徴点と認識対象物の特徴点との位置関係が、回転や反転、視点の位置変更などの変化によっても可能な位置関係なのか、不可能な位置関係なのかを判定する。かかる判定方法は幾何学的に既知であるので、詳細な説明は省略する。ステップＳ９５１において、整形変換可能か否かの判定結果により、線形変換可能であればステップＳ９５３に進んで、照合した認識対象物が入力映像に存在すると判定する。なお、閾値βは、認識対象物によって求められる認識精度に対応して変更可能である。ここで、他の認識対象物との相関が低い、あるいは一部分からでも特徴が判断可能な認識対象物であれば合致した特徴点が少なくても、正確な認識が可能である。すなわち、一部分が隠れて見えなくても、あるいは特徴的な一部分が見えてさえいれば、対象物の認識が可能である。 When collation with all the feature points of the recognition target object is completed, the process proceeds from step S945 to S947. In steps S947 to S953, it is determined whether or not the recognition target object exists in the input video. First, in step S947, it is determined whether or not the ratio of the feature point number q that matches the local feature amount of the feature point of the input video among the feature point number p of the recognition target object exceeds the threshold value β. If exceeded, the process proceeds to step S949, and it is further determined as a recognition object candidate whether the positional relationship between the feature point of the input video and the feature point of the recognition object is a relationship that allows linear transformation. . That is, the positional relationship between the feature point of the input image and the feature point of the recognition target that is stored as the local feature amount matches in step S941 is possible even by a change such as rotation, inversion, or change of the viewpoint position. It is determined whether it is a positional relationship that is impossible or impossible. Since such a determination method is geometrically known, detailed description thereof is omitted. In step S951, if linear conversion is possible based on the determination result of whether or not shaping conversion is possible, the process proceeds to step S953, where it is determined that the collated recognition target object exists in the input video. The threshold value β can be changed according to the recognition accuracy required by the recognition object. Here, accurate recognition is possible even if there are few matching feature points as long as the recognition object has a low correlation with other recognition objects or a feature can be determined even from a part. That is, even if a part is hidden and cannot be seen, or if a characteristic part is visible, the object can be recognized.

ステップＳ９５５においては、局所特徴量ＤＢ３３０に未照合の認識対象物が残っているか否かを判定する。まだ認識対象物が残っていれば、ステップＳ９５７において次の認識対象物を設定して、パラメータｐ＝１，ｑ＝０に初期化し、ステップＳ９３５に戻って照合を繰り返す。 In step S955, it is determined whether or not an unmatched recognition target remains in the local feature DB 330. If there is still a recognition object, the next recognition object is set in step S957, initialized to parameters p = 1 and q = 0, and the process returns to step S935 to repeat the collation.

なお、かかる照合処理の説明からも明らかなように、あらゆる分野の認識対象物を局所特徴量ＤＢ３３０に記憶して、全認識対象物を携帯端末で照合する処理は、負荷が非常に大きくなる。したがって、例えば、入力映像からの対象物認識の前にユーザが対象物の分野をメニューから選択して、その分野を局所特徴量ＤＢ３３０から検索して照合することが考えられる。また、局所特徴量ＤＢ３３０にユーザが使用する分野（例えば、図２の例であれば建築物など）の局所特徴量のみをダウンロードすることによっても、負荷を軽減できる。 As is clear from the description of the collation processing, the processing for storing recognition objects in all fields in the local feature DB 330 and collating all the recognition objects with the mobile terminal is very heavy. Therefore, for example, before the object recognition from the input video, it is conceivable that the user selects the field of the object from the menu, searches the field from the local feature DB 330, and collates the field. The load can also be reduced by downloading only the local feature amount of the field used by the user (for example, a building in the example of FIG. 2) to the local feature amount DB 330.

［第３実施形態］
次に、本発明の第３実施形態に係る映像処理装置について説明する。本実施形態に係る映像処理装置は、上記第２実施形態と比べると、映像中から認識した複数の認識対象物を組み合わせることによって、映像全体が何を表わすかを認識する点で異なる。例えば、複数の建築物の認識から現在地や行き先を認識する。あるいは、複数の部品の認識から製品を認識する。その他の構成および動作は、第２実施形態と同様であるため、同じ構成および動作については説明を省略する。 [Third Embodiment]
Next, a video processing apparatus according to the third embodiment of the present invention will be described. The video processing apparatus according to the present embodiment is different from the second embodiment in that it recognizes what the entire video represents by combining a plurality of recognition objects recognized from the video. For example, the present location and the destination are recognized from the recognition of a plurality of buildings. Alternatively, a product is recognized from recognition of a plurality of parts. Other configurations and operations are the same as those of the second embodiment, and thus description of the same configurations and operations is omitted.

本実施形態によれば、ユーザが映像を視聴中、その映像全体が表わす認識対象物についての認識結果を、認識精度を維持しながら映像上でリアルタイムにユーザに知らせることができる。 According to the present embodiment, while the user is viewing the video, the recognition result of the recognition target object represented by the entire video can be notified to the user in real time on the video while maintaining the recognition accuracy.

《本実施形態に係る映像処理》
図１０は、本実施形態に係る映像処理装置１０００による映像処理を説明する図である。 << Video processing according to this embodiment >>
FIG. 10 is a diagram for explaining video processing by the video processing apparatus 1000 according to the present embodiment.

まず、図１０の上段は、観光客などユーザが、視界内の搭やビルなどの建築物を知りたい場合や、行き先や自分の現在地を知りたい場合に、携帯端末で認識対象物を含む映像を撮像した例を示すものである。 First, the upper part of FIG. 10 shows an image including an object to be recognized on a mobile terminal when a user such as a tourist wants to know a building such as a tower or a building in the field of view or wants to know a destination or his / her current location. The example which imaged is shown.

携帯端末としての映像処理装置１０００の表示画面１０１０には、図１０上段の左図に示すように、撮像中の映像表示領域１０１１とタッチパネルの指示ボタン表示領域１０１２とが表示されている。なお、映像表示領域１０１１に表示されている建築物は、撮像中の映像がそのまま表示されたものであり、静止画（写真）ではない。 On the display screen 1010 of the video processing apparatus 1000 as a portable terminal, a video display area 1011 during imaging and an instruction button display area 1012 on the touch panel are displayed as shown in the upper left diagram of FIG. Note that the building displayed in the video display area 1011 is the image being captured as it is, and is not a still image (photo).

本実施形態においては、その表示映像に対してリアルタイムの認識処理が行なわれ、図１０上段の中央の表示画面１０２０の映像表示領域１０２１には、それぞれの建築物の名前１０２２が表示される。 In this embodiment, real-time recognition processing is performed on the display image, and the name 1022 of each building is displayed in the image display area 1021 of the center display screen 1020 in the upper part of FIG.

さらに、本実施形態においては、認識された複数の建築物の映像内の配置や距離、角度などから、映像処理装置１０００で撮像しているユーザの現在地を認識して、図１０上段の右の表示画面１０３０の映像表示領域１０３１には、現在地周囲の地図上の現在地マーク１０３３と現在地の住所１０３２とがリアルタイムに表示されている。あるいは、スピーカから音声出力されてもよい。ユーザは、かかる映像表示領域１０３１から、例えば、観光案内などが無くても、目的地や自分の現在地を知ることができる。 Furthermore, in the present embodiment, the current location of the user imaged by the video processing apparatus 1000 is recognized from the arrangement, distance, angle, and the like in the video of the plurality of recognized buildings, and the right side in the upper part of FIG. In the video display area 1031 of the display screen 1030, the current location mark 1033 on the map around the current location and the address 1032 of the current location are displayed in real time. Alternatively, sound may be output from a speaker. The user can know the destination and his / her current location from the video display area 1031 without, for example, sightseeing guidance.

図１０の下段は、製品を撮像すると、映像中の各部品を認識すると共に、それら複数部品を有する製品を認識して表示する例を示したものである。 The lower part of FIG. 10 shows an example in which, when a product is imaged, each part in the video is recognized and a product having these parts is recognized and displayed.

携帯端末としての映像処理装置１０００の表示画面１０４０には、図１０下段の左図に示すように、撮像中の映像表示領域１０４１とタッチパネルの指示ボタン表示領域１０４２とが表示されている。なお、映像表示領域１０４１に表示されている製品は、撮像中の映像がそのまま表示されたものであり、静止画（写真）ではない。 On the display screen 1040 of the video processing apparatus 1000 as a portable terminal, a video display area 1041 being imaged and an instruction button display area 1042 on the touch panel are displayed as shown in the left diagram in the lower part of FIG. Note that the product displayed in the video display area 1041 is a still image (photograph) that is displayed as it is with the video being captured.

本実施形態においては、その表示映像に対してリアルタイムの認識処理が行なわれ、図１０下段の中央の表示画面１０５０の映像表示領域１０５１には、それぞれの部品の名前１０５２が表示される。 In the present embodiment, real-time recognition processing is performed on the display image, and the name 1052 of each component is displayed in the image display area 1051 of the center display screen 1050 in the lower part of FIG.

さらに、本実施形態においては、認識された複数の部品の映像内の配置や距離、角度などから、映像処理装置１０００で撮像している製品を認識して、図１０下段の右の表示画面１０６０の映像表示領域１０６１には、製品名１０６２がリアルタイムに表示されている。あるいは、スピーカから音声出力されてもよい。かかる映像表示領域１０６１から、例えば、製品を知ることができる。 Furthermore, in the present embodiment, the product imaged by the video processing apparatus 1000 is recognized from the arrangement, distance, angle, and the like of the recognized plurality of parts in the video, and the right display screen 1060 in the lower part of FIG. The product name 1062 is displayed in real time in the video display area 1061. Alternatively, sound may be output from a speaker. For example, the product can be known from the video display area 1061.

なお、認識対象は上記例に限定されない。全体映像を認識対象物の配置から認識できるものであれば、いずれにも適用できる。 The recognition target is not limited to the above example. Any video can be applied as long as the entire video can be recognized from the arrangement of the recognition objects.

《映像処理装置の機能構成》
図１１は、本実施形態に係る映像処理装置１０００の機能構成を示すブロック図である。なお、映像処理装置１０００の機能構成は、第２実施形態の映像処理装置２００に関連情報を表示する構成が追加された構成であるので、同じ構成要素には同じ参照番号を付し、説明は省略する。《Functional configuration of video processing device》
FIG. 11 is a block diagram showing a functional configuration of the video processing apparatus 1000 according to the present embodiment. Note that the functional configuration of the video processing device 1000 is a configuration in which a configuration for displaying related information is added to the video processing device 200 of the second embodiment, and therefore, the same reference numerals are given to the same components, and the description will be omitted. Omitted.

映像処理装置１０００は、照合部３４０による映像中の複数の照合結果を保持する複数照合結果保持部１１１０を有する（図１２参照）。また、複数の照合結果の組み合わせから、映像を認識するための組み合わせ識別ＤＢ１１２０を有する（図１３参照）。 The video processing apparatus 1000 includes a multiple verification result holding unit 1110 that holds a plurality of verification results in the video by the verification unit 340 (see FIG. 12). Moreover, it has combination identification DB1120 for recognizing an image | video from the combination of a some collation result (refer FIG. 13).

なお、音声出力があればスピーカから出力される。また、組み合わせ識別ＤＢ１１２０に格納される組み合わせ情報は、通信制御部３９０（図１１には不図示）を介してダウンロードされる構成であってもよい。 If there is an audio output, it is output from the speaker. Further, the combination information stored in the combination identification DB 1120 may be downloaded via the communication control unit 390 (not shown in FIG. 11).

（複数照合結果保持部）
図１２は、本実施形態に係る複数照合結果保持部１１１０の構成を示す図である。 (Multiple verification result holding part)
FIG. 12 is a diagram illustrating a configuration of the multiple matching result holding unit 1110 according to the present embodiment.

複数照合結果保持部１１１０は、照合結果の認識対象物ＩＤ１２０１および認識対象物名１２０２に対応付けて、映像中の位置１２０３、向き（角度）１２０４、サイズ（距離）１２０５を記憶する。 The multiple matching result holding unit 1110 stores a position 1203, a direction (angle) 1204, and a size (distance) 1205 in the video in association with the recognition target object ID 1201 and the recognition target name 1202 of the matching result.

（組み合わせ識別ＤＢ１１２０）
図１３は、本実施形態に係る組み合わせ識別ＤＢ１１２０の構成を示す図である。 (Combination identification DB 1120)
FIG. 13 is a diagram showing a configuration of the combination identification DB 1120 according to the present embodiment.

組み合わせ識別ＤＢ１１２０には、２つの種類がある。１つは組み合わせ識別ＤＢ１３１０であり、複数の認識対象物の位置関係が分からなくても、映像が認識可能なものが格納される。組み合わせ識別ＤＢ１３１０は、複数の認識対象物１３１１の組み合わせに対応付けて、照合映像ＩＤ１３１２と照合映像名１３１３とが記憶される。 There are two types of combination identification DB 1120. One is a combination identification DB 1310, which stores a recognizable image even if the positional relationship among a plurality of recognition objects is unknown. The combination identification DB 1310 stores a collation video ID 1312 and a collation video name 1313 in association with a combination of a plurality of recognition objects 1311.

もう１つは組み合わせ識別ＤＢ１３２０であり、複数の認識対象物の位置関係を考慮して、映像を認識するものが格納される。組み合わせ識別ＤＢ１３２０は、複数の認識対象物と位置１３２１の組み合わせに対応付けて、照合映像ＩＤ１３２２と照合映像名１３２３とが記憶される。 The other is a combination identification DB 1320 that stores video recognition in consideration of the positional relationship between a plurality of recognition objects. The combination identification DB 1320 stores a collation video ID 1322 and a collation video name 1323 in association with combinations of a plurality of recognition objects and positions 1321.

《映像処理装置のハードウェア構成》
図１４は、本実施形態に係る映像処理装置１０００のハードウェア構成を示すブロック図である。なお、映像処理装置１０００の構成の、第２実施形態の図７の構成との相違点は、複数の認識対象物の組み合わせ照合の構成である。他の構成は同様であるので、図７と同じ参照番号を付し説明は省略する。 << Hardware configuration of video processing device >>
FIG. 14 is a block diagram showing a hardware configuration of the video processing apparatus 1000 according to the present embodiment. The difference of the configuration of the video processing apparatus 1000 from the configuration of FIG. 7 of the second embodiment is the configuration of combination verification of a plurality of recognition objects. Since other configurations are the same, the same reference numerals as those in FIG.

ＲＡＭ１４４０の１１１０には、図１２に示した複数照合結果が保持される。また、組み合わせ照合結果１４４１は、組み合わせ照合結果を記憶する領域である。そして、ストレージ１４５０の組み合わせ識別ＤＢ１１２０は、図１３に示した組み合わせ識別ＤＢが格納される猟奇である。組み合わせ識別ＤＢ１４５２は、本実施形態の携帯端末制御プログラムか格納される領域である（図１５参照）。映像照合制御モジュール１４５６は、携帯端末制御プログラム１４５２において、複数の認識対象物から映像照合を行なう映像照合制御モジュールが格納される領域である（図１６参照）。 A plurality of collation results shown in FIG. The combination matching result 1441 is an area for storing the combination matching result. And the combination identification DB 1120 of the storage 1450 is a bizarre in which the combination identification DB shown in FIG. 13 is stored. The combination identification DB 1452 is an area in which the portable terminal control program of this embodiment is stored (see FIG. 15). Video collation control module 1456 is an area in which video collation control module for collating video from a plurality of recognition objects is stored in portable terminal control program 1452 (see FIG. 16).

《映像処理装置の処理手順》
図１５は、本実施形態に係る映像処理装置１０００の処理手順を示すフローチャートである。このフローチャートは、図１４のＣＰＵ７１０によってＲＡＭ７４０を用いて実行され、図１１の各機能構成部を実現する。《Processing procedure of video processing device》
FIG. 15 is a flowchart showing a processing procedure of the video processing apparatus 1000 according to the present embodiment. This flowchart is executed by the CPU 710 in FIG. 14 using the RAM 740, and implements each functional component in FIG.

なお、図１５において、第２実施形態の図８との相違は、複数認識対象物による映像照合処理の追加である。他の処理は図８と同様であるので、同じステップには同じステップ番号を付し、説明は省略する。 In FIG. 15, the difference from the second embodiment shown in FIG. 8 is the addition of video collation processing using a plurality of recognition objects. Since the other processes are the same as those in FIG. 8, the same steps are denoted by the same step numbers and description thereof is omitted.

ステップＳ８１５における照合処理の後に、ステップＳ１５０１において、複数認識対象物による映像照合処理が行なわれる（図１６参照）。 After the matching process in step S815, in step S1501, a video matching process using a plurality of recognition objects is performed (see FIG. 16).

（映像照合処理）
図１６は、本実施形態に係る映像照合処理Ｓ１５０１の処理手順を示すフローチャートである。 (Video verification process)
FIG. 16 is a flowchart showing a processing procedure of the video collation processing S1501 according to the present embodiment.

まず、ステップＳ１６０１において、映像中から認識した複数の認識対象物（また位置）を取得する。そして、ステップＳ１６０３において、組み合わせ識別ＤＢ１１２０を参照して、複数の認識対象物（位置）の組み合わせから映像の照合を行なう。 First, in step S1601, a plurality of recognition objects (or positions) recognized from the video are acquired. In step S1603, with reference to the combination identification DB 1120, video is collated from combinations of a plurality of recognition objects (positions).

ステップＳ１６０５において、照合する映像があるか否かを判定する。照合映像があればステップＳ１６０７に進んで、照合映像に関連する情報を取得してリターンする。
［第４実施形態］
次に、本発明の第４実施形態に係る映像処理システムについて説明する。本実施形態に係る映像処理システムは、上記第２実施形態および第３実施形態と比べると、携帯端末は、映像の局所特徴量を生成してサーバに送信し、サーバによって認識された認識対象物や映像全体の認識結果を受信する点で異なる。本実施形態の携帯端末とサーバ間の通信において、本実施形態の局所特徴量生成における容量削減がリアルタイム処理に有効となる。 In step S1605, it is determined whether there is an image to be verified. If there is a collation video, the process proceeds to step S1607 to acquire information related to the collation video and return.
[Fourth Embodiment]
Next, a video processing system according to the fourth embodiment of the present invention will be described. Compared with the second embodiment and the third embodiment described above, the video processing system according to the present embodiment generates a local feature amount of video and transmits it to the server, and the recognition target object recognized by the server. And receiving the recognition result of the entire video. In communication between the mobile terminal and the server of this embodiment, capacity reduction in local feature generation of this embodiment is effective for real-time processing.

本実施形態によれば、携帯端末の負荷を軽減して、ユーザが映像を視聴中、その映像内の認識対象物および映像についての認識結果あるいはさらにリンク情報を、認識精度を維持しながら映像上でリアルタイムにユーザに知らせることができる。 According to the present embodiment, the load on the mobile terminal is reduced, and while the user is viewing the video, the recognition target object and the recognition result or the link information in the video are displayed on the video while maintaining the recognition accuracy. Can inform the user in real time.

《本実施形態に係る映像処理》
図１７は、本実施形態に係る映像処理システム１７００による映像処理を説明する図である。 << Video processing according to this embodiment >>
FIG. 17 is a diagram for explaining video processing by the video processing system 1700 according to the present embodiment.

映像処理システム１７００は、第２実施形態および第３実施形態において示した携帯端末である映像処理装置のように、映像入力と局所特徴量生成処理と照合処理とを自己完結的に行なわない。すなわち、携帯端末である映像処理装置１７１０は映像入力と局所特徴量生成処理とを行ない、ネットワーク１７７０で接続された照合サーバである映像処理装置１７２０が照合などの負荷の大きな処理を行なう。かかる処理においては、ネットワーク１７７０上を転送する局所特徴量の容量の大小が照合速度や通信のトラフィックに影響する。本実施形態の図４Ａ〜図４Ｆに従って生成した精度を保った局所特徴量の容量削減により、入力映像内の対象物認識や映像認識と認識結果の入力映像への重畳表示がリアルタイムに可能となる。なお、本実施形態においては、さらに、符号化処理により、入力映像内の対象物認識と認識結果の入力映像への重畳表示がより高速に可能とする。 The video processing system 1700 does not perform video input, local feature generation processing, and collation processing in a self-contained manner, unlike the video processing device that is a portable terminal shown in the second and third embodiments. That is, the video processing device 1710 that is a portable terminal performs video input and local feature generation processing, and the video processing device 1720 that is a collation server connected by the network 1770 performs processing with a large load such as collation. In such processing, the size of the local feature amount transferred over the network 1770 affects the verification speed and communication traffic. 4A to 4F according to the present embodiment, the capacity reduction of the local feature amount that maintains the accuracy can be realized, and the object recognition in the input video, the video recognition, and the superimposed display of the recognition result on the input video can be performed in real time. . In the present embodiment, the object recognition in the input video and the superimposed display of the recognition result on the input video can be performed at higher speed by the encoding process.

図１７の映像処理システム１７００は、端末装置である映像処理装置１７１０と、映像処理装置１７１０とネットワーク１７７０を介して接続する照合サーバである映像処理装置１７２０とを有する。また、ネットワーク１７７０には、サービス情報を保持してリンク情報によりユーザに提供するサービス提供サーバ１７３０も接続されている。 A video processing system 1700 in FIG. 17 includes a video processing device 1710 that is a terminal device, and a video processing device 1720 that is a collation server connected to the video processing device 1710 via a network 1770. Also connected to the network 1770 is a service providing server 1730 that holds service information and provides the user with link information.

照合サーバである映像処理装置１７２０は、認識対象物の照合処理に使用する局所特徴量ＤＢ１７２１と、映像の照合処理に使用する組み合わせ識別ＤＢ１７２２と、リンク情報の提供に使用するリンク情報ＤＢ１７２３とを有する。 The video processing device 1720, which is a collation server, includes a local feature DB 1721 used for collation processing of recognition objects, a combination identification DB 1722 used for video collation processing, and a link information DB 1723 used for providing link information. .

図１７には、端末装置である映像処理装置１７１０の１つの処理例が示されている。図１０にも示した、製品を撮像すると、映像中の各部品を認識すると共に、それら複数部品を有する製品を認識して表示する例を示したものである。 FIG. 17 shows one processing example of the video processing device 1710 which is a terminal device. FIG. 10 shows an example in which when a product is imaged, each component in the video is recognized and a product having the plurality of components is recognized and displayed.

携帯端末としての映像処理装置１７１０の表示画面１７４０には、左図に示すように、撮像中の映像表示領域１７４１とタッチパネルの指示ボタン表示領域１７４２とが表示されている。なお、映像表示領域１７４１に表示されている製品は、撮像中の映像をそのまま表示したものであり、静止画（写真）ではない。 On the display screen 1740 of the video processing apparatus 1710 as a portable terminal, as shown in the left figure, a video display area 1741 during imaging and an instruction button display area 1742 on the touch panel are displayed. Note that the product displayed in the video display area 1741 displays the video being captured as it is, not a still image (photograph).

本実施形態においては、その表示映像に対してリアルタイムの認識処理が行なわれ、中央の表示画面１７５０の映像表示領域１７５１には、それぞれの部品の名前１７５２が表示される。 In this embodiment, real-time recognition processing is performed on the display video, and the name 1752 of each component is displayed in the video display area 1751 of the central display screen 1750.

さらに、本実施形態においては、認識された複数の部品の映像内の配置や距離、角度などから、映像処理装置１７１０で撮像している製品を認識して、右の表示画面１７６０の映像表示領域１７６１には、製品名１７６２およびその関連情報へのリンク情報１７６３がリアルタイムに表示されている。あるいは、スピーカから音声出力されてもよい。かかる映像表示領域１７６１から、例えば、製品の情報を知ることができる。 Furthermore, in the present embodiment, the product imaged by the video processing device 1710 is recognized from the arrangement, distance, angle, etc. of the recognized parts in the video, and the video display area of the right display screen 1760 is recognized. In 1761, a product name 1762 and link information 1763 to related information are displayed in real time. Alternatively, sound may be output from a speaker. From the video display area 1761, for example, product information can be known.

《映像処理システムの映像処理手順》
図１８は、本実施形態に係る映像処理システム１７００の映像処理手順を示すシーケンス図である。なお、ステップＳ１８００において、必要であれば、照合サーバから本実施形態に係るアプリケーションをダウンロードする。《Video processing procedure of video processing system》
FIG. 18 is a sequence diagram showing a video processing procedure of the video processing system 1700 according to the present embodiment. In step S1800, if necessary, the application according to the present embodiment is downloaded from the verification server.

まず、ステップＳ１８０１においては、携帯端末および照合サーバのアプリケーションを起動し、初期化する。携帯端末は、ステップＳ１８０３において、撮像部３１０により映像を撮影する。携帯端末は次にステップＳ１８０５において、局所特徴量を生成する。そして、携帯端末は、ステップＳ１８０７において、生成した局所特徴量と特徴点の位置座標を符号化して、ステップＳ１８０９において、ネットワークを介して照合サーバに送信する。 First, in step S1801, the mobile terminal and the verification server application are activated and initialized. In step S1803, the portable terminal captures an image using the imaging unit 310. Next, in step S1805, the portable terminal generates a local feature amount. In step S1807, the portable terminal encodes the generated local feature amount and the position coordinates of the feature point, and in step S1809, transmits them to the matching server via the network.

照合サーバでは、ステップＳ１８１１において、局所特徴量ＤＢ１７２１の認識対象物の局所特徴量と受信した局所特徴量との照合により、映像中の対象物を認識する。ステップＳ１８１３においては、複数の認識対象物の照合が終了するまで照合処理を繰り返す。次に、ステップＳ１８１５において、組み合わせ識別ＤＢ１７２２を参照して照合結果の複数の認識対象物による映像照合を行なう。そして、照合サーバは、ステップＳ１８１７において、リンク情報ＤＢ１７２３を参照して、認識対象物に対応するリンク情報を取得する。そして、ステップＳ１８１９において、絵認識対象物と映像の照合結果とリンク情報とを携帯端末に返信する。 In the collation server, in step S1811, the object in the video is recognized by collating the local feature quantity of the recognition target object in the local feature quantity DB 1721 with the received local feature quantity. In step S1813, the collation process is repeated until collation of a plurality of recognition objects is completed. Next, in step S1815, with reference to the combination identification DB 1722, video collation is performed using a plurality of recognition objects as a collation result. In step S1817, the collation server refers to the link information DB 1723, and acquires link information corresponding to the recognition target object. In step S1819, the picture recognition target object, the image comparison result, and the link information are returned to the portable terminal.

携帯端末は、ステップＳ１８２１において、入力映像に受信した認識対象物とリンク情報とを重畳して表示する（図１７の製品名１７６２やリンク情報１７６３に相当）。ステップＳ１８２３においてリンク先が指示されればステップＳ１８２５に進んで、図１７のサービス提供サーバ１７３０などのリンク先サーバに、照合映像ＩＤに基づいてアクセスを行なう。 In step S1821, the portable terminal superimposes and displays the received recognition target object and link information on the input video (corresponding to the product name 1762 and link information 1763 in FIG. 17). If a link destination is instructed in step S1823, the process advances to step S1825 to access a link destination server such as the service providing server 1730 in FIG. 17 based on the collation video ID.

サービス提供サーバは、ステップＳ１８２７において、関連情報ＤＢから照合映像の関連情報（文書や音、または画像）を読み出す。そして、ステップＳ１８２９において、照合映像の関連情報を携帯端末にダウンロードする。携帯端末では、受信した照合映像の関連情報を入力映像に重畳表示したり、音声再生したりして、ユーザに入力映像の対象物を報知する。 In step S1827, the service providing server reads related information (document, sound, or image) of the collation video from the related information DB. In step S1829, the related information of the collation video is downloaded to the mobile terminal. In the mobile terminal, the related information of the received verification video is superimposed and displayed on the input video, or the audio is played back, and the target of the input video is notified to the user.

本実施形態においては、かかる一連の処理がリアルタイムで実現され、ユーザは入力映像に認識対象物名および照合映像名や関連情報の表示を見ることができる。 In the present embodiment, such a series of processing is realized in real time, and the user can see the display of the recognition object name, the collation video name, and related information on the input video.

《携帯端末用の映像処理装置の機能構成》
図１９Ａは、本実施形態に係る携帯端末用の映像処理装置１７１０の機能構成を示すブロック図である。なお、映像処理装置１７１０の機能構成は、第２実施形態の映像処理装置２００から照合処理に関連する構成を無くし、代わりに、局所特徴量の送信構成と照合結果の受信構成を追加した構成であるので、図３と同じ構成要素には同じ参照番号を付し、説明は省略する。 <Functional configuration of video processing device for portable terminal>
FIG. 19A is a block diagram illustrating a functional configuration of a video processing device 1710 for a portable terminal according to the present embodiment. The functional configuration of the video processing device 1710 is a configuration in which the configuration related to the collation processing is eliminated from the video processing device 200 of the second embodiment, and instead, a local feature transmission configuration and a collation result reception configuration are added. Therefore, the same components as those in FIG. 3 are denoted by the same reference numerals, and description thereof is omitted.

映像処理装置１７１０は、通信制御部３９０を介して局所特徴量生成部３２０で生成した局所特徴量および特徴点座標を送信するために、それらを符号化する符号化部１９３０を有する（図１９Ｂ参照）。 The video processing device 1710 includes an encoding unit 1930 that encodes the local feature amount and the feature point coordinates generated by the local feature amount generation unit 320 via the communication control unit 390 (see FIG. 19B). ).

一方、通信制御部３９０を介して照合結果／リンク情報受信部１９５０で受信したデータから、照合サーバにおいて認識した認識対象物名や照合映像などやリンク情報を取得する。取得されたデータを入力映像に重畳した表示画面を生成して、表示部３６０に表示する。また、音声データがある場合には、スピーカから出力される。 On the other hand, from the data received by the verification result / link information receiving unit 1950 via the communication control unit 390, the recognition target name recognized by the verification server, the verification video, and the like and link information are acquired. A display screen in which the acquired data is superimposed on the input video is generated and displayed on the display unit 360. If there is audio data, it is output from the speaker.

（符号化部）
図１９Ｂは、本実施形態に係る符号化部１９３０を示すブロック図である。なお、符号化部は本例に限定されず、他の符号化処理も適用可能である。 (Encoding part)
FIG. 19B is a block diagram showing an encoding unit 1930 according to this embodiment. Note that the encoding unit is not limited to this example, and other encoding processes can be applied.

符号化部１９３０は、局所特徴量生成部３２０の特徴点検出部４１１から特徴点の座標を入力して、座標値を操作する座標値走査部１９３１を有する。座標値走査部１９３１は、画像をある特定の走査方法に従って走査し、特徴点の２次元座標値（Ｘ座標値とＹ座標値）を１次元のインデックス値に変換する。このインデックス値は、走査に従った原点からの走査距離である。なお、走査方向については、制限はない。 The encoding unit 1930 includes a coordinate value scanning unit 1931 that inputs the coordinates of feature points from the feature point detection unit 411 of the local feature quantity generation unit 320 and manipulates the coordinate values. The coordinate value scanning unit 1931 scans the image according to a specific scanning method, and converts the two-dimensional coordinate values (X coordinate value and Y coordinate value) of the feature points into one-dimensional index values. This index value is a scanning distance from the origin according to scanning. There is no restriction on the scanning direction.

また、特徴点のインデックス値をソートし、ソート後の順列の情報を出力するソート部１９３２を有する。ここでソート部１９３２は、例えば昇順にソートする。また降順にソートしてもよい。 In addition, a sorting unit 1932 that sorts the index values of the feature points and outputs permutation information after sorting is provided. Here, the sorting unit 1932 sorts in ascending order, for example. You may also sort in descending order.

また、ソートされたインデックス値における、隣接する２つのインデックス値の差分値を算出し、差分値の系列を出力する差分算出部１９３３を有する。 Further, a difference calculation unit 1933 is provided that calculates a difference value between two adjacent index values in the sorted index value and outputs a series of difference values.

そして、差分値の系列を系列順に符号化する差分符号化部１９３４を有する。差分値の系列の符号化は、例えば固定ビット長の符号化でもよい。固定ビット長で符号化する場合、そのビット長はあらかじめ規定されていてもよいが、これでは考えられうる差分値の最大値を表現するのに必要なビット数を要するため、符号化サイズは小さくならない。そこで、差分符号化部１９３４は、固定ビット長で符号化する場合、入力された差分値の系列に基づいてビット長を決定することができる。具体的には、例えば、差分符号化部１９３４は、入力された差分値の系列から差分値の最大値を求め、その最大値を表現するのに必要なビット数（表現ビット数）を求め、求められた表現ビット数で差分値の系列を符号化することができる。 And it has the difference encoding part 1934 which encodes the series of a difference value in order of a series. The sequence of the difference value may be encoded with a fixed bit length, for example. When encoding with a fixed bit length, the bit length may be specified in advance, but this requires the number of bits necessary to express the maximum possible difference value, so the encoding size is small. Don't be. Therefore, when encoding with a fixed bit length, the differential encoding unit 1934 can determine the bit length based on the input sequence of difference values. Specifically, for example, the difference encoding unit 1934 obtains the maximum value of the difference value from the input series of difference values, obtains the number of bits (expression bit number) necessary to express the maximum value, A series of difference values can be encoded with the obtained number of expression bits.

一方、ソートされた特徴点のインデックス値と同じ順列で、対応する特徴点の局所特徴量を符号化する局所特徴量符号化部１９３５を有する。ソートされたインデックス値と同じ順列で符号化することで、差分符号化部１９３４で符号化された座標値と、それに対応する局所特徴量とを１対１で対応付けることが可能となる。局所特徴量符号化部１９３５は、本実施形態においては、１つの特徴点に対する１５０次元の局所特徴量から次元選定された局所特徴量を、例えば１次元を１バイトで符号化し、次元数のバイトで符号化することができる。 On the other hand, it has a local feature encoding unit 1935 that encodes the local feature of the corresponding feature point in the same permutation as the index value of the sorted feature point. By encoding with the same permutation as the sorted index value, the coordinate value encoded by the differential encoding unit 1934 and the corresponding local feature amount can be associated one-to-one. In this embodiment, the local feature amount encoding unit 1935 encodes a local feature amount that is dimension-selected from 150-dimensional local feature amounts for one feature point, for example, one dimension with one byte, and the number of dimensions. Can be encoded.

《サーバ用の映像処理装置の機能構成》
図２０は、本実施形態に係るサーバ用の映像処理装置１７２０の機能構成を示すブロック図である。 << Functional configuration of server video processing apparatus >>
FIG. 20 is a block diagram showing a functional configuration of the server video processing apparatus 1720 according to the present embodiment.

サーバ用の映像処理装置１７２０は、通信制御部２０１０を有する。復号部２０２０は、通信制御部２０１０を介して携帯端末から受信した、符号化された局所特徴量および特徴点座標を復号する。そして、照合部２０３０において、局所特徴量ＤＢ１７２１の認識対象物の局所特徴量と照合する。複数照合結果保持部２０４０は、照合部２０３０の照合結果の複数の認識対象物を映像中の位置などと共に保持する。映像照合部２０５０は、組み合わせ識別ＤＢ１７２２を参照して、複数の認識対象物から映像を照合する。リンク情報所得部２０６０は、照合結果の映像と対応してリンク情報ＤＢ１７２３から取得されたリンク情報を取得する。これら照合部２０３０の認識対象物と、映像照合部２０５０の照合映像と、リンク情報取得部２０６０のリンク情報とから、送信データ生成部２０７０において、送信データが生成される。送信データは、送信部２０８０から通信制御部２０１０を介して携帯端末に返信される。 The server video processing device 1720 includes a communication control unit 2010. The decoding unit 2020 decodes the encoded local feature amount and feature point coordinates received from the mobile terminal via the communication control unit 2010. Then, the collation unit 2030 collates with the local feature amount of the recognition target object in the local feature amount DB 1721. The multiple verification result holding unit 2040 holds a plurality of recognition objects of the verification result of the verification unit 2030 together with positions in the video. The video collation unit 2050 refers to the combination identification DB 1722 and collates videos from a plurality of recognition objects. The link information income unit 2060 acquires the link information acquired from the link information DB 1723 corresponding to the collation result video. Transmission data is generated in the transmission data generation unit 2070 from the recognition target of the verification unit 2030, the verification video of the video verification unit 2050, and the link information of the link information acquisition unit 2060. The transmission data is returned from the transmission unit 2080 to the portable terminal via the communication control unit 2010.

（リンク情報ＤＢ）
図２１は、本実施形態に係るリンク情報ＤＢ１７２３の構成を示す図である。 (Link information DB)
FIG. 21 is a diagram showing the configuration of the link information DB 1723 according to this embodiment.

リンク情報記憶部であるリンク情報ＤＢ１７２３は、照合映像ＩＤ２１０１に対応付けて、リンク先アドレス２１０２とリンク先表示画像２１０３とからなるリンク情報が記憶される。なお、かかるリンク情報ＤＢ１７２３は、組み合わせ識別ＤＢ１７２２や局所特徴量ＤＢ１７２１と一体に準備されてもよい。 The link information DB 1723 which is a link information storage unit stores link information including a link destination address 2102 and a link destination display image 2103 in association with the collation video ID 2101. The link information DB 1723 may be prepared integrally with the combination identification DB 1722 and the local feature DB 1721.

《携帯端末用の映像処理装置のハードウェア構成》
図２２は、本実施形態に係る携帯端末用の映像処理装置１７１０のハードウェア構成を示すブロック図である。なお、携帯端末用の映像処理装置１７１０のハードウェア構成は、第２実施形態の映像処理装置２００から照合処理に関連する構成を無くし、代わりに、局所特徴量の送信構成と照合結果の受信構成を追加した構成であるので、図７と同じ構成要素には同じ参照番号を付し、説明は省略する。 << Hardware configuration of video processing device for portable terminal >>
FIG. 22 is a block diagram showing a hardware configuration of a video processing device 1710 for a portable terminal according to the present embodiment. The hardware configuration of the video processing device 1710 for the portable terminal eliminates the configuration related to the collation processing from the video processing device 200 of the second embodiment, and instead, the local feature transmission configuration and the collation result reception configuration. Therefore, the same components as those in FIG. 7 are denoted by the same reference numerals, and the description thereof is omitted.

ＲＡＭ２２４０は、ＣＰＵ７１０が一時記憶のワークエリアとして使用するランダムアクセスメモリである。ＲＡＭ２２４０には、本実施形態の実現に必要なデータを記憶する領域が確保されている。組み合わせ照合結果２２４１は、照合サーバから受信した組み合わせ照合結果を記憶する領域である。リンク情報２２４２は、照合サーバから受信したリンク情報を記憶する領域である。入力映像／照合結果／リンク情報重畳データ２２４５は、入力映像／照合結果／リンク情報を重畳した入力映像／照合結果／リンク情報重畳データを記憶する領域である。 The RAM 2240 is a random access memory that the CPU 710 uses as a work area for temporary storage. In the RAM 2240, an area for storing data necessary for realizing the present embodiment is secured. The combination matching result 2241 is an area for storing the combination matching result received from the matching server. The link information 2242 is an area for storing link information received from the verification server. The input video / collation result / link information superimposition data 2245 is an area for storing input video / collation result / link information superimposition data in which the input video / collation result / link information is superimposed.

ストレージ２２５０には、データベースや各種のパラメータ、あるいは本実施形態の実現に必要な以下のデータまたはプログラムが記憶されている。表示フォーマット２２５１は、認識対象物や映像の照合結果／リンク情報を表示するフォーマットを生成するために使用される表示フォーマットが格納される領域である。ストレージ２２５０には、以下のプログラムが格納される。携帯端末制御プログラム２２５２は、本映像処理装置１７１０の全体を制御する携帯端末制御プログラムが格納される領域である。局所特徴量送信モジュール２２５３は、携帯端末制御プログラム２２５２において、生成された局所特徴量および特徴点座標を符号化して照合サーバに送信する局所特徴量送信モジュールが格納される領域である。照合結果受信モジュール２２５４には、照合結果／リンク情報を受信して表示または音声によりユーザに報知するための照合結果受信モジュールが格納される領域である。リンク先アクセスモジュール２２５５は、表示されたリンク情報を指示された場合に、リンク先をアクセスして関連情報をダウンロードするための、リンク先アクセスモジュールが格納される領域である。 The storage 2250 stores a database, various parameters, or the following data or programs necessary for realizing the present embodiment. The display format 2251 is an area in which a display format used for generating a format for displaying a recognition target object and a video collation result / link information is stored. The storage 2250 stores the following programs. The portable terminal control program 2252 is an area in which a portable terminal control program for controlling the entire video processing apparatus 1710 is stored. The local feature value transmission module 2253 is an area in which a local feature value transmission module that encodes the generated local feature value and feature point coordinates and transmits them to the matching server in the mobile terminal control program 2252 is stored. The collation result receiving module 2254 is an area in which a collation result receiving module for receiving the collation result / link information and notifying the user by display or voice is stored. The link destination access module 2255 is an area for storing a link destination access module for accessing the link destination and downloading the related information when the displayed link information is instructed.

なお、図２２には、本実施形態に必須なデータやプログラムのみが示されており、本実施形態に関連しないデータやプログラムは図示されていない。 Note that FIG. 22 shows only data and programs essential to the present embodiment, and data and programs not related to the present embodiment are not shown.

《携帯端末用の映像処理装置の処理手順》
図２３は、本実施形態に係る携帯端末用の映像処理装置１６２０の処理手順を示すフローチャートである。このフローチャートは、図２２のＣＰＵ７１０によってＲＡＭ２２４０を用いて実行され、図１９Ａの各機能構成部を実現する。 << Processing procedure of video processing device for mobile terminal >>
FIG. 23 is a flowchart illustrating a processing procedure of the video processing device 1620 for the portable terminal according to the present embodiment. This flowchart is executed by the CPU 710 in FIG. 22 using the RAM 2240, and implements each functional component in FIG. 19A.

まず、ステップＳ２３１１において、対象物認識を行なうための映像入力があったか否かを判定する。また、携帯端末の機能として、ステップＳ２３２１においては受信を判定する。いずれでもなければ、ステップＳ２３３１においてその他の処理を行なう。なお、通常の送信処理については説明を省略する。 First, in step S2311, it is determined whether or not there is a video input for performing object recognition. As a function of the mobile terminal, reception is determined in step S2321. Otherwise, other processing is performed in step S2331. Note that description of normal transmission processing is omitted.

映像入力があればステップＳ２３１３に進んで、入力映像から局所特徴量生成処理を実行する（図９Ａ参照）。次に、ステップＳ２３１５において、局所特徴量および特徴点座標を符号化する（図２４Ａおよび図２４Ｂ参照）。ステップＳ２３１７においては、符号化されたデータを照合サーバに送信する。 If there is video input, the process proceeds to step S2313 to execute local feature generation processing from the input video (see FIG. 9A). Next, in step S2315, local feature quantities and feature point coordinates are encoded (see FIGS. 24A and 24B). In step S2317, the encoded data is transmitted to the verification server.

データ受信の場合はステップＳ２３２３に進んで、照合サーバからの認識結果の受信か／リンク先サーバからの関連情報の受信かを判定する。認識結果であればステップＳ２３２５に進んで、受信した認識結果とリンク先情報を入力映像に重畳して表示する。一方、関連情報であればステップＳ２３２７に進んで、リンク先サーバからの関連情報を表示あるいは音声出力する。 In the case of data reception, the process proceeds to step S2323, and it is determined whether the recognition result is received from the verification server or the related information is received from the link destination server. If it is a recognition result, it will progress to step S2325 and will superimpose and display the received recognition result and link destination information on an input image | video. On the other hand, if it is related information, it will progress to step S2327 and the related information from a link destination server will be displayed or audio | voice output.

（符号化処理）
図２４Ａは、本実施形態に係る符号化処理Ｓ２３１５の処理手順を示すフローチャートである。 (Encoding process)
FIG. 24A is a flowchart showing the processing procedure of the encoding processing S2315 according to the present embodiment.

まず、ステップＳ２４１１において、特徴点の座標値を所望の順序で走査する。次に、ステップＳ２４１３において、走査した座標値をソートする。ステップＳ２４１５において、ソートした順に座標値の差分値を算出する。ステップＳ２４１７においては、差分値を符号化する（図２４Ｂ参照）。そして、ステップＳ２４１９において、座標値のソート順に局所特徴量を符号化する。なお、差分値の符号化と局所特徴量の符号化とは並列に行なってもよい。 First, in step S2411, the coordinate values of feature points are scanned in a desired order. Next, in step S2413, the scanned coordinate values are sorted. In step S2415, a coordinate difference value is calculated in the sorted order. In step S2417, the difference value is encoded (see FIG. 24B). In step S2419, local feature amounts are encoded in the coordinate value sorting order. The difference value encoding and the local feature amount encoding may be performed in parallel.

（差分値の符号化処理）
図２４Ｂは、本実施形態に係る差分値の符号化処理Ｓ２４１７の処理手順を示すフローチャートである。 (Difference processing)
FIG. 24B is a flowchart illustrating a processing procedure of difference value encoding processing S2417 according to the present embodiment.

まず、ステップＳ２４２１において、差分値が符号化可能な値域内であるか否かを判定する。符号化可能な値域内であればステップＳ２４２７に進んで、差分値を符号化する。そして、ステップＳ２４２９へ移行する。符号化可能な値域内でない場合（値域外）はステップＳ２４２３に進んで、エスケープコードを符号化する。そしてステップＳ２４２５において、ステップＳ２４２７の符号化とは異なる符号化方法で差分値を符号化する。そして、ステップＳ２４２９へ移行する。ステップＳ２４２９では、処理された差分値が差分値の系列の最後の要素であるかを判定する。最後である場合は、処理が終了する。最後でない場合は、再度ステップＳ２４２１に戻って、差分値の系列の次の差分値に対する処理が実行される。 First, in step S2421, it is determined whether or not the difference value is within a codeable range. If it is within the range that can be encoded, the process proceeds to step S2427 to encode the difference value. Then, control goes to a step S2429. If it is not within the range that can be encoded (outside the range), the process proceeds to step S2423 to encode the escape code. In step S2425, the difference value is encoded by an encoding method different from the encoding in step S2427. Then, control goes to a step S2429. In step S2429, it is determined whether the processed difference value is the last element in the series of difference values. If it is the last, the process ends. When it is not the last, it returns to step S2421 again and the process with respect to the next difference value of the series of a difference value is performed.

《サーバ用の映像処理装置のハードウェア構成》
図２５は、本実施形態に係るサーバ用の映像処理装置１７２０のハードウェア構成を示すブロック図である。 << Hardware configuration of video processing device for server >>
FIG. 25 is a block diagram showing a hardware configuration of the server video processing apparatus 1720 according to the present embodiment.

図２５で、ＣＰＵ２５１０は演算制御用のプロセッサであり、プログラムを実行することで照合サーバである映像処理装置１６２０の各機能構成部を実現する。ＲＯＭ２５２０は、初期データおよびプログラムなどの固定データおよびプログラムを記憶する。また、通信制御部２０１０は通信制御部であり、本実施形態においては、ネットワークを介して他の装置と通信する。なお、ＣＰＵ２５１０は１つに限定されず、複数のＣＰＵであっても、あるいは画像処理用のＧＰＵを含んでもよい。 In FIG. 25, a CPU 2510 is a processor for arithmetic control, and implements each functional component of the video processing device 1620 that is a collation server by executing a program. The ROM 2520 stores fixed data and programs such as initial data and programs. The communication control unit 2010 is a communication control unit, and in the present embodiment, communicates with other devices via a network. Note that the number of CPUs 2510 is not limited to one, and may be a plurality of CPUs or may include a GPU for image processing.

ＲＡＭ２５４０は、ＣＰＵ２５１０が一時記憶のワークエリアとして使用するランダムアクセスメモリである。ＲＡＭ２５４０には、本実施形態の実現に必要なデータを記憶する領域が確保されている。受信した局所特徴量２５４１は、携帯端末から受信した特徴点座標を含む局所特徴量を記憶する領域である。読出した局所特徴量２５４２は、局所特徴量ＤＢ１７２１から読み出した特徴点座標を含むと局所特徴量を記憶する領域である。複数の認識対象物照合結果２５４３は、受信した局所特徴量と局所特徴量ＤＢ３３０に格納された局所特徴量との照合から認識された、複数の認識対象物照合結果を記憶する領域である。映像認識結果２５４４は、組み合わせ識別ＤＢ１７２２を参照して、複数の認識対象物の組み合わせから映像認識した映像認識結果を記憶する領域である。リンク情報２５４５は、認識対象物に対応してリンク情報ＤＢ１７２３から検索されたリンク情報を記憶する領域である。送受信データ２５４６は、通信制御部２０１０を介して送受信される送受信データを記憶する領域である。 The RAM 2540 is a random access memory that the CPU 2510 uses as a work area for temporary storage. The RAM 2540 has an area for storing data necessary for realizing the present embodiment. The received local feature value 2541 is an area for storing the local feature value including the feature point coordinates received from the mobile terminal. The read local feature value 2542 is an area for storing the local feature value when the feature point coordinates read from the local feature value DB 1721 are included. The plurality of recognition target object matching results 2543 is an area for storing a plurality of recognition target object matching results recognized from the matching between the received local feature value and the local feature value stored in the local feature value DB 330. The video recognition result 2544 is an area for storing a video recognition result of video recognition from a combination of a plurality of recognition objects with reference to the combination identification DB 1722. The link information 2545 is an area for storing link information retrieved from the link information DB 1723 corresponding to the recognition target object. Transmission / reception data 2546 is an area for storing transmission / reception data transmitted / received via the communication control unit 2010.

ストレージ２５５０には、データベースや各種のパラメータ、あるいは本実施形態の実現に必要な以下のデータまたはプログラムが記憶されている。局所特徴量ＤＢ１７２１は、図６に示したと同様の局所特徴量ＤＢが格納される領域である。なお、照合サーバにおいては処理能力や記憶容量は十分であるので、全分野の局所特徴量を格納してもよい。組み合わせ識別ＤＢ１７２２は、図１３に示したと同様の組み合わせ識別ＤＢが格納される領域である。リンク情報ＤＢ１７２３は、図２１に示したリンク情報ＤＢが格納される領域である。 The storage 2550 stores a database, various parameters, or the following data or programs necessary for realizing the present embodiment. The local feature DB 1721 is an area in which a local feature DB similar to that shown in FIG. 6 is stored. Since the verification server has sufficient processing capacity and storage capacity, local feature values in all fields may be stored. The combination identification DB 1722 is an area in which a combination identification DB similar to that shown in FIG. 13 is stored. The link information DB 1723 is an area in which the link information DB shown in FIG. 21 is stored.

ストレージ２５５０には、以下のプログラムが格納される。照合サーバ制御プログラム２３５１は、本映像処理装置１７２０の全体を制御する照合サーバ制御プログラムが格納される領域である。局所特徴量モジュール２５５２は、照合サーバ制御プログラム２５５１において、認識対象部の画像から局所特徴量モジュールが格納される領域である。対象物認識制御モジュール２５５３には、照合サーバ制御プログラム２５５１において、受信した局所特徴量と局所特徴量ＤＢ３３０に格納された局所特徴量とを照合して対象物を認識する対象物認識制御モジュールが格納される領域である。映像認識制御モジュール２５５４には、複数の認識対象物の組み合わせから組み合わせ識別ＤＢ１７２２を参照して映像を認識する映像認識制御モジュールが格納される領域である。リンク情報取得モジュール２５５５には、認識された映像に対応してリンク情報ＤＢ７２３からリンク情報を取得するリンク情報取得モジュールが格納される領域である。 The storage 2550 stores the following programs. The verification server control program 2351 is an area in which a verification server control program for controlling the entire video processing apparatus 1720 is stored. The local feature amount module 2552 is an area in which the local feature amount module is stored from the image of the recognition target portion in the matching server control program 2551. The object recognition control module 2553 stores an object recognition control module that recognizes an object by collating the received local feature quantity with the local feature quantity stored in the local feature quantity DB 330 in the matching server control program 2551. It is an area to be done. The video recognition control module 2554 is an area in which a video recognition control module that recognizes video by referring to the combination identification DB 1722 from a combination of a plurality of recognition objects. The link information acquisition module 2555 is an area in which a link information acquisition module that acquires link information from the link information DB 723 corresponding to the recognized video is stored.

なお、図２５には、本実施形態に必須なデータやプログラムのみが示されており、本実施形態に関連しないデータやプログラムは図示されていない。 In FIG. 25, only data and programs essential to the present embodiment are shown, and data and programs not related to the present embodiment are not shown.

《サーバ用の映像処理装置の処理手順》
図２６は、本実施形態に係るサーバ用の映像処理装置１７２０の処理手順を示すフローチャートである。このフローチャートは、図２５のＣＰＵ２５１０によりＲＡＭ２５４０を使用して実行され、図２０の各機能構成部を実現する。 << Processing procedure of server video processing apparatus >>
FIG. 26 is a flowchart showing a processing procedure of the server video processing apparatus 1720 according to the present embodiment. This flowchart is executed by the CPU 2510 of FIG. 25 using the RAM 2540, and implements each functional component of FIG.

まず、ステップＳ２６１１において、局所特徴量ＤＢの生成か否かを判定する。また、ステップＳ２６２１において、携帯端末からの局所特徴量受信かを判定する。いずれでもなければ、ステップＳ２６３１において他の処理を行なう。 First, in step S2611, it is determined whether or not a local feature DB is generated. In step S2621, it is determined whether a local feature amount is received from the mobile terminal. Otherwise, other processing is performed in step S2631.

局所特徴量ＤＢの生成であればステップＳ２６１３に進んで、局所特徴量ＤＢ生成処理を実行する（図２７Ａ参照）。また、局所特徴量の受信であればステップＳ２６２３に進んで、対象物認識／映像認識／リンク情報取得処理を実行する（図２７Ｂ参照）。そして、ステップＳ２６２５において、認識対象物とリンク情報とを携帯端末に送信する。 If it is generation of a local feature DB, the process proceeds to step S2613, and local feature DB generation processing is executed (see FIG. 27A). If the local feature amount is received, the process advances to step S2623 to execute object recognition / video recognition / link information acquisition processing (see FIG. 27B). In step S2625, the recognition target object and link information are transmitted to the mobile terminal.

（局所特徴量ＤＢ生成処理）
図２７Ａは、本実施形態に係る局所特徴量ＤＢ生成処理Ｓ２６１３の処理手順を示すフローチャートである。 (Local feature DB generation processing)
FIG. 27A is a flowchart showing a processing procedure of local feature DB generation processing S2613 according to the present embodiment.

まず、ステップＳ２７１１において、認識対象物の画像を取得する。ステップＳ２７１３においては、特徴点の位置座標、スケール、角度を検出する。ステップＳ２７１５において、ステップＳ２７１３で検出された特徴点の１つに対して局所領域を取得する。次に、ステップＳ２７１７において、局所領域をサブ領域に分割する。ステップＳ２７１９においては、各サブ領域の特徴ベクトルを生成して局所療育の特徴ベクトルを生成する。ステップＳ２７１３からＳ２７１９の処理は図４Ｂに図示されている。 First, in step S2711, an image of a recognition object is acquired. In step S2713, the position coordinates, scale, and angle of the feature point are detected. In step S2715, a local region is acquired for one of the feature points detected in step S2713. Next, in step S2717, the local area is divided into sub-areas. In step S2719, feature vectors for each sub-region are generated to generate feature vectors for local medical treatment. The processing from step S2713 to S2719 is illustrated in FIG. 4B.

次に、ステップＳ２７２１において、ステップＳ２７１９において生成された局所領域の特徴ベクトルに対して次元選定を実行する。次元選定については、図４Ｄ〜図４Ｆに図示されている。しかしながら、局所特徴量ＤＢ１７２１の生成においては、次元選定における階層化を実行するが、生成されたすべての特徴ベクトルを格納するのが望ましい。 Next, in step S2721, dimension selection is performed on the feature vector of the local region generated in step S2719. The dimension selection is illustrated in FIGS. 4D to 4F. However, in the generation of the local feature DB 1721, hierarchization is performed in dimension selection, but it is desirable to store all generated feature vectors.

ステップＳ２７２３においては、ステップＳ２７１３で検出した全特徴点について局所特徴量の生成と次元選定とが終了したかを判定する。終了していない場合はステップＳ２７１３に戻って、次の１つの特徴点について処理を繰り返す。全特徴点について終了した場合はステップＳ２７２５に進んで、認識対象物に対応付けて局所特徴量と特徴点座標を局所特徴量ＤＢ１７２１に登録する。 In step S2723, it is determined whether the generation of local feature values and dimension selection have been completed for all feature points detected in step S2713. If not completed, the process returns to step S2713 to repeat the process for the next one feature point. When all the feature points have been completed, the process proceeds to step S2725, and the local feature amount and the feature point coordinates are registered in the local feature amount DB 1721 in association with the recognition target object.

ステップＳ２７２７においては、他の認識対象物があるか否かを判定する。他の認識対象物があればステップＳ２７１１に戻って、認識対象物の画像を取得して処理を繰り返す。 In step S2727, it is determined whether there is another recognition object. If there is another recognition object, the process returns to step S2711 to acquire an image of the recognition object and repeat the process.

（認識対象物／リンク情報取得処理）
図２７Ｂは、本実施形態に係る認識対象物／リンク情報取得処理Ｓ２６２３の処理手順を示すフローチャートである。 (Recognized object / link information acquisition process)
FIG. 27B is a flowchart showing the processing procedure of the recognition object / link information acquisition processing S2623 according to the present embodiment.

まず、ステップＳ２７３１において、局所特徴量ＤＢ１７２１から１つの認識対象物の局所特徴量を取得する。そして、ステップＳ２７３３において、認識対象物の局所特徴量とッ携帯端末から受信した局所特徴量との照合を行なう。なお、ステップＳ２７３３の照合処理は、携帯端末が行なう図９Ｂの照合処理と基本的に同様であり、詳細な説明は省略する。 First, in step S2731, the local feature amount of one recognition target is acquired from the local feature amount DB 1721. In step S2733, the local feature amount of the recognition target object is compared with the local feature amount received from the mobile terminal. Note that the matching process in step S2733 is basically the same as the matching process of FIG. 9B performed by the mobile terminal, and detailed description thereof is omitted.

ステップＳ２７３５において、合致したか否かを判定する。合致していればステップＳ２７３７に進んで、合致した認識対象物を記憶する。そして、ステップＳ２７３９においては、映像中の全認識対象物を照合したかを判定し、まだであればステップＳ２７３１に戻って、映像中の対象物認識処理を繰り返す。 In step S2735, it is determined whether or not they match. If they match, the process proceeds to step S2737 to store the matched recognition object. In step S2739, it is determined whether all recognition objects in the video have been collated. If not, the process returns to step S2731, and the object recognition process in the video is repeated.

全認識対応物を照合したと判定すればステップＳ２７４１に進んで、組み合わせ識別ＤＢ１７２２を参照して、複数の認識対象物の組み合わせから映像を認識する。次に、リンク情報ＤＢ１６２２から認識映像に対応するリンク情報を取得する。 If it is determined that all the corresponding recognition objects have been collated, the process proceeds to step S2741, and the video is recognized from the combination of a plurality of recognition objects with reference to the combination identification DB 1722. Next, link information corresponding to the recognized video is acquired from the link information DB 1622.

［他の実施形態］
以上、実施形態を参照して本発明を説明したが、本発明は上記実施形態に限定されものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。また、それぞれの実施形態に含まれる別々の特徴を如何様に組み合わせたシステムまたは装置も、本発明の範疇に含まれる。 [Other Embodiments]
Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention. In addition, a system or an apparatus in which different features included in each embodiment are combined in any way is also included in the scope of the present invention.

また、本発明は、複数の機器から構成されるシステムに適用されてもよいし、単体の装置に適用されてもよい。さらに、本発明は、実施形態の機能を実現する制御プログラムが、システムあるいは装置に直接あるいは遠隔から供給される場合にも適用可能である。したがって、本発明の機能をコンピュータで実現するために、コンピュータにインストールされる制御プログラム、あるいはその制御プログラムを格納した媒体、その制御プログラムをダウンロードさせるＷＷＷ(World Wide Web)サーバも、本発明の範疇に含まれる。 In addition, the present invention may be applied to a system composed of a plurality of devices, or may be applied to a single device. Furthermore, the present invention can also be applied to a case where a control program that realizes the functions of the embodiments is supplied directly or remotely to a system or apparatus. Therefore, in order to realize the functions of the present invention on a computer, a control program installed in the computer, a medium storing the control program, and a WWW (World Wide Web) server that downloads the control program are also included in the scope of the present invention. include.

Claims

Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , First local feature storage means for storing m1, m2,..., Mk first local feature amounts each consisting of feature vectors from one dimension to i1, i2,.
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. Second local feature generating means for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, Recognition means for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
Display means for displaying information indicating a plurality of recognition objects recognized by the recognition means on an image in which the plurality of recognition objects exist in the video;
A video processing apparatus comprising:

The video processing apparatus according to claim 1, further comprising an identification unit that identifies the video based on an image in which the plurality of recognition objects exist in the video.

Link information storage means for storing link information for accessing related information related to the video in association with the video identified by the recognition means;
The video processing apparatus according to claim 1, wherein the display unit further displays the link information superimposed on an image in the video.

Link information storage means for storing link information for accessing related information related to the video in association with the video identified by the recognition means;
Download means for accessing the related information according to the link information;
Further comprising
The video processing apparatus according to claim 1, wherein the display unit further displays the related information superimposed on an image in the video.

The first local feature amount storage means includes the m1, m2,..., Mk first local feature amounts and the m1, m2,..., Mk features in the respective images of the plurality of recognition objects. Memorize the set of point coordinates and
The second local feature quantity generation means holds a set of the n second local feature quantities and the position coordinates of the n feature points in the image in the video,
The recognizing means is a predetermined ratio of a set of the n second local feature values and their position coordinates, and a set of the m1, m2,..., Mk first local feature values and their position coordinates. 5. The apparatus according to claim 1, wherein the plurality of recognition objects are recognized to exist in the image in the video when it is determined that the set is in a linear transformation relationship. The video processing apparatus according to the item.

The first local feature amount and the second local feature amount are a plurality of dimensions formed by dividing a local region including a feature point extracted from an image into a plurality of sub-regions, and comprising histograms of gradient directions in the plurality of sub-regions. The video processing device according to claim 1, wherein the video processing device is generated by generating a feature vector.

The first local feature amount and the second local feature amount are generated by deleting a dimension having a larger correlation between adjacent sub-regions from the generated plurality of dimension feature vectors. The video processing apparatus according to claim 6.

The plurality of dimensions of the feature vector are selected for each predetermined number of dimensions so that the dimension can be selected in order from the dimension that contributes to the feature of the feature point and from the first dimension according to the accuracy required for the local feature amount. The video processing apparatus according to claim 6, wherein the video image processing apparatus is arranged so as to make a round around the local area.

Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , M1, m2,..., Mk first local feature amounts each comprising feature vectors from one dimension to i1, i2,. A control method for a video processing apparatus,
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. A second local feature generation step for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, A recognition step for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
A display step of displaying information indicating a plurality of recognition objects recognized in the recognition step on an image in which the plurality of recognition objects exist in the video;
A control method for a video processing apparatus, comprising:

Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , M1, m2,..., Mk first local feature amounts each comprising feature vectors from one dimension to i1, i2,. A video processing device control program,
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. A second local feature generation step for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, A recognition step for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
A display step of displaying information indicating a plurality of recognition objects recognized in the recognition step on an image in which the plurality of recognition objects exist in the video;
A control program for causing a computer to execute.

A video processing system having a video processing device for a mobile terminal and a video processing device for a server connected via a network,
Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , First local feature storage means for storing m1, m2,..., Mk first local feature amounts each consisting of feature vectors from one dimension to i1, i2,.
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. Second local feature generating means for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, Recognition means for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
Display means for displaying information indicating a plurality of recognition objects recognized by the recognition means on an image in which the plurality of recognition objects exist in the video;
A video processing system comprising:

The video processing device for the portable terminal is:
The second local feature quantity generating means;
First transmitting means for encoding the n second local feature values and transmitting the encoded second local feature values to the server video processing apparatus via the network;
First receiving means for receiving information indicating a recognition object recognized by the server video processing device from the server video processing device;
With
The server video processing device is:
The first local feature storage means;
Second receiving means for receiving and decoding the n second local feature values encoded from the video processing device for the mobile terminal;
The recognition means;
Second transmission means for transmitting information indicating the recognition object recognized by the recognition means to the video processing device for the portable terminal via the network;
The video processing system according to claim 12, further comprising:

A video processing apparatus for a portable terminal in the video processing system according to claim 11 or 12,
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. Second local feature generating means for generating a quantity;
First transmitting means for encoding the n second local feature values and transmitting the encoded second local feature values to the server video processing apparatus via the network;
First receiving means for receiving, from the server video processing device, information indicating a plurality of recognition objects recognized by the server video processing device;
Display means for displaying the received information indicating the plurality of recognition objects on an image in which the objects are present in the video;
A video processing apparatus comprising:

A method for controlling a video processing device for a portable terminal in the video processing system according to claim 11 or 12,
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. A second local feature generation step for generating a quantity;
A first transmission step of encoding the n second local feature amounts and transmitting the encoded second local feature amounts to the server video processing apparatus via the network;
A first receiving step of receiving, from the server video processing device, information indicating a plurality of recognition objects recognized by the server video processing device;
A display step for displaying the received information indicating the plurality of recognition objects on an image in which the objects exist in the video;
A control method for a video processing apparatus, comprising:

A control program for a video processing device for a portable terminal in the video processing system according to claim 11 or 12,
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. A second local feature generation step for generating a quantity;
A first transmission step of encoding the n second local feature amounts and transmitting the encoded second local feature amounts to the server video processing apparatus via the network;
A first receiving step of receiving, from the server video processing device, information indicating a plurality of recognition objects recognized by the server video processing device;
A display step for displaying the received information indicating the plurality of recognition objects on an image in which the objects exist in the video;
A control program for causing a computer to execute.

A video processing apparatus for a server in the video processing system according to claim 11 or 12,
Generated for each of a plurality of recognition objects and m1, m2,..., Mk local regions each including m1, m2,..., Mk feature points in the images of the plurality of recognition objects. , First local feature storage means for storing m1, m2,..., Mk first local feature amounts each consisting of feature vectors from one dimension to i1, i2,.
Second receiving means for receiving and decoding the n second local feature values encoded from the video processing device for the mobile terminal;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, Recognition means for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
Second transmission means for transmitting information indicating a plurality of recognition objects recognized by the recognition means to the video processing device for portable terminal via the network;
A video processing apparatus for a server, comprising:

13. The video processing system according to claim 11 or 12, comprising a plurality of recognition objects and m1, m2,..., Mk feature points, respectively, in the images of the plurality of recognition objects. ..., corresponding to m1, m2, ..., mk first local features generated from feature vectors from 1 to i1, i2, ..., ik, respectively, generated for each of mk local regions. A method for controlling a video processing apparatus for a server provided with first local feature storage means for storing,
A second receiving step of receiving and decoding the encoded second local feature values from the video processing device for the mobile terminal;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, A recognition step for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
A second transmission step of transmitting information indicating a plurality of recognition objects recognized in the recognition step to the video processing device for the portable terminal via the network;
A method for controlling a video processing apparatus for a server, comprising:

13. The video processing system according to claim 11 or 12, comprising a plurality of recognition objects and m1, m2,..., Mk feature points, respectively, in the images of the plurality of recognition objects. ..., corresponding to m1, m2, ..., mk first local features generated from feature vectors from 1 to i1, i2, ..., ik, respectively, generated for each of mk local regions. A control program for a video processing apparatus for a server provided with first local feature storage means for storing
A second receiving step of receiving and decoding the encoded second local feature values from the video processing device for the mobile terminal;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, A recognition step for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
A second transmission step of transmitting information indicating a plurality of recognition objects recognized in the recognition step to the video processing device for the portable terminal via the network;
A control program for causing a computer to execute.

A mobile terminal video processing apparatus and a server video processing apparatus connected via a network, each having a plurality of recognition objects and m1, m2,. ..., mk local regions including mk feature points, m1, m2,..., each of feature vectors from one dimension to i1, i2,. , Mk first local feature values in association with each other, and a first local feature value storing means for storing the image data in association with each other.
N feature points are extracted from the image in the video, and n second local features each consisting of a feature vector from the first dimension to the jth dimension for each of the n local regions including each of the n feature points. A second local feature generation step for generating a quantity;
The number of dimensions i1, i2,..., Ik of the first local feature quantity and the dimension number j of the second local feature quantity are selected, and the feature vector consisting of the feature vector up to the selected dimension number is selected. When it is determined that n second local feature amounts correspond to a predetermined ratio or more for each of the m1, m2,..., mk first local feature amounts including feature vectors up to the selected number of dimensions, A recognition step for recognizing that there are a plurality of recognition objects corresponding to the predetermined ratio or more in the image in the video;
A display step of displaying information indicating a plurality of recognition objects recognized in the recognition step on an image in which the plurality of recognition objects exist in the video;
A video processing method comprising: