JP2017045423A

JP2017045423A - Image processing apparatus, image processing method, image processing system, and program

Info

Publication number: JP2017045423A
Application number: JP2015169729A
Authority: JP
Inventors: 伊藤　幹; Miki Ito; 幹伊藤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2015-08-28
Filing date: 2015-08-28
Publication date: 2017-03-02

Abstract

PROBLEM TO BE SOLVED: To provide an image processing apparatus, method, system and program that can more accurately identify objects in which customers are interested.SOLUTION: An image processing apparatus comprises a person's position detector 203 that detects the position of a person in an image picked up with an image pickup device, a head position detector 206 that detects the height of the head of the person, and an information generator 207 that generates information regarding an object attentively watched by the person on the basis of the position of the person and any change in the height of the person's head. The apparatus further has a personal attributes estimator 205 that estimates attributes of the person on the basis of the image, and the information generating means 207 adds information on attributes of the person to the information regarding the object attentively watched by the person.SELECTED DRAWING: Figure 1

Description

本発明は、複数のカメラで撮影された映像を解析する映像処理装置、映像処理方法、映像処理システム、およびプログラムに関する。 The present invention relates to a video processing apparatus, a video processing method, a video processing system, and a program for analyzing videos taken by a plurality of cameras.

近年、映像解析技術の向上とともに、スーパーやコンビニエンスストア等の店舗内に配置されたカメラにより撮影された映像に基づいて、店舗に来店する顧客の購買行動を分析するシステムが考案されている。例えば、特許文献１では、顧客が関心を有する商品を特定し、その商品に関する情報と、該顧客の情報とを対応付けて出力するシステムが提案されている。また、特許文献２では、人物の手や腕の動きを検知することができない状況でも、人物が物品を手に取る行動を検知する分析装置が提案されている。 In recent years, with the improvement of video analysis technology, a system has been devised that analyzes the purchasing behavior of customers who visit a store based on video taken by a camera placed in a store such as a supermarket or a convenience store. For example, Patent Document 1 proposes a system that specifies a product that a customer is interested in, and outputs information related to the product and information about the customer in association with each other. Further, Patent Document 2 proposes an analysis device that detects an action of a person picking up an article even in a situation where the movement of the person's hand or arm cannot be detected.

特開2009-3701号公報JP 2009-3701 特開2015-11649号公報JP-A-2015-11649

実際の店舗では、商品は多くの陳列棚に置かれており、顧客は同じ位置に立ちながらも商品を見るために様々な方向を向く可能性がある。また、所望の商品の位置が顧客の目線から遠い場合は、顧客は、しゃがんだり、脚立等に乗って商品を見る場合もある。しかしながら、上記の従来技術では、顧客が関心を有する商品の情報を把握するために、顧客の顔の向きや顧客の地面から垂直方向の動きの変化は考慮されていない。 In an actual store, merchandise is placed on many display shelves, and customers can look in different directions to see the merchandise while standing in the same position. Further, when the position of a desired product is far from the customer's line of sight, the customer may squat down or step on a stepladder and view the product. However, in the above-described conventional technology, in order to grasp information on products that the customer is interested in, changes in the direction of the customer's face and the vertical movement from the customer's ground are not taken into consideration.

本発明は、上記課題に鑑みてなされたものであり、顧客が関心を有する対象をより正確に捉えることを目的とする。 This invention is made | formed in view of the said subject, and aims at capturing the target which a customer is interested more correctly.

上記目的を達成するための一手段として、本発明の映像処理装置は以下の構成を備える。すなわち、撮像装置により撮像された映像における人物の位置を検出する人物位置検出手段と、前記人物の頭部の高さを検出する頭部位置検出手段と、前記人物の位置と、前記人物の頭部の高さの変化とに基づいて、前記人物が注視する対象に関する情報を生成する生成手段とを有する。 As a means for achieving the above object, the video processing apparatus of the present invention comprises the following arrangement. That is, person position detecting means for detecting the position of the person in the video imaged by the imaging device, head position detecting means for detecting the height of the head of the person, the position of the person, and the head of the person Generating means for generating information relating to an object to be watched by the person based on a change in the height of the section.

本発明によれば、顧客が関心を有する対象をより正確に捉えることが可能となる。 According to the present invention, it is possible to more accurately capture an object that a customer is interested in.

第一の実施形態における映像処理システムの機能ブロック構成の一例を示す図。The figure which shows an example of the functional block structure of the video processing system in 1st embodiment. 第一の実施形態における映像処理システムの動作環境を示す図。The figure which shows the operating environment of the video processing system in 1st embodiment. 第一の実施形態におけるサーバ装置のハードウェア構成の一例を示す図。The figure which shows an example of the hardware constitutions of the server apparatus in 1st embodiment. 第一の実施形態における映像処理システムの動作を示すフローチャート。The flowchart which shows operation | movement of the video processing system in 1st embodiment. 第一の実施形態における人物位置検出の動作を示すフローチャート。The flowchart which shows the operation | movement of a person position detection in 1st embodiment. 第一の実施形態におけるキャリブレーションの処理を説明する図。The figure explaining the process of the calibration in 1st embodiment. 第一の実施形態における人物の対応付け探索と人物の三次元位置座標の検出の処理を説明する図。The figure explaining the process of the matching search of a person and the detection of the three-dimensional position coordinate of a person in 1st embodiment. 第一の実施形態における検出漏れの補完の処理を説明する図。The figure explaining the process of a complement of the detection omission in 1st embodiment. 第一実施形態における表示装置に表示される画面の例を示す図。The figure which shows the example of the screen displayed on the display apparatus in 1st embodiment. 第一実施形態における表示装置に表示される画面の例を示す図。The figure which shows the example of the screen displayed on the display apparatus in 1st embodiment. 第一実施形態における表示装置に表示される画面の例を示す図。The figure which shows the example of the screen displayed on the display apparatus in 1st embodiment.

以下、添付の図面を参照して、本発明をその好適な実施形態に基づいて詳細に説明する。なお、以下の実施形態において示す構成は一例に過ぎず、本発明は図示された構成に限定されるものではない。 Hereinafter, the present invention will be described in detail based on preferred embodiments with reference to the accompanying drawings. The configurations shown in the following embodiments are merely examples, and the present invention is not limited to the illustrated configurations.

＜第一の実施形態１＞
図２は、第一の実施形態における映像処理システムの動作環境を示す図である。カメラ１００、サーバ装置２００、ストレージ装置３００、表示装置４００が、ネットワーク回線であるLAN（Local Area Network）５００によって接続されている。LANによるネットワークであることは一例であり、その他のネットワークであってもよい。 <First Embodiment 1>
FIG. 2 is a diagram showing an operating environment of the video processing system in the first embodiment. The camera 100, the server device 200, the storage device 300, and the display device 400 are connected by a LAN (Local Area Network) 500 that is a network line. Being a LAN network is just an example, and other networks may be used.

カメラ１００は、ネットワークに接続可能な撮像装置である。サーバ装置２００は、LAN５００で接続された複数のカメラ１００により撮影された映像データを収集（例えば受信）し、映像解析処理を行う。すなわち、サーバ装置２００は映像処理装置として機能する。映像解析処理は、例えば、動体検知、動体追尾、人体検知、顔認識、物体検知などの処理を含む。サーバ装置２００は、映像解析処理が施された解析データのほかに、ストレージ装置３００に記録された過去の映像データや解析データを収集する。そして、サーバ装置２００は、収集したデータを利用することによって、店舗内全体に渡る映像情報の管理を行なう。ストレージ装置３００には、カメラ１００で撮影された映像データ、さらにサーバ装置２００で映像解析処理が施された解析データがLAN５００を介して記録される。 The camera 100 is an imaging device that can be connected to a network. The server device 200 collects (for example, receives) video data captured by a plurality of cameras 100 connected via the LAN 500, and performs video analysis processing. That is, the server device 200 functions as a video processing device. The video analysis process includes, for example, processes such as moving object detection, moving object tracking, human body detection, face recognition, and object detection. The server device 200 collects past video data and analysis data recorded in the storage device 300 in addition to the analysis data subjected to the video analysis processing. Then, the server device 200 manages video information throughout the store by using the collected data. Video data captured by the camera 100 and analysis data subjected to video analysis processing by the server device 200 are recorded in the storage device 300 via the LAN 500.

表示装置４００は、ストレージ装置３００に記録された映像データと解析データを組み合わせた画像の表示を行う。また、表示装置４００は、サーバ装置２００により管理されている映像情報や時刻などの情報を表示することも可能である。表示装置４００は、不図示のユーザI/F（インタフェース）を介して、映像検索のための操作を受け付けることも可能である。この場合、サーバ装置２００は、表示装置から検索対象に関する情報を受け取り、ストレージ装置３００に記録されているデータから、検索対象となる特定のイベントシーンを検索し、検索結果を表示装置４００へ提供する。表示装置４００は、提供された情報の表示を行う。なお、表示装置４００は、サーバ装置２００の一部に組み込まれてもよい。また、サーバ装置２００が表示装置４００に情報を表示させる表示制御機能を有していてもよい。 The display device 400 displays an image that combines the video data recorded in the storage device 300 and the analysis data. The display device 400 can also display information such as video information and time managed by the server device 200. The display device 400 can accept an operation for video search via a user I / F (interface) (not shown). In this case, the server device 200 receives information related to the search target from the display device, searches for a specific event scene to be searched from the data recorded in the storage device 300, and provides the search result to the display device 400. . The display device 400 displays the provided information. Note that the display device 400 may be incorporated in a part of the server device 200. Further, the server device 200 may have a display control function for displaying information on the display device 400.

表示装置４００は、図２に示すように、例えばPCとモニタとの組み合わせなどが想定される。しかし、表示装置４００のLAN５００への物理的な接続形態は、有線だけでなく、無線の場合もあることから、表示装置４００はタブレット端末のような無線端末であってもよい。また、サーバ装置２００が表示装置４００に情報を表示させる表示制御機能を有している場合、表示装置４００は単なるモニタであってもよい。このように、表示装置４００の形態は限定されない。また、映像処理システムを構成するカメラ１００の台数は、図２では４台としているが、何台でもよい。さらに、LAN５００に接続される、サーバ装置２００、ストレージ装置３００、表示装置４００の台数は、図１で示すように1台に限定されず、アドレスなどで識別できれば複数であってもよい。なお、カメラ１００それぞれが設置されている位置は、既知であるとする。 As shown in FIG. 2, the display device 400 is assumed to be a combination of a PC and a monitor, for example. However, since the physical connection form of the display device 400 to the LAN 500 is not only wired but may be wireless, the display device 400 may be a wireless terminal such as a tablet terminal. When the server device 200 has a display control function for displaying information on the display device 400, the display device 400 may be a simple monitor. Thus, the form of the display device 400 is not limited. Further, the number of cameras 100 constituting the video processing system is four in FIG. 2, but any number is possible. Further, the number of server devices 200, storage devices 300, and display devices 400 connected to the LAN 500 is not limited to one as shown in FIG. 1, and may be plural as long as they can be identified by an address or the like. It is assumed that the position where each camera 100 is installed is known.

図１は、本実施形態における映像処理システムの機能ブロック構成の一例を示す図である。カメラ１００は、撮像センサ部１０１、映像処理部１０２、映像符号化部１０３及び通信制御部１０４を有する。
撮像センサ部１０１は、CMOSなどの撮像素子であり、撮像面に結像された光像を光電変換によりデジタル電気信号に変換する。
映像処理部１０２は、撮像センサ部１０１から光電変換により得られたデジタル電気信号に対して、所定の画素補間や色変換処理を行う。上記の処理を行うことで、映像処理部１０２は、RGBあるいはYUVなどのデジタル映像を生成する。さらに、映像処理部１０２は、生成したデジタル映像に対して所定の演算処理を行い、得られた演算結果に基づいてホワイトバランスの調整、シャープネスの調整、コントラストの調整、色変換などの映像処理を行うこともできる。 FIG. 1 is a diagram illustrating an example of a functional block configuration of a video processing system according to the present embodiment. The camera 100 includes an imaging sensor unit 101, a video processing unit 102, a video encoding unit 103, and a communication control unit 104.
The imaging sensor unit 101 is an imaging element such as a CMOS, and converts a light image formed on the imaging surface into a digital electric signal by photoelectric conversion.
The video processing unit 102 performs predetermined pixel interpolation and color conversion processing on the digital electric signal obtained by photoelectric conversion from the image sensor unit 101. By performing the above processing, the video processing unit 102 generates a digital video such as RGB or YUV. Further, the video processing unit 102 performs predetermined arithmetic processing on the generated digital video, and performs video processing such as white balance adjustment, sharpness adjustment, contrast adjustment, and color conversion based on the obtained calculation result. It can also be done.

映像符号化部１０３は、映像処理部１０２から入力されたデジタル映像信号に対して符号化を行う。映像符号化部１０３は、例えば、映像を配信するために、入力されたデジタル映像信号に対して圧縮処理を施す。圧縮処理の方式は、例えば、MPEG4、H.264、MJPEGまたはJPEGなどの規格に基づく。さらに、映像符号化部１０３は、mp4やmov形式などに従って映像データをファイル化する。なお、圧縮処理については必ずしも行わなくてもよい。
通信制御部１０４は、サーバ装置２００と通信を行うための通信制御を行なう。例えば、通信制御部１０４は、サーバ装置２００と802.11シリーズに準拠した通信を行うための通信制御を行なう。また、通信制御部１０４は、ストレージ装置３００の通信制御部（不図示）と連携することにより、例えば、NFS（Network File System）やCIFS（Comon Internet File System）などのネットワークファイルシステムを構築して、映像データの記録を行なうことも可能である。 The video encoding unit 103 encodes the digital video signal input from the video processing unit 102. For example, the video encoding unit 103 performs compression processing on the input digital video signal in order to distribute the video. The compression processing method is based on standards such as MPEG4, H.264, MJPEG, or JPEG. Further, the video encoding unit 103 converts the video data into a file according to the mp4 or mov format. Note that the compression process is not necessarily performed.
The communication control unit 104 performs communication control for communicating with the server device 200. For example, the communication control unit 104 performs communication control for communication with the server device 200 in accordance with the 802.11 series. Further, the communication control unit 104 constructs a network file system such as NFS (Network File System) or CIFS (Comon Internet File System) by cooperating with a communication control unit (not shown) of the storage apparatus 300. It is also possible to record video data.

サーバ装置２００は、通信制御部２０１、映像復号化部２０２、人物位置検出部２０３、顔検出部２０４、人物属性推定部２０５、頭部位置検出部２０６および情報生成部２０７を有する。通信制御部２０１は、上述のカメラ１００の通信制御部１０４と同等の機能を持つ。
映像復号化部２０２は、カメラ１００から配信された映像データを伸張し復号化する。映像復号化部２０２は、ストレージ装置３００から取得した映像データを伸張し復号化することも可能である。
人物位置検出部２０３は、複数のカメラ１００からの映像データにより、撮影されている人物の追尾処理を行いながら、三角測量の方式を用いて、人物の位置を検出する位置検出処理を行う。
顔検出部２０４は、復号された映像データから、撮影されている人物の顔を検出する。そして、顔検出部２０４は、検出した人物の顔の目や鼻、口などの部位を認識することによって、顔の向きを検出する。
人物属性推定部２０５は、顔検出部２０４で検出された人物の顔に基づいて、該人物の年齢および性別を推定する。頭部位置検出部２０６は、復号された映像データから、撮影されている人物の頭部の位置を検出し、検出した頭部の位置に基づいて、該人物の身長を推定する。なお、身長を推定するには、入口のドア枠を通過したときの枠の下端から頭部の位置までの距離を用いて推定してもよいし、パターンマッチングによって人物の足を検出し、足から頭部までの距離を用いて推定してもよい。また、床面を認識して、床から頭部までの距離で推定してもよい。人物の足を検出する方法は、任意の地点における人物の身長を推定できる。
情報生成部２０７は、人物が注視する対象に関する情報を生成する。情報生成部２０７は、予め店内の商品の位置や商品棚の位置についての情報を有しているものとする The server device 200 includes a communication control unit 201, a video decoding unit 202, a person position detection unit 203, a face detection unit 204, a person attribute estimation unit 205, a head position detection unit 206, and an information generation unit 207. The communication control unit 201 has a function equivalent to that of the communication control unit 104 of the camera 100 described above.
The video decoding unit 202 decompresses and decodes video data distributed from the camera 100. The video decoding unit 202 can also decompress and decode the video data acquired from the storage device 300.
The person position detection unit 203 performs position detection processing for detecting the position of the person using the triangulation method while performing tracking processing of the person being photographed using the video data from the plurality of cameras 100.
The face detection unit 204 detects the face of the person being photographed from the decoded video data. The face detection unit 204 detects the orientation of the face by recognizing a part of the detected person's face such as eyes, nose, or mouth.
The person attribute estimation unit 205 estimates the age and sex of the person based on the face of the person detected by the face detection unit 204. The head position detection unit 206 detects the position of the head of the person being photographed from the decoded video data, and estimates the height of the person based on the detected position of the head. The height may be estimated by using the distance from the lower end of the frame when passing through the entrance door frame to the position of the head, or by detecting a person's foot by pattern matching. The distance from the head to the head may be used for estimation. Alternatively, the floor surface may be recognized and estimated from the distance from the floor to the head. The method of detecting a person's foot can estimate the height of the person at an arbitrary point.
The information generation unit 207 generates information related to an object that the person gazes at. It is assumed that the information generation unit 207 has information about the position of the product in the store and the position of the product shelf in advance.

図３は、本実施形態におけるサーバ装置２００と表示装置４００のハードウェア構成の一例を示す図である。一例として表示装置４００を、サーバ装置２００に組み込んだ例を示している。制御部３１は例えばCPU（Central Processing Unit）であり、各構成要素の動作を制御する。ROM（Read Only Memory）３２は、制御命令つまりプログラムを格納する。RAM（Random Access Memory）３３は、プログラムを実行する際のワークメモリやデータの一時保存などに利用される。通信部３４は、外部の装置と物理的に通信するための制御を行う。表示部３５は、各種表示を行う。ユーザI/F３６は、ユーザの操作を受け付ける。 FIG. 3 is a diagram illustrating an example of a hardware configuration of the server device 200 and the display device 400 in the present embodiment. As an example, an example in which the display device 400 is incorporated in the server device 200 is shown. The control unit 31 is a CPU (Central Processing Unit), for example, and controls the operation of each component. A ROM (Read Only Memory) 32 stores control commands, that is, programs. A RAM (Random Access Memory) 33 is used for temporary storage of work memory and data when executing a program. The communication unit 34 performs control for physically communicating with an external device. The display unit 35 performs various displays. The user I / F 36 receives a user operation.

次に、人物位置検出部２０３の動作について図５を参照して説明する。図５は、人物位置検出部２０３の動作を示すフローチャートである。人物位置検出部２０３は、予め、複数のカメラ１００で生成された映像データに基づく映像に含まれる物体（例えば人物）に対して、その映像上の座標と三次元上の実際の座標（三次元位置座標）とを関連付けるキャリブレーションを行なう(S11)。 Next, the operation of the person position detection unit 203 will be described with reference to FIG. FIG. 5 is a flowchart showing the operation of the person position detection unit 203. The person position detection unit 203 preliminarily determines the coordinates on the image and the actual coordinates (three-dimensional) on an object (for example, a person) included in the image based on the image data generated by the plurality of cameras 100. Calibration for associating with (position coordinates) is performed (S11).

S11の処理を図６を参照して説明する。図６は、キャリブレーションの処理を説明する図である。図６において、カメラ１〜３は、図１と図２におけるカメラ１００に相当する。本処理では、図６で示すように、人物位置検出部２０３は、例えば、カメラ１で生成された映像データに含まれる人物６０２の映像上の座標(X₁,Y₁,Z₁)と、人物６０１の三次元上の実際の座標(X_a,Y_a,Z_a)が同じであるということを関連付ける。そして、人物位置検出部２０３は、カメラ２とカメラ３で撮影された映像データに対しても同様な処理を行う。これにより、カメラ１〜３それぞれで生成された映像に含まれる人物６０２〜６０４の座標(X₁,Y₁,Z₁)、(X₂,Y₂,Z₂)、(X₃,Y₃,Z₃)それぞれと、人物６０１の三次元上の実際の座標(X_a,Y_a,Z_a)とが同じであるとして、以下の処理を開始する。詳細は後述するが、人物６０１の三次元上の実際の座標(X_a,Y_a,Z_a)は、例えば、所定時間ごとに算出されることとなる。キャリブレーションの段階では、三次元上の実際の座標(X_a,Y_a,Z_a)は任意の初期値としてもよい。 The process of S11 will be described with reference to FIG. FIG. 6 is a diagram for explaining the calibration process. In FIG. 6, cameras 1 to 3 correspond to the camera 100 in FIGS. 1 and 2. In this process, as shown in FIG. 6, the person position detection unit 203, for example, the coordinates (X ₁ , Y ₁ , Z ₁ ) on the video of the person 602 included in the video data generated by the camera 1, The fact that the three-dimensional actual coordinates (X _a , Y _a , Z _a ) of the person 601 are the same is related. The person position detection unit 203 performs similar processing on the video data captured by the camera 2 and the camera 3. As a result, the coordinates (X ₁ , Y ₁ , Z ₁ ), (X ₂ , Y ₂ , Z ₂ ), (X ₃ , Y ₃ ) of the persons 602 to 604 included in the images generated by the cameras 1 to 3 respectively. , Z ₃ ) and the three-dimensional actual coordinates (X _a , Y _a , Z _a ) of the person 601 are the same, and the following processing is started. Although details will be described later, the three-dimensional actual coordinates (X _a , Y _a , Z _a ) of the person 601 are calculated, for example, every predetermined time. In the calibration stage, the actual coordinates (X _a , Y _a , Z _a ) in three dimensions may be arbitrary initial values.

S11におけるキャリブレーションが完了した後、人物位置検出部２０３は、各カメラ１００で生成された映像に含まれる人物の追跡を行なうための追尾処理を行なう(S12)。例えば、パターンマッチングを用いて人物を検出し、その検出した人物の画像特徴量を記録し、次のフレームにてその画像特徴量に最も類似する領域を検出することで追尾処理を行うことができる。人物追尾を行う方法はこの他にも種々の方法を用いることができる。
次に、人物位置検出部２０３は、追尾処理を行っている人物に対して、多視点幾何の解析を利用したカメラ間の人物の対応付け探索を行なう(S13)。その後、人物位置検出部２０３は、追尾処理を行っている人物の三次元位置座標の算出を行う（S14）。 After the calibration in S11 is completed, the person position detection unit 203 performs a tracking process for tracking a person included in the video generated by each camera 100 (S12). For example, a person can be detected using pattern matching, the image feature amount of the detected person can be recorded, and a tracking process can be performed by detecting a region most similar to the image feature amount in the next frame. . Various other methods can be used as a method for tracking a person.
Next, the person position detecting unit 203 searches the person who is performing the tracking process by searching for the correspondence between the persons using the multi-viewpoint geometry analysis (S13). Thereafter, the person position detection unit 203 calculates the three-dimensional position coordinates of the person who is performing the tracking process (S14).

S13とS14の処理を図７を参照して説明する。図７は、人物の対応付け探索と、人物の三次元位置座標の算出の処理を説明する図である。図７におけるカメラ１〜２は、図１と図２におけるカメラ１００に相当する。図７に示すように、カメラ１で撮影された映像に含まれる人物７０１（図７では頭部）とカメラ１とを繋ぐ直線７０２は、カメラ２で撮影された映像上では直線７０３のように見える。人物位置検出部２０３は、各カメラの位置を示す情報を用いて、カメラ２で撮影された映像上における直線７０３の位置を算出する。各カメラの位置を示す情報は、少なくとも各カメラの相対的な位置を示す情報であればよいが、カメラが設置される空間を三次元座標で表現した情報であってもよい。
人物位置検出部２０３は、カメラ２で撮影された映像上で直線７０３と交わる人物７０４（図７では頭部と直線７０３が交わる人物）を、カメラ１で撮影された映像に含まれる人物７０１と同定する。すなわち、カメラ１で生成された映像に含まれる人物７０１とカメラ２で生成された映像に含まれる人物７０４とは、同じ人物７０５であると判定されて、対応付けられる。次に、人物位置検出部２０３は、既知であるカメラ１とカメラ２の位置の情報を用いて、三角測量により、人物７０５の三次元位置座標の検出を行なう(S14)。具体的には、人物位置検出部２０３は、各カメラの位置を示す情報と、カメラの撮影方向を示す情報とから、カメラ１で撮影された映像に含まれる人物７０１とカメラ１とを繋ぐ直線７０２を算出する。そして、カメラ２で撮影された映像に含まれる人物７０１とカメラ２とを繋ぐ直線を算出する。そして、その２つの直線の交点を人物７０１（図７では頭部）の位置として算出する。これによって、カメラの設置位置に対する人物７０１の実際の位置を算出することができる。 The processes of S13 and S14 will be described with reference to FIG. FIG. 7 is a diagram for explaining the process of searching for the correspondence between persons and calculating the three-dimensional position coordinates of the person. Cameras 1 and 2 in FIG. 7 correspond to the camera 100 in FIGS. 1 and 2. As shown in FIG. 7, a straight line 702 connecting the person 701 (the head in FIG. 7) and the camera 1 included in the video imaged by the camera 1 is like a straight line 703 on the video imaged by the camera 2. appear. The person position detection unit 203 calculates the position of the straight line 703 on the video captured by the camera 2 using information indicating the position of each camera. The information indicating the position of each camera may be information indicating at least the relative position of each camera, but may be information expressing the space where the camera is installed in three-dimensional coordinates.
The person position detecting unit 203 detects a person 704 that intersects with the straight line 703 on the image captured by the camera 2 (a person whose head and the straight line 703 intersect in FIG. 7) and a person 701 included in the image captured by the camera 1. Identify. That is, the person 701 included in the video generated by the camera 1 and the person 704 included in the video generated by the camera 2 are determined to be the same person 705 and are associated with each other. Next, the person position detection unit 203 detects the three-dimensional position coordinates of the person 705 by triangulation using the known position information of the cameras 1 and 2 (S14). Specifically, the person position detection unit 203 is a straight line connecting the person 701 and the camera 1 included in the video captured by the camera 1 from information indicating the position of each camera and information indicating the shooting direction of the camera. 702 is calculated. Then, a straight line connecting the person 701 and the camera 2 included in the video photographed by the camera 2 is calculated. Then, the intersection of the two straight lines is calculated as the position of the person 701 (the head in FIG. 7). Thereby, the actual position of the person 701 with respect to the installation position of the camera can be calculated.

さらに、人物位置検出部２０３は、S13で人物の対応付けができなかったカメラに対して、検出漏れを補完する（S15）。S15の処理を図８を参照して説明する。図８は、検出漏れの補完の処理を説明する概略図である。図８におけるカメラ１〜３は、図１と図２におけるカメラ１００に相当する。図８では、図７を用いて説明したように、カメラ１とカメラ２は、人物７０５との対応付けができているが、カメラ３は人物７０５との対応付けができていないものとする。この場合、人物位置検出部２０３は、S14で検出した人物７０５の３次元位置座標を、カメラ３で生成された映像データに基づく映像上に対して射影したときの映像上の位置座標を検出する。そしてその位置に他のカメラの検出結果に基づいて算出した人物７０５の位置を認識可能な画像を表示する。これにより、カメラ３で生成された映像データには人物７０５が映ることとなり、カメラ３での検出漏れが補完される(S15)。このときに表示する画像は他のカメラで人物７０５を撮像した画像であってもよいし、人物を示す図形等であってもよい。 Furthermore, the person position detection unit 203 supplements the detection omission with respect to the camera that could not associate the person in S13 (S15). The process of S15 will be described with reference to FIG. FIG. 8 is a schematic diagram for explaining detection omission complementing processing. Cameras 1 to 3 in FIG. 8 correspond to the camera 100 in FIGS. 1 and 2. In FIG. 8, as described with reference to FIG. 7, the camera 1 and the camera 2 are associated with the person 705, but the camera 3 is not associated with the person 705. In this case, the person position detection unit 203 detects the position coordinates on the video when the three-dimensional position coordinates of the person 705 detected in S14 are projected on the video based on the video data generated by the camera 3. . Then, an image capable of recognizing the position of the person 705 calculated based on the detection result of another camera is displayed at that position. As a result, the person 705 appears in the video data generated by the camera 3, and the detection omission in the camera 3 is complemented (S15). The image displayed at this time may be an image obtained by capturing the person 705 with another camera, or may be a figure showing the person.

人物位置検出部２０３は、上述のS12からS15の処理を繰り返すことによって、人物位置の検出処理を行っていく。なお、上記の位置検出処理方法は一例であって、人物位置検出部２０３は、別の位置センサ等によって特定人物の位置を検出することも可能である。例えば、レーダー等によって検出することも可能である。 The person position detection unit 203 performs the person position detection process by repeating the processes from S12 to S15 described above. Note that the above-described position detection processing method is an example, and the person position detection unit 203 can also detect the position of the specific person using another position sensor or the like. For example, it can be detected by a radar or the like.

次に、上記の構成を有する本実施形態における映像処理システムの動作の詳細を説明する。図４は、映像処理システムの動作を示すフローチャートである。図４では、一例として、顧客である人物が、ある店舗に来店し、店内を動く際のシステムの動作を示している。まず、顧客である人物が、ある店舗に来店する。そして、店舗内に設置してあるカメラ１００は、人物を撮影して映像データを生成し、サーバ装置２００に配信する。配信された映像データは、サーバ装置２００の映像復号化部２０２で復号化され、人物位置検出部２０３に入力される。人物位置検出部２０３は、映像データを用いて人物の追尾を行いながら、店舗内での人物の位置を検出する処理を開始する(S1)。その際、人物位置検出部２０３は、ユニークな追尾IDを映像に含まれる追尾対象の人物に付与する。該人物の位置は、追尾IDに関連付けられ、RAM３３等において管理される（S1）。 Next, details of the operation of the video processing system in the present embodiment having the above-described configuration will be described. FIG. 4 is a flowchart showing the operation of the video processing system. FIG. 4 shows, as an example, the operation of the system when a person who is a customer visits a store and moves in the store. First, a person who is a customer visits a store. Then, the camera 100 installed in the store shoots a person, generates video data, and distributes it to the server device 200. The distributed video data is decoded by the video decoding unit 202 of the server device 200 and input to the person position detection unit 203. The person position detection unit 203 starts processing for detecting the position of the person in the store while tracking the person using the video data (S1). At that time, the person position detection unit 203 assigns a unique tracking ID to the tracking target person included in the video. The position of the person is associated with the tracking ID and managed in the RAM 33 or the like (S1).

そして、人物位置検出部２０３が、人物に対する位置検出・追尾を行っている状態で、顔検出部２０４は、映像データから、人物の顔を検出し、該顔の向きの検出を行う(S2)。さらに、顔検出部２０４で検出された人物の顔に対して、人物属性推定部２０５は、人物の顔の画像特徴量などに基づいて、人物の年齢及び性別を推定する(S3)。推定結果である年齢及び性別などの属性データは、S1で付与されたものと同じ追尾IDが関連付けられ、RAM３３等において管理される。次に、頭部位置検出部２０６は、映像データから人物の頭部の位置の検出を行い、その検出結果から、人物の身長の推定を行う（S4）。頭部の位置はパターンマッチング等を用いて検出することができる。推定された身長のデータは、S1で付与されたものと同じ追尾IDが関連付けられ、RAM３３等において管理される。S3の工程において、人物属性推定部２０５が、年齢の推定を行えなかった場合には、頭部位置検出部２０６は、推定した身長のデータから、大人か子供をおおよそ分類し、該分類に基づいて年齢や性別を推定してもよい。この場合も、推定された年齢や性別を含む属性データは、S1で付与されたものと同じ追尾IDが関連付けられ、RAM３３等において管理される。 Then, in a state where the person position detecting unit 203 is performing position detection / tracking on the person, the face detecting unit 204 detects the face of the person from the video data and detects the face direction (S2). . Further, for the face of the person detected by the face detection unit 204, the person attribute estimation unit 205 estimates the age and sex of the person based on the image feature amount of the person's face (S3). The attribute data such as age and gender, which is the estimation result, is associated with the same tracking ID as that given in S1 and managed in the RAM 33 or the like. Next, the head position detection unit 206 detects the position of the person's head from the video data, and estimates the height of the person from the detection result (S4). The position of the head can be detected using pattern matching or the like. The estimated height data is associated with the same tracking ID as that given in S1, and is managed in the RAM 33 or the like. In step S3, when the person attribute estimation unit 205 cannot estimate the age, the head position detection unit 206 roughly classifies adults or children from the estimated height data, and based on the classification. Age and gender may be estimated. Also in this case, the attribute data including the estimated age and sex is associated with the same tracking ID as that given in S1 and managed in the RAM 33 or the like.

以上の処理によって、サーバ装置２００は、人物が店舗内を歩く間、追尾処理を行い続ける。また、カメラ１００で生成された映像データやサーバ装置２００により上述のように管理されている情報は、表示装置４００に表示される。図９に、表示装置４００に表示される画面の例９００を示す。カメラ１〜４は、図１と図２におけるカメラ１００に相当する。画面９０１〜９０４はそれぞれ、カメラ１〜４で生成された映像データを表している。また、画面９０５は、カメラ１〜４と人物との位置関係を表している。カメラ１〜２で生成された映像データでは、人物の顔が検出されており、人物属性推定部２０５により、該映像データから、年齢及び性別が推定される。上述したように、頭部位置検出部２０６が人物の年齢を推定してもよい。さらに、カメラ１〜４で生成された映像データから、画面９０１〜９０４に図示されているように、頭部位置検出部２０６により人物の頭部が検出される。 Through the above process, the server device 200 continues to perform the tracking process while the person walks in the store. Further, the video data generated by the camera 100 and the information managed as described above by the server device 200 are displayed on the display device 400. FIG. 9 shows an example screen 900 displayed on the display device 400. The cameras 1 to 4 correspond to the camera 100 in FIGS. Screens 901 to 904 represent video data generated by the cameras 1 to 4, respectively. A screen 905 represents the positional relationship between the cameras 1 to 4 and the person. In the video data generated by the cameras 1 and 2, a human face is detected, and the person attribute estimation unit 205 estimates the age and sex from the video data. As described above, the head position detection unit 206 may estimate the age of the person. Furthermore, as shown in the screens 901 to 904, the head position detection unit 206 detects the person's head from the video data generated by the cameras 1 to 4.

次に、人物位置検出部２０３は、位置検出の結果から、人物がある商品棚の前で静止しているか否かを判定する(S5)。静止していないと判定された場合（S5のNo）、処理はS2へ戻り、人物位置検出部２０３は、引き続き人物の追尾を続ける。静止していると判定された場合（S5のYes）、頭部位置検出部２０６は、人物の頭部を検出して該人物の身長を推定し、身長が閾値以上変化しているか否かを判定する(S6)。少なくとも頭の位置を検出できれば身長の変化を検出できるが、足と頭の位置から検出してもよい。
身長が変化していない（頭部の高さが閾値以上変化していない）と判定された場合（S6のNo）、処理はS7へ進む。身長が変化した（頭部の高さが変化した）と判定された場合（S6のYes）、顔検出部２０４は、人物の顔の向きを検出する。この検出結果は該人物の追尾IDに関連付けられ、情報生成部２０７へ渡される。 Next, the person position detecting unit 203 determines whether or not the person is stationary in front of a certain product shelf from the position detection result (S5). If it is determined that the camera is not stationary (No in S5), the process returns to S2, and the person position detection unit 203 continues to track the person. When it is determined that it is stationary (Yes in S5), the head position detection unit 206 detects the head of the person, estimates the height of the person, and determines whether the height has changed by more than a threshold value. Determine (S6). A change in height can be detected if at least the position of the head can be detected, but it may also be detected from the positions of the foot and head.
If it is determined that the height has not changed (the height of the head has not changed more than the threshold) (No in S6), the process proceeds to S7. When it is determined that the height has changed (the height of the head has changed) (Yes in S6), the face detection unit 204 detects the orientation of the person's face. This detection result is associated with the tracking ID of the person and passed to the information generation unit 207.

情報生成部２０７は、顔検出部２０４により検出された人物の顔の向きの方向に置かれている商品、または商品棚の位置を検出し、該検出した位置を、その人物が関心を有する商品、すなわち、その人物が注視する対象が配置されている店内の位置と特定する。そして、情報生成部２０７は、この特定した位置の情報を、人物が注視する対象に関する情報として生成する(S8)。人物がしゃがんだり、脚立等に乗ったりして商品を見た場合、商品を注視する行動と判断できるからである。また、頭部の高さが変化した場合、変化しなかった場合と異なる対象を人物が注視する対象として情報を生成してもよい。例えば、人物の頭部の高さが低くなった場合、棚の下部にある商品を人物が注視する対象としてもよい。また、サーバ装置２００が不図示のタイマを有していれば、情報生成部２０７は、顔検出部２０４により顔の向きの検出が行われた時刻の情報をタイマから取得して、人物が注視する対象に関する情報に含めてもよい。情報生成部２０７は、追尾IDに基づいて、RAM３３等に管理されている上述した情報を取得し、該取得した情報を人物が注視する対象に関する情報に含めることができる。その後、情報生成部２０７により生成された情報は、LAN５００を介してストレージ装置３００内のデータベースへ記録される。また、情報生成部２０７により生成された情報は、LAN５００を介して表示装置４００に送られてもよい。 The information generation unit 207 detects a product placed in the direction of the face of the person detected by the face detection unit 204 or the position of the product shelf, and the detected position is a product that the person is interested in. That is, it is specified as the position in the store where the object to be watched by the person is arranged. Then, the information generation unit 207 generates the information on the specified position as information related to a target to be watched by the person (S8). This is because when a person squats down or rides on a stepladder or the like and views the product, it can be determined that the action is to watch the product. In addition, when the height of the head changes, information may be generated as a target on which a person gazes at a different target from the case where the head does not change. For example, when the height of a person's head becomes low, the product at the lower part of the shelf may be an object to be watched. If the server device 200 has a timer (not shown), the information generation unit 207 acquires information on the time when the face detection unit 204 detects the face orientation from the timer, and the person gazes. It may be included in the information regarding the target to be performed. The information generation unit 207 can acquire the above-described information managed in the RAM 33 or the like based on the tracking ID, and can include the acquired information in information related to a target that a person watches. Thereafter, the information generated by the information generation unit 207 is recorded in a database in the storage apparatus 300 via the LAN 500. The information generated by the information generation unit 207 may be sent to the display device 400 via the LAN 500.

図１０に、表示装置４００に表示される画面の例１０００を示す。図１０では、店舗内の商品棚の前で人物が商品を見ている状況が示されている。カメラ１〜４は、図１と図２におけるカメラ１００に相当する。画面１００１は、カメラ１〜４のうちのいずれかにより生成された映像データを表している。また、画面１００２は、カメラ１〜４と人物との位置関係を表している。 FIG. 10 shows an example screen 1000 displayed on the display device 400. FIG. 10 shows a situation where a person is looking at a product in front of a product shelf in the store. The cameras 1 to 4 correspond to the camera 100 in FIGS. A screen 1001 represents video data generated by any one of the cameras 1 to 4. A screen 1002 shows the positional relationship between the cameras 1 to 4 and the person.

図４に戻り、S6において、身長が変化していないと判定された場合（S6のNo）、頭部位置検出部２０６は、そのまま、人物の位置が閾値以上変化するか否か（人物が所定距離以上移動するか否か）の判定を所定時間続ける（S7のNo）。人物の位置が所定時間変化しなければ（S7のYes）、頭部位置検出部２０６は、所定時間が経過したことを顔検出部２０４へ通知する。そして、上述のS8と同様の処理が行われる。 Returning to FIG. 4, when it is determined in S6 that the height has not changed (No in S6), the head position detection unit 206 directly determines whether or not the position of the person changes more than a threshold value (the person is determined in advance). The determination of whether or not to move beyond the distance is continued for a predetermined time (No in S7). If the position of the person does not change for a predetermined time (Yes in S7), the head position detection unit 206 notifies the face detection unit 204 that the predetermined time has elapsed. Then, the same process as S8 described above is performed.

その後、サーバ装置２００により人物が追尾可能である間は、該人物が店舗内にまだいると判断し、サーバ装置２００は、引き続き上述の処理を続ける(S9のNo)。サーバ装置２００は、該人物が追尾できなくなった場合には、該人物は退店したものと判断し、処理を終了する（S9のYes）。 After that, while the person can be tracked by the server apparatus 200, it is determined that the person is still in the store, and the server apparatus 200 continues the above processing (No in S9). If the person can no longer track the server apparatus 200, the server apparatus 200 determines that the person has left the store and ends the process (Yes in S9).

図１１に、表示装置４００表示される、映像データと人物が注視する対象に関する情報が組み合わされた例示的な画面１１０１を示す。画面１１０１には、一例として、人物の顔のデータ、性別、時刻、該人物が関心を有する商品棚の位置や識別情報が表示される。 FIG. 11 shows an example screen 1101 displayed on the display device 400, in which video data and information related to a target to be watched by a person are combined. As an example, the screen 1101 displays the face data, gender, time, position of the product shelf that the person is interested in, and identification information.

このように本実施形態によれば、顧客である人物が所定時間静止した場所、及び、静止しつつ身長の変化があった場所が、該人物が関心を有する商品や商品棚の店舗内の位置として特定される。これにより、顧客が関心を有する商品の把握を、より正確に行うことが可能となる。なお、上記実施形態では、情報生成部２０７は、人物が注視する対象に関する情報を生成するために、顔検出部２０４により検出された人物の顔の向きを利用したが、静止している人物の位置から該人物が注視する対象に関する情報を生成することが可能であれば、該人物の顔の向きを利用しなくともよい。この場合、情報生成部２０７は、静止している人物の位置と、頭部位置検出部２０６により人物の身長が変化したことの判定を受けて、該人物が注視する対象に関する情報を生成することができる。 As described above, according to the present embodiment, the place where the person who is the customer is stationary for a predetermined time and the place where the height is changed while standing still are the position in the store of the product or shelf where the person is interested. Identified as As a result, it is possible to more accurately grasp the products that the customer is interested in. In the above-described embodiment, the information generation unit 207 uses the face orientation of the person detected by the face detection unit 204 to generate information about the object that the person is gazing at. As long as it is possible to generate information about an object to be watched by the person from the position, the orientation of the person's face need not be used. In this case, the information generation unit 207 receives the determination that the person's height has been changed by the position of the stationary person and the head position detection unit 206, and generates information on the target to be watched by the person. Can do.

＜その他の実施形態＞
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 <Other embodiments>
The present invention supplies a program that realizes one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in a computer of the system or apparatus read and execute the program This process can be realized. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

１００カメラ、２００サーバ装置、３００ストレージ装置、４００表示装置、２０１通信制御部、２０２映像復号化部、２０３人物位置検出部、２０４顔検出部、２０５人物属性推定部、２０６頭部位置検出部、２０７情報生成部 DESCRIPTION OF SYMBOLS 100 Camera, 200 Server apparatus, 300 Storage apparatus, 400 Display apparatus, 201 Communication control part, 202 Image | video decoding part, 203 Person position detection part, 204 Face detection part, 205 Person attribute estimation part, 206 Head position detection part, 207 Information generator

Claims

Person position detecting means for detecting the position of the person in the video imaged by the imaging device;
A head position detecting means for detecting the height of the head of the person;
Generating means for generating information on an object to be watched by the person based on the position of the person and a change in the height of the head of the person;
A video processing apparatus comprising:

Attribute estimation means for estimating the attribute of the person based on the video;
The video processing apparatus according to claim 1, wherein the generation unit includes information relating to the attribute of the person in information relating to an object to be watched by the person.

Further comprising face detection means for detecting the face of the person based on the video,
The video processing apparatus according to claim 2, wherein the attribute estimation unit estimates the attribute of the person based on the face detected by the face detection unit.

The face detection means detects the orientation of the face based on the detected face;
The generation unit generates information on a target to be watched by the person based on a face orientation when the position of the person does not change and a change in the height of the head of the person. Item 4. The video processing apparatus according to Item 3.

When the position of the person does not change for a predetermined time, the generation unit includes the position of the target arranged in the direction of the face at the position of the person in the information regarding the target to be watched by the person. The video processing apparatus according to claim 4.

When the height of the person's head changes, the generation unit includes the position of the target arranged in the direction of the face at the position of the person in the information related to the target to be watched by the person. The video processing apparatus according to claim 4, wherein:

The head position detecting means, when the attribute of the person is not estimated by the attribute estimating means, estimates the attribute of the person based on the height of the head of the person. The video processing apparatus according to any one of 2 to 6.

The video processing apparatus according to claim 2, wherein the attribute of the person includes at least one of a sex and an age of the person.

The video processing apparatus according to claim 1, further comprising display means for displaying the video and information related to an object to be watched by the person.

A person position detecting step of detecting the position of the person in the video imaged by the imaging device;
A head position detecting step for detecting the height of the head of the person;
Based on the position of the person and the change in the height of the head of the person, a generating step for generating information on a target to be watched by the person;
A video processing method characterized by comprising:

Imaging means for photographing a person;
Person position detecting means for detecting the position of the person in the video imaged by the imaging means;
A head position detecting means for detecting the height of the head of the person;
Generating means for generating information on an object to be watched by the person based on the position of the person and a change in the height of the head of the person;
Display means for displaying the video and information of a target to be watched by the person;
A video processing system comprising:

The program for functioning a computer as each means of the video processing apparatus of any one of Claim 1 to 9.