JP7338182B2

JP7338182B2 - Action recognition device, action recognition method and program

Info

Publication number: JP7338182B2
Application number: JP2019051167A
Authority: JP
Inventors: 海克関
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2023-09-05
Anticipated expiration: 2039-03-19
Also published as: JP2020154552A

Description

本発明は、行動認識装置、行動認識方法及びプログラムに関する。 The present invention relates to an action recognition device, an action recognition method, and a program.

オフィスや工場などの職場において、作業者の行動を可視化し、作業時間等を分析することにより職場の生産効率を改善することは重要な課題である。そのため、職場をカメラで動画撮影し、得られた動画を分析することで、作業者による特定の標準的な作業（以下、標準作業という）の行動を認識し、分析する手段は有効である。 In a workplace such as an office or a factory, it is an important issue to improve the production efficiency of the workplace by visualizing the behavior of workers and analyzing the working hours and the like. Therefore, it is an effective means of recognizing and analyzing specific standard work (hereinafter referred to as "standard work") by workers by capturing videos of workplaces with cameras and analyzing the obtained videos.

ただし、カメラで撮影した職場動画を目視で解析し、決まった一定の手順で行う標準作業の行動を抽出し、各動作の時間を測定し、それらを可視化するには、膨大な解析時間と労力が必要である。そこで従来では、人間の行動を自動認識するために、撮影した動画から人を認識し、認識した人の重心から人の移動軌跡を求め、移動軌跡から特定の行動を認識する方法が提案されている。 However, visual analysis of workplace videos taken with a camera, extraction of standard work actions performed in a fixed procedure, measurement of the time for each action, and visualization of these actions requires a huge amount of analysis time and labor. is necessary. Conventionally, in order to automatically recognize human behavior, a method has been proposed in which a person is recognized from a captured video, the movement trajectory of the person is obtained from the weight of the recognized person, and specific behavior is recognized from the movement trajectory. there is

作業者の行動を認識する際には、処理の効率化を図るために、１台のカメラで、できるだけ広い視野を撮影するのが望ましい。そのために、画角の広い広角レンズを備えたカメラを用いて撮影を行うのが望ましい。しかしながら、広角レンズを備えたカメラで撮影した画像には歪が発生する。画像に歪が発生すると、画像に写った人の形状が歪むため、人の認識精度が悪化する。標準作業を認識するためには、同じ人の動きを時間経過に沿ってトレースする必要があるため、人の認識精度の悪化は、標準作業の認識精度の悪化を招く。このような精度の悪化を防止するために、画像の歪を補正した上で標準作業を認識するのが望ましい。しかしながら、歪の補正には手間がかかるため、高精度かつ高速に標準作業の認識を行うのは困難であるという問題があった。 When recognizing a worker's behavior, it is desirable to photograph a field of view as wide as possible with one camera in order to improve processing efficiency. Therefore, it is desirable to use a camera equipped with a wide-angle lens with a wide angle of view. However, images captured by cameras with wide-angle lenses are distorted. When the image is distorted, the shape of the person in the image is distorted, and the accuracy of human recognition deteriorates. In order to recognize the standard work, it is necessary to trace the movement of the same person over time. In order to prevent such deterioration in accuracy, it is desirable to recognize the standard work after correcting the distortion of the image. However, since it takes time and effort to correct the distortion, there is a problem that it is difficult to recognize the standard work with high precision and high speed.

本発明は、上記に鑑みてなされたものであって、作業者の標準作業を、高精度かつ高速に認識することが可能な行動認識装置、行動認識方法及びプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide an action recognition device, an action recognition method, and a program capable of recognizing a worker's standard work with high precision and high speed. .

上述した課題を解決し、目的を達成するために、本発明の行動認識装置は、撮影した認識対象となる動画から、当該動画に写った被写体の特定行動を認識する行動認識装置であって、広角レンズを備えて、同じ領域を異なる方向から撮影する複数の撮影手段が、当該撮影手段の観測範囲の中の歪の異なる複数の位置で特定行動を行っている被写体をそれぞれ撮影した動画を入力する第１の動画入力部と、前記複数の撮影手段によって、認識対象となる動画を入力する第２の動画入力部と、前記第１の動画入力部および前記第２の動画入力部が入力した動画に含まれる画像を、それぞれ、歪の異なる複数の領域に分割する領域分割部と、前記第１の動画入力部が入力した動画から、前記撮影手段毎および前記領域毎に、前記被写体の特定行動を認識するための認識辞書を作成する辞書作成部と、異なる前記撮影手段から入力された前記認識対象となる動画に含まれる画像の前記領域からそれぞれ検出された同じ被写体の位置に応じて、前記辞書作成部が、前記撮影手段毎および前記領域毎に作成した複数の認識辞書の中から、最も歪の小さい認識辞書を選択する辞書選択部と、前記辞書選択部が選択した認識辞書に基づいて、前記被写体の特定行動を認識する行動認識部と、を備えることを特徴とする。 In order to solve the above-described problems and achieve the object, the action recognition device of the present invention is an action recognition device that recognizes a specific action of a subject captured in a captured moving image to be recognized, Input a video of a subject performing a specific action at multiple positions with different distortions within the observation range of the imaging means, taken by multiple imaging means equipped with a wide-angle lens and photographing the same area from different directions . a second moving image input unit for inputting a moving image to be recognized by the plurality of photographing means; and input by the first moving image input unit and the second moving image input unit an area dividing unit that divides an image included in a moving image into a plurality of areas each having a different distortion; a dictionary creating unit that creates a recognition dictionary for recognizing actions; The dictionary creation unit selects a recognition dictionary with the least distortion from among the plurality of recognition dictionaries created for each of the photographing means and for each of the regions , and based on the recognition dictionary selected by the dictionary selection unit. and an action recognition unit that recognizes the specific action of the subject.

本発明によれば、作業者の標準作業を、高精度かつ高速に認識することができる。 ADVANTAGE OF THE INVENTION According to this invention, a worker's standard work can be recognized highly accurately and at high speed.

図１は、第１の実施形態に係る行動認識システムのハードウェア構成の一例を示すハードウェアブロック図である。FIG. 1 is a hardware block diagram showing an example of the hardware configuration of an action recognition system according to the first embodiment. 図２は、第１の実施形態に係る行動認識システムが使用されている場面の一例を示す図である。FIG. 2 is a diagram showing an example of a scene in which the action recognition system according to the first embodiment is used. 図３は、魚眼カメラのハードウェア構成の一例を示すハードウェアブロック図である。FIG. 3 is a hardware block diagram showing an example of the hardware configuration of the fisheye camera. 図４は、行動認識装置のハードウェア構成の一例を示すハードウェアブロック図である。FIG. 4 is a hardware block diagram showing an example of the hardware configuration of the action recognition device. 図５は、魚眼レンズで観測した画像の歪を説明する図である。FIG. 5 is a diagram for explaining distortion of an image observed with a fisheye lens. 図６は、魚眼レンズで観測した画像の位置による歪の違いを説明する図である。FIG. 6 is a diagram for explaining the difference in distortion depending on the position of an image observed with a fisheye lens. 図７は、第１の実施形態の行動認識システムが観測した画像の一例を示す図である。FIG. 7 is a diagram illustrating an example of an image observed by the action recognition system of the first embodiment; 図８は、行動認識システムが認識する特定行動のうち、「歩く」行動を説明する図である。FIG. 8 is a diagram explaining a "walking" action among the specific actions recognized by the action recognition system. 図９は、図７の画像における人の拡大図である。FIG. 9 is an enlarged view of a person in the image of FIG. 図１０は、行動認識システムが認識する特定行動のうち、商品を棚に入れる「棚入れ」行動を説明する図である。FIG. 10 is a diagram for explaining a "putting" action of placing a product on a shelf among the specific actions recognized by the action recognition system. 図１１は、棚入れ行動を行っている人の拡大図の一例を示す図である。FIG. 11 is a diagram showing an example of an enlarged view of a person who is taking the putaway action. 図１２は、行動認識処理部の機能構成の一例を示す機能ブロック図である。12 is a functional block diagram illustrating an example of a functional configuration of an action recognition processing unit; FIG. 図１３は、辞書作成部の機能構成の一例を示す機能ブロック図である。FIG. 13 is a functional block diagram showing an example of the functional configuration of a dictionary creation unit. 図１４は、行動認識部の機能構成の一例を示す機能ブロック図である。14 is a functional block diagram illustrating an example of a functional configuration of an action recognition unit; FIG. 図１５は、動画入力部に入力される動画の一例を示す図である。FIG. 15 is a diagram showing an example of a moving image input to the moving image input unit. 図１６は、特徴点検出方法について説明する図である。FIG. 16 is a diagram for explaining the feature point detection method. 図１７Ａは、抽出した特徴点の一例を示す第１の図である。FIG. 17A is a first diagram showing an example of extracted feature points. 図１７Ｂは、抽出した特徴点の一例を示す第２の図である。FIG. 17B is a second diagram showing an example of extracted feature points. 図１８は、特定行動の持続時間の測定について説明する図である。FIG. 18 is a diagram illustrating measurement of the duration of specific actions. 図１９は、認識辞書の作成の流れの一例を示すフローチャートである。FIG. 19 is a flow chart showing an example of the flow of creating a recognition dictionary. 図２０は、特定行動の認識処理の流れの一例を示すフローチャートである。FIG. 20 is a flowchart showing an example of the flow of specific action recognition processing. 図２１は、複数の特定行動を認識する処理の流れの一例を示すフローチャートである。FIG. 21 is a flow chart showing an example of the flow of processing for recognizing a plurality of specific actions. 図２２は、第２の実施形態に係る行動認識システムのハードウェア構成の一例を示すハードウェアブロック図である。FIG. 22 is a hardware block diagram showing an example of the hardware configuration of the action recognition system according to the second embodiment. 図２３は、第２の実施形態に係る行動認識システムが使用されている場面の一例を示す図である。FIG. 23 is a diagram illustrating an example of a scene in which the action recognition system according to the second embodiment is used; 図２４は、第２の実施形態における行動認識処理部の機能構成の一例を示す機能ブロック図である。24 is a functional block diagram illustrating an example of a functional configuration of an action recognition processing unit according to the second embodiment; FIG. 図２５は、第２の実施形態における特定行動の認識処理の流れの一例を示すフローチャートである。FIG. 25 is a flowchart showing an example of the flow of specific action recognition processing according to the second embodiment.

（第１の実施形態）
以下に添付図面を参照して、行動認識装置、行動認識方法及びプログラムの第１の実施形態を詳細に説明する。 (First embodiment)
A first embodiment of a behavior recognition device, a behavior recognition method, and a program will be described in detail below with reference to the accompanying drawings.

（行動認識装置のハードウェア構成の説明）
図１は、本実施形態に係る行動認識システム１００のハードウェア構成の一例を示すハードウェアブロック図である。図１に示すように、行動認識システム１００は、魚眼カメラ２００と、行動認識装置３００とを備える。 (Description of hardware configuration of action recognition device)
FIG. 1 is a hardware block diagram showing an example of the hardware configuration of an action recognition system 100 according to this embodiment. As shown in FIG. 1, the action recognition system 100 includes a fisheye camera 200 and an action recognition device 300. FIG.

行動認識システム１００は、魚眼カメラ２００で撮影した被写体の特定行動を認識する。特定行動とは、例えば、職場の作業環境において繰り返し行われる、「歩行する」、「荷物を棚入れする」等の標準作業である。 The action recognition system 100 recognizes a specific action of a subject photographed by the fisheye camera 200. FIG. A specific action is, for example, standard work such as "walking" or "putting things in storage" that is repeatedly performed in the working environment of the workplace.

魚眼カメラ２００は、全周囲３６０°の範囲を観測可能な魚眼レンズを備えたビデオカメラである。なお、魚眼レンズを備えるのは一例であって、魚眼カメラ２００は、広角レンズを備えるものであってもよい。なお、魚眼カメラ２００は、撮影手段の一例である。 The fish-eye camera 200 is a video camera with a fish-eye lens capable of observing a full 360° range. Note that the provision of the fish-eye lens is merely an example, and the fish-eye camera 200 may be provided with a wide-angle lens. Note that the fisheye camera 200 is an example of a photographing unit.

行動認識装置３００は、魚眼カメラ２００が撮影した動画を分析することによって、当該動画に写っている人（被写体）の特定行動を認識する。被写体の特定行動を認識するためには、ある程度のコマ数の画像（連続画像、映像）が必要になる。コマ数が多くなると、魚眼カメラ２００が有する歪を補正する処理の負荷が高くなる。本実施の形態は、歪の補正を行うことなく動画を分析する点が特徴である。 The action recognition device 300 analyzes a moving image captured by the fisheye camera 200 to recognize a specific action of a person (subject) captured in the moving image. In order to recognize the specific behavior of the subject, a certain number of frames of images (continuous images, videos) are required. As the number of frames increases, the processing load for correcting distortion of the fisheye camera 200 increases. This embodiment is characterized in that a moving image is analyzed without correcting distortion.

なお、行動認識装置３００は、行動認識処理部３２１と、行動認識処理部３２１と魚眼カメラ２００とを接続するインタフェース部３２２と、を備える。 The action recognition device 300 includes an action recognition processing unit 321 and an interface unit 322 that connects the action recognition processing unit 321 and the fisheye camera 200 .

行動認識処理部３２１は、人（被写体）の特定行動を認識する。インタフェース部３２２は、魚眼カメラ２００が撮影した動画を、行動認識処理部３２１が認識可能なデータ形式に変換して、行動認識処理部３２１に受け渡す。 The action recognition processing unit 321 recognizes a specific action of a person (subject). The interface unit 322 converts the moving image captured by the fisheye camera 200 into a data format recognizable by the action recognition processing unit 321 and transfers the converted data to the action recognition processing unit 321 .

次に、図２を用いて、行動認識システム１００が使われる代表的な場面を説明する。図２は、第１の実施形態に係る行動認識システム１００が使用されている場面の一例を示す図である。 Next, a typical scene in which the action recognition system 100 is used will be described with reference to FIG. FIG. 2 is a diagram showing an example of a scene in which the action recognition system 100 according to the first embodiment is used.

図２に示すように、行動認識システム１００は、オフィスや工場などの職場における作業環境に設置される。そして、魚眼カメラ２００は、作業環境において作業を行っている複数の人Ｈ１，Ｈ２を含む動画を撮影する。作業環境を１台のカメラで撮影するのが効率的であるため、魚眼カメラ２００は、画角の広い広角レンズを備えるのが望ましい。本実施形態では、魚眼カメラ２００は、対角線画角１８０°を有する魚眼レンズを備えるものとする。なお、人Ｈ１，Ｈ２は、被写体の一例である。 As shown in FIG. 2, the action recognition system 100 is installed in a work environment such as an office or a factory. Then, the fisheye camera 200 shoots a moving image including a plurality of people H1 and H2 working in the work environment. Since it is efficient to photograph the working environment with one camera, it is desirable that the fisheye camera 200 has a wide-angle lens with a wide angle of view. In this embodiment, the fisheye camera 200 is assumed to have a fisheye lens having a diagonal angle of view of 180°. Note that the people H1 and H2 are examples of subjects.

（魚眼カメラのハードウェア構成の説明）
まず、魚眼カメラ２００のハードウェア構成について説明する。 (Description of the hardware configuration of the fisheye camera)
First, the hardware configuration of the fisheye camera 200 will be described.

図３は、魚眼カメラ２００のハードウェア構成の一例を示すハードウェアブロック図である。図３に示すように、魚眼カメラ２００は、対角線画角が１８０度以上の画角を有する魚眼レンズ２１７及びＣＣＤ（Charge Coupled Device）２０３を備えている。なお、魚眼カメラ２００ａは撮影手段の一例である。魚眼カメラ２００は、被写体光を、魚眼レンズ２１７を通してＣＣＤ２０３に入射する。また、魚眼カメラ２００は、魚眼レンズ２１７とＣＣＤ２０３との間に、メカシャッタ２０２を備えている。メカシャッタ２０２は、ＣＣＤ２０３への入射光を遮断する。メカシャッタ２０２の開閉は、モータドライバ２０６により制御される。また、魚眼レンズ２１７のレンズ位置もモータドライバ２０６により制御されて、オートフォーカス機能が実現される。 FIG. 3 is a hardware block diagram showing an example of the hardware configuration of the fisheye camera 200. As shown in FIG. As shown in FIG. 3, the fisheye camera 200 includes a fisheye lens 217 having a diagonal angle of view of 180 degrees or more and a CCD (Charge Coupled Device) 203 . It should be noted that the fisheye camera 200a is an example of a photographing means. The fisheye camera 200 causes subject light to enter the CCD 203 through the fisheye lens 217 . The fisheye camera 200 also has a mechanical shutter 202 between the fisheye lens 217 and the CCD 203 . A mechanical shutter 202 blocks incident light to the CCD 203 . Opening and closing of the mechanical shutter 202 is controlled by a motor driver 206 . The lens position of the fisheye lens 217 is also controlled by the motor driver 206 to realize an autofocus function.

ＣＣＤ２０３は、撮像面に結像された光学像を電気信号に変換して、アナログの画像データを出力する。ＣＣＤ２０３から出力された画像データは、ＣＤＳ（Correlated Double Sampling：相関２重サンプリング）回路２０４によりノイズ成分を除去され、Ａ／Ｄ変換器２０５によりデジタル画像データ（以下、単に画像データと呼ぶ）に変換された後、画像処理回路２０８に対して出力される。 The CCD 203 converts the optical image formed on the imaging surface into an electrical signal and outputs analog image data. Noise components are removed from the image data output from the CCD 203 by a CDS (Correlated Double Sampling) circuit 204, and converted into digital image data (hereinafter simply referred to as image data) by an A/D converter 205. After that, it is output to the image processing circuit 208 .

画像処理回路２０８は、画像データを一時格納するＳＤＲＡＭ（Synchronous DRAM）２１２を用いて、ＹＣｒＣｂ変換処理や、ホワイトバランス制御処理、コントラスト補正処理、エッジ強調処理、色変換処理などの各種画像処理を行う。なお、ホワイトバランス処理は、画像データの色濃さを調整し、コントラスト補正処理は、画像データのコントラストを調整する画像処理である。エッジ強調処理は、画像データのシャープネスを調整し、色変換処理は、画像データの色合いを調整する画像処理である。また、画像処理回路２０８は、信号処理や画像処理が施された画像データをＬＣＤ２１６（液晶ディスプレイ）に表示する。 An image processing circuit 208 uses an SDRAM (Synchronous DRAM) 212 that temporarily stores image data to perform various image processing such as YCrCb conversion processing, white balance control processing, contrast correction processing, edge enhancement processing, and color conversion processing. . The white balance processing is image processing for adjusting the color density of image data, and the contrast correction processing is image processing for adjusting the contrast of image data. Edge enhancement processing is image processing that adjusts the sharpness of image data, and color conversion processing is image processing that adjusts the hue of image data. Further, the image processing circuit 208 displays image data that has undergone signal processing and image processing on the LCD 216 (liquid crystal display).

画像処理回路２０８において信号処理、画像処理が施された画像データは、圧縮伸張回路２１３を介して、メモリカード２１４に記録される。圧縮伸張回路２１３は、操作部２１５から取得した指示によって、画像処理回路２０８から出力される画像データを圧縮してメモリカード２１４に出力すると共に、メモリカード２１４から読み出した画像データを伸張して画像処理回路２０８に出力する。 Image data subjected to signal processing and image processing in the image processing circuit 208 is recorded in the memory card 214 via the compression/decompression circuit 213 . The compression/decompression circuit 213 compresses the image data output from the image processing circuit 208 and outputs the image data to the memory card 214 according to an instruction obtained from the operation unit 215, and decompresses the image data read from the memory card 214 to generate an image. Output to the processing circuit 208 .

魚眼カメラ２００ａは、プログラムに従って各種演算処理を行うＣＰＵ（Central Processing Unit）２０９を備えている。ＣＰＵ２０９は、プログラムなどを格納した読み出し専用メモリであるＲＯＭ（Read Only Memory）２１１、及び各種の処理過程で利用するワークエリア、各種データ格納エリアなどを有する読み出し書き込み自在のメモリであるＲＡＭ（Random Access Memory）２１０とバスラインによって相互接続されている。 The fisheye camera 200a includes a CPU (Central Processing Unit) 209 that performs various arithmetic processing according to a program. The CPU 209 includes a ROM (Read Only Memory) 211 which is a read-only memory storing programs and the like, and a RAM (Random Access Memory) which is a readable and writable memory having a work area used in various processing processes and various data storage areas. Memory) 210 and are interconnected by bus lines.

ＣＣＤ２０３、ＣＤＳ回路２０４及びＡ／Ｄ変換器２０５は、タイミング信号を発生するタイミング信号発生器２０７を介してＣＰＵ２０９によって、タイミングを制御される。さらに、画像処理回路２０８、圧縮伸張回路２１３、メモリカード２１４も、ＣＰＵ２０９によって制御される。 CCD 203, CDS circuit 204 and A/D converter 205 are controlled in timing by CPU 209 via timing signal generator 207 which generates timing signals. Furthermore, the image processing circuit 208 , compression/decompression circuit 213 and memory card 214 are also controlled by the CPU 209 .

魚眼カメラ２００の出力は、図１に示す行動認識装置３００の信号処理ボードであるインタフェース部３２２に入力される。 The output of the fisheye camera 200 is input to the interface section 322, which is the signal processing board of the action recognition device 300 shown in FIG.

（行動認識装置のハードウェア構成の説明）
次に、行動認識装置３００のハードウェア構成について説明する。 (Description of hardware configuration of action recognition device)
Next, the hardware configuration of the action recognition device 300 will be described.

図４は、行動認識装置３００のハードウェア構成の一例を示すハードウェアブロック図である。図４に示すように、行動認識装置３００は、行動認識装置３００全体の動作を制御するＣＰＵ（Central Processing Unit）３０１、ＣＰＵ３０１の駆動に用いられるプログラムを記憶したＲＯＭ（Read Only Memory）３０２、ＣＰＵ３０１のワークエリアとして使用されるＲＡＭ（Random Access Memory）３０３を有する。また、プログラム等の各種データを記憶するＨＤ（Hard Disk）３０４、ＣＰＵ３０１の制御にしたがってＨＤ３０４に対する各種データの読み出し又は書き込みを制御するＨＤＤ（Hard Disk Drive）３０５を有する。 FIG. 4 is a hardware block diagram showing an example of the hardware configuration of the action recognition device 300. As shown in FIG. As shown in FIG. 4, the action recognition device 300 includes a CPU (Central Processing Unit) 301 that controls the overall operation of the action recognition device 300, a ROM (Read Only Memory) 302 that stores a program used to drive the CPU 301, a CPU 301 has a RAM (random access memory) 303 used as a work area for It also has an HD (Hard Disk) 304 that stores various data such as programs, and an HDD (Hard Disk Drive) 305 that controls reading and writing of various data to and from the HD 304 under the control of the CPU 301 .

また、行動認識装置３００は、メディアＩ／Ｆ３０７、ディスプレイ３０８、ネットワークＩ／Ｆ３０９を有する。メディアＩ／Ｆ３０７は、フラッシュメモリ等のメディア３０６に対するデータの読み出し又は書き込み（記憶）を制御する。ディスプレイ３０８は、カーソル、メニュー、ウィンドウ、文字、又は画像などの各種情報を表示する。ネットワークＩ／Ｆ３０９は、通信ネットワークを利用してデータ通信する。 Moreover, the action recognition apparatus 300 has media I/F307, the display 308, and network I/F309. A media I/F 307 controls reading or writing (storage) of data to a media 306 such as a flash memory. A display 308 displays various information such as a cursor, menus, windows, characters, or images. A network I/F 309 performs data communication using a communication network.

また、行動認識装置３００は、キーボード３１１、マウス３１２、ＣＤ－ＲＯＭ（Compact Disk Read Only Memory）ドライブ３１４、バスライン３１０を有する。キーボード３１１は、文字、数値、各種指示などの入力のための複数のキーを備える。マウス３１２は、各種指示の選択や実行、処理対象の選択、カーソルの移動などを行う。ＣＤ－ＲＯＭドライブ３１４は、着脱可能な記録媒体の一例としてのＣＤ－ＲＯＭ３１３に対する各種データの読み出し又は書き込みを制御する。バスライン３１０は、上記各構成要素を電気的に接続するためのアドレスバスやデータバス等である。 The action recognition device 300 also has a keyboard 311 , a mouse 312 , a CD-ROM (Compact Disk Read Only Memory) drive 314 and a bus line 310 . A keyboard 311 has a plurality of keys for inputting characters, numerical values, various instructions, and the like. A mouse 312 selects and executes various instructions, selects an object to be processed, moves a cursor, and the like. A CD-ROM drive 314 controls reading or writing of various data to a CD-ROM 313, which is an example of a removable recording medium. A bus line 310 is an address bus, a data bus, or the like for electrically connecting the components described above.

図示した行動認識装置３００のハードウェアは、１つの筐体に収納したり、ひとまとまりの装置としたりする必要はない。また、クラウドコンピューティングに対応するため、本実施形態の行動認識装置３００の物理的な構成は固定的でなくてもよく、負荷に応じてハード的なリソースが動的に接続・切断されることで構成されてもよい。 The illustrated hardware of the action recognition device 300 does not need to be housed in one housing or integrated into a device. In addition, in order to support cloud computing, the physical configuration of the action recognition device 300 of this embodiment does not have to be fixed, and hardware resources can be dynamically connected/disconnected according to the load. may consist of

なお、プログラムは、実行可能形式や圧縮形式などでメディア３０６やＣＤ－ＲＯＭ３１３などの記憶媒体に記憶された状態で配布されるか、又は、プログラムを配信するサーバから配信される。 The program is distributed while being stored in a storage medium such as the medium 306 or the CD-ROM 313 in an executable format or a compressed format, or distributed from a server that distributes the program.

本実施の形態の行動認識装置３００で実行されるプログラムは、下記に示す各機能を含むモジュール構成となっている。行動認識装置３００のＣＰＵ３０１は、ＲＯＭ３０２やＨＤ３０４などの記憶媒体からプログラムを読み出して実行することにより各モジュールがＲＡＭ３０３上にロードされ、各機能を発揮する。 The program executed by the action recognition device 300 of this embodiment has a module configuration including each function shown below. The CPU 301 of the action recognition device 300 reads a program from a storage medium such as the ROM 302 or the HD 304 and executes it, thereby loading each module onto the RAM 303 and exerting each function.

（魚眼カメラで発生する歪の説明）
次に、図５，図６を用いて魚眼カメラ２００で撮影した画像に発生する歪について説明する。図５は、魚眼レンズで観測した画像の歪を説明する図である。図６は、魚眼レンズで観測した画像の位置による歪の違いを説明する図である。 (Description of distortion that occurs with a fisheye camera)
Next, the distortion that occurs in the image captured by the fisheye camera 200 will be described with reference to FIGS. 5 and 6. FIG. FIG. 5 is a diagram for explaining distortion of an image observed with a fisheye lens. FIG. 6 is a diagram for explaining the difference in distortion depending on the position of an image observed with a fisheye lens.

図５に示す画像Ｉは、標準レンズ又は望遠レンズを装着したカメラで、縦横の規則的な直線で構成されるマス目が描かれたターゲットを撮影した際に観測される画像Ｉの一例である。図５に示すように、画像Ｉには、縦横の直線的なマス目が観測される。そして、各マス目における縦線と横線の長さの比率は、画像Ｉの位置に依らずにほぼ等しい。すなわち、画像Ｉにおいて発生する歪は非常に小さい。 The image I shown in FIG. 5 is an example of an image I observed when a target on which squares composed of regular vertical and horizontal straight lines is drawn is photographed with a camera equipped with a standard lens or a telephoto lens. . As shown in FIG. 5, vertical and horizontal linear grids are observed in the image I. FIG. The ratio of the length of the vertical line to the length of the horizontal line in each square is almost the same regardless of the position of the image I. That is, the distortion generated in image I is very small.

一方、画像Ｊは、本実施の形態の魚眼カメラ２００で、前記と同じターゲットを撮影した際に観測される画像の一例である。魚眼カメラ２００は、画像の中心からの距離と観測対象物の方向（角度）とが比例する画像を生成する、いわゆる等距離射影方式によって画像を生成する。したがって、画像Ｊの中心付近と周辺部とで、発生する歪の大きさが異なる。 On the other hand, image J is an example of an image observed when the same target as described above is photographed by fisheye camera 200 of the present embodiment. The fisheye camera 200 generates an image by a so-called equidistant projection method, which generates an image in which the distance from the center of the image is proportional to the direction (angle) of the observed object. Therefore, the magnitude of distortion generated is different between the vicinity of the center of the image J and the peripheral portion.

具体的には、前記したターゲットを撮影した際に観測される画像Ｊにおいて、縦線は、画像Ｊに外接する円の円周上の点Ｃ１と点Ｃ２を通る円弧状に観測される。また、ターゲットの横線は、画像Ｊに外接する円の円周上の点Ｃ３と点Ｃ４を通る円弧状に観測される。 Specifically, in the image J observed when the target is photographed, a vertical line is observed in the shape of an arc passing through the points C1 and C2 on the circumference of the circle circumscribing the image J. A horizontal line of the target is observed as an arc passing through points C3 and C4 on the circumference of a circle circumscribing image J. FIG.

すなわち、画像Ｊの中心付近では、ターゲットの縦線と横線は直線に近い状態で観測される。そして、各マス目における縦線と横線の長さの比率はほぼ等しい。すなわち、発生する歪は小さい。一方、画像Ｊの周辺部では、ターゲットの縦線及び横線は、ともに曲線として観測される。さらに、各マス目における縦線と横線の比率は異なる。このように、画像Ｊでは、画像の中心からの距離が大きいほど、発生する歪が大きくなる。そして、発生する歪の方向は、画像Ｊの中心に対して点対称な方向になる。 That is, near the center of the image J, the vertical and horizontal lines of the target are observed to be nearly straight. The ratio of the length of the vertical line to the horizontal line in each square is almost equal. That is, the generated strain is small. On the other hand, in the peripheral portion of image J, both the vertical and horizontal lines of the target are observed as curved lines. Furthermore, the ratio of vertical lines to horizontal lines in each square is different. Thus, in the image J, the greater the distance from the center of the image, the greater the distortion that occurs. The direction of the generated distortion is point-symmetrical with respect to the center of the image J. FIG.

したがって、画像Ｊでは、人が観測される位置によって、当該人が同じ行動を行った場合に発生する動きの大きさと方向とが異なる。すなわち、人の行動を認識するためには、画像Ｊの場所毎に認識辞書を用意して、人が観測された位置に応じた認識辞書を利用して行動認識を行えばよい。 Therefore, in the image J, the magnitude and direction of movement that occurs when the person performs the same action differs depending on the position where the person is observed. That is, in order to recognize a person's action, a recognition dictionary may be prepared for each location of the image J, and action recognition may be performed using the recognition dictionary corresponding to the position where the person is observed.

具体的には、図６に示すように、画像Ｊの中心から周辺に向けて、複数の領域Ｒ１，Ｒ２，Ｒ３，Ｒ４を設定して、領域Ｒ１，Ｒ２，Ｒ３，Ｒ４毎に認識辞書を作成する。この場合、画像の歪は、領域Ｒ１が最も小さく、領域Ｒ４が最も大きい。なお、領域Ｒ１，Ｒ２，Ｒ３，Ｒ４は、設定する領域の一例であって、領域数を４個に限定するものではない。このように、本実施形態の行動認識システム１００は、画像Ｊの複数の異なる位置に同様の領域を設定して、各領域において認識辞書を作成する。そして、行動認識システム１００は、撮影した動画の中から検出した人の位置に最も近い位置で作成された認識辞書を用いて、行動認識を行う。 Specifically, as shown in FIG. 6, a plurality of regions R1, R2, R3, and R4 are set from the center to the periphery of the image J, and a recognition dictionary is created for each of the regions R1, R2, R3, and R4. create. In this case, the image distortion is the smallest in the region R1 and the largest in the region R4. Note that the regions R1, R2, R3, and R4 are examples of regions to be set, and the number of regions is not limited to four. In this manner, the action recognition system 100 of this embodiment sets similar regions at a plurality of different positions on the image J, and creates a recognition dictionary for each region. Then, the action recognition system 100 performs action recognition using a recognition dictionary created at a position closest to the position of the person detected from the captured video.

なお、魚眼カメラ２００の代わりに、広角レンズや超広角レンズを備えたカメラを用いた場合であっても、魚眼レンズと同様に、画像の周辺には、画像の中心よりも大きい歪が発生する。そのため、画像内の位置に応じた認識辞書を用いて行動認識を行う方法は有効である。 Note that even if a camera equipped with a wide-angle lens or an ultra-wide-angle lens is used instead of the fisheye camera 200, distortion larger than that in the center of the image occurs in the periphery of the image, similar to the case of the fisheye lens. . Therefore, a method of performing action recognition using a recognition dictionary corresponding to a position within an image is effective.

（実際に観測される画像の説明）
図７から図１１を用いて、行動認識システム１００が観測する画像の例を説明する。図７は、第１の実施形態の行動認識システム１００が観測した画像の一例を示す図である。 (Description of the image actually observed)
Examples of images observed by the action recognition system 100 will be described with reference to FIGS. 7 to 11. FIG. FIG. 7 is a diagram showing an example of an image observed by the action recognition system 100 of the first embodiment.

図７は、職場における作業者の特定行動の一例である。特に、図７の画像Ｊ１は、「歩く」という特定行動の一例を示す図である。「歩く」行動は、作業者が複数の特定行動を行う際に、ある特定行動から別の特定行動に移る際に発生する行動である。そして、一般に、「歩く」行動に要する時間が多くなると、作業効率が低くなる。行動認識システム１００は、歩く行動を、特定行動の一つとして認識する。 FIG. 7 is an example of specific actions of workers in the workplace. In particular, the image J1 in FIG. 7 is a diagram showing an example of the specific action "walking". A "walking" action is an action that occurs when a worker performs a plurality of specific actions and moves from one specific action to another specific action. And generally, the longer the time required for the "walking" action, the lower the working efficiency. The action recognition system 100 recognizes a walking action as one of specific actions.

図８は、行動認識システム１００が認識する特定行動のうち、「歩く」行動を説明する図である。図８に示すように、魚眼カメラ２００は、歩行動作を行っている人Ｈ１を時系列で撮影する。この場合、歩行している人Ｈ１を撮影した動画（画像列）が得られる。 FIG. 8 is a diagram for explaining the "walking" action among the specific actions recognized by the action recognition system 100. As shown in FIG. As shown in FIG. 8, the fisheye camera 200 photographs a person H1 who is walking in chronological order. In this case, a moving image (image sequence) of the walking person H1 is obtained.

図９は、図７の画像Ｊ１における人の拡大図ｊ１である。行動認識システム１００は、図９に示す領域の時間変化を観測することによって、特定行動を認識する。 FIG. 9 is an enlarged view j1 of a person in image J1 of FIG. The action recognition system 100 recognizes a specific action by observing the time change of the area shown in FIG.

図１０は、行動認識システム１００が認識する特定行動のうち、商品を棚に入れる「棚入れ」行動を説明する図である。図１０に示すように、魚眼カメラ２００は、棚入れを行っている人Ｈ１を字系列で撮影する。この場合、棚入れを行っている人Ｈ１を撮影した動画（画像列）が得られる。 10A and 10B are diagrams for explaining a "putting" action of placing a product on a shelf, among the specific actions recognized by the action recognition system 100. FIG. As shown in FIG. 10, the fish-eye camera 200 photographs a character sequence of a person H1 who is doing shelving. In this case, a moving image (image sequence) of the person H1 who is doing the shelving is obtained.

図１１は、棚入れ行動を行っている人の拡大図ｊ２の一例を示す図である。行動認識システム１００は、図１１に示す領域の時間変化を観測することによって、特定行動を認識する。 FIG. 11 is a diagram showing an example of an enlarged view j2 of a person who is taking the putaway action. The action recognition system 100 recognizes a specific action by observing the time change of the area shown in FIG.

（行動認識処理部の機能構成の説明）
次に、図１２を用いて、行動認識処理部３２１の機能構成を説明する。図１２は、本実施形態に係る行動認識処理部３２１の一例を示す機能ブロック図である。図１２に示すように、行動認識処理部３２１は、動画入力部３３１と、領域分割部３３２と、辞書作成部３３３と、辞書選択部３３４と、行動認識部３３５と、持続時間測定部３３６とを備える。 (Description of the functional configuration of the action recognition processing unit)
Next, the functional configuration of the action recognition processing unit 321 will be described with reference to FIG. 12 . FIG. 12 is a functional block diagram showing an example of the action recognition processing section 321 according to this embodiment. As shown in FIG. 12, the action recognition processing unit 321 includes a video input unit 331, a region dividing unit 332, a dictionary creating unit 333, a dictionary selecting unit 334, an action recognizing unit 335, and a duration measuring unit 336. Prepare.

動画入力部３３１は、魚眼カメラ２００で撮影された動画を、インタフェース部３２２（図１）を介して入力する。 The moving image input unit 331 inputs a moving image captured by the fisheye camera 200 via the interface unit 322 (FIG. 1).

領域分割部３３２は、魚眼カメラ２００が撮影した動画に含まれる画像を、歪の異なる複数の領域に分割する。 The region dividing unit 332 divides the image included in the moving image captured by the fisheye camera 200 into a plurality of regions with different distortions.

辞書作成部３３３は、分割された領域毎に、人の特定行動を認識するための異なる認識辞書を作成する。 The dictionary creation unit 333 creates a different recognition dictionary for recognizing specific human behavior for each divided area.

辞書選択部３３４は、辞書作成部３３３が作成した異なる認識辞書の中から、動画から検出した人の特定行動を認識するために使用する認識辞書を選択する。 A dictionary selection unit 334 selects a recognition dictionary to be used for recognizing a specific behavior of a person detected from a moving image from among the different recognition dictionaries created by the dictionary creation unit 333 .

行動認識部３３５は、辞書選択部３３４が選択した認識辞書に基づいて、人の特定行動を認識する。 The action recognition unit 335 recognizes specific human actions based on the recognition dictionary selected by the dictionary selection unit 334 .

持続時間測定部３３６は、特定行動の認識結果に基づいて、当該特定行動の持続時間を測定する。 The duration measurement unit 336 measures the duration of the specific action based on the recognition result of the specific action.

（辞書作成部の機能構成の説明）
次に、図１３を用いて、辞書作成部３３３の機能構成を説明する。図１３は、本実施形態に係る辞書作成部３３３の概略構成の一例を示す機能ブロック図である。図１３に示すように、辞書作成部３３３は、特徴点抽出部３３３ａと、特徴点分類部３３３ｂと、特徴ベクトル算出部３３３ｃと、ヒストグラム作成部３３３ｄと、認識辞書作成部３３３ｅとを備える。 (Description of the functional configuration of the dictionary creation part)
Next, with reference to FIG. 13, the functional configuration of the dictionary creation unit 333 will be described. FIG. 13 is a functional block diagram showing an example of a schematic configuration of the dictionary creation unit 333 according to this embodiment. As shown in FIG. 13, the dictionary creation unit 333 includes a feature point extraction unit 333a, a feature point classification unit 333b, a feature vector calculation unit 333c, a histogram creation unit 333d, and a recognition dictionary creation unit 333e.

なお、魚眼カメラ２００で撮影した動画は歪を有しているが、歪の補正は行わず、辞書作成部３３３は、撮影された動画が含む画像の複数の位置に対応する認識辞書を作成する。すなわち、辞書作成部３３３は、歪が大きい領域では歪が大きい状態で特定行動（標準作業）を認識する認識辞書を作成する。また、辞書作成部３３３は、歪が小さい領域では歪が小さい状態で特定行動を認識する認識辞書を作成する。したがって、認識辞書を作成する際には、被験者は、画像の中の様々な位置で標準作業を行う。 Although the moving image captured by the fisheye camera 200 has distortion, the distortion is not corrected. do. That is, the dictionary creating unit 333 creates a recognition dictionary for recognizing specific actions (standard work) in a highly distorted region in a highly distorted region. In addition, the dictionary creating unit 333 creates a recognition dictionary for recognizing the specific action in a state where the distortion is small in the area where the distortion is small. Therefore, when creating a recognition dictionary, subjects perform a standard task at various positions in the image.

特徴点抽出部３３３ａは、魚眼カメラ２００で撮影された動画に含まれる複数の画像の中から、特定行動（標準作業）に伴って発生する特徴点を抽出する。より具体的には、特徴点抽出部３３３ａは、入力された動画から画像フレームをＴ枚ずつ切り出し、切り出されたＴ枚の画像フレームに対して、時空間における特徴点（時空間特徴点ともいう）を抽出する。特徴点とは、入力された動画を、空間方向２軸と時間方向１軸とからなる３次元の所定サイズのブロックに分割した際に、当該ブロック内における画像の平均的な明るさが所定値を超えるブロックである。なお、特徴点抽出部３３３ａは、精度の高い学習データを生成するために、複数の動画から特徴点の抽出を行う。 The feature point extraction unit 333a extracts feature points that occur along with a specific action (standard work) from among a plurality of images included in the moving image captured by the fisheye camera 200. FIG. More specifically, the feature point extraction unit 333a extracts T image frames from the input moving image, and extracts feature points in spatiotemporal space (also referred to as spatiotemporal feature points) for the extracted T image frames. ). A feature point is an input video that is divided into three-dimensional blocks of a predetermined size consisting of two axes in the spatial direction and one axis in the time direction. is a block that exceeds . Note that the feature point extraction unit 333a extracts feature points from a plurality of moving images in order to generate highly accurate learning data.

なお、特徴点抽出部３３３ａは、魚眼カメラ２００で撮影された動画に含まれる画像の中から、公知の人検出アルゴリズムを用いて人を検出して、検出された人の領域のみに対して、前記した特徴点抽出を行うようにしてもよい。これによると、特徴点を抽出する領域を限定することができるため、処理をより一層効率的に行うことができる。 Note that the feature point extracting unit 333a uses a known human detection algorithm to detect a person from images included in the moving image captured by the fisheye camera 200, and detects only the area of the detected person. , the feature point extraction described above may be performed. According to this, it is possible to limit the area from which the feature points are extracted, so that the processing can be performed more efficiently.

特徴点分類部３３３ｂは、特徴点抽出部３３３ａが抽出した特徴点を表すＭ×Ｎ×Ｔ×３次元のベクトルを、例えば、公知のＫ平均法（Ｋ－ｍｅａｎｓ法）で分類（クラスタリング）する。分類するクラスの数をＫ種類とすると、特徴点分類部３３３ｂは、学習用の動画から抽出した特徴点をＫ種類に分類する。 The feature point classification unit 333b classifies (clusters) the M×N×T×3-dimensional vector representing the feature points extracted by the feature point extraction unit 333a, for example, by a known K-means method. . Assuming that the number of classes to be classified is K types, the feature point classification unit 333b classifies the feature points extracted from the learning video into K types.

特徴ベクトル算出部３３３ｃは、特徴点分類部３３３ｂが分類したＫ種類の特徴点のうち、同じ種類の特徴点におけるＭ×Ｎ×Ｔ×３次元のベクトルを平均化して、Ｋ個の平均ベクトルＶｋを求める。特徴ベクトル算出部３３３ｃが算出した平均ベクトルＶｋは、それぞれ、Ｋ種類の特徴点を代表するベクトルである。なお、平均ベクトルＶｋは、学習ベクトルの一例である。 The feature vector calculation unit 333c averages the M×N×T×3-dimensional vectors of the same type of feature points among the K types of feature points classified by the feature point classification unit 333b to obtain K average vectors Vk Ask for Each of the average vectors Vk calculated by the feature vector calculator 333c is a vector representing K types of feature points. Note that the average vector Vk is an example of a learning vector.

特定行動を観測した動画から得られる特徴ベクトルは、同じ特定行動の学習データで得られた平均ベクトルＶｋの近く分布する。行動認識部３３５は、この特性を利用して、魚眼カメラ２００で撮影した歪を有する動画から、歪を補正しない状態でも高精度な行動認識を行うことができる。すなわち、作業者が直線的に移動した際に、撮影された動画の歪が大きい領域では、人の動きが曲線的になる。しかし、辞書作成部３３３が作成した認識辞書は、人の動きが曲線状になるものとして学習されるため、歪を補正することなく、特定行動を認識することができる。同様に、歪の小さい領域では、作業者の直線的な動きが、直線的な動きとして学習されるため、歪を補正することなく、特定行動を認識することができる。 Feature vectors obtained from videos in which a specific action is observed are distributed near the average vector Vk obtained from learning data of the same specific action. Using this characteristic, the action recognition unit 335 can perform highly accurate action recognition from a distorted moving image captured by the fisheye camera 200 even without correcting the distortion. That is, when the worker moves linearly, the movement of the person becomes curvilinear in a region where the distortion of the captured moving image is large. However, since the recognition dictionary created by the dictionary creating unit 333 is learned assuming that human movements are curvilinear, it is possible to recognize the specific action without correcting the distortion. Similarly, in areas where distortion is small, the linear movements of the worker are learned as linear movements, so specific actions can be recognized without correcting the distortion.

ヒストグラム作成部３３３ｄは、平均ベクトルＶｋの出現頻度を表す学習ヒストグラムＨ（ｋ）を作成する。具体的には、Ｋ種類の特徴点について、各特徴点グループのブロック合計数を計算し、学習ヒストグラムＨ（ｋ）を作成する。学習ヒストグラムＨ（ｋ）は、特徴点ｋグループの頻度を示す。なお、ヒストグラム作成部３３３ｄは、学習ヒストグラム作成部の一例である。 The histogram creation unit 333d creates a learning histogram H(k) representing the appearance frequency of the average vector Vk. Specifically, for K types of feature points, the total number of blocks in each feature point group is calculated to create a learning histogram H(k). A learning histogram H(k) indicates the frequency of k groups of feature points. Note that the histogram creation unit 333d is an example of a learning histogram creation unit.

認識辞書作成部３３３ｅは、Ｎ個の行動認識対象領域において、各領域の学習データから求めた学習ヒストグラムＨ（ｋ）により各領域の特定行動を認識する辞書を作成する。認識辞書作成部３３３ｅは、ＳＶＭ（Support Vector Machine）の機械学習方法で、認識辞書を作成する。なお、認識辞書作成部３３３ｅは、ＳＶＭの機械学習方法で認識辞書を作成する際に、認識対象となる特定行動を含む正の学習データ（プラス学習データ）と、記認識対象となる特定行動を含まない負の学習データ（マイナス学習データ）とを用意して認識辞書を作成してもよい。すなわち、認識辞書作成部３３３ｅは、正の学習データを正しいデータであるとして受け入れて、負の学習データを異なるデータであるとして除外する認識辞書を作成する。これによって、特定行動と間違いやすい行動を負の学習データとして学習させることができるため、特定行動の認識率を向上させることができる。なお、認識辞書を作成するとき、ＳＶＭ機械学習方法以外に他の機械学習方法を使ってもよい。例えば、ＫＮＮ（K Nearest Neighbor）や、ＭＬＰ（Multilayer perceptron）などの機械学習方法を使ってもよい。 The recognition dictionary creation unit 333e creates a dictionary for recognizing specific actions in each of the N action recognition target regions, based on the learning histogram H(k) obtained from the learning data of each region. The recognition dictionary creating unit 333e creates a recognition dictionary by a machine learning method of SVM (Support Vector Machine). Note that when creating the recognition dictionary by the SVM machine learning method, the recognition dictionary creation unit 333e combines positive learning data (plus learning data) including specific actions to be recognized and specific actions to be recognized. A recognition dictionary may be created by preparing negative learning data (negative learning data) that does not include such data. That is, the recognition dictionary creating unit 333e creates a recognition dictionary that accepts positive learning data as correct data and excludes negative learning data as different data. As a result, actions that are likely to be mistaken for specific actions can be learned as negative learning data, so that the recognition rate of specific actions can be improved. It should be noted that other machine learning methods than the SVM machine learning method may be used when creating the recognition dictionary. For example, machine learning methods such as KNN (K Nearest Neighbor) and MLP (Multilayer Perceptron) may be used.

（行動認識部の機能構成の説明）
次に、図１４を用いて、行動認識部３３５の機能構成を説明する。図１４は、本実施形態に係る行動認識部３３５の概略構成の一例を示す機能ブロック図である。図１４に示すように、行動認識部３３５は、特徴点抽出部３３５ａと、特徴ベクトル算出部３３５ｂと、ヒストグラム作成部３３５ｃと、行動認識部３３５ｄとを備える。特徴点抽出部３３５ａは、辞書作成部３３３が備える特徴点抽出部３３３ａと同じ機能を備える。 (Description of the functional configuration of the action recognition unit)
Next, the functional configuration of the action recognition unit 335 will be described with reference to FIG. 14 . FIG. 14 is a functional block diagram showing an example of a schematic configuration of the action recognition unit 335 according to this embodiment. As shown in FIG. 14, the action recognition section 335 includes a feature point extraction section 335a, a feature vector calculation section 335b, a histogram creation section 335c, and an action recognition section 335d. The feature point extraction unit 335a has the same function as the feature point extraction unit 333a included in the dictionary creation unit 333. FIG.

特徴ベクトル算出部３３５ｂは、特徴点抽出部３３５ａが抽出した特徴点における時空間エッジ情報（微分ベクトル）を求める。時空間エッジ情報について、詳しくは後述する。 The feature vector calculation unit 335b obtains spatio-temporal edge information (differential vector) at the feature points extracted by the feature point extraction unit 335a. Details of the spatio-temporal edge information will be described later.

ヒストグラム作成部３３５ｃは、時空間エッジ情報の出現頻度を表す特定行動ヒストグラムＴ（ｋ）を作成する。 The histogram creation unit 335c creates a specific action histogram T(k) representing the appearance frequency of spatio-temporal edge information.

行動認識部３３５ｄは、動画から得られる微分ベクトルに基づいてヒストグラム作成部３３５ｃが作成した特定行動ヒストグラムＴ（ｋ））と、認識辞書が記憶している学習ヒストグラムＨ（ｋ）とを比較することによって、特定行動を認識する。認識対象となる特徴点の分布は、認識辞書における特徴点の分布と近い。すなわち、特定行動を行っている認識対象の画像から得た特定行動ヒストグラムＴ（ｋ）と、同じ特定行動の学習ヒストグラムＨ（ｋ）とは類似しているため、画像の歪み補正を行うことなく、特定行動を認識することが可能である。 The action recognition unit 335d compares the specific action histogram T(k)) created by the histogram creation unit 335c based on the differential vector obtained from the moving image and the learning histogram H(k) stored in the recognition dictionary. to recognize specific behaviors. The distribution of feature points to be recognized is close to the distribution of feature points in the recognition dictionary. That is, since the specific action histogram T(k) obtained from the recognition target image performing the specific action and the learning histogram H(k) of the same specific action are similar, the image distortion correction is not performed. , it is possible to recognize specific actions.

（行動認識システムが観測する画像の説明）
次に、図１５，図１６を用いて、行動認識システム１００が観測する画像の例を説明する。図１５は、動画入力部３３１に入力される動画（画像列）の一例を示す図である。図１５に示す各画像（フレーム）は、魚眼カメラ２００で撮影した画像であり、歪を補正していない画像である。撮影された画像の横軸ｘ、縦軸ｙは空間座標である。そして、画像フレームＦ１，Ｆ２の時間軸はｔで示す。つまり、入力された画像は、座標（ｘ，ｙ，ｔ）における時空間データになる。時空間の一つの座標における画素値は、空間座標（ｘ，ｙ）と時刻ｔの関数である。前述した職場における特定行動を認識する際に、人が移動すると、図１５に示す時空間データに変化点が発生する。行動認識システム１００は、この変化点、すなわち時空間の特徴点を見つけることで、特定行動を認識する。 (Description of images observed by the action recognition system)
Next, examples of images observed by the action recognition system 100 will be described with reference to FIGS. 15 and 16. FIG. FIG. 15 is a diagram showing an example of a moving image (image sequence) input to the moving image input unit 331. As shown in FIG. Each image (frame) shown in FIG. 15 is an image captured by the fisheye camera 200 and is an image without distortion correction. The horizontal axis x and vertical axis y of the photographed image are spatial coordinates. The time axis of the image frames F1 and F2 is indicated by t. That is, the input image becomes spatio-temporal data at coordinates (x, y, t). A pixel value at one spatio-temporal coordinate is a function of spatial coordinates (x, y) and time t. When a person moves when recognizing the above-mentioned specific behavior in the workplace, a change point occurs in the spatio-temporal data shown in FIG. 15 . The action recognition system 100 recognizes the specific action by finding this change point, that is, the spatio-temporal feature point.

次に、本実施形態における特徴点の抽出方法を説明する。図１６に示すように、時空間画像データをブロックに分割する。図１６の大きい立方体は時空間画像データを示す。横軸ｘと縦軸ｙとは空間座標を表す。それぞれの単位は画素である。また時間軸をｔで示す。例えば、動画を３０フレーム／秒のビデオレートで入力し、時系列画像を入力する。このビデオレートで換算することによって、画像が撮影された実際の時間を求めることができる。図１６の時空間画像データを、サイズ（Ｎ，Ｎ，Ｔ)のブロックで分割する。１ブロックのサイズは横Ｍ画素、縦Ｎ画素、Ｔフレームになる。図１６の１つのマス目が１つのブロックを示す。人がある行動を行ったとき、時空間データにおいて動きが発生したブロックでは、当該ブロックの特徴量が大きくなる。すなわち、時空間に大きな変化量が発生する。 Next, a method for extracting feature points in this embodiment will be described. As shown in FIG. 16, the spatio-temporal image data is divided into blocks. Large cubes in FIG. 16 indicate spatio-temporal image data. The horizontal axis x and the vertical axis y represent spatial coordinates. Each unit is a pixel. Also, the time axis is indicated by t. For example, a moving image is input at a video rate of 30 frames/second, and time-series images are input. By converting at this video rate, the actual time the image was taken can be determined. The spatio-temporal image data of FIG. 16 is divided into blocks of size (N, N, T). The size of one block is horizontal M pixels, vertical N pixels, and T frames. One square in FIG. 16 indicates one block. When a person performs a certain action, the feature amount of the block in which movement occurs in the spatio-temporal data increases. That is, a large amount of change occurs in time and space.

次に、変化量の大きいブロックを特徴点として抽出する方法を説明する。時空間の画像データから特徴点を抽出するため、まず、空間方向、すなわち（ｘ，ｙ）方向でノイズを除去するために平滑化処理を行う。平滑化処理は、式（１）で行われる。 Next, a method of extracting a block with a large amount of change as a feature point will be described. In order to extract feature points from spatio-temporal image data, first, smoothing processing is performed to remove noise in the spatial direction, that is, the (x, y) direction. The smoothing process is performed by Equation (1).

ここで、Ｉ（ｘ，ｙ，ｔ）は、時刻ｔのフレームにおける（ｘ，ｙ）座標の画素値である。また、ｇ（ｘ，ｙ）は、平滑化処理のためのカーネルである。また、＊は畳み込み処理を示す演算子である。平滑化処理は、単純に画素値の平均化処理としてもよいし、既存のGaussian平滑化フィルタ処理を行ってもよい。 Here, I(x, y, t) is the pixel value of the (x, y) coordinates in the frame at time t. Also, g(x, y) is a kernel for smoothing processing. Also, * is an operator indicating convolution processing. The smoothing process may simply be an averaging process of pixel values, or an existing Gaussian smoothing filter process may be performed.

次に時間軸でフィルタリング処理を行う。ここでは、式（２）に示すＧａｂｏｒフィルタリング処理を行う。Ｇａｂｏｒフィルタは指向性フィルタであり、フィルタを作用させる領域に存在する平行で等間隔な線を強調して、線の間に存在するノイズを除去する作用を有する。式（２）におけるｇ_ｅｖとｇ_ｏｄとは、それぞれ、式（３）と式（４）が示すＧａｂｏｒフィルタのカーネルである。また、＊は畳み込み処理を示す演算子である。さらに、τとωは、Ｇａｂｏｒフィルタにおけるカーネルのパラメータである。 Next, filtering processing is performed on the time axis. Here, the Gabor filtering process shown in Equation (2) is performed. The Gabor filter is a directional filter, and has the effect of emphasizing parallel and equally spaced lines existing in the region where the filter is applied and removing noise existing between the lines. g _ev and g _od in equation (2) are kernels of Gabor filters indicated by equations (3) and (4), respectively. Also, * is an operator indicating convolution processing. Furthermore, τ and ω are parameters of the kernel in the Gabor filter.

図１５に示す時空間画像の全画素に対して、上記式（２）に示すフィルタリング処理を行った後、図１６に示す分割ブロック内のＲ（ｘ，ｙ，ｔ）の平均値を求める。式（５）で、時空間座標（ｘ，ｙ，ｔ）のブロックの平均値を求める。 After performing the filtering process shown in the above formula (2) on all pixels of the spatio-temporal image shown in FIG. 15, the average value of R(x, y, t) in the divided blocks shown in FIG. 16 is obtained. The average value of the block of the spatio-temporal coordinates (x, y, t) is obtained by Equation (5).

式（６）に示すように、ブロック内の平均値Ｍ（ｘ，ｙ，ｔ）が所定の閾値Ｔｈｒｅ＿Ｍより大きい場合、このブロックを特徴点とする。 As shown in Equation (6), if the average value M(x, y, t) in the block is greater than a predetermined threshold Thre_M, this block is taken as a feature point.

（特徴点の記述方法の説明）
次に、図１７Ａ，図１７Ｂを用いて、特徴点の記述方法を説明する。図１７Ａは、動画から抽出した特徴点の一例を示す第１の図である。図１７Ｂは、動画から抽出した特徴点の一例を示す第２の図である。すなわち、図１７Ａは、図１１に示した棚入れを行っている人の画像から抽出した、時刻ｔにおける特徴点の一例を示す画像ｋ１である。図１７Ａに示すように、動きのある部分に特徴点が抽出される。図１７Ｂは、同様に時刻ｔ＋Δｔにおいて抽出された特徴点の一例を示す画像ｋ２である。 (Description of how to describe feature points)
Next, a description method of feature points will be described with reference to FIGS. 17A and 17B. FIG. 17A is a first diagram showing an example of feature points extracted from a moving image. FIG. 17B is a second diagram showing an example of feature points extracted from a moving image. That is, FIG. 17A is an image k1 showing an example of feature points at time t, extracted from the image of the person who is doing the shelving shown in FIG. As shown in FIG. 17A, feature points are extracted from moving parts. FIG. 17B is an image k2 showing an example of feature points similarly extracted at time t+Δt.

図１７Ａに示す特徴点が抽出されたら、当該特徴点が属するブロック内の画素の時空間エッジ情報を求める。すなわち、式（７）に示す微分演算を行うことによって、画素のエッジ情報Ｅ（ｘ，ｙ，ｔ）（微分ベクトル）を求める。 After the feature points shown in FIG. 17A are extracted, the spatio-temporal edge information of the pixels in the block to which the feature points belong is obtained. That is, the edge information E(x, y, t) (differential vector) of the pixel is obtained by performing the differential operation shown in Equation (7).

１ブロックはＭ×Ｎ×Ｔの画素を含むため、式（７）によってＭ×Ｎ×Ｔ×３の微分値が得られる。すなわち、特徴点を含むブロックを、Ｍ×Ｎ×Ｔ×３個の微分値のベクトルで記述することができる。つまり、特徴点をＭ×Ｎ×Ｔ×３次元のベクトルで記述することができる。そして、図１７Ｂの画像ｋ２についても、同様にしてエッジ情報Ｅ（ｘ，ｙ，ｔ）を求める。 Since one block includes M.times.N.times.T pixels, equation (7) yields M.times.N.times.T.times.3 differential values. That is, a block containing feature points can be described by a vector of M×N×T×3 differential values. That is, feature points can be described by M×N×T×3-dimensional vectors. Edge information E(x, y, t) is similarly obtained for image k2 in FIG. 17B.

なお、辞書作成部３３３は、学習により、特定行動を認識する認識辞書を作成するとき、画像の中の歪が異なる複数の異なる位置にそれぞれ対応する認識辞書を作成する。 When creating a recognition dictionary for recognizing a specific action through learning, the dictionary creation unit 333 creates recognition dictionaries corresponding to a plurality of different positions with different distortions in an image.

ここで、画像ｋ１から抽出された複数の特徴点のうち、近接した特徴点は、一人の人の行動に伴って発生する特徴点であると考えられる。すなわち、図１７Ａに示す領域ｍ１を、人の存在領域であるとして、辞書作成部３３３が作成した認識辞書を、領域ｍ１の代表点（例えば重心位置）と関連付けて記憶する。 Here, among the plurality of feature points extracted from the image k1, the closest feature points are considered to be feature points that occur as a result of the action of one person. That is, the recognition dictionary created by the dictionary creation unit 333 is stored in association with the representative point (for example, the position of the center of gravity) of the region m1, assuming that the region m1 shown in FIG. 17A is the region where a person exists.

図１７Ｂの画像ｋ２から抽出された特徴点が形成する領域ｍ２についても同様である。このように、辞書作成部３３３は、特定行動を含むＮフレームの画像を１つの学習データとして、認識辞書を作成する。 The same applies to the area m2 formed by the feature points extracted from the image k2 in FIG. 17B. In this way, the dictionary creation unit 333 creates a recognition dictionary using N frames of images including a specific action as one set of learning data.

（特定行動の持続時間の説明）
次に、図１８を用いて、特定行動の持続時間について説明する。図１８は、特定行動の持続時間の測定について説明する図である。 (Explanation of duration of specific action)
Next, with reference to FIG. 18, the duration of specific actions will be described. FIG. 18 is a diagram illustrating measurement of the duration of specific actions.

持続時間測定部３３６は、特定行動の認識結果により特定行動の持続時間を測定する。図１８は、時刻ｔ０から時刻ｔ１の間は、「歩く」行動を行ったと認識されて、時刻ｔ２から時刻ｔ３の間は、「棚入れ」行動を行ったと認識された例を示す。 The duration measurement unit 336 measures the duration of the specific action based on the recognition result of the specific action. FIG. 18 shows an example in which an action of "walking" is recognized from time t0 to time t1, and an action of "putting" is recognized from time t2 to time t3.

持続時間測定部３３６は、図１８において、「歩く」行動の持続時間は（ｔ１－ｔ０）であるとし、「棚入れ」行動の持続時間は（ｔ３－ｔ２）であると判断する。なお、認識する特定行動の数が増えた場合も、同様に、各特定行動の認識処理を行い、行動の持続時間が測定される。 The duration measuring unit 336 determines that the duration of the “walking” behavior is (t1-t0) and the duration of the “putting away” behavior is (t3-t2) in FIG. Also when the number of specific behaviors to be recognized increases, the recognition processing for each specific behavior is similarly performed, and the duration of the behavior is measured.

（認識辞書作成処理の流れの説明）
次に、図１９を用いて、辞書作成部３３３が行う認識辞書作成処理の流れを説明する。なお、図１９は、認識辞書の作成の流れの一例を示すフローチャートである。 (Description of the flow of recognition dictionary creation processing)
Next, with reference to FIG. 19, the flow of recognition dictionary creation processing performed by the dictionary creation unit 333 will be described. Note that FIG. 19 is a flowchart showing an example of the flow of creating a recognition dictionary.

動画入力部３３１は、魚眼カメラ２００が撮影した動画を入力する（ステップＳ１１）。 The moving image input unit 331 inputs a moving image captured by the fisheye camera 200 (step S11).

特徴点抽出部３３３ａは、入力された動画の中から特徴点を抽出する（ステップＳ１２）。 The feature point extraction unit 333a extracts feature points from the input moving image (step S12).

特徴点分類部３３３ｂは、抽出された特徴点をクラスタリングする（ステップＳ１３）。 The feature point classification unit 333b clusters the extracted feature points (step S13).

特徴ベクトル算出部３３３ｃは、平均ベクトルＶｋを算出する（ステップＳ１４）。 The feature vector calculator 333c calculates the average vector Vk (step S14).

ヒストグラム作成部３３３ｄは、学習ヒストグラムＨ（ｋ）を作成する（ステップＳ１５）。 The histogram creation unit 333d creates a learning histogram H(k) (step S15).

認識辞書作成部３３３ｅは、認識辞書を作成する（ステップＳ１６）。その後、辞書作成部３３３は、図１９の処理を終了する。なお、前記したように、認識辞書は、画像の異なる位置（歪が異なる位置）において複数作成する必要があるため、図１９の処理は繰り返し実行される。 The recognition dictionary creation unit 333e creates a recognition dictionary (step S16). After that, the dictionary creating unit 333 terminates the processing of FIG. 19 . As described above, it is necessary to create a plurality of recognition dictionaries at different positions of the image (positions with different distortions), so the process of FIG. 19 is repeatedly executed.

（行動認識処理の流れの説明）
次に、図２０を用いて、行動認識処理部３２１が行う行動認識処理の流れを説明する。なお、図２０は、特定行動の認識処理の流れの一例を示すフローチャートである。 (Explanation of flow of action recognition processing)
Next, the flow of action recognition processing performed by the action recognition processing unit 321 will be described with reference to FIG. Note that FIG. 20 is a flowchart showing an example of the flow of recognition processing for a specific action.

動画入力部３３１は、魚眼カメラ２００が撮影した動画を入力する（ステップＳ２１）。 The moving image input unit 331 inputs a moving image captured by the fisheye camera 200 (step S21).

特徴点抽出部３３５ａは、入力された動画の中から特徴点を抽出する（ステップＳ２２）。 The feature point extraction unit 335a extracts feature points from the input moving image (step S22).

特徴ベクトル算出部３３５ｂは、平均ベクトルＶｋを算出する（ステップＳ２３）。 The feature vector calculator 335b calculates the average vector Vk (step S23).

ヒストグラム作成部３３５ｃは、特定行動ヒストグラムＴ（ｋ）を作成する（ステップＳ２４）。 The histogram creation unit 335c creates a specific action histogram T(k) (step S24).

辞書選択部３３４は、認識辞書を選択する（ステップＳ２５）。具体的には、辞書選択部３３４は、特徴点抽出部３３５ａが抽出した特徴点の位置の近傍で作成された認識辞書を選択する。すなわち、辞書選択部３３４は、歪の大きさが近い位置で作成された認識辞書を選択する。 The dictionary selection unit 334 selects a recognition dictionary (step S25). Specifically, the dictionary selection unit 334 selects a recognition dictionary created near the position of the feature point extracted by the feature point extraction unit 335a. That is, the dictionary selection unit 334 selects recognition dictionaries created at positions where the magnitudes of distortion are close to each other.

行動認識部３３５は、特定行動を認識する（ステップＳ２６）。なお、特定行動の認識処理の流れは後述する（図２１）。 The action recognition unit 335 recognizes the specific action (step S26). Note that the flow of recognition processing for specific actions will be described later (FIG. 21).

持続時間測定部３３６は、特定行動の持続時間を測定する（ステップＳ２７）。 The duration measurement unit 336 measures the duration of the specific action (step S27).

さらに、持続時間測定部３３６は、特定行動の種類と特定行動の測定結果とを出力する（ステップＳ２８）。その後、行動認識部３３５は、図２０の処理を終了する。 Further, the duration measurement unit 336 outputs the type of specific action and the measurement result of the specific action (step S28). After that, the action recognition unit 335 ends the processing of FIG. 20 .

（特定行動の認識処理の流れの説明）
次に、図２１を用いて、行動認識部３３５が行う特定行動の認識処理の流れを説明する。なお、図２１は、複数の特定行動を認識する処理の流れの一例を示すフローチャートである。特に図２１は、特定行動のうち、「歩く」行動を行った後で「棚入れ」行動を行ったことを認識する処理の流れを示す。 (Description of the flow of specific action recognition processing)
Next, with reference to FIG. 21, the flow of specific action recognition processing performed by the action recognition unit 335 will be described. Note that FIG. 21 is a flowchart showing an example of the flow of processing for recognizing a plurality of specific actions. In particular, FIG. 21 shows the flow of processing for recognizing that the "putting away" action was performed after the "walking" action among the specific actions.

行動認識部３３５は、「歩く」行動を認識する（ステップＳ３１）。 The action recognition unit 335 recognizes a "walking" action (step S31).

次に、行動認識部３３５は、「歩く」行動を認識したかを判定する（ステップＳ３２）。「歩く」行動を認識したと判定される（ステップＳ３２：Ｙｅｓ）とステップＳ３１に進む。一方、「歩く」行動を認識したと判定されない（ステップＳ３２：Ｎｏ）とステップＳ３３に進む。 Next, the action recognition unit 335 determines whether or not a "walking" action has been recognized (step S32). If it is determined that the "walking" action has been recognized (step S32: Yes), the process proceeds to step S31. On the other hand, if it is not determined that the "walking" action has been recognized (step S32: No), the process proceeds to step S33.

ステップＳ３２でＮｏと判定されると、行動認識部３３５は、「棚入れ」行動を認識する（ステップＳ３３）。 If determined as No in step S32, the action recognition unit 335 recognizes the action of "shelving" (step S33).

次に、行動認識部３３５は、「棚入れ」行動を認識したかを判定する（ステップＳ３４）。「棚入れ」行動を認識したと判定される（ステップＳ３４：Ｙｅｓ）と図２１の処理を終了して、図２０のステップＳ２７に進む。一方、「棚入れ」行動を認識したと判定されない（ステップＳ３４：Ｎｏ）とステップＳ３１に戻る。 Next, the action recognizing unit 335 determines whether or not the action of "shelving" has been recognized (step S34). If it is determined that the "put away" action has been recognized (step S34: Yes), the process of FIG. 21 is terminated and the process proceeds to step S27 of FIG. On the other hand, if it is not determined that the "putting away" action has been recognized (step S34: No), the process returns to step S31.

なお、図２１に示すフローチャートは一例であって、行動認識部３３５は、認識する特定行動の種類や順序に応じた処理を行う。 Note that the flowchart shown in FIG. 21 is an example, and the action recognition unit 335 performs processing according to the type and order of specific actions to be recognized.

以上説明したように、第１の実施形態の行動認識装置３００によれば、動画入力部３３１は、魚眼カメラ２００（撮影手段）で撮影された動画を入力して、領域分割部３３２は、動画に含まれる画像を、歪の異なる複数の領域に分割する。辞書作成部３３３は、分割された領域毎に、人（被写体）の特定行動を認識するための認識辞書を作成する。辞書選択部３３４は、辞書作成部３３３が作成した複数の認識辞書の中から、動画から検出した人の特定行動を認識するために使用する認識辞書を選択する。そして、行動認識部３３５は、辞書選択部３３４が選択した認識辞書に基づいて、人の特定行動を認識する。したがって、画像の領域毎に認識辞書を作成するため、撮影した画像の歪を補正することなく、人の特定行動（標準作業）を認識することができる。 As described above, according to the action recognition device 300 of the first embodiment, the moving image input unit 331 inputs a moving image captured by the fisheye camera 200 (capturing means), and the area dividing unit 332 An image included in a moving image is divided into a plurality of regions with different distortions. The dictionary creating unit 333 creates a recognition dictionary for recognizing a specific action of a person (subject) for each divided area. A dictionary selection unit 334 selects a recognition dictionary to be used for recognizing a specific behavior of a person detected from a moving image from among the plurality of recognition dictionaries created by the dictionary creation unit 333 . Based on the recognition dictionary selected by the dictionary selection unit 334, the action recognition unit 335 recognizes the specific human action. Therefore, since a recognition dictionary is created for each area of an image, it is possible to recognize a specific human action (standard work) without correcting the distortion of the photographed image.

また、第１の実施形態の行動認識装置３００によれば、辞書選択部３３４は、魚眼カメラ２００（撮影手段）が撮影した動画に含まれる画像から検出した人（被写体）の位置に応じた認識辞書を選択する。したがって、撮影した画像の歪を補正することなく、人の特定行動（標準作業）を認識することができる。 Further, according to the action recognition device 300 of the first embodiment, the dictionary selection unit 334 detects the position of the person (subject) from the image included in the moving image captured by the fisheye camera 200 (capturing means). Select a recognition dictionary. Therefore, it is possible to recognize specific human behavior (standard work) without correcting the distortion of the photographed image.

また、本実施形態の行動認識装置３００によれば、持続時間測定部３３６は、特定行動の認識結果に基づいて、当該特定行動の持続時間を測定する。したがって、特定行動（標準作業）の持続時間を容易かつ正確に測定することができる。 Also, according to the action recognition device 300 of the present embodiment, the duration measurement unit 336 measures the duration of the specific action based on the recognition result of the specific action. Therefore, the duration of specific behavior (standard work) can be measured easily and accurately.

また、第１の実施形態の行動認識装置３００によれば、特徴点抽出部３３３ａは、魚眼カメラ２００（撮影手段）で撮影された動画に含まれる複数の画像の中から特徴点を抽出する。特徴点分類部３３３ｂは、抽出された特徴点をＫ種類に分類する。特徴ベクトル算出部３３３ｃは、分類されたＫ種類の特徴点グループに対して、それぞれのＫ個の平均ベクトルＶｋ（学習ベクトル）を求める。したがって、人（被写体）の特定行動を容易に学習することができる。 Further, according to the action recognition device 300 of the first embodiment, the feature point extraction unit 333a extracts feature points from a plurality of images included in a moving image captured by the fisheye camera 200 (capturing means). . The feature point classification unit 333b classifies the extracted feature points into K types. The feature vector calculator 333c obtains K average vectors Vk (learning vectors) for each of the K kinds of classified feature point groups. Therefore, it is possible to easily learn the specific behavior of a person (subject).

また、第１の実施形態の行動認識装置３００によれば、辞書作成部３３３は、動画入力部３３１によって入力された、特定行動を行っている人の動画（プラス学習データ）と、特定行動を行っていない人の動画（マイナス学習データ）とから、各データの特徴点が有する特徴量による平均ベクトルＶｋ（学習ベクトル）を用いて、それぞれ学習ヒストグラムＨ（ｋ）を作成して、プラス学習データから生成した学習ヒストグラムＨ（ｋ）と、マイナス学習データから生成した学習ヒストグラムＨ（ｋ）とに基づいて、認識辞書を作成する。したがって、認識辞書の精度を向上させることができる。 Further, according to the action recognition device 300 of the first embodiment, the dictionary creation unit 333 combines the video (plus learning data) of the person performing the specific action, input by the video input unit 331, with the specific action. Using the average vector Vk (learning vector) by the feature amount of the feature points of each data, the learning histogram H(k) is created from the video of the person who has not performed (negative learning data), and the plus learning data. and the learning histogram H(k) generated from the negative learning data, a recognition dictionary is created. Therefore, it is possible to improve the accuracy of the recognition dictionary.

また、第１の実施形態の行動認識装置３００によれば、特徴点抽出部３３５ａは、魚眼カメラ２００（撮影手段）で撮影された動画に含まれる複数の画像の中から特徴点を抽出する。特徴ベクトル算出部３３５ｂは、抽出された特徴点における時空間エッジの大きさと方向を示す特徴ベクトルを算出する。ヒストグラム作成部３３５ｃは、抽出された特徴点の特徴ベクトルに基づいて、特定行動ヒストグラムＴ（ｋ）を作成する。そして、行動認識部３３５ｄは、特定行動ヒストグラムＴ（ｋ）と認識辞書が有する学習ヒストグラムＨ（ｋ）とに基づいて、人の特定行動を認識する。したがって、人（被写体）の特定行動を容易かつ正確に認識することができる。 Further, according to the action recognition device 300 of the first embodiment, the feature point extraction unit 335a extracts feature points from a plurality of images included in a moving image captured by the fisheye camera 200 (capturing means). . The feature vector calculator 335b calculates a feature vector indicating the magnitude and direction of the spatio-temporal edge at the extracted feature point. The histogram creating unit 335c creates a specific action histogram T(k) based on the feature vectors of the extracted feature points. Then, the action recognition unit 335d recognizes a person's specific action based on the specific action histogram T(k) and the learning histogram H(k) of the recognition dictionary. Therefore, the specific behavior of a person (subject) can be easily and accurately recognized.

また、第１の実施形態の行動認識装置３００によれば、特徴ベクトル算出部３３５ｂは、入力された複数の画像をＭ×Ｎ×Ｔサイズのフロックに分割し、各ブロックを微分処理することで、Ｍ×Ｎ×Ｔ×３次元のエッジ情報Ｅ（ｘ，ｙ，ｔ）（微分ベクトル）を計算する。そして、特徴ベクトル算出部３３５ｂは、計算したエッジ情報Ｅ（ｘ，ｙ，ｔ）と事前に学習したＫ種類の平均ベクトルＶｋ（学習ベクトル）とを比較し、当該比較の結果に基づいて、エッジ情報Ｅ（ｘ，ｙ，ｔ）を最も近い平均ベクトルＶｋと同じ種類の特徴点に分類する。ヒストグラム作成部３３５ｃは、分類の結果に基づいて特定行動ヒストグラムＴ（ｋ）を作成する、したがって、撮影された動画から、特定行動の認識に使用する特定行動ヒストグラムＴ（ｋ）を容易に作成することができる。 Further, according to the action recognition device 300 of the first embodiment, the feature vector calculation unit 335b divides a plurality of input images into M×N×T size blocks, and performs differentiation processing on each block. , M×N×T×3-dimensional edge information E(x, y, t) (differential vector). Then, the feature vector calculation unit 335b compares the calculated edge information E (x, y, t) with the pre-learned K kinds of average vectors Vk (learning vectors), and based on the comparison result, calculates the edge Information E(x, y, t) is classified into the same kind of feature points as the nearest mean vector Vk. The histogram creation unit 335c creates the specific action histogram T(k) based on the classification result, and thus easily creates the specific action histogram T(k) used for recognizing the specific action from the captured moving image. be able to.

また、第１の実施形態の行動認識装置３００によれば、辞書作成部３３３及び行動認識部３３５は、入力された動画に対して、時間軸でのフィルタリング処理を行う。そして、特徴点抽出部３３３ａ，３３５ａは、フィルタリング処理を行った結果、Ｍ×Ｎ×Ｔのブロック内における平均値が所定の閾値より大きい場合に、当該ブロックを特徴点として抽出する。したがって、特徴点の抽出を容易に行うことができる。 Further, according to the action recognition device 300 of the first embodiment, the dictionary creation unit 333 and the action recognition unit 335 perform filtering processing on the input moving image on the time axis. Then, when the average value in the M×N×T block is larger than a predetermined threshold as a result of the filtering processing, the feature point extraction units 333a and 335a extract the block as a feature point. Therefore, feature points can be easily extracted.

また、本実施形態の行動認識装置３００によれば、フィルタリング処理は、式（２），式（３），式（４）に示したＧａｂｏｒフィルタリング処理によって行う。したがって、撮影された動画のノイズが除去されることによって、特定行動の認識を行いやすい画像を得ることができる。 Further, according to the action recognition device 300 of the present embodiment, the filtering process is performed by the Gabor filtering process shown in Equations (2), (3), and (4). Therefore, by removing noise from the captured moving image, it is possible to obtain an image that facilitates recognition of the specific action.

また、第１の実施形態の行動認識装置３００によれば、特徴点抽出部３３３ａ，３３５ａは、時間軸でのフィルタリング処理を行う前に、各画像に対して平滑化処理を行う。したがって、時間軸方向に発生するノイズが除去されるため、人（被写体）の特定行動を、より一層高精度に認識することができる。 Further, according to the action recognition device 300 of the first embodiment, the feature point extraction units 333a and 335a perform smoothing processing on each image before performing filtering processing on the time axis. Therefore, since noise occurring in the direction of the time axis is removed, the specific action of a person (subject) can be recognized with even higher accuracy.

また、第１の実施形態の行動認識装置３００によれば、行動認識部３３５は、人の特定行動を認識する場合に、所定の順序で特定行動を認識し、特定行動が認識された場合は認識結果を出力して、特定行動が認識されない場合は、次の特定行動を認識する。したがって、複数の特定行動が連続して発生する場合であっても、確実に認識することができる。 Further, according to the action recognition device 300 of the first embodiment, when recognizing a specific action of a person, the action recognition unit 335 recognizes the specific action in a predetermined order, and when the specific action is recognized, When the recognition result is output and the specific action is not recognized, the next specific action is recognized. Therefore, even if a plurality of specific behaviors occur consecutively, they can be reliably recognized.

また、第１の実施形態の行動認識装置３００によれば、広角レンズは、魚眼レンズである。したがって、１台のカメラでより一層広範囲を観測することができる。 Further, according to the action recognition device 300 of the first embodiment, the wide-angle lens is a fisheye lens. Therefore, a wider range can be observed with a single camera.

（第２の実施形態）
次に、添付図面を参照して、行動認識装置、行動認識方法及びプログラムの第２の実施形態を詳細に説明する。 (Second embodiment)
Next, a second embodiment of the action recognition device, action recognition method, and program will be described in detail with reference to the accompanying drawings.

（行動認識装置のハードウェア構成の説明）
図２２は、本実施形態に係る行動認識システム１００ａのハードウェア構成の一例を示すハードウェアブロック図である。図２２に示すように、行動認識システム１００ａは、魚眼カメラ２００，２０１と、行動認識装置３００ａとを備える。 (Description of hardware configuration of action recognition device)
FIG. 22 is a hardware block diagram showing an example of the hardware configuration of the action recognition system 100a according to this embodiment. As shown in FIG. 22, the action recognition system 100a includes fisheye cameras 200 and 201 and an action recognition device 300a.

行動認識システム１００ａは、第１の実施形態で説明した行動認識システム１００と同様の機能を有し、魚眼カメラ２００，２０１で撮影した人（被写体）の特定行動を認識する。行動認識システム１００との違いは、２台の魚眼カメラ２００，２０１で撮影した動画を入力可能な点である。 The action recognition system 100a has the same function as the action recognition system 100 described in the first embodiment, and recognizes a specific action of a person (subject) photographed by the fisheye cameras 200,201. The difference from the action recognition system 100 is that moving images captured by two fisheye cameras 200 and 201 can be input.

なお、行動認識装置３００ａは、行動認識処理部３２１ａと、行動認識処理部３２１ａと魚眼カメラ２００，２０１とを接続するインタフェース部３２２ａと、を備える。 The action recognition device 300a includes an action recognition processing section 321a and an interface section 322a that connects the action recognition processing section 321a and the fisheye cameras 200 and 201 to each other.

行動認識処理部３２１ａは、人（被写体）の特定行動を認識する。インタフェース部３２２ａは、魚眼カメラ２００，２０１が撮影した動画を、行動認識処理部３２１ａが認識可能なデータ形式に変換して、行動認識処理部３２１ａに受け渡す。 The action recognition processing unit 321a recognizes a specific action of a person (subject). The interface unit 322a converts the moving images captured by the fisheye cameras 200 and 201 into a data format recognizable by the action recognition processing unit 321a, and transfers the converted data to the action recognition processing unit 321a.

次に、図２３を用いて、行動認識システム１００ａが使われる代表的な場面を説明する。図２３は、第２の実施形態に係る行動認識システム１００ａが使用されている場面の一例を示す図である。 Next, a typical scene in which the action recognition system 100a is used will be described with reference to FIG. FIG. 23 is a diagram showing an example of a scene in which the action recognition system 100a according to the second embodiment is used.

図２３に示すように、行動認識システム１００ａは、オフィスや工場などの職場における作業環境に設置される。魚眼カメラ２００，２０１は、作業環境において作業を行っている複数の人Ｈ１，Ｈ２を含む動画を撮影する。本実施形態では、魚眼カメラ２００，２０１は、いずれも対角線画角１８０°を有する魚眼レンズを備えるものとする。そして、２台の魚眼カメラ２００，２０１は、異なる方向から同じ作業環境を撮影する。なお、人Ｈ１，Ｈ２は、被写体の一例である。 As shown in FIG. 23, the action recognition system 100a is installed in a work environment such as an office or a factory. The fisheye cameras 200 and 201 capture moving images including a plurality of people H1 and H2 working in the work environment. In this embodiment, the fisheye cameras 200 and 201 are both equipped with a fisheye lens having a diagonal angle of view of 180°. The two fisheye cameras 200 and 201 photograph the same work environment from different directions. Note that the people H1 and H2 are examples of subjects.

第１の実施形態で説明した行動認識システム１００は、複数の作業員が作業している環境において、複数の人の所定行動を認識することが可能であるが、別の作業者の死角に入っている作業者は可視化することができないため、行動認識を行うことができなかった。これに対して、行動認識システム１００ａは、作業環境を異なる方向から観測するため、死角が少なくなり、複数の人の所定行動を、より確実に認識することができる。さらに、行動認識システム１００ａは、１人の作業者を２台の魚眼カメラ２００，２０１で撮影することができるため、より小さい歪で撮影された画像を用いて行動認識を行うことができる。 The action recognition system 100 described in the first embodiment can recognize predetermined actions of a plurality of workers in an environment where a plurality of workers are working. Since it is not possible to visualize the worker who is doing this, it was not possible to perform action recognition. On the other hand, since the action recognition system 100a observes the work environment from different directions, the blind spots are reduced and the predetermined actions of a plurality of people can be recognized more reliably. Furthermore, since the action recognition system 100a can photograph one worker with the two fisheye cameras 200 and 201, action recognition can be performed using images photographed with less distortion.

なお、行動認識システム１００ａのハードウェア構成は、魚眼カメラの台数が増える以外は、行動認識システム１００のハードウェア構成と同じであるため、説明は省略する。 Note that the hardware configuration of the action recognition system 100a is the same as the hardware configuration of the action recognition system 100 except that the number of fisheye cameras is increased, so the description is omitted.

（行動認識処理部の機能構成の説明）
次に、図２４を用いて、行動認識処理部３２１ａの機能構成を説明する。図２４は、第２の実施形態における行動認識処理部３２１ａの機能構成の一例を示す機能ブロック図である。図２４に示すように、行動認識処理部３２１ａは、第１の実施形態で説明した行動認識処理部３２１の機能構成（図１２）に加えて、同一人物判定部３３７を備える。また行動認識処理部３２１ａは、辞書選択部３３４の代わりに、機能が変更された辞書選択部３３４ａを備える。 (Description of the functional configuration of the action recognition processing unit)
Next, the functional configuration of the action recognition processing section 321a will be described with reference to FIG. FIG. 24 is a functional block diagram showing an example of the functional configuration of the action recognition processing section 321a in the second embodiment. As shown in FIG. 24, the action recognition processing unit 321a includes a same person determination unit 337 in addition to the functional configuration (FIG. 12) of the action recognition processing unit 321 described in the first embodiment. Also, the action recognition processing unit 321a includes a dictionary selection unit 334a whose function is changed instead of the dictionary selection unit 334. FIG.

辞書選択部３３４ａは、魚眼カメラ２００，２０１が、それぞれ同じ人を撮影した際に、行動認識を行うために使用する画像に応じた認識辞書を選択する。具体的には、辞書選択部３３４ａは、魚眼カメラ２００，２０１が撮影した画像における同一人物の位置を比較して、より画像の中央に近い位置に写っている人の特定行動を認識するための認識辞書、すなわち、より歪の小さい位置で作成された認識辞書を選択する。なお、魚眼カメラ２００，２０１が撮影した画像に同一人物が写っているかは、後述する同一人物判定部３３７が判定する。 The dictionary selection unit 334a selects a recognition dictionary corresponding to an image used for action recognition when the fisheye cameras 200 and 201 respectively photograph the same person. Specifically, the dictionary selection unit 334a compares the positions of the same person in the images captured by the fisheye cameras 200 and 201, and recognizes the specific behavior of the person who appears in a position closer to the center of the image. , that is, the recognition dictionary created at a position with less distortion. The same person determination unit 337, which will be described later, determines whether or not the same person is captured in the images captured by the fisheye cameras 200 and 201. FIG.

同一人物判定部３３７は、魚眼カメラ２００，２０１がそれぞれ撮影した画像の中に同一人物が写っているかを判定する。具体的には、同一人物判定部３３７は、魚眼カメラ２００，２０１がそれぞれ撮影した画像から抽出された特徴点に基づく特徴ベクトルを比較することによって、特徴ベクトルの種類と特徴ベクトルの向きが類似している場合に、同一人物が写っていると判定する。 The same person determination unit 337 determines whether the images captured by the fisheye cameras 200 and 201 include the same person. Specifically, the same person determination unit 337 compares feature vectors based on feature points extracted from images captured by the fisheye cameras 200 and 201, respectively, to determine whether the types of feature vectors and the orientations of the feature vectors are similar. If so, it is determined that the same person is photographed.

（行動認識処理の流れの説明）
次に、図２５を用いて、行動認識処理部３２１ａが行う行動認識処理の流れを説明する。なお、図２５は、第２の実施形態における特定行動の認識処理の流れの一例を示すフローチャートである。 (Explanation of flow of action recognition processing)
Next, the flow of action recognition processing performed by the action recognition processing unit 321a will be described with reference to FIG. Note that FIG. 25 is a flow chart showing an example of the flow of specific action recognition processing in the second embodiment.

ステップＳ４１からステップＳ４４は、第１の実施形態で説明したステップＳ２１からステップＳ２４（図２０）と同じ処理である。 Steps S41 to S44 are the same processes as steps S21 to S24 (FIG. 20) described in the first embodiment.

次に、同一人物判定部３３７は、魚眼カメラ２００，２０１がそれぞれ撮影した画像の中から、同一人物を表す領域を特定する（ステップＳ４５）。 Next, the same person determination unit 337 identifies areas representing the same person from the images captured by the fisheye cameras 200 and 201 (step S45).

続いて、辞書選択部３３４ａは、ステップＳ４５で特定された同一人物を表す領域のうち、最も画像の中央に近い位置にある領域を撮影した魚眼カメラを特定して、当該位置に対応する認識辞書を選択する（ステップＳ４６）。なお、ステップＳ４５において、同一人物を表す領域が特定できなかった場合は、辞書選択部３３４ａは、検出された各領域にそれぞれ対応する認識辞書を選択する。 Next, the dictionary selection unit 334a identifies the fish-eye camera that captured the area closest to the center of the image from among the areas representing the same person identified in step S45, and recognizes the corresponding position. A dictionary is selected (step S46). In step S45, if the regions representing the same person cannot be identified, the dictionary selection unit 334a selects recognition dictionaries corresponding to the respective detected regions.

続くステップＳ４７からステップＳ４９で行う処理は、第１の実施形態で説明したステップＳ２６からステップＳ２９（図２０）と同じ処理である。 The processing from step S47 to step S49 that follows is the same as the processing from step S26 to step S29 (FIG. 20) described in the first embodiment.

以上説明したように、第２の実施形態の行動認識装置３００ａは、複数の魚眼カメラ２００，２０１（撮影手段）が、同じ領域を異なる方向から撮影する。したがって、観測範囲の死角が減少する。また、同じ人（被写体）を異なる方向から撮影することができるため、行動認識の認識精度を向上させることができる。 As described above, in the action recognition device 300a of the second embodiment, the plurality of fisheye cameras 200 and 201 (photographing means) photograph the same area from different directions. Therefore, blind spots in the observation range are reduced. In addition, since the same person (subject) can be photographed from different directions, the recognition accuracy of action recognition can be improved.

また、第２の実施形態の行動認識装置３００ａによれば、辞書選択部３３４ａは、複数の魚眼カメラ２００，２０１（撮影手段）が撮影した動画に含まれる画像からそれぞれ検出した同じ人（被写体）の位置に応じた認識辞書のうち、最も歪の小さい認識辞書を選択する。したがって、特定行動の認識精度を向上させることができる。 Further, according to the action recognition device 300a of the second embodiment, the dictionary selection unit 334a detects the same person (subject ), the recognition dictionary with the least distortion is selected from among the recognition dictionaries corresponding to the positions of the . Therefore, it is possible to improve the recognition accuracy of the specific action.

また、第２の実施形態の行動認識装置３００ａによれば、辞書選択部３３４ａは、複数の魚眼カメラ２００，２０１（撮影手段）が撮影した動画に含まれる画像からそれぞれ検出した同じ人の位置に応じた認識辞書のうち、画像の中央に近い位置に対応する認識辞書を選択する。したがって、歪の小さい位置で作成された認識辞書が選択されるため、特定行動の認識精度を向上させることができる。 Further, according to the action recognition device 300a of the second embodiment, the dictionary selection unit 334a detects the positions of the same person detected from the images included in the moving images captured by the plurality of fisheye cameras 200 and 201 (capturing means). A recognition dictionary corresponding to a position near the center of the image is selected from among the recognition dictionaries corresponding to . Therefore, since a recognition dictionary created at a position with little distortion is selected, it is possible to improve the recognition accuracy of the specific action.

以上、本発明の実施の形態について説明したが、上述した実施の形態は、例として提示したものであり、本発明の範囲を限定することは意図していない。この新規な実施の形態は、その他の様々な形態で実施されることが可能である。また、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。また、この実施の形態は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although the embodiments of the present invention have been described above, the above-described embodiments are presented as examples and are not intended to limit the scope of the present invention. This novel embodiment can be implemented in various other forms. Also, various omissions, replacements, and changes can be made without departing from the scope of the invention. Moreover, this embodiment is included in the scope and gist of the invention, and is included in the scope of the invention described in the claims and its equivalents.

２００，２０１魚眼カメラ（撮影手段）
３００，３００ａ行動認識装置
３２１，３２１ａ行動認識処理部
３３１動画入力部
３３２領域分割部
３３３辞書作成部
３３４辞書選択部
３３５行動認識部
３３６持続時間測定部
３３３ａ，３３５ａ特徴点抽出部
３３３ｂ特徴点分類部
３３３ｃ，３３５ｂ特徴ベクトル算出部
３３３ｄヒストグラム作成部（学習ヒストグラム作成部）
３３５ｃヒストグラム作成部
３３３ｅ認識辞書作成部
３３５ｄ行動認識部
Ｈ１，Ｈ２人（被写体）
Ｈ（ｋ）学習ヒストグラム
Ｔ（ｋ）特定行動ヒストグラム
Ｖｋ平均ベクトル（学習ベクトル） 200, 201 fisheye camera (photographing means)
300, 300a action recognition device 321, 321a action recognition processing unit 331 moving image input unit 332 region division unit 333 dictionary creation unit 334 dictionary selection unit 335 action recognition unit 336 duration measurement unit 333a, 335a feature point extraction unit 333b feature point classification unit 333c, 335b feature vector calculation unit 333d histogram creation unit (learning histogram creation unit)
335c Histogram creation unit 333e Recognition dictionary creation unit 335d Action recognition unit H1, H2 Person (subject)
H(k) learning histogram T(k) specific action histogram Vk average vector (learning vector)

特開２０１１－１００１７５号公報JP 2011-100175 A

Claims

A behavior recognition device for recognizing a specific behavior of a subject captured in a captured moving image to be recognized,
Input a video of a subject performing a specific action at multiple positions with different distortions within the observation range of the imaging means, taken by multiple imaging means equipped with a wide-angle lens and photographing the same area from different directions . a first moving image input unit for
a second moving image input unit for inputting a moving image to be recognized by the plurality of photographing means;
an area dividing unit that divides an image included in the moving images input by the first moving image input unit and the second moving image input unit into a plurality of areas with different distortions;
a dictionary creation unit for creating a recognition dictionary for recognizing the specific behavior of the subject for each of the photographing means and each of the regions from the moving image input by the first moving image input unit ;
The dictionary creating unit creates a dictionary for each photographing means and for each area according to the position of the same subject detected from each of the areas of the image included in the moving image to be recognized input from the different photographing means. a dictionary selection unit that selects a recognition dictionary with the least distortion from among the plurality of recognition dictionaries obtained;
an action recognition unit that recognizes the specific action of the subject based on the recognition dictionary selected by the dictionary selection unit;
An action recognition device comprising:

The dictionary selection unit selects a position near the center of the image from among the recognition dictionaries corresponding to the positions of the same subject detected from the images included in the moving image to be recognized input by the second moving image input unit. select the corresponding recognition dictionary,
The action recognition device according to claim 1 .

Further comprising a duration measurement unit that measures the duration of the specific action based on the recognition result of the specific action,
The action recognition device according to claim 1 or 2 .

The dictionary creation unit
a feature point extraction unit for extracting feature points from a plurality of images included in the moving image input by the first moving image input unit ;
a feature point classification unit that classifies the extracted feature points into K types;
a feature vector calculation unit that obtains K learning vectors for each of the K types of classified feature point groups;
The action recognition device according to any one of claims 1 to 3 , further comprising a learning histogram creation unit that creates a learning histogram representing the appearance frequency of the learning vector.

The learning histogram creation unit
from the moving image of the subject performing the specific action, which constitutes the positive learning data, and the moving image of the subject not performing the specific action, which constitutes the negative learning data, which are input by the first moving image input unit; , using the learning vectors based on the feature amounts of the feature points of each data, create a learning histogram,
The dictionary creation unit
creating the recognition dictionary based on the learning histogram generated from the positive learning data and the learning histogram generated from the negative learning data;
The action recognition device according to claim 4 .

The action recognition unit is
a feature point extracting unit for extracting feature points from among a plurality of images included in the moving image to be recognized input by the second moving image input unit ;
a feature vector calculation unit that calculates a feature vector indicating the magnitude and direction of the spatiotemporal edge at the extracted feature point;
a histogram creation unit that creates a histogram representing the frequency of appearance of the feature vectors at the feature points,
Recognizing the specific behavior of the subject based on the histogram and the recognition dictionary;
The action recognition device according to claim 4 or 5 .

The feature vector calculator,
A plurality of input images are divided into blocks of size M×N×T, and each block is differentiated to calculate an M×N×T×3-dimensional differential vector,
Comparing the calculated differential vector with the learning vector learned in advance, classifying the differential vector into the same type of feature points as the closest learning vector based on the result of the comparison,
The histogram creating unit
creating the histogram based on the results of the classification;
The action recognition device according to claim 6 .

The dictionary creation unit and the action recognition unit are
Filtering the moving image input by the first moving image input unit and the moving image to be recognized input by the second moving image input unit on the time axis,
The feature point extraction unit is
As a result of performing the filtering process, if the average value in the M × N × T block is larger than a predetermined threshold, extracting the block as a feature point;
The action recognition device according to any one of claims 4 to 7 .

Let g _ev and g _od be the kernel of the Gabor filter shown in the following equations (1) and (2), * be the convolution process, and τ and ω be the parameters of the kernel. A Gabor filtering process using the following equation (3),

The action recognition device according to claim 8 .

The feature point extraction unit is
Performing a smoothing process on each image before performing the filtering process,
The action recognition device according to claim 8 or 9 .

The action recognition unit is
when recognizing the specific action of the subject, recognizing the specific action in a predetermined order, and outputting the recognition result when the specific action is recognized,
If the specific action is not recognized, recognize the next specific action,
The action recognition device according to any one of claims 1 to 10 .

The wide-angle lens is a fisheye lens,
The action recognition device according to any one of claims 1 to 11 .

When recognizing the specific behavior of the subject in the video from the captured video to be recognized,
Input a video of a subject performing a specific action at multiple positions with different distortions within the observation range of the imaging means, taken by multiple imaging means equipped with a wide-angle lens and photographing the same area from different directions . a first video input step of
a second moving image input step of inputting a moving image to be recognized by the plurality of photographing means;
an area dividing step of dividing an image included in the moving images input in the first moving image input step and the second moving image input step into a plurality of areas each having a different distortion;
a dictionary creation step of creating a recognition dictionary for recognizing the specific behavior of the subject for each of the photographing means and each of the regions from the moving image input in the first moving image input step ;
The dictionary creation step creates the dictionary for each imaging means and for each area according to the position of the same subject detected from each of the areas of the image included in the moving image to be recognized input from the different imaging means . a dictionary selection step of selecting a recognition dictionary with the least distortion from among the plurality of recognition dictionaries obtained;
an action recognition step of recognizing the specific action of the subject based on the recognition dictionary selected in the dictionary selection step;
An action recognition method that performs

A computer that controls an action recognition device that recognizes a specific action of a subject captured in a captured video to be recognized,
Input a video of a subject performing a specific action at multiple positions with different distortions within the observation range of the imaging means, taken by multiple imaging means equipped with a wide-angle lens and photographing the same area from different directions . a first moving image input unit for
a second moving image input unit for inputting a moving image to be recognized by the plurality of photographing means;
an area dividing unit that divides an image included in the moving images input by the first moving image input unit and the second moving image input unit into a plurality of areas with different distortions;
a dictionary creation unit for creating a recognition dictionary for recognizing the specific behavior of the subject for each of the photographing means and each of the regions from the moving image input by the first moving image input unit ;
The dictionary creating unit creates a dictionary for each photographing means and for each area according to the position of the same subject detected from each of the areas of the image included in the moving image to be recognized input from the different photographing means. a dictionary selection unit that selects a recognition dictionary with the least distortion from among the plurality of recognition dictionaries obtained;
an action recognition unit that recognizes the specific action of the subject based on the recognition dictionary selected by the dictionary selection unit;
A program that works as