JP2022189456A

JP2022189456A - Action recognition program, action recognition method, and information processing apparatus

Info

Publication number: JP2022189456A
Application number: JP2021098038A
Authority: JP
Inventors: 大輔内田; Daisuke Uchida; 智史島田; Tomohito Shimada; 有一村瀬; Yuichi Murase
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2022-12-22

Abstract

To recognize an action of a target person at a low cost with high accuracy.SOLUTION: An information processing apparatus detects skeleton information of a person from an image obtained by an imaging apparatus. The information processing apparatus acquires region information on a position where the person was imaged, and positions of a target region and an object. The information processing apparatus estimates an action performed by the person with respect to the object, based on the skeleton information of the person and the region information. The information processing apparatus recognizes an action of the person based on a distance between the person and the object, and the estimated action.SELECTED DRAWING: Figure 7

Description

本発明は、行動認識プログラム、行動認識方法および情報処理装置に関する。 The present invention relates to an action recognition program, an action recognition method, and an information processing apparatus.

ＡＩ（Artificial Intelligence）技術の発展により、映像から人や物体を認識し、認識した人の骨格情報から人の動作や姿勢、状態や行動を自動で検知する技術が開発されたり、利用されたりしている。例えば、高齢者や体の不自由な人を自動で検知し、危険な状態か否かを把握するする技術や、作業者の姿勢や工程を認識し、危険な場所に立ち入っていないか、無理な姿勢で作業していなか、手順を守っているか等を把握する技術がある。 With the development of AI (Artificial Intelligence) technology, technology has been developed and used that recognizes people and objects from images and automatically detects human movements, postures, states, and actions from the skeletal information of the recognized people. ing. For example, technology that automatically detects elderly people and physically handicapped people and ascertains whether they are in a dangerous state, recognizes the posture and processes of workers, and determines whether they are entering dangerous places or not. There is a technology to grasp whether the worker is working in a proper posture and whether the procedure is followed.

このように、ＡＩ技術の活用により、人の行動や状態を自動で分析することが可能であることから、購買分析や作業分析、現場監視や見守り、不審者検知など様々な分野への適用が望まれる。例えば、カメラとセンサを組み合わせ、対象者の動作、特に手を伸ばした動作や作業を認識する技術が知られている。 In this way, the use of AI technology makes it possible to automatically analyze human behavior and conditions, so it can be applied to various fields such as purchase analysis, work analysis, on-site monitoring and monitoring, and suspicious person detection. desired. For example, there is known a technology that combines a camera and a sensor and recognizes the movement of a subject, particularly the movement of reaching out and the work.

特開２０１９－１３９３２１号公報JP 2019-139321 A 特開２０１５－１７６２２７号公報JP 2015-176227 A

しかしながら、上記技術では、対象者の行動を低コストで高精度に認識することが難しい。例えば、センサを用いる認識技術は、レーザーや無線などの特殊なセンサを使うことが一般的であり、構成が複雑で、コストも高くなる。なお、カメラのみを用いる認識技術も知られているが、上下左右および奥行方向が正確に検知できない。 However, with the above technology, it is difficult to recognize the behavior of the target person at low cost and with high accuracy. For example, recognition technology that uses sensors generally uses special sensors such as lasers and wireless sensors, which results in complex configurations and high costs. Although a recognition technique using only a camera is also known, it cannot accurately detect up, down, left, right, and depth directions.

一つの側面では、対象者の行動を低コストで高精度に認識することができる行動認識プログラム、行動認識方法および情報処理装置を提供することを目的とする。 An object of one aspect is to provide an action recognition program, an action recognition method, and an information processing apparatus capable of recognizing a target person's action at low cost and with high accuracy.

第１の案では、行動認識プログラムは、コンピュータに、撮像装置により取得される画像から人物の骨格情報を検出し、前記人物が撮像された位置と対象となる領域および対象物の位置に関する領域情報を取得し、前記人物の骨格情報と前記領域情報とに基づいて、前記人物が前記対象物に行う動作を推定し、前記人物と前記対象物との距離と、推定された前記動作とに基づき、前記人物の行動を認識する、処理を実行させることを特徴とする。 In the first plan, the action recognition program causes the computer to detect the skeleton information of the person from the image acquired by the imaging device, and the position where the person was imaged, the target area, and the area information regarding the position of the target object. is obtained, based on the skeletal information of the person and the region information, an action performed by the person on the object is estimated, and based on the distance between the person and the object and the estimated action , recognizing the action of the person, and executing a process.

一実施形態によれば、対象者の行動を低コストで高精度に認識することができる。 According to one embodiment, the behavior of a subject can be recognized at low cost and with high accuracy.

図１は、実施例１にかかる情報処理装置を説明する図である。FIG. 1 is a diagram for explaining an information processing apparatus according to a first embodiment; 図２は、実施例１にかかる情報処理装置の機能構成を示す機能ブロック図である。FIG. 2 is a functional block diagram of the functional configuration of the information processing apparatus according to the first embodiment; 図３は、対象場所情報ＤＢに記憶される情報の例を示す図である。FIG. 3 is a diagram showing an example of information stored in the target location information DB. 図４は、人物検知および骨格検知を説明する図である。FIG. 4 is a diagram for explaining person detection and skeleton detection. 図５は、骨格情報の一例を説明する図である。FIG. 5 is a diagram illustrating an example of skeleton information. 図６は、人物の動作推定を説明する図である。FIG. 6 is a diagram for explaining motion estimation of a person. 図７は、手伸ばし範囲の推定を説明する図である。FIG. 7 is a diagram for explaining estimation of a hand stretching range. 図８は、人物の手伸ばし行動認識と位置算出を説明する図である。FIG. 8 is a diagram for explaining recognition of a person's hand stretching action and position calculation. 図９は、実施例１にかかる行動認識処理の全体的な流れを示すフローチャートである。FIG. 9 is a flowchart illustrating an overall flow of action recognition processing according to the first embodiment; 図１０は、実施例１にかかる人物検知処理の流れを示すフローチャートである。FIG. 10 is a flowchart illustrating the flow of human detection processing according to the first embodiment. 図１１は、実施例１にかかる骨格検知処理の流れを示すフローチャートである。FIG. 11 is a flowchart illustrating the flow of skeleton detection processing according to the first embodiment. 図１２は、実施例１にかかる動作推定処理の流れを示すフローチャートである。FIG. 12 is a flowchart illustrating the flow of motion estimation processing according to the first embodiment. 図１３は、実施例１にかかる範囲推定処理の流れを示すフローチャートである。13 is a flowchart illustrating the flow of range estimation processing according to the first embodiment; FIG. 図１４は、実施例１にかかる手伸ばし行動の認識処理の流れを示すフローチャートである。FIG. 14 is a flowchart illustrating a flow of recognition processing for a hand-stretching action according to the first embodiment. 図１５は、実施例１にかかる手伸ばし位置の算出処理の流れを示すフローチャートである。FIG. 15 is a flowchart of a process for calculating a stretched-hand position according to the first embodiment. 図１６は、実施例２にかかる行動認識処理の全体的な流れを示すフローチャートである。FIG. 16 is a flowchart illustrating an overall flow of action recognition processing according to the second embodiment; 図１７は、出力結果の画面例を説明する図である。FIG. 17 is a diagram for explaining an example of the output result screen. 図１８は、動作遷移を用いた動作推定処理を説明する図である。FIG. 18 is a diagram for explaining motion estimation processing using motion transitions. 図１９は、ハードウェア構成例を説明する図である。FIG. 19 is a diagram illustrating a hardware configuration example.

以下に、本願の開示する行動認識プログラム、行動認識方法および情報処理装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。また、各実施例は、矛盾のない範囲内で適宜組み合わせることができる。 An action recognition program, an action recognition method, and an information processing apparatus disclosed in the present application will be described in detail below with reference to the drawings. In addition, this invention is not limited by this Example. Moreover, each embodiment can be appropriately combined within a range without contradiction.

［全体構成］
図１は、実施例１にかかる情報処理装置１０を説明する図である。図１に示すように、このシステムは、衣料や食料品などを消費者に販売する小売店などに設置されるカメラなどの撮像装置と、情報処理装置１０とが有線や無線に接続され、情報処理装置１０により小売店内の購買状況を把握するシステムである。情報処理装置１０は、小売店の店内の映像データを取得して解析することで、ある商品を購入する消費者層の特定、ある商品を手に取った消費者層の特定などに利用可能な情報を生成して出力するコンピュータ装置の一例である。 [overall structure]
FIG. 1 is a diagram illustrating an information processing apparatus 10 according to the first embodiment. As shown in FIG. 1, in this system, an imaging device such as a camera installed in a retail store that sells clothing, food, etc. to consumers, and an information processing device 10 are connected by wire or wirelessly. This is a system for grasping the purchase situation in a retail store by means of a processing device 10. FIG. The information processing device 10 acquires and analyzes video data inside a retail store, and can be used to identify a consumer group that purchases a certain product, a consumer group that picks up a certain product, and the like. It is an example of a computer device that generates and outputs information.

具体的には、情報処理装置１０は、撮像装置により取得される画像データなどから人物の骨格情報を特定する。そして、情報処理装置１０は、人物が撮像された位置と対象となる領域および対象物の位置に関する領域情報を取得し、人物の骨格情報と領域情報とに基づいて、人物が対象物に行う動作を推定する。その後、情報処理装置１０は、人物と対象物との距離と、推定された動作とに基づき、人物の行動を認識する。 Specifically, the information processing device 10 identifies skeleton information of a person from image data acquired by an imaging device. Then, the information processing apparatus 10 acquires area information about the position where the person is imaged, the target area, and the position of the object, and based on the skeleton information and the area information of the person, the motion performed by the person on the object. to estimate After that, the information processing device 10 recognizes the action of the person based on the distance between the person and the target object and the estimated action.

例えば、情報処理装置１０は、カメラから取得した画像データから人物の骨格情報を認識し、骨格情報が条件「棚を見ている（見た）、棚に向かっている、立つもしくはしゃがむ、手を伸ばしているなどの動作を含み、棚と人物の距離は手の届く範囲である」に合致する場合に、画像に存在する該人物が「棚にある商品に手を延ばす行動をしている」と認識する。そして、情報処理装置１０は、「棚にある商品に手を延ばす行動をしている」と認識した場合に、骨格情報から手を伸ばしている位置を特定する。 For example, the information processing apparatus 10 recognizes skeleton information of a person from image data acquired from a camera, and the skeleton information is based on conditions such as "looking at (saw) a shelf, facing a shelf, standing or crouching, holding hands." Including actions such as stretching, the distance between the shelf and the person is within the reach of the hand", the person present in the image is "reaching for the product on the shelf". Recognize. Then, when the information processing apparatus 10 recognizes that "the person is reaching for the product on the shelf", the information processing apparatus 10 identifies the position where the hand is reaching from the skeleton information.

このようにすることで、情報処理装置１０は、カメラの映像を使った簡易かつ低コストな手法で、対象者の行動を高精度に認識することができる。この結果、情報処理装置１０は、高精度に対象者の手伸ばし動作分析や棚だし業務等の作業分析を行うことができる。 By doing so, the information processing apparatus 10 can recognize the behavior of the target person with high accuracy by a simple and low-cost method using the image of the camera. As a result, the information processing apparatus 10 can perform a work analysis such as a motion analysis of a target person's hand stretching and a shelf removal work with high accuracy.

［機能構成］
図２は、実施例１にかかる情報処理装置１０の機能構成を示す機能ブロック図である。図２に示すように、情報処理装置１０は、通信部１１、出力部１２、記憶部１３、制御部２０を有する。 [Function configuration]
FIG. 2 is a functional block diagram of the functional configuration of the information processing apparatus 10 according to the first embodiment. As shown in FIG. 2 , the information processing device 10 has a communication section 11 , an output section 12 , a storage section 13 and a control section 20 .

通信部１１は、他の装置の間の通信を制御する処理部であり、例えば通信インタフェースなどにより実現される。例えば、通信部１１は、カメラなどの撮像装置から、映像、画像、または、動画などのデータを受信する。 The communication unit 11 is a processing unit that controls communication between other devices, and is realized by, for example, a communication interface. For example, the communication unit 11 receives data such as videos, images, or moving images from imaging devices such as cameras.

出力部１２は、各種情報を表示する処理部であり、例えばディスプレイやタッチパネルなどにより実現される。例えば、出力部１２は、後述する制御部２０により認識された行動認識の結果などを出力する。 The output unit 12 is a processing unit that displays various types of information, and is realized by, for example, a display or a touch panel. For example, the output unit 12 outputs a result of behavior recognition recognized by the control unit 20, which will be described later.

記憶部１３は、各種データや制御部２０が実行するプログラムなどを記憶する処理部であり、例えばメモリやハードディスクなどにより実現される。この記憶部１３は、機械学習モデルＤＢ１４と対象場所情報ＤＢ１５を記憶する。 The storage unit 13 is a processing unit that stores various data, programs executed by the control unit 20, and the like, and is realized by, for example, a memory or a hard disk. This storage unit 13 stores a machine learning model DB 14 and a target location information DB 15 .

機械学習モデルＤＢ１４は、訓練済みである機械学習モデルを記憶するデータベースである。例えば、機械学習モデルＤＢ１４は、映像データの各フレームなどである画像データの入力に応じて、画像データに写っている人物を特定し、写っている人物の領域情報を出力する第１機械学習モデルを記憶する。また、機械学習モデルＤＢ１４は、人物の領域情報および画像データの入力に応じて、写っている人物の骨格情報を出力する第２機械学習モデルを記憶する。 The machine learning model DB 14 is a database that stores trained machine learning models. For example, the machine learning model DB 14 identifies a person appearing in the image data according to input of image data such as each frame of video data, and outputs area information of the person appearing in the first machine learning model. memorize The machine learning model DB 14 also stores a second machine learning model that outputs skeleton information of a person in the image in response to input of area information of the person and image data.

対象場所情報ＤＢ１５は、撮像装置が撮像する領域に関する対象場所情報を記憶するデータベースである。具体的には、対象場所情報ＤＢ１５は、設置されている撮像装置ごとに、各撮像装置の撮像領域にある商品棚および商品の領域や位置を示す対象場所情報を記憶する。 The target location information DB 15 is a database that stores target location information regarding an area imaged by the imaging device. Specifically, the target location information DB 15 stores, for each installed imaging device, target location information indicating the areas and positions of product shelves and products in the imaging area of each imaging device.

図３は、対象場所情報ＤＢ１５に記憶される情報の例を示す図である。図３に示すように、対象場所情報ＤＢ１５は、商品棚のＲＯＩ（Region of Interest）である棚ＲＯＩ３０、通路のＲＯＩである通路ＲＯＩ４０を記憶する。これら以外にも、対象場所情報ＤＢ１５は、商品棚の座標、商品棚（棚ＲＯＩ３０）に設置される各商品の座標、通路の座標などを記憶する。 FIG. 3 is a diagram showing an example of information stored in the target location information DB 15. As shown in FIG. As shown in FIG. 3, the target place information DB 15 stores a shelf ROI 30 that is a ROI (Region of Interest) of a product shelf and an aisle ROI 40 that is a ROI of an aisle. In addition to these, the target location information DB 15 stores the coordinates of the product shelf, the coordinates of each product placed on the product shelf (shelf ROI 30), the coordinates of the passage, and the like.

なお、対象場所情報は、あらかじめ棚の領域として左右上下端の位置を記録されたものでも構わない。また、対象場所情報は、棚の領域をあらかじめ学習し、認識された棚領域（Semantic Segmentation）を使用してもかまわない。また、対象場所情報は、棚の商品をあらかじめ物体モデルとして学習し、映像解析技術を用いて物体検知した領域群を使用しても構わない。 Note that the target location information may be information in which the positions of the left, right, upper and lower ends of the shelf area are recorded in advance. Alternatively, the target location information may be obtained by learning the shelf area in advance and using the recognized shelf area (semantic segmentation). For the target location information, it is also possible to use an area group in which products on shelves are learned in advance as object models and objects are detected using image analysis technology.

制御部２０は、情報処理装置１０全体を司る処理部であり、例えばプロセッサなどにより実現される。この制御部２０は、映像取得部２１、人物検知部２２、骨格検知部２３、動作推定部２４、範囲推定部２５、手伸ばし行動認識部２６、手伸ばし位置算出部２７を有する。なお、映像取得部２１、人物検知部２２、骨格検知部２３、動作推定部２４、範囲推定部２５、手伸ばし行動認識部２６、手伸ばし位置算出部２７は、プロセッサが有する電子回路やプロセッサが実行するプロセスなどにより実現される。 The control unit 20 is a processing unit that controls the entire information processing apparatus 10, and is realized by, for example, a processor. The control unit 20 has an image acquisition unit 21 , a person detection unit 22 , a skeleton detection unit 23 , a motion estimation unit 24 , a range estimation unit 25 , a hand stretching action recognition unit 26 and a hand stretching position calculation unit 27 . Note that the image acquisition unit 21, the person detection unit 22, the skeleton detection unit 23, the motion estimation unit 24, the range estimation unit 25, the hand stretching behavior recognition unit 26, and the hand stretching position calculation unit 27 are electronic circuits and processors possessed by the processor. It is realized by a process to execute.

映像取得部２１は、映像データを取得する処理部である。例えば、映像取得部２１は、ＵＳＢ（Universal Serial Bus）やＬＡＮ（Local Area Network）、無線などにより接続されるカメラより、対象とするエリアの映像データを取得し、人物検知部２２に出力する。なお、映像取得部２１は、カメラからのリアルタイムな映像データのほか、あらかじめ撮影した動画像データを取得することもできる。 The video acquisition unit 21 is a processing unit that acquires video data. For example, the video acquisition unit 21 acquires video data of a target area from a camera connected via USB (Universal Serial Bus), LAN (Local Area Network), wireless, or the like, and outputs the video data to the person detection unit 22 . Note that the video acquisition unit 21 can acquire not only real-time video data from the camera but also moving image data captured in advance.

人物検知部２２は、映像データから映像データに映っている人物を検知する処理部である。具体的には、人物検知部２２は、映像取得部２１により取得される映像データの各フレームについて、人物検知を実行し、検知された人物に関する情報を骨格検知部２３に出力する。例えば、人物検知部２２は、取得した映像に対し、深層学習等の映像解析技術を用いて人物検知を行う。なお、閾値等のパラメータを用い、人物らしさを判定し、次のフレーム映像取得処理に移ってもかまわない。 The person detection unit 22 is a processing unit that detects a person appearing in the video data from the video data. Specifically, the person detection unit 22 executes person detection for each frame of the video data acquired by the image acquisition unit 21 and outputs information about the detected person to the skeleton detection unit 23 . For example, the human detection unit 22 performs human detection on the acquired video using video analysis technology such as deep learning. It is also possible to use a parameter such as a threshold value to determine the likeness of a person, and then proceed to the next frame video acquisition process.

骨格検知部２３は、撮像装置により取得される画像から人物の骨格情報を検出する処理部である。例えば、骨格検知部２３は、人物検知部２２により検知された人物領域に対し、同様に映像解析技術を用い、骨格検知を行う。なお、閾値等のパラメータを用い、骨格らしさを判定し、次のフレーム映像取得処理に移ってもかまわない。 The skeleton detection unit 23 is a processing unit that detects skeleton information of a person from an image acquired by an imaging device. For example, the skeleton detection unit 23 performs skeleton detection on the person region detected by the person detection unit 22 using the same video analysis technology. It is also possible to use a parameter such as a threshold value to determine the skeleton-likeness, and then proceed to the next frame video acquisition process.

ここで、人物検知と骨格検知について具体的に説明する。図４は、人物検知および骨格検知を説明する図である。図４に示すように、人物検知部２２は、映像データの１フレームを第１機械学習モデルに入力し、検知された人物の領域を含む位置情報（領域情報）を取得する。ここで、人物検知部２２は、第１機械学習モデルの出力値のスコア（確率）が閾値以上の場合に、出力された人物の位置情報を採用することもできる。 Here, person detection and skeleton detection will be specifically described. FIG. 4 is a diagram for explaining person detection and skeleton detection. As shown in FIG. 4, the person detection unit 22 inputs one frame of video data to the first machine learning model and acquires position information (area information) including the area of the detected person. Here, when the score (probability) of the output value of the first machine learning model is equal to or greater than the threshold, the person detection unit 22 can also adopt the output position information of the person.

続いて、骨格検知部２３は、人物が検知された画像データおよび第１機械学習モデルから得られた人物の位置情報を第２機械学習モデルに入力し、検知された人物の骨格情報を取得する。ここで、骨格検知部２３は、第２機械学習モデルの出力値のスコア（確率）が閾値以上の場合に、出力された人物の骨格情報を採用することもできる。 Next, the skeleton detection unit 23 inputs the image data in which the person is detected and the position information of the person obtained from the first machine learning model to the second machine learning model, and acquires the skeleton information of the detected person. . Here, if the score (probability) of the output value of the second machine learning model is equal to or greater than the threshold, the skeleton detection unit 23 can also adopt the output skeleton information of the person.

図５は、骨格情報の一例を説明する図である。図５には、骨格情報として取得される骨格の定義情報が示される。図５に示すように、骨格情報は、公知の骨格モデルで特定される各関節をナンバリングした、１８個（０番から１７番）の定義情報を記憶する。例えば、右肩関節（SHOULDER＿RIGHT）には７番が付与され、左肘関節（ELBOW＿LEFT）には５番が付与され、左膝関節（KNEE＿LEFT）には１１番が付与され、右股関節（HIP＿RIGHT）には１５番が付与される。 FIG. 5 is a diagram illustrating an example of skeleton information. FIG. 5 shows skeleton definition information acquired as skeleton information. As shown in FIG. 5, the skeletal information stores 18 pieces (0 to 17) of definition information numbered for each joint specified in a known skeletal model. For example, the right shoulder joint (SHOULDER_RIGHT) is assigned number 7, the left elbow joint (ELBOW_LEFT) is assigned number 5, the left knee joint (KNEE_LEFT) is assigned number 11, and the right hip joint (HIP_RIGHT) is assigned number 11. is given the number 15.

したがって、骨格検知部２３は、図５に示した１８個の骨格の座標情報を、第２機械学習モデルから取得する。例えば、骨格検知部２３は、７番の右肩関節の位置として「Ｘ座標＝Ｘ７、Ｙ座標＝Ｙ７、Ｚ座標＝Ｚ７」を取得する。なお、例えば、Ｚ軸は、撮像装置から対象に向けた距離方向、Ｙ軸は、Ｚ軸に垂直な高さ方向、Ｘ軸は、水平方向をと定義することができる。 Therefore, the skeleton detection unit 23 acquires the coordinate information of the 18 skeletons shown in FIG. 5 from the second machine learning model. For example, the skeleton detection unit 23 acquires “X coordinate=X7, Y coordinate=Y7, Z coordinate=Z7” as the position of the seventh right shoulder joint. Note that, for example, the Z-axis can be defined as the distance direction from the imaging device to the object, the Y-axis as the height direction perpendicular to the Z-axis, and the X-axis as the horizontal direction.

図２に戻り、動作推定部２４は、人物が撮像された位置にある対象物の領域に関する領域情報を取得し、領域情報と人物の骨格情報とに基づいて、人物が対象物に行う動作を推定する処理部である。具体的には、動作推定部２４は、人物検知部２２により検知された人物が棚に対して手を伸ばした動作、棚を見ている動作、または、棚に向いている動作（正対している動作）のいずれかの条件と一致するか否かにより、人物が対象物に行う動作を推定し、推定結果を範囲推定部２５と手伸ばし行動認識部２６に出力する。 Returning to FIG. 2, the motion estimating unit 24 acquires region information about the region of the object in the position where the person was imaged, and estimates the motion of the person to the object based on the region information and the skeleton information of the person. This is the processing unit for estimation. Specifically, the motion estimating unit 24 detects a motion of the person detected by the human detecting unit 22 reaching out toward the shelf, looking at the shelf, or facing the shelf (facing the shelf). Depending on whether or not any of the conditions are met, the motion performed by the person on the object is estimated, and the estimation result is output to the range estimation unit 25 and the hand-stretching action recognition unit 26 .

図６は、人物の動作推定を説明する図である。図６では、骨格情報に含まれる各関節を線でつなげることで人物を描写している。図６に示すように、動作推定部２４は、対象場所情報ＤＢ１５に記憶される棚の位置情報と、骨格検知部２３により検知された骨格情報とを用い、検出された人物の動作が予め学習した動作と一致するか否かを判定する。 FIG. 6 is a diagram for explaining motion estimation of a person. In FIG. 6, a person is depicted by connecting each joint included in the skeleton information with a line. As shown in FIG. 6, the motion estimating unit 24 uses the shelf position information stored in the target location information DB 15 and the skeleton information detected by the skeleton detecting unit 23 to previously learn the motion of the detected person. It is determined whether or not it matches the action that was performed.

例えば、動作推定部２４は、棚からの所定位置内に位置し、検知された骨格情報が予め用意しておいた動作Ａを示す骨格情報と類似する場合に、その動作Ａを推定結果とする。また、動作推定部２４は、棚からの所定位置内に位置する状態で、それまでに検知された各骨格情報の遷移が予め用意しておいた動作Ｂの骨格情報の遷移と類似する場合に、その動作Ｂを推定結果とする。なお、類似度は、各座標の差分の合計値が閾値未満か否かなど、公知の手法を採用することができる。 For example, if the motion estimating unit 24 is located within a predetermined position from the shelf and the detected skeleton information is similar to the skeleton information indicating the motion A prepared in advance, the motion estimation unit 24 determines that motion A as the estimation result. . In addition, the motion estimating unit 24 determines that the transition of each skeleton information detected so far is similar to the transition of the skeleton information of the motion B that has been prepared in advance while the motion estimating unit 24 is positioned within a predetermined position from the shelf. , and its motion B is assumed to be the estimation result. It should be noted that the degree of similarity can employ a known method, such as whether or not the total value of the differences of each coordinate is less than a threshold.

ここで、動作推定部２４は、棚を見る、棚に向いているなどの判定に際しては、例えば、顔や肩、胴体など垂線ベクトルが、あらかじめ設定した棚や商品などの物体情報の領域と交差することを用いる。 Here, the motion estimating unit 24 determines whether the user is looking at the shelf or is facing the shelf. use to do

例えば、動作推定部２４は、手を伸ばした動作として、さらに一連の動作の中で腕の角度が最も大きいことを、判定条件として用いてもかまわない。具体的には、動作推定部２４は、骨格情報の５番や８番を用いた腕の角度が閾値以上の場合に、手を伸ばした動作と推定することもできる。また、動作推定部２４は、手を伸ばしたのちに左右上下方向に手を伸ばす動作のほか、手が一方向の軌跡で動いた場合に、手を伸ばす動作と推定することもできる。 For example, the motion estimating unit 24 may use, as a determination condition, a motion in which the hand is stretched out, and furthermore, the angle of the arm being the largest in a series of motions. Specifically, the motion estimating unit 24 can also estimate that the hand is extended when the angle of the arm using skeletal information Nos. 5 and 8 is equal to or greater than a threshold value. In addition, the motion estimating unit 24 can also estimate that the hand is extended when the hand moves along a trajectory in one direction, in addition to the motion of extending the hand in the horizontal and vertical directions after the hand is stretched out.

また、動作推定部２４は、場所を見る動作として、上述のあらかじめ指定した領域や棚などの物体領域以外に、検知した手を見ることを用いて、場所を見るみなし動作として判定してもかまわない。例えば、動作推定部２４は、数枚のフレームを用いて骨格情報の３番（HEAD）の遷移を監視することで、手を伸ばした動作のちに、手を見る動作を検知した場合に、場所（棚）を見る動作と推定する。また、動作推定部２４は、複数回の顔を動かす動作を検知し後、手を伸ばす動作を検知した場合に、場所（棚）を見る動作として推定する。 In addition, the motion estimating unit 24 may determine that the action of looking at the place is the motion of looking at the place by looking at the detected hand in addition to the above-mentioned predetermined area or the object area such as the shelf. No. For example, the motion estimating unit 24 monitors the transition of the 3rd (HEAD) of the skeleton information using several frames. Presumed to be the action of looking at (shelf). Further, the motion estimating unit 24 estimates a motion of looking at a place (shelf) when detecting a motion of extending a hand after detecting a motion of moving the face a plurality of times.

また、動作推定部２４は、棚に向いている動作として、立つ、しゃがむ、座る動作に加え、領域や物体に対して体が正面または斜めに向いていることを、判定条件として用いてもかまわない。例えば、動作推定部２４は、棚からの所定位置内に位置した状態かつ体が対象物（棚）に向いている状態で、立つ、しゃがむ、座る動作のいずれかを検出した場合に、棚に向いている動作と推定する。なお、上述した各動作は、骨格情報や骨格情報の遷移を監視することにより特定できる。また、動作推定部２４は、複数の動作を推定してもよい。 Further, the motion estimating unit 24 may use, as a determination condition, the motion of standing, squatting, and sitting as the motion of facing the shelf, and that the body is facing the front or diagonally with respect to the area or object. No. For example, when the motion estimating unit 24 detects any motion of standing, crouching, or sitting while the body is positioned within a predetermined position from the shelf and the body faces the object (shelf), Presumed to be facing motion. Each motion described above can be specified by monitoring the skeleton information and the transition of the skeleton information. Also, the motion estimation unit 24 may estimate a plurality of motions.

図２に戻り、範囲推定部２５は、人物と対象物（棚）の距離として、人物の骨格情報と棚の領域情報とから人物の手が届く範囲を推定する処理部である。例えば、範囲推定部２５は、動作推定部２４による推定結果とあらかじめ設定した棚の位置情報（物体領域）とを用い、例えば両足の中心点などにより特定される、人物が立っている位置に対し、最短となる棚の垂線から一定の距離（例えば肩幅に相当）を手の届く範囲として推定する。また、範囲推定部２５は、骨格検知部２３により検知された、例えば足の長さや背の高さ、肩幅などの骨格情報から推定される手の長さと、例えば腕の角度は１８０度以上曲がらないなどの生体的特徴から推定される手の位置を用いて、手の届く範囲を推定してもかまわない。 Returning to FIG. 2, the range estimating unit 25 is a processing unit that estimates, as the distance between the person and the object (shelf), the reachable range of the person based on the skeleton information of the person and the area information of the shelf. For example, the range estimating unit 25 uses the result of estimation by the motion estimating unit 24 and position information (object region) of a shelf set in advance. , a certain distance (e.g., shoulder width) from the perpendicular line of the shortest shelf is estimated as the reachable range. The range estimating unit 25 also determines that the length of the hand estimated from the skeletal information detected by the skeletal detection unit 23, such as the length of the legs, the height of the back, and the width of the shoulders, and the angle of the arm, for example, is 180 degrees or more. The reachable range of the hand may be estimated using the position of the hand estimated from the biological characteristics such as the absence of the hand.

図７は、手伸ばし範囲の推定を説明する図である。図７に示すように、範囲推定部２５は、まず、骨格検知部２３により検知された骨格情報で特定される両足の中心点を、立っている位置として検出する（Ｓ１）。続いて、範囲推定部２５は、対象場所情報ＤＢ１５に記憶される棚（位置情報）に対し、両足の中心点から最短となる距離として、例えば棚ＲＯＩ３０の下線への垂線を算出する（Ｓ２）。そして、範囲推定部２５は、最短距離に対する棚（位置情報）の高さ方向の垂線を算出する（Ｓ３）。その後、範囲推定部２５は、骨格検知部２３により検知された骨格情報における足と肩までの高さ（Ｓ４）と肩幅情報（Ｓ５）を用いて、例えば高さの９０から１１０％、肩幅の±１００％を、手伸ばし範囲（手の届く範囲）と推定する（Ｓ６）。そして、範囲推定部２５は、推定結果を手伸ばし行動認識部２６に出力する。なお、ここで示した数値は、あくまで一例であり、任意に変更することができる。 FIG. 7 is a diagram for explaining estimation of a hand stretching range. As shown in FIG. 7, the range estimating unit 25 first detects the center point of both feet specified by the skeleton information detected by the skeleton detecting unit 23 as the standing position (S1). Subsequently, the range estimating unit 25 calculates, for example, a perpendicular line to the underline of the shelf ROI 30 as the shortest distance from the center point of both feet to the shelf (position information) stored in the target location information DB 15 (S2). . Then, the range estimating unit 25 calculates a vertical line in the height direction of the shelf (positional information) with respect to the shortest distance (S3). After that, the range estimating unit 25 uses the height (S4) from the feet to the shoulders and the shoulder width information (S5) in the skeleton information detected by the skeleton detecting unit 23 to obtain, for example, 90% to 110% of the height and the width of the shoulders. ±100% is estimated as the reachable range (reachable range) (S6). Then, the range estimating section 25 outputs the estimation result to the hand reaching action recognizing section 26 . Note that the numerical values shown here are only examples, and can be changed arbitrarily.

なお、範囲推定部２５は、立っている状態に限らず、座っていると動作を判定された場合であっても、座っている状態から上記Ｓ１からＳ６の判定により、座っている状態で手が届く範囲を推定することもできる。また、範囲推定部２５は、肩と肘と手先の骨格情報により特定される腕の長さを用い、肩の骨格情報と棚の垂線から、手の届く範囲を推定することもできる。 It should be noted that the range estimating unit 25 is not limited to the standing state, and even if it is determined that the action is sitting, the range estimating unit 25 performs the above-described determinations from S1 to S6 from the sitting state. It is also possible to estimate the reach of In addition, the range estimating unit 25 can also estimate the reachable range from the shoulder skeleton information and the perpendicular to the shelf using the length of the arm specified by the shoulder, elbow and hand skeleton information.

図２に戻り、手伸ばし行動認識部２６は、動作推定部２４により推定された動作と、範囲推定部２５により推定された人物の手が届く範囲とから、人物が対象物に手を伸ばした行動を認識する処理部である。そして、手伸ばし行動認識部２６は、認識結果を手伸ばし位置算出部２７に出力する。 Returning to FIG. 2, the hand-stretching action recognition unit 26 determines whether the person reached out to the target object based on the motion estimated by the motion estimation unit 24 and the reachable range of the person estimated by the range estimation unit 25. It is a processing unit that recognizes actions. Then, the hand-stretching behavior recognition unit 26 outputs the recognition result to the hand-stretching position calculation unit 27 .

具体的には、手伸ばし行動認識部２６は、動作推定部２４により推定された動作と、範囲推定部２５により推定された人物の手が届く範囲とを用い、あらかじめ設けた閾値にて判定することで、手を伸ばした行動として認識する。また、これらの判定項目を特徴量として数値化し、マハラビノス距離などを算出し機械学習の結果を用いて判定してもかまわない。 Specifically, the hand-stretching behavior recognition unit 26 uses the motion estimated by the motion estimation unit 24 and the reachable range of the person estimated by the range estimation unit 25, and determines with a preset threshold value. By doing so, it is recognized as an action of reaching out. Alternatively, these determination items may be digitized as feature amounts, Mahalabino's distances may be calculated, and the results of machine learning may be used for determination.

手伸ばし位置算出部２７は、手伸ばし行動認識部２６により手を伸ばした行動と認識された画像（映像のフレーム）において、骨格情報や棚ＲＯＩ３０などを用いて、伸ばした手の位置情報を算出する処理部である。 The stretched hand position calculation unit 27 calculates the position information of the extended hand using the skeleton information, the shelf ROI 30, etc. in the image (video frame) recognized as the stretched hand behavior by the stretched hand behavior recognition unit 26. It is a processing unit that

また、手伸ばし位置算出部２７は、手伸ばし行動を認識した時間における手の位置情報を算出し、対応する商品へのアクセスとすることもできる。例えば、手伸ばし位置算出部２７は、棚ＲＯＩ３０内の商品の位置情報と、伸ばした手の位置情報との比較により、人物が手に取った商品や手に取ろうとした商品などを特定する。このように、手伸ばし位置算出部２７は、映像データの各フレームから人物が特定された場合に、その人物が手を伸ばした商品を特定して集計する。 In addition, the hand-stretching position calculation unit 27 can also calculate hand position information at the time when the hand-stretching action is recognized, and use it as access to the corresponding product. For example, the stretched-hand position calculator 27 identifies the product that the person picked up or was about to pick up by comparing the position information of the product in the shelf ROI 30 and the position information of the outstretched hand. In this way, when a person is specified from each frame of the video data, the stretched-hand position calculation unit 27 specifies and tallies the products stretched out by the person.

ここで、行動認識と位置算出について具体的に説明する。図８は、人物の手伸ばし行動認識と位置算出を説明する図である。図８に示すように、手伸ばし行動認識部２６は、手の届く範囲の推定結果および人物の動作の推定結果と、予め定めた条件とのスコア（例えば一致率）により、手伸ばし行動か否かを認識する。 Action recognition and position calculation will now be described in detail. FIG. 8 is a diagram for explaining recognition of a person's hand stretching action and position calculation. As shown in FIG. 8, the reaching action recognizing unit 26 determines whether or not the reaching action is a reaching action based on the result of estimating the range of reach of the hand and the result of estimating the action of the person, and the score (for example, matching rate) of the predetermined condition. recognize what

例えば、手伸ばし行動認識部２６は、所定数以上の条件と一致する場合、または、全条件のうち所定の割合の条件と一致する場合に、手伸ばし行動と認識する。なお、条件は、実験データ等を用いて予め生成して設定しておくこともでき、手の届く範囲の推定結果に対する条件と、人物の動作の推定結果に対する条件とに分けて定義して分けて判定することもできる。 For example, the hand-stretching action recognition unit 26 recognizes the hand-stretching action when a predetermined number or more of conditions are matched, or when a predetermined percentage of all conditions are matched. It should be noted that the conditions can also be generated and set in advance using experimental data, etc., and can be defined separately for the results of estimating the reach of the hand and the results of estimating the motion of a person. can also be determined.

その後、手伸ばし位置算出部２７は、手伸ばし行動と認識された画像データを特定し、その画像データを用いて特定された人物の骨格情報、棚ＲＯＩ３０、商品の位置情報などを用いて、手の位置情報を算出する。例えば、手伸ばし位置算出部２７は、棚の座標や商品の座標などを用いて、画像データ内の手と重複する棚や商品の座標を手の位置情報として算出する。また、手伸ばし位置算出部２７は、棚の座標、商品の座標、手伸ばし行動の認識結果、人物の骨格情報を訓練済みの機械学習モデルに入力して、人物の手の位置情報を取得することもできる。 After that, the stretching position calculation unit 27 identifies the image data recognized as the stretching behavior, and uses the skeleton information of the person identified using the image data, the shelf ROI 30, the product position information, etc., to calculate the hand position. Calculate the position information of For example, the hand-stretching position calculation unit 27 calculates the coordinates of the shelves and products overlapping the hand in the image data as the position information of the hand, using the coordinates of the shelves, the coordinates of the products, and the like. In addition, the stretched hand position calculation unit 27 inputs the coordinates of the shelf, the coordinates of the product, the recognition result of the stretched hand action, and the skeleton information of the person to the trained machine learning model, and acquires the position information of the hand of the person. can also

そして、手伸ばし位置算出部２７は、商品の位置情報と人物の手の位置情報とを用いて、人物がアクセスした商品を特定する。例えば、手伸ばし位置算出部２７は、手の座標と一致する商品、手の座標から所定範囲のある商品、または、手の座標から棚方向への延長線上にある商品などを、アクセスした商品と特定する。 Then, the stretched-hand position calculation unit 27 identifies the product accessed by the person using the position information of the product and the position information of the person's hand. For example, the stretched-hand position calculation unit 27 considers a product that matches the coordinates of the hand, a product that is within a predetermined range from the coordinates of the hand, or a product that is on an extension line from the coordinates of the hand to the shelf direction as the accessed product. Identify.

［処理の流れ］
次に、上述した行動認識処理の流れについて説明する。ここでは、全体的な処理の流れ、各処理部による処理の流れについて説明する。 [Process flow]
Next, the flow of the action recognition processing described above will be described. Here, the flow of the overall processing and the flow of processing by each processing unit will be described.

（全体的な処理の流れ）
図９は、実施例１にかかる行動認識処理の全体的な流れを示すフローチャートである。図９に示すように、情報処理装置１０は、処理を開始すると（Ｓ１０１：Ｙｅｓ）、撮像装置から映像データを取得する（Ｓ１０２）。 (Overall processing flow)
FIG. 9 is a flowchart illustrating an overall flow of action recognition processing according to the first embodiment; As shown in FIG. 9, when the information processing apparatus 10 starts processing (S101: Yes), it acquires video data from the imaging device (S102).

続いて、情報処理装置１０は、映像データ内の各フレーム（画像データ）から人物の検知を行い（Ｓ１０３）、人物が検知できない場合（Ｓ１０４：Ｎｏ）、次の映像データを取得する。一方、情報処理装置１０は、人物が検知できた場合（Ｓ１０４：Ｙｅｓ）、検知された人物の骨格を検知する（Ｓ１０５）。 Subsequently, the information processing apparatus 10 detects a person from each frame (image data) in the video data (S103), and if the person cannot be detected (S104: No), acquires the next video data. On the other hand, when a person can be detected (S104: Yes), the information processing apparatus 10 detects the skeleton of the detected person (S105).

そして、情報処理装置１０は、検知された人物の動作を推定し（Ｓ１０６）、検知された人物の手伸ばし範囲を推定する（Ｓ１０７）。ここで、情報処理装置１０は、推定された手伸ばし範囲により、商品もしくは棚に手が届く範囲ではない場合（Ｓ１０８：Ｎｏ）、次の画像を取得する。一方、情報処理装置１０は、商品もしくは棚に手が届く範囲である場合（Ｓ１０８：Ｙｅｓ）、人物の行動認識を行い（Ｓ１０９）、人物の手の位置を算出する（Ｓ１１０）。 Then, the information processing apparatus 10 estimates the motion of the detected person (S106), and estimates the stretched-out range of the detected person (S107). Here, the information processing apparatus 10 acquires the next image when the product or the shelf is out of reach (S108: No) according to the estimated reachable range. On the other hand, if the product or the shelf is within reach (S108: Yes), the information processing apparatus 10 recognizes the action of the person (S109) and calculates the position of the person's hand (S110).

（人物検知処理の流れ）
図１０は、実施例１にかかる人物検知処理の流れを示すフローチャートである。図１０に示すように、人物検知部２２は、映像データを取得し（Ｓ２０１）、映像データ内の各フレーム（画像データ）に対して、第１機械学習モデルを適用して、人物検知を行う（Ｓ２０２）。 (Flow of person detection processing)
FIG. 10 is a flowchart illustrating the flow of human detection processing according to the first embodiment. As shown in FIG. 10, the human detection unit 22 acquires video data (S201), applies the first machine learning model to each frame (image data) in the video data, and performs human detection. (S202).

そして、人物検知部２２は、第１機械学習モデルの出力値のスコア（確率）が閾値未満の場合（Ｓ２０３：Ｎｏ）、第１機械学習モデルの予測精度が低いと判定して、次の映像データを取得する。一方、人物検知部２２は、第１機械学習モデルの出力値のスコア（確率）が閾値以上の場合（Ｓ２０３：Ｙｅｓ）、第１機械学習モデルの予測精度が高いと判定して、検知された人物の位置情報を記憶部１３等に出力する（Ｓ２０４）。 Then, when the score (probability) of the output value of the first machine learning model is less than the threshold (S203: No), the human detection unit 22 determines that the prediction accuracy of the first machine learning model is low, and Get data. On the other hand, when the score (probability) of the output value of the first machine learning model is equal to or greater than the threshold (S203: Yes), the person detection unit 22 determines that the prediction accuracy of the first machine learning model is high, and The position information of the person is output to the storage unit 13 or the like (S204).

（骨格検知処理の流れ）
図１１は、実施例１にかかる骨格検知処理の流れを示すフローチャートである。図１１に示すように、骨格検知部２３は、映像データを取得し（Ｓ３０１）、人物検知部２２により検知された人物の位置情報を取得する（Ｓ３０２）。 (Flow of skeleton detection processing)
FIG. 11 is a flowchart illustrating the flow of skeleton detection processing according to the first embodiment. As shown in FIG. 11, the skeleton detection unit 23 acquires video data (S301), and acquires position information of a person detected by the person detection unit 22 (S302).

そして、骨格検知部２３は、映像データ内の人物が検知された画像データおよび人物の位置情報を第２機械学習モデルに入力し、検知された人物の骨格情報を取得し（Ｓ３０３）、記憶部１３等に出力する（Ｓ３０４）。なお、ここでも、図１０と同様、スコアによる判定を実行してもよい。 Then, the skeleton detection unit 23 inputs the image data in which the person is detected in the video data and the position information of the person to the second machine learning model, acquires the skeleton information of the detected person (S303), and stores the information. 13, etc. (S304). Also here, as in FIG. 10, the determination based on the score may be executed.

（動作推定処理の流れ）
図１２は、実施例１にかかる動作推定処理の流れを示すフローチャートである。図１２に示すように、動作推定部２４は、対象場所情報ＤＢ１５から、撮像領域にある棚等の情報を含む対象場所の情報（領域情報）を取得する（Ｓ４０１）。また、動作推定部２４は、骨格検知部２３から人物の骨格情報を取得する（Ｓ４０２）。 (Flow of motion estimation processing)
FIG. 12 is a flowchart illustrating the flow of motion estimation processing according to the first embodiment. As shown in FIG. 12, the motion estimating unit 24 acquires target place information (area information) including information such as shelves in the imaging area from the target place information DB 15 (S401). Also, the motion estimation unit 24 acquires the skeleton information of the person from the skeleton detection unit 23 (S402).

続いて、動作推定部２４は、人物の骨格情報と対象場所の情報とに基づいて、人物が対象物に行う動作を推定する。具体的には、動作推定部２４は、手伸ばし動作を推定し（Ｓ４０３）、場所を見る動作を推定し（Ｓ４０４）、場所に向く動作を推定する（Ｓ４０５）。なお、動作推定部２４は、推定した動作に関する情報を、記憶部１３等に出力する。 Subsequently, the motion estimating unit 24 estimates a motion performed by the person on the target based on the information on the skeleton of the person and the information on the target location. Specifically, the motion estimating unit 24 estimates a hand stretching motion (S403), estimates a motion to look at a place (S404), and estimates a motion to face a place (S405). Note that the motion estimation unit 24 outputs information about the estimated motion to the storage unit 13 or the like.

（範囲推定処理の流れ）
図１３は、実施例１にかかる範囲推定処理の流れを示すフローチャートである。図１３に示すように、範囲推定部２５は、対象場所情報ＤＢ１５から、撮像領域にある棚等の情報を含む対象場所の情報（領域情報）を取得する（Ｓ５０１）。また、動作推定部２４は、骨格検知部２３から、人物の骨格情報を取得する（Ｓ５０２）。 (Flow of range estimation processing)
13 is a flowchart illustrating the flow of range estimation processing according to the first embodiment; FIG. As shown in FIG. 13, the range estimating unit 25 acquires target place information (area information) including information such as shelves in the imaging area from the target place information DB 15 (S501). Also, the motion estimation unit 24 acquires the skeleton information of the person from the skeleton detection unit 23 (S502).

続いて、範囲推定部２５は、人物の骨格情報と対象場所（棚）の情報とに基づいて、人物から対象場所への最短距離を推定する（Ｓ５０３）。そして、範囲推定部２５は、推定された最短距離と、人物の骨格情報とに基づき、手の届く範囲を推定する（Ｓ５０４）。なお、範囲推定部２５は、推定した手の届く範囲に関する情報を、記憶部１３等に出力する。 Subsequently, the range estimation unit 25 estimates the shortest distance from the person to the target location based on the skeleton information of the person and the information on the target location (shelf) (S503). Then, the range estimation unit 25 estimates the reachable range based on the estimated shortest distance and the skeleton information of the person (S504). Note that the range estimation unit 25 outputs information about the estimated reachable range to the storage unit 13 or the like.

（手伸ばし行動の認識処理の流れ）
図１４は、実施例１にかかる手伸ばし行動の認識処理の流れを示すフローチャートである。図１４に示すように、手伸ばし行動認識部２６は、動作推定部２４により推定された動作推定の情報を取得し（Ｓ６０１）、範囲推定部２５により推定された手の届く範囲の推定結果を取得する（Ｓ６０２）。 (Flow of Recognition Processing of Hand Reaching Action)
FIG. 14 is a flowchart illustrating a flow of recognition processing for a hand-stretching action according to the first embodiment. As shown in FIG. 14, the hand stretching action recognition unit 26 acquires the information of the motion estimation estimated by the motion estimation unit 24 (S601), and the estimation result of the reachable range estimated by the range estimation unit 25 Acquire (S602).

続いて、手伸ばし行動認識部２６は、取得した上記情報を用いて、人物が対象物（棚）に手を伸ばした行動のスコアを算出する（Ｓ６０３）。ここで、手伸ばし行動認識部２６は、スコアが閾値未満である場合（Ｓ６０３：Ｎｏ）、Ｓ６０１に戻って以降の処理を繰り返す。一方、手伸ばし行動認識部２６は、スコアが閾値以上である場合（Ｓ６０３：Ｙｅｓ）、手伸ばし行動と認識する（Ｓ６０４）。なお、手伸ばし行動認識部２６は、認識した手伸ばし行動の認識結果を、記憶部１３等に出力する。 Subsequently, the hand-stretching action recognition unit 26 uses the acquired information to calculate the score of the action of the person reaching out to the object (shelf) (S603). Here, if the score is less than the threshold (S603: No), the hand-stretching action recognition unit 26 returns to S601 and repeats the subsequent processes. On the other hand, if the score is equal to or greater than the threshold (S603: Yes), the hand-stretching action recognition unit 26 recognizes the hand-stretching action (S604). Note that the reaching action recognition unit 26 outputs the recognition result of the recognized reaching action to the storage unit 13 or the like.

（手伸ばし位置の算出処理の流れ）
図１５は、実施例１にかかる手伸ばし位置の算出処理の流れを示すフローチャートである。図１５に示すように、手伸ばし位置算出部２７は、手伸ばし行動認識部２６により手伸ばし行動の認識結果を取得し（Ｓ７０１）、人物の骨格情報を取得し（Ｓ７０２）、認識結果や骨格情報などを用いて、伸ばした手の位置情報を算出する（Ｓ７０３）。なお、手伸ばし位置算出部２７は、算出した手の位置情報を、記憶部１３等に出力する。 (Flow of calculation processing of hand stretching position)
FIG. 15 is a flowchart of a process for calculating a stretched-hand position according to the first embodiment. As shown in FIG. 15, the stretched-hand position calculation unit 27 acquires the recognition result of the stretched-hand action by the stretched-hand action recognition unit 26 (S701), acquires the skeleton information of the person (S702), Information and the like are used to calculate the position information of the outstretched hand (S703). Note that the stretched-hand position calculation unit 27 outputs the calculated hand position information to the storage unit 13 or the like.

［効果］
上述したように、情報処理装置１０は、映像データから取得した骨格情報とあらかじめ定義した棚位置情報から、人体特性に基づき手を伸ばした位置を高精度に算出することができる。この結果、情報処理装置１０は、作業状態分析や購買分析等に用いる有用な情報を生成して出力することができる。また、棚前に到達してから手を伸ばすまでに要した時間や、手伸ばし速度、軌跡や姿勢、回数をさらに分析することによって、作業者や購買者の心理状況も把握することも可能となる。 [effect]
As described above, the information processing apparatus 10 can highly accurately calculate the stretched-out position based on the human body characteristics from the skeleton information acquired from the video data and the predefined shelf position information. As a result, the information processing apparatus 10 can generate and output useful information used for work state analysis, purchase analysis, and the like. In addition, by further analyzing the time required from reaching the shelf to reaching the hand, the speed of reaching, the trajectory and posture, and the number of times, it is also possible to understand the psychological state of the worker and the purchaser. Become.

また、情報処理装置１０は、機械学習モデルを用いて、人物検知や骨格検知を行うので、機械学習モデルの再訓練などを定期的に行うことで、検知精度の向上を図ることができる。また、情報処理装置１０は、予め想定した動作の骨格情報を用意しておき、検知された骨格情報とそれらとの比較により、動作を推定することができるので、画像データの画質等に依存せず、画像データの高度な解析も不要とし、検知処理の高速化を図ることができる。 In addition, since the information processing apparatus 10 performs person detection and skeleton detection using a machine learning model, detection accuracy can be improved by periodically retraining the machine learning model. Further, the information processing apparatus 10 prepares skeleton information of an assumed motion in advance, and can estimate the motion by comparing the detected skeleton information with the information. Moreover, high-level analysis of image data is not required, and detection processing can be speeded up.

また、センサのみを用いる認識技術は、センサで取得された位置関係のみを用いるので、手の位置や体の向きと商品が重なった場合に手伸ばしと判定する。このため、実施例１で説明した商品棚のような大きな棚を対象物とした場合、誤検知することが多く、手が届かない場所でも、画像上重なった場合には誤検知する。一方、情報処理装置１０は、人の位置（骨格情報）と棚の位置に応じて、動的に手の届く範囲を変えることで、手と商品（もしくは棚全体）が重なっても、手の届く範囲外の場合は検知しない仕組みを採用しており、検知精度を改善することが可能である。 In addition, since the recognition technology using only sensors uses only the positional relationship obtained by the sensors, it is determined that the hand is stretched when the product overlaps with the position of the hand or the orientation of the body. For this reason, when a large shelf such as the product shelf described in the first embodiment is used as a target object, erroneous detection is often made. On the other hand, the information processing apparatus 10 dynamically changes the reachable range according to the position of the person (skeletal information) and the position of the shelf. It is possible to improve the detection accuracy by adopting a mechanism that does not detect when it is out of range.

ところで、情報処理装置１０は、人物の属性を推定することで、作業状態分析や購買分析等に用いることができるさらに有用な情報を生成して出力することができる。 Incidentally, by estimating the attribute of a person, the information processing apparatus 10 can generate and output more useful information that can be used for work state analysis, purchase analysis, and the like.

図１６は、実施例２にかかる行動認識処理の全体的な流れを示すフローチャートである。図１６に示すように、情報処理装置１０は、処理を開始すると（Ｓ８０１：Ｙｅｓ）、撮像装置から映像データを取得する（Ｓ８０２）。 FIG. 16 is a flowchart illustrating an overall flow of action recognition processing according to the second embodiment; As shown in FIG. 16, when the information processing apparatus 10 starts processing (S801: Yes), it acquires video data from the imaging device (S802).

続いて、情報処理装置１０は、映像データ内の各フレーム（画像データ）から人物の検知を行い（Ｓ８０３）、人物が検知できない場合（Ｓ８０４：Ｎｏ）、次の映像データを取得する。一方、情報処理装置１０は、人物が検知できた場合（Ｓ８０４：Ｙｅｓ）、検知された人物の骨格を検知する（Ｓ８０５）。 Subsequently, the information processing apparatus 10 detects a person from each frame (image data) in the video data (S803), and if the person cannot be detected (S804: No), acquires the next video data. On the other hand, when a person can be detected (S804: Yes), the information processing apparatus 10 detects the skeleton of the detected person (S805).

そして、情報処理装置１０は、検知された人物の動作を推定し（Ｓ８０６）、検知された人物の手伸ばし範囲を推定する（Ｓ８０７）。ここで、情報処理装置１０は、推定された手伸ばし範囲により、商品もしくは棚に手が届く範囲ではない場合（Ｓ８０８：Ｎｏ）、次の画像を取得する。一方、情報処理装置１０は、商品もしくは棚に手が届く範囲である場合（Ｓ８０８：Ｙｅｓ）、人物の行動認識を行う（Ｓ８０９）。 Then, the information processing apparatus 10 estimates the motion of the detected person (S806), and estimates the stretched-out range of the detected person (S807). Here, the information processing apparatus 10 acquires the next image when the product or the shelf is out of the reach of the product or the shelf according to the estimated reachable range (S808: No). On the other hand, if the product or the shelf is within reach (S808: Yes), the information processing apparatus 10 recognizes the action of the person (S809).

これらと並行して、情報処理装置１０は、人物の属性を推定する（Ｓ８１０）。例えば、情報処理装置１０の属性推定部（図示しない）は、検知された人物の骨格情報を深層学習等により生成された機械学習モデルに入力して、属性推定を行う。人の属性として、性別や年代推定のほか、店員検知、不審者検知などを推定してもかまわない。なお、閾値等のパラメータを用い、属性らしさを判定し、次のフレーム映像取得処理に移ってもかまわない。 In parallel with these, the information processing apparatus 10 estimates the attributes of the person (S810). For example, an attribute estimating unit (not shown) of the information processing apparatus 10 inputs skeletal information of the detected person to a machine learning model generated by deep learning or the like to perform attribute estimation. As attributes of a person, in addition to gender and age estimation, clerk detection, suspicious person detection, etc. may be estimated. It is also possible to use a parameter such as a threshold value to determine the likelihood of an attribute, and then proceed to the next frame video acquisition process.

その後、情報処理装置１０は、手伸ばし行動を認識した時間における手の位置情報を算出し、推定した属性情報を付加し、どの属性の人が対応する商品へアクセスしたかを出力する（Ｓ８１１）。 After that, the information processing apparatus 10 calculates hand position information at the time when the hand stretching action was recognized, adds the estimated attribute information, and outputs which attribute person accessed the corresponding product (S811). .

このように、情報処理装置１０は、１日や１週間などの所定期間の映像データを用いて、人物の属性をさらに推定することで、どのような人物がどのような時間帯にどのような商品を手に取るかを集計して、ユーザに出力することができる。図１７は、出力結果の画面例を説明する図である。図１７に示すように、情報処理装置１０は、店舗の各時間帯に棚前を通過した人数や棚前に滞留した人数を集計した情報に加え、属性情報を用いて生成される各種情報を含むコンバージョン分析結果５０を生成して出力することができる。このコンバージョン分析結果５０には、取得される映像５１、左手で商品を手の取った性別の割合および時間帯ごとの集計結果５２、左手で商品を手に取った年代の割合および時間帯ごとの集計結果５３などが含まれる。 In this way, the information processing apparatus 10 further estimates the attributes of a person using video data for a predetermined period of time, such as one day or one week. It is possible to aggregate whether or not to pick up the product and output it to the user. FIG. 17 is a diagram for explaining an example of the output result screen. As shown in FIG. 17, the information processing apparatus 10 collects information on the number of people who passed in front of the shelf and the number of people who stayed in front of the shelf during each time period of the store, and various information generated using attribute information. Conversion analysis results 50 may be generated and output that include: This conversion analysis result 50 includes an image 51 to be acquired, the ratio of genders who picked up the product with their left hand and the total result 52 for each time period, the ratio of age groups who picked up the product with their left hand and the number of hours for each time period. The tally result 53 and the like are included.

さて、これまで本発明の実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。 Although the embodiments of the present invention have been described so far, the present invention may be implemented in various different forms other than the embodiments described above.

［数値等］
上記実施例で用いた数値例、動作例、画面例、属性例等は、あくまで一例であり、任意に変更することができる。また、クラウドシステムを採用することができる。例えば、エッジ端末で処理した結果をアップロードし、結果をクラウド経由でブラウザ表示することもできる。また、カメラ映像をアップロードし、クラウドで処理し、結果をブラウザ表示することもできる。 [Numbers, etc.]
Numerical examples, operation examples, screen examples, attribute examples, and the like used in the above embodiments are only examples, and can be arbitrarily changed. Also, a cloud system can be adopted. For example, it is possible to upload the results processed by an edge terminal and display the results in a browser via the cloud. You can also upload camera footage, process it in the cloud, and display the results in a browser.

また、検知対象のデータは、映像データに限らず、画像データでも動画データもよい。また、手伸ばした行動を認識し、伸ばした手の位置を算出する例を説明したが、これに限定されるものではなく、人物の様々な所作を検知することができる。例えば、情報処理装置１０は、足を伸ばした動作を検知することで、棚へのいたずら検知を行うこともできる。また、情報処理装置１０は、駐車場で、足により車のトランクのセンサを起動させた行動を認識し、車両開発に役立つ情報を収集して出力することもできる。 Data to be detected is not limited to video data, and may be image data or moving image data. Also, an example of recognizing an action of extending a hand and calculating the position of the extended hand has been described, but the present invention is not limited to this, and various actions of a person can be detected. For example, the information processing apparatus 10 can also detect tampering with a shelf by detecting a motion of stretching a leg. In addition, the information processing device 10 can also recognize an action of activating a sensor in the trunk of a car with a foot in a parking lot, and collect and output information useful for vehicle development.

［行動認識例］
例えば、情報処理装置１０は、映像フレームにおける複数枚のフレーム（画像データ）から連続した動作が推定された場合に、手伸ばし動作と推定することで、誤検出を抑制し、推定精度を向上させることができる。図１８は、動作遷移を用いた動作推定処理を説明する図である。具体的には、情報処理装置１０は、図１８に示すように、（１）棚前にくる動作、（２）棚前を見る、棚前で立つもしくはしゃがむ、または、手を前に出す動作、（３）手が伸びる動作を、所定フレーム内で連続して検知した場合に、手伸ばし行動の認識へ移行する特定動作と推定することもできる。 [Action recognition example]
For example, when a continuous motion is estimated from a plurality of frames (image data) in a video frame, the information processing device 10 estimates that it is a stretching motion, thereby suppressing erroneous detection and improving estimation accuracy. be able to. FIG. 18 is a diagram for explaining motion estimation processing using motion transitions. Specifically, as shown in FIG. 18, the information processing apparatus 10 performs (1) an action of coming in front of the shelf, (2) an action of looking at the front of the shelf, standing or crouching in front of the shelf, or putting one's hand forward. , (3) When a motion of extending a hand is continuously detected within a predetermined frame, it can be estimated as a specific motion that shifts to recognition of the motion of extending a hand.

例えば、情報処理装置１０は、第１のフレームにおいて、棚ＲＯＩ３０から所定距離以上離れているものの棚ＲＯＩ３０の前の通路ＲＯＩ４０にいる「棚前に来る」動作を検知する。次に、情報処理装置１０は、第１のフレームから所定枚数内に取得された第２のフレームにおいて、棚ＲＯＩ３０から所定距離未満の位置にいる「棚前を見る」動作を検知する。その後、情報処理装置１０は、第２のフレームから所定枚数内に取得された第３フレームにおいて、「手が伸びる」動作を検知する。このように、情報処理装置１０は、予め想定した連続動作を検知した場合に、手伸ばし行動の認識を実行することもできる。 For example, in the first frame, the information processing apparatus 10 detects an action of "come in front of the shelf" in the passage ROI 40 in front of the shelf ROI 30 although it is at least a predetermined distance away from the shelf ROI 30 . Next, the information processing apparatus 10 detects an action of “looking at the front of the shelf” at a position less than a predetermined distance from the shelf ROI 30 in the second frame acquired within the predetermined number of frames from the first frame. After that, the information processing apparatus 10 detects the action of "stretching the hand" in the third frame acquired within the predetermined number of frames from the second frame. In this way, the information processing apparatus 10 can also recognize a hand-stretching action when an assumed continuous action is detected in advance.

なお、ここでは、映像フレームにおける複数枚のフレーム（画像データ）から連続した動作が推定された場合の例を説明したが、これに限定されるものではない。例えば、情報処理装置１０は、予め指定した複数の動作が予め指定した順番通りに推定された場合に、手伸ばし行動の認識を実行することもできる。 Although an example in which a continuous motion is estimated from a plurality of frames (image data) in a video frame has been described here, the present invention is not limited to this. For example, the information processing apparatus 10 can also recognize a hand-stretching action when a plurality of pre-specified actions are estimated in a pre-specified order.

［システム］
上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 [system]
Information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散や統合の具体的形態は図示のものに限られない。つまり、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 Also, each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated. That is, the specific forms of distribution and integration of each device are not limited to those shown in the drawings. That is, all or part of them can be functionally or physically distributed and integrated in arbitrary units according to various loads and usage conditions.

さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 Further, each processing function performed by each device may be implemented in whole or in part by a CPU and a program analyzed and executed by the CPU, or implemented as hardware based on wired logic.

［ハードウェア］
図１９は、ハードウェア構成例を説明する図である。図１９に示すように、情報処理装置１９は、通信装置１０ａ、ＨＤＤ（Hard Disk Drive）１０ｂ、メモリ１０ｃ、プロセッサ１０ｄを有する。また、図２に示した各部は、バス等で相互に接続される。 [hardware]
FIG. 19 is a diagram illustrating a hardware configuration example. As shown in FIG. 19, the information processing device 19 has a communication device 10a, a HDD (Hard Disk Drive) 10b, a memory 10c, and a processor 10d. 2 are interconnected by a bus or the like.

通信装置１０ａは、ネットワークインタフェースカードなどであり、他の装置との通信を行う。ＨＤＤ１０ｂは、図２に示した機能を動作させるプログラムやＤＢを記憶する。 The communication device 10a is a network interface card or the like, and communicates with other devices. The HDD 10b stores programs and DBs for operating the functions shown in FIG.

プロセッサ１０ｄは、図２に示した各処理部と同様の処理を実行するプログラムをＨＤＤ１０ｂ等から読み出してメモリ１０ｃに展開することで、図２等で説明した各機能を実行するプロセスを動作させる。例えば、このプロセスは、情報処理装置１０が有する各処理部と同様の機能を実行する。具体的には、プロセッサ１０ｄは、映像取得部２１、人物検知部２２、骨格検知部２３、動作推定部２４、範囲推定部２５、手伸ばし行動認識部２６、手伸ばし位置算出部２７等と同様の機能を有するプログラムをＨＤＤ１０ｂ等から読み出す。そして、プロセッサ１０ｄは、映像取得部２１、人物検知部２２、骨格検知部２３、動作推定部２４、範囲推定部２５、手伸ばし行動認識部２６、手伸ばし位置算出部２７等と同様の処理を実行するプロセスを実行する。 The processor 10d reads from the HDD 10b or the like a program that executes the same processing as each processing unit shown in FIG. 2 and develops it in the memory 10c, thereby operating the process of executing each function described with reference to FIG. 2 and the like. For example, this process executes the same function as each processing unit of the information processing apparatus 10 . Specifically, the processor 10d is similar to the video acquisition unit 21, the person detection unit 22, the skeleton detection unit 23, the motion estimation unit 24, the range estimation unit 25, the hand stretching action recognition unit 26, the hand stretching position calculation unit 27, and the like. from the HDD 10b or the like. Then, the processor 10d performs the same processing as the video acquisition unit 21, the person detection unit 22, the skeleton detection unit 23, the motion estimation unit 24, the range estimation unit 25, the hand stretching action recognition unit 26, the hand stretching position calculation unit 27, and the like. Run the process you want to run.

このように、情報処理装置１０は、プログラムを読み出して実行することで行動認識方法を実行する情報処理装置として動作する。また、情報処理装置１０は、媒体読取装置によって記録媒体から上記プログラムを読み出し、読み出された上記プログラムを実行することで上記した実施例と同様の機能を実現することもできる。なお、この他の実施例でいうプログラムは、情報処理装置１０によって実行されることに限定されるものではない。例えば、他のコンピュータまたはサーバがプログラムを実行する場合や、これらが協働してプログラムを実行するような場合にも、本発明を同様に適用することができる。 Thus, the information processing apparatus 10 operates as an information processing apparatus that executes the action recognition method by reading and executing the program. Further, the information processing apparatus 10 can read the program from the recording medium by the medium reading device and execute the read program, thereby realizing the same function as the embodiment described above. Note that the programs referred to in other embodiments are not limited to being executed by the information processing apparatus 10 . For example, the present invention can be applied in the same way when another computer or server executes the program, or when they cooperate to execute the program.

このプログラムは、インターネットなどのネットワークを介して配布することができる。また、このプログラムは、ハードディスク、フレキシブルディスク（ＦＤ）、ＣＤ－ＲＯＭ、ＭＯ（Magneto－Optical disk）、ＤＶＤ（Digital Versatile Disc）などのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行することができる。 This program can be distributed via a network such as the Internet. Also, this program is recorded on a computer-readable recording medium such as a hard disk, flexible disk (FD), CD-ROM, MO (Magneto-Optical disk), DVD (Digital Versatile Disc), etc., and is read from the recording medium by a computer. It can be executed by being read.

１０情報処理装置
１１通信部
１２出力部
１３記憶部
１４機械学習モデルＤＢ
１５対象場所情報ＤＢ
２０制御部
２１映像取得部
２２人物検知部
２３骨格検知部
２４動作推定部
２５範囲推定部
２６手伸ばし行動認識部
２７手伸ばし位置算出部 REFERENCE SIGNS LIST 10 information processing device 11 communication unit 12 output unit 13 storage unit 14 machine learning model DB
15 Target location information DB
20 control unit 21 image acquisition unit 22 person detection unit 23 skeleton detection unit 24 motion estimation unit 25 range estimation unit 26 hand stretching action recognition unit 27 hand stretching position calculation unit

Claims

to the computer,
Detecting skeletal information of a person from an image acquired by an imaging device,
Acquiring area information about the position where the person is imaged, the target area, and the position of the target object;
estimating an action performed by the person on the object based on the skeletal information of the person and the area information;
Recognizing the action of the person based on the distance between the person and the object and the estimated action;
An action recognition program characterized by causing a process to be executed.

The estimating process includes:
Depending on whether or not any one of the motion of reaching out to the target, the motion of looking at the target, or the motion of facing the target is met, the person may move toward the target. estimating the action to be taken,
The recognizing process includes:
When the distance between the person and the object is less than a threshold and an action that satisfies any of the conditions is estimated, the person recognizes the action of reaching out to the object.
2. The action recognition program according to claim 1, characterized by:

The recognizing process includes:
When the distance between the person and the object is less than a threshold and a plurality of pre-specified actions are estimated in a pre-specified order within a predetermined time, the person reaches out to the object. act and recognize
3. The action recognition program according to claim 2, characterized by:

causing the computer to perform processing for estimating, as the distance between the person and the object, a reachable range of the person from the skeleton information of the person and the region information;
The recognizing process includes:
Using the estimated motion and the reachable range of the person, recognizing the action of the person reaching out to the object;
4. The action recognition program according to any one of claims 1 to 3, characterized by:

The process of estimating the reachable range of the person includes:
detecting the center point of both legs specified by the skeleton information as the position where the person stands;
calculating the shortest distance from the center point of both feet to the object;
calculating a perpendicular line in the height direction of the object with respect to the shortest distance;
estimating the position of the perpendicular line corresponding to the height from the foot to the shoulder specified by the skeleton information as the reachable range of the person;
5. The action recognition program according to claim 4, characterized by:

The process of estimating the reachable range of the person includes:
calculating the shoulder width of the person based on the skeleton information;
estimating the area corresponding to the shoulder width at the position of the perpendicular line corresponding to the height from the foot to the shoulder specified by the skeleton information as the reachable range of the person;
6. The action recognition program according to claim 5, characterized by:

calculating the position of the hand based on the skeleton information and the area information when the person is recognized as reaching out to the object;
7. The method according to any one of claims 1 to 6, wherein the computer is caused to execute a process of specifying the object to be accessed by the hand, based on the position of the hand and the positional relationship of the object. Action recognition program as described.

causing the computer to execute a process of estimating attributes of the person based on the skeleton information;
The recognizing process includes:
outputting a result of recognizing that the person reached out his or her hand toward the object in association with attributes of the person;
8. The action recognition program according to any one of claims 1 to 7, characterized by:

the computer
Detecting skeletal information of a person from an image acquired by an imaging device,
Acquiring area information about the position where the person is imaged, the target area, and the position of the target object;
estimating an action performed by the person on the object based on the skeletal information of the person and the area information;
Recognizing the action of the person based on the distance between the person and the object and the estimated action;
An action recognition method characterized by executing processing.

a detection unit that detects human skeleton information from an image acquired by an imaging device;
an acquisition unit that acquires area information about the position where the person is imaged, the target area, and the position of the target object;
an estimation unit for estimating an action performed by the person on the object based on the skeletal information of the person and the area information;
a recognition unit that recognizes the action of the person based on the distance between the person and the object and the estimated action;
An information processing device comprising: