JP7031048B1

JP7031048B1 - Image processing methods, computer programs and image processing equipment

Info

Publication number: JP7031048B1
Application number: JP2021120339A
Authority: JP
Inventors: アブドゥルラーマンアブドゥルガニ; 勝永安
Original assignee: Exa Wizards Inc
Current assignee: Exa Wizards Inc
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2022-03-07
Anticipated expiration: 2041-07-21
Also published as: JP2023016188A

Abstract

【課題】歪みのある画像に対してバウンディングボックスを利用した画像処理を行うことが期待できる画像処理方法、コンピュータプログラム及び画像処理装置を提供する。【解決手段】画像処理方法は、カメラが撮影した歪みを含む歪曲画像を取得し、歪曲画像に写された対象物を囲む立体形状の枠を生成し、立体形状の枠に基づいて歪曲画像から対象物を含む部分画像を抽出する。立体形状の枠は、対象物を囲む２つの平面形状の枠が交差した形状であってよい。歪曲画像における前記対象物の２次元座標を、仮想の３次元空間における３次元座標に変換し、３次元空間において対象物を囲む交差した２つの平面形状の枠を生成し、生成した枠の３次元座標を歪曲画像における２次元座標に逆変換し、歪曲画像におけるサイズに基づいて２つの平面形状の枠のいずれか１つを選択し、選択した平面形状の枠に基づいて、歪曲画像から対象物を含む部分画像を抽出してもよい。【選択図】図１０PROBLEM TO BE SOLVED: To provide an image processing method, a computer program and an image processing apparatus which can be expected to perform image processing using a bounding box on a distorted image. An image processing method acquires a distorted image including distortion taken by a camera, generates a three-dimensional frame surrounding an object captured in the distorted image, and uses the distorted image based on the three-dimensional frame. Extract a partial image containing an object. The three-dimensional frame may be a shape in which two planar frames surrounding the object intersect. The two-dimensional coordinates of the object in the distorted image are converted into the three-dimensional coordinates in the virtual three-dimensional space, and two intersecting planar frame frames surrounding the object are generated in the three-dimensional space, and the generated frame 3 Inversely transform the 3D coordinates to 2D coordinates in the distorted image, select one of the two planar frames based on the size in the distorted image, and target from the distorted image based on the selected planar frame. A partial image including an object may be extracted. [Selection diagram] FIG. 10

Description

本発明は、カメラが撮影した画像に写された対象物を検出する処理を行う画像処理方法、コンピュータプログラム及び画像処理装置に関する。 The present invention relates to an image processing method, a computer program, and an image processing apparatus that perform a process of detecting an object captured in an image taken by a camera.

近年、機械学習及び深層学習等の技術が進歩し、カメラで撮影した画像から特定の対象物、例えば人又は車等を検出することが可能となっている。この技術は、監視カメラ及び車載カメラ等に用いられ、例えばカメラの撮影範囲内に人が存在するか否か、人がどのような行動を行っているか等の判断が可能である。監視カメラ等では、できるだけ広い範囲を撮影できることが望まれ、画角の広いレンズ（いわゆる広角レンズ）を用いた撮影が行われる場合が多い。広角レンズを用いて撮影される画像は、周辺部分ほど大きなゆがみが生じた画像となる。 In recent years, technologies such as machine learning and deep learning have advanced, and it has become possible to detect a specific object, such as a person or a car, from an image taken by a camera. This technology is used for surveillance cameras, in-vehicle cameras, and the like, and it is possible to determine, for example, whether or not a person is within the shooting range of the camera, and what kind of behavior the person is performing. With a surveillance camera or the like, it is desired to be able to shoot as wide a range as possible, and in many cases, shooting is performed using a lens with a wide angle of view (so-called wide-angle lens). An image taken with a wide-angle lens is an image in which a large amount of distortion occurs in the peripheral portion.

特許文献１においては、物体を撮影して生成された画像の歪み補正を行う画像処理システムが提案されている。この画像処理システムは、物体を照明する照明部と、物体上に所定パターンを投写する投写部と、照明部により照明された物体を所定パターンを含まずに撮影して第１画像データを生成し、照明部により照明された物体を所定パターンと共に撮影して第２画像データを生成する撮像部と、照明部及び撮像部を物体に対して相対的に移動させる駆動部とを有する撮像装置を備える。また画像処理システムは、第２画像データが示す第２画像における所定パターンに対応したパターン画像の歪み量を検出し、検出した歪み量に基づいて第１画像データが示す第１画像の歪み補正を行う画像処理装置を備える。 Patent Document 1 proposes an image processing system that corrects distortion of an image generated by photographing an object. This image processing system generates first image data by photographing an object illuminated by an illumination unit, a projection unit that projects a predetermined pattern on the object, and an object illuminated by the illumination unit without including the predetermined pattern. The image pickup device includes an image pickup unit that captures an object illuminated by the illumination unit together with a predetermined pattern to generate second image data, and a drive unit that moves the illumination unit and the image pickup unit relative to the object. .. Further, the image processing system detects the distortion amount of the pattern image corresponding to the predetermined pattern in the second image indicated by the second image data, and corrects the distortion of the first image indicated by the first image data based on the detected distortion amount. It is equipped with an image processing device to perform.

特開２０１９－１９２９４９号公報Japanese Unexamined Patent Publication No. 2019-192949

カメラが撮影した画像から対象物を検出する処理を行った場合、検出結果をユーザに提示するために、検出した対象物を囲む長方形状の枠、いわゆるバウンディングボックスを撮影画像に重畳して表示することが行われる。しかしながら、カメラが撮影した歪みのある画像に対してバウンディングボックスを表示する場合、対象物が歪んでいることからそれを囲むバウンディングボックスのサイズが大きくなり、バウンディングボックス内に対象物以外のものが入り込むなど、バウンディングボックスの精度が低下する場合があった。カメラが例えば通常のレンズを通して撮影を行う場合にも撮影画像に歪みが生じる可能性があり、特に広角レンズを通して撮影を行う場合には画像に大きな歪みが生じる可能性がある。 When the process of detecting an object from the image taken by the camera is performed, a rectangular frame surrounding the detected object, a so-called bounding box, is superimposed and displayed on the captured image in order to present the detection result to the user. Is done. However, when displaying a bounding box for a distorted image taken by a camera, the size of the bounding box surrounding the object becomes large because the object is distorted, and something other than the object gets inside the bounding box. In some cases, the accuracy of the bounding box deteriorated. Even when the camera shoots through a normal lens, for example, the captured image may be distorted, and particularly when the camera shoots through a wide-angle lens, the image may be greatly distorted.

本発明は、斯かる事情に鑑みてなされたものであって、その目的とするところは、歪みのある画像に対してバウンディングボックスを利用した画像処理を行うことが期待できる画像処理方法、コンピュータプログラム及び画像処理装置を提供することにある。 The present invention has been made in view of such circumstances, and an object thereof is an image processing method and a computer program which can be expected to perform image processing using a bounding box on a distorted image. And to provide an image processing apparatus.

一実施形態に係る画像処理方法は、カメラが撮影した歪みを含む歪曲画像を取得し、取得した前記歪曲画像に写された対象物を囲む立体形状の枠を生成し、生成した前記立体形状の枠に基づいて、前記歪曲画像から前記対象物を含む部分画像を抽出し、前記立体形状の枠は、前記対象物を囲む２つの平面形状の枠が交差した形状である。 In the image processing method according to one embodiment, a distorted image including distortion taken by a camera is acquired, a three-dimensional frame surrounding an object reflected in the acquired distorted image is generated, and the generated three-dimensional shape is obtained. A partial image including the object is extracted from the distorted image based on the frame, and the three-dimensional frame is a shape in which two planar frames surrounding the object intersect .

一実施形態による場合は、歪みのある画像に対してバウンディングボックスを利用した画像処理を行うことが期待できる。 In the case of one embodiment, it can be expected that image processing using a bounding box is performed on a distorted image.

本実施の形態に係る情報処理システムの概要を説明するための模式図である。It is a schematic diagram for demonstrating the outline of the information processing system which concerns on this embodiment. 本実施の形態に係るサーバ装置の構成を示すブロック図である。It is a block diagram which shows the structure of the server apparatus which concerns on this embodiment. 本実施の形態に係る端末装置の構成を示すブロック図である。It is a block diagram which shows the structure of the terminal apparatus which concerns on this embodiment. 本実施の形態に係るサーバ装置が行うバウンディングボックス生成処理を説明するための模式図である。It is a schematic diagram for demonstrating the bounding box generation process performed by the server apparatus which concerns on this embodiment. 本実施の形態に係るサーバ装置が行うバウンディングボックス調整処理を説明するための模式図である。It is a schematic diagram for demonstrating the bounding box adjustment process performed by the server apparatus which concerns on this embodiment. 本実施の形態に係るサーバ装置が行うバウンディングボックス調整処理を説明するための模式図である。It is a schematic diagram for demonstrating the bounding box adjustment process performed by the server apparatus which concerns on this embodiment. 端末装置による画像の表示例を示す模式図である。It is a schematic diagram which shows the display example of the image by the terminal apparatus. 本実施の形態に係るサーバ装置が行う処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the process performed by the server apparatus which concerns on this embodiment. 本実施の形態に係るサーバ装置が行う画像抽出用のバウンディングボックス生成処理を説明するための模式図である。It is a schematic diagram for demonstrating the bounding box generation process for image extraction performed by the server apparatus which concerns on this embodiment. バウンディングボックスに基づく画像抽出を説明するための模式図である。It is a schematic diagram for demonstrating image extraction based on a bounding box. 本実施の形態に係るサーバ装置が行う画像抽出処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the image extraction processing performed by the server apparatus which concerns on this embodiment.

本発明の実施形態に係る情報処理システムの具体例を、以下に図面を参照しつつ説明する。なお、本発明はこれらの例示に限定されるものではなく、特許請求の範囲によって示され、特許請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。 A specific example of the information processing system according to the embodiment of the present invention will be described below with reference to the drawings. It should be noted that the present invention is not limited to these examples, and is indicated by the scope of claims, and is intended to include all modifications within the meaning and scope equivalent to the scope of claims.

＜システム構成＞
図１は、本実施の形態に係る情報処理システムの概要を説明するための模式図である。本実施の形態に係る情報処理システムは、例えば商業施設又は公共施設等に設置されたカメラ１が周辺の撮影を行い、撮影した画像からサーバ装置３が人を検出し、検出した人を囲むバウンディングボックスを撮影画像に重畳して端末装置５に表示するシステムである。 <System configuration>
FIG. 1 is a schematic diagram for explaining an outline of an information processing system according to the present embodiment. In the information processing system according to the present embodiment, for example, a camera 1 installed in a commercial facility or a public facility photographs the surroundings, a server device 3 detects a person from the captured image, and bounding surrounds the detected person. This is a system in which a box is superimposed on a captured image and displayed on the terminal device 5.

本実施の形態に係るカメラ１は、画角が広い広角レンズを通して撮影を行うカメラであり、例えば水平方向の画角が約１８０°のカメラである。カメラ１は、１秒間に数回～数十回の頻度で撮影を行い、撮影により得られた画像をサーバ装置３へ送信する。カメラ１は、例えばＬＡＮ（Local Area Network）、無線ＬＡＮ、携帯電話通信網又はインターネット等の有線又は無線のネットワークを介して、サーバ装置３との通信を行い、撮影画像をサーバ装置３へ送信する。なお本実施の形態においてカメラ１は広角レンズを通して撮影を行うものとするが、これに限るものではなく、広角レンズとは異なるレンズを通して撮影を行ってもよい。 The camera 1 according to the present embodiment is a camera that shoots through a wide-angle lens having a wide angle of view, for example, a camera having a horizontal angle of view of about 180 °. The camera 1 shoots at a frequency of several to several tens of times per second, and transmits the image obtained by the shooting to the server device 3. The camera 1 communicates with the server device 3 via, for example, a LAN (Local Area Network), a wireless LAN, a mobile phone communication network, or a wired or wireless network such as the Internet, and transmits a captured image to the server device 3. .. In the present embodiment, the camera 1 shoots through a wide-angle lens, but the present invention is not limited to this, and the camera 1 may shoot through a lens different from the wide-angle lens.

サーバ装置３は、一又は複数のカメラ１から送信される撮影画像を受信して、受信した撮影画像に写された人を検出する処理を行う。サーバ装置３は、撮影画像から検出した人に対して、この人を囲むバウンディングボックス１０１を生成し、元の撮影画像に生成したバウンディングボックス１０１を重畳した表示用の画像を生成する。サーバ装置３は、生成した表示用の画像を端末装置５へ送信する。 The server device 3 receives a photographed image transmitted from one or a plurality of cameras 1 and performs a process of detecting a person photographed in the received photographed image. The server device 3 generates a bounding box 101 surrounding the person detected from the captured image, and generates an image for display in which the generated bounding box 101 is superimposed on the original captured image. The server device 3 transmits the generated display image to the terminal device 5.

端末装置５は、カメラ１を用いた監視等を行うユーザが使用する装置であり、例えばＰＣ（パーソナルコンピュータ）、スマートフォン又はタブレット型端末装置等の汎用の情報処理装置を用いて構成され得る。端末装置５は、サーバ装置３から送信される画像を受信し、受信した画像を表示部に表示する。サーバ装置３から端末装置５へ送信されて表示部に表示される画像は、カメラ１が撮影した画像であり、画像中に人が存在する場合には、この人を囲むバウンディングボックス１０１が重畳された画像である。 The terminal device 5 is a device used by a user who performs monitoring or the like using the camera 1, and may be configured by using a general-purpose information processing device such as a PC (personal computer), a smartphone, or a tablet-type terminal device. The terminal device 5 receives the image transmitted from the server device 3 and displays the received image on the display unit. The image transmitted from the server device 3 to the terminal device 5 and displayed on the display unit is an image taken by the camera 1, and if a person is present in the image, the bounding box 101 surrounding the person is superimposed. It is an image.

なお本実施の形態においては、撮影画像に写された人を検出する処理、及び、検出した人を囲むバウンディングボックス１０１を生成する処理等をサーバ装置３が行うものとするが、これに限るものではない。例えば、カメラ１が人を検出する処理及びバウンディングボックス１０１を生成する処理等を行い、バウンディングボックス１０１を重畳した撮影画像を、サーバ装置３を介して又はサーバ装置３を介さず直接的に、端末装置５へ送信してもよい。また本実施の形態においては、サーバ装置３がバウンディングボックス１０１を重畳した撮影画像を端末装置５へ送信し、これを受信した端末装置５が表示部に表示するものとするが、これに限るものではない。例えば、サーバ装置３が表示部を備える場合に、サーバ装置３の表示部にバウンディングボックス１０１を重畳した撮影画像を表示してもよい。また例えば、カメラ１が表示部を備える場合に、カメラ１がバウンディングボックス１０１の生成及び重畳を行って、自身の表示部にバウンディングボックス１０１を重畳した撮影画像を表示してもよい。 In the present embodiment, the server device 3 performs a process of detecting a person captured in the captured image, a process of generating a bounding box 101 surrounding the detected person, and the like, but the present invention is limited to this. is not it. For example, the camera 1 performs a process of detecting a person, a process of generating a bounding box 101, and the like, and a photographed image on which the bounding box 101 is superimposed is directly transmitted to a terminal via the server device 3 or without the server device 3. It may be transmitted to the device 5. Further, in the present embodiment, the server device 3 transmits the captured image in which the bounding box 101 is superimposed to the terminal device 5, and the terminal device 5 that receives the image is displayed on the display unit, but the present invention is limited to this. is not it. For example, when the server device 3 includes a display unit, a captured image in which the bounding box 101 is superimposed on the display unit of the server device 3 may be displayed. Further, for example, when the camera 1 includes a display unit, the camera 1 may generate and superimpose the bounding box 101 to display a captured image in which the bounding box 101 is superimposed on its own display unit.

＜装置構成＞
図２は、本実施の形態に係るサーバ装置３の構成を示すブロック図である。本実施の形態に係るサーバ装置３は、処理部３１、記憶部（ストレージ）３２及び通信部（トランシーバ）３３等を備えて構成されている。なお本実施の形態においては、１つのサーバ装置にて処理が行われるものとして説明を行うが、複数のサーバ装置が分散して処理を行ってもよい。 <Device configuration>
FIG. 2 is a block diagram showing a configuration of the server device 3 according to the present embodiment. The server device 3 according to the present embodiment includes a processing unit 31, a storage unit (storage) 32, a communication unit (transceiver) 33, and the like. In the present embodiment, it is assumed that the processing is performed by one server device, but a plurality of server devices may perform the processing in a distributed manner.

処理部３１は、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro-Processing Unit）又はＧＰＵ（Graphics Processing Unit）等の演算処理装置、ＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）等を用いて構成されている。処理部３１は、記憶部３２に記憶されたサーバプログラム３２ａを読み出して実行することにより、カメラ１が撮影した画像を取得する処理、画像に写されている人を検出する処理、検出した人を囲むバウンディングボックスを生成する処理、及び、バウンディングボックスを撮影画像に重畳して端末装置５へ送信する処理等の種々の処理を行う。 The processing unit 31 is configured by using an arithmetic processing unit such as a CPU (Central Processing Unit), an MPU (Micro-Processing Unit) or a GPU (Graphics Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory) and the like. Has been done. The processing unit 31 reads and executes the server program 32a stored in the storage unit 32 to acquire an image taken by the camera 1, a process of detecting a person captured in the image, and a process of detecting the detected person. Various processes such as a process of generating a surrounding bounding box and a process of superimposing the bounding box on a captured image and transmitting it to the terminal device 5 are performed.

記憶部３２は、例えばハードディスク等の大容量の記憶装置を用いて構成されている。記憶部３２は、処理部３１が実行する各種のプログラム、及び、処理部３１の処理に必要な各種のデータを記憶する。本実施の形態において記憶部３２は、処理部３１が実行するサーバプログラム３２ａを記憶すると共に、カメラ１が撮影した画像から人を検出する処理に用いられる学習済のＹＯＬＯ（You Only Look Once）学習モデル３２ｂを記憶している。 The storage unit 32 is configured by using a large-capacity storage device such as a hard disk. The storage unit 32 stores various programs executed by the processing unit 31 and various data required for processing by the processing unit 31. In the present embodiment, the storage unit 32 stores the server program 32a executed by the processing unit 31, and has learned YOLO (You Only Look Once) learning used for the process of detecting a person from the image taken by the camera 1. The model 32b is stored.

本実施の形態においてサーバプログラム（プログラム製品）３２ａは、メモリカード又は光ディスク等の記録媒体９９に記録された態様で提供され、サーバ装置３は記録媒体９９からサーバプログラム３２ａを読み出して記憶部３２に記憶する。ただし、サーバプログラム３２ａは、例えばサーバ装置３の製造段階において記憶部３２に書き込まれてもよい。また例えばサーバプログラム３２ａは、遠隔の他のサーバ装置等が配信するものをサーバ装置３が通信にて取得してもよい。例えばサーバプログラム３２ａは、記録媒体９９に記録されたものを書込装置が読み出してサーバ装置３の記憶部３２に書き込んでもよい。サーバプログラム３２ａは、ネットワークを介した配信の態様で提供されてもよく、記録媒体９９に記録された態様で提供されてもよい。 In the present embodiment, the server program (program product) 32a is provided in a mode of being recorded on a recording medium 99 such as a memory card or an optical disk, and the server device 3 reads the server program 32a from the recording medium 99 and stores it in the storage unit 32. Remember. However, the server program 32a may be written to the storage unit 32, for example, at the manufacturing stage of the server device 3. Further, for example, the server program 32a may be acquired by the server device 3 by communication, which is distributed by another remote server device or the like. For example, in the server program 32a, the writing device may read what was recorded on the recording medium 99 and write it in the storage unit 32 of the server device 3. The server program 32a may be provided in the form of distribution via the network, or may be provided in the form recorded on the recording medium 99.

ＹＯＬＯ学習モデル３２ｂは、ＣＮＮ（Convolutional Neural Network）をベースとする学習モデルであり、入力した画像から特定の対象物を検出し、検出した物体の種別、及び、検出した対象物を囲むバウンディングボックス等の情報を出力するように機械学習がなされた学習モデルである。本実施の形態に係るＹＯＬＯ学習モデル３２ｂは、画像に写された人の頭部を検出し、検出した人の頭部を囲むバウンディングボックスの情報を出力するように予め機械学習がなされている。なおＹＯＬＯの学習モデルは既存の技術であるため、その構造及びアルゴリズム等の詳細な説明は省略する。また本実施の形態ではＹＯＬＯの学習モデルを用いて人の頭部を検出するものとするが、これに限るものではなく、例えばR-CNN（Regions with Convolutional Neural Networks）、Faster R-CNN又はSSD（Single Shot multibox Detector）等の学習モデルが用いられてもよい。記憶部３２には、ＹＯＬＯ学習モデル３２ｂを構成する学習モデルの構造に関する情報、及び、機械学習により決定された内部のパラメータの情報等が記憶される。 The YOLO learning model 32b is a learning model based on a CNN (Convolutional Neural Network), detects a specific object from an input image, types of the detected object, a bounding box surrounding the detected object, and the like. It is a learning model in which machine learning is performed so as to output the information of. The YOLO learning model 32b according to the present embodiment is machine-learned in advance so as to detect a human head captured in an image and output information on a bounding box surrounding the detected human head. Since the learning model of YOLO is an existing technology, detailed description of its structure, algorithm, etc. will be omitted. Further, in the present embodiment, the human head is detected by using the learning model of YOLO, but the present invention is not limited to this, and for example, R-CNN (Regions with Convolutional Neural Networks), Faster R-CNN or SSD. A learning model such as (Single Shot multibox Detector) may be used. The storage unit 32 stores information on the structure of the learning model constituting the YOLO learning model 32b, information on internal parameters determined by machine learning, and the like.

通信部３３は、携帯電話通信網、無線ＬＡＮ及びインターネット等を含むネットワークＮを介して、種々の装置との間で通信を行う。本実施の形態において通信部３３は、ネットワークＮを介して、カメラ１及び端末装置５との間で通信を行う。通信部３３は、処理部３１から与えられたデータを他の装置へ送信すると共に、他の装置から受信したデータを処理部３１へ与える。 The communication unit 33 communicates with various devices via a network N including a mobile phone communication network, a wireless LAN, the Internet, and the like. In the present embodiment, the communication unit 33 communicates with the camera 1 and the terminal device 5 via the network N. The communication unit 33 transmits the data given from the processing unit 31 to another device, and gives the data received from the other device to the processing unit 31.

なお記憶部３２は、サーバ装置３に接続された外部記憶装置であってよい。またサーバ装置３は、複数のコンピュータを含んで構成されるマルチコンピュータであってよく、ソフトウェアによって仮想的に構築された仮想マシンであってもよい。またサーバ装置３は、上記の構成に限定されず、例えば可搬型の記憶媒体に記憶された情報を読み取る読取部、操作入力を受け付ける入力部、又は、画像を表示する表示部等を含んでもよい。 The storage unit 32 may be an external storage device connected to the server device 3. Further, the server device 3 may be a multi-computer including a plurality of computers, or may be a virtual machine virtually constructed by software. Further, the server device 3 is not limited to the above configuration, and may include, for example, a reading unit for reading information stored in a portable storage medium, an input unit for receiving operation input, a display unit for displaying an image, and the like. ..

また本実施の形態に係るサーバ装置３には、記憶部３２に記憶されたサーバプログラム３２ａを処理部３１が読み出して実行することにより、画像取得部３１ａ、人検出部３１ｂ、座標変換部３１ｃ、バウンディングボックス（図２においてＢＢｏｘと略示する）生成部３１ｄ、座標逆変換部３１ｅ、バウンディングボックス調整部３１ｆ、画像重畳部３１ｇ及び画像送信部３１ｈ等が、ソフトウェア的な機能部として処理部３１に実現される。なお本図においては、処理部３１の機能部として、カメラ１が撮影した画像に対する処理を行う機能部を図示し、これ以外の処理に関する機能部は図示を省略している。 Further, in the server device 3 according to the present embodiment, the processing unit 31 reads out and executes the server program 32a stored in the storage unit 32, whereby the image acquisition unit 31a, the person detection unit 31b, the coordinate conversion unit 31c, and the like. The bounding box (abbreviated as BBox in FIG. 2) generation unit 31d, the coordinate inverse conversion unit 31e, the bounding box adjustment unit 31f, the image superimposition unit 31g, the image transmission unit 31h, etc. are combined with the processing unit 31 as software-like functional units. It will be realized. In this figure, as the functional unit of the processing unit 31, the functional unit that performs processing on the image captured by the camera 1 is shown, and the functional unit related to other processing is omitted.

画像取得部３１ａは、カメラ１が送信する画像を通信部３３にて受信することで、カメラ１が撮影した画像（撮影画像）を取得する。なお本実施の形態においてカメラ１は広角レンズを通して撮影を行うものであり、画像取得部３１ａが取得する撮影画像は、周辺部分に歪みがある画像（歪曲画像）である。画像取得部３１ａは、取得した撮影画像を記憶部３２に一時的に記憶する。なお画像取得部３１ａにより記憶部３２に一時的に記憶された撮影画像は、後続の画像処理によりバウンディングボックスが重畳されて端末装置５へ送信された後、適宜のタイミングで記憶部３２から消去されてよい。 The image acquisition unit 31a acquires an image (photographed image) taken by the camera 1 by receiving the image transmitted by the camera 1 by the communication unit 33. In the present embodiment, the camera 1 shoots through a wide-angle lens, and the captured image acquired by the image acquisition unit 31a is an image (distorted image) having distortion in the peripheral portion. The image acquisition unit 31a temporarily stores the acquired captured image in the storage unit 32. The captured image temporarily stored in the storage unit 32 by the image acquisition unit 31a is erased from the storage unit 32 at an appropriate timing after the bounding box is superimposed and transmitted to the terminal device 5 by the subsequent image processing. It's okay.

人検出部３１ｂは、画像取得部３１ａがカメラ１から取得した撮影画像に写された人の頭部を検出する処理を行う。本実施の形態において人検出部３１ｂは、記憶部３２に記憶されたＹＯＬＯ学習モデル３２ｂを用いて検出処理を行う。人検出部３１ｂは、画像取得部３１ａがカメラ１から取得した歪みのある撮影画像をＹＯＬＯ学習モデル３２ｂへ入力する。ＹＯＬＯ学習モデル３２ｂは、入力画像に写された人の頭部を検出し、検出した人の頭部を囲むバウンディングボックスを生成し、生成したバウンディングボックスの情報を出力する。 The human detection unit 31b performs a process in which the image acquisition unit 31a detects the head of a person captured in the captured image acquired from the camera 1. In the present embodiment, the human detection unit 31b performs the detection process using the YOLO learning model 32b stored in the storage unit 32. The human detection unit 31b inputs the distorted captured image acquired from the camera 1 by the image acquisition unit 31a into the YOLO learning model 32b. The YOLO learning model 32b detects the head of the person captured in the input image, generates a bounding box surrounding the detected head of the person, and outputs the information of the generated bounding box.

本実施の形態においてＹＯＬＯ学習モデル３２ｂが出力するバウンディングボックスは、検出した人の身体を囲む長方形状の枠体である。この枠体は、例えば撮影画像の２次元平面における２点の座標を用いて表される。例えば枠体は、左上角部分の座標（ｘ１、ｙ１）と、右下角部分の座標（ｘ２、ｙ２）とで表され、ＹＯＬＯ学習モデル３２ｂは、この（ｘ１、ｙ１）及び（ｘ２、ｙ２）の座標をバウンディングボックスの情報として出力する。なお撮影画像に複数の人が写されている場合、ＹＯＬＯ学習モデル３２ｂは、各人について頭部を囲むバウンディングボックスを生成し、各バウンディングボックスの座標を出力する。 The bounding box output by the YOLO learning model 32b in the present embodiment is a rectangular frame that surrounds the detected human body. This frame is represented using, for example, the coordinates of two points on the two-dimensional plane of the captured image. For example, the frame is represented by the coordinates (x1, y1) of the upper left corner portion and the coordinates (x2, y2) of the lower right corner portion, and the YOLO learning model 32b has these (x1, y1) and (x2, y2). Outputs the coordinates of as the bounding box information. When a plurality of people are captured in the captured image, the YOLO learning model 32b generates a bounding box surrounding the head for each person and outputs the coordinates of each bounding box.

人検出部３１ｂは、撮影画像の入力に対してＹＯＬＯ学習モデル３２ｂが出力するバウンディングボックスの情報を取得する。また本実施の形態において人検出部３１ｂは、ＹＯＬＯ学習モデル３２ｂから取得したバウンディングボックスの情報に基づいて、撮影画像に写された人の頭部の中心の座標を算出する。例えば、ＹＯＬＯ学習モデル３２ｂがバウンディングボックスの左上角部分の座標（ｘ１、ｙ１）及び右下角部分の座標（ｘ２、ｙ２）を出力する場合、人検出部はこの２点の中点の座標を算出することで、人の頭部の中心の座標を算出する。人検出部３１ｂは、算出した人の頭部の中心の座標を座標変換部３１ｃへ与える。 The human detection unit 31b acquires the information of the bounding box output by the YOLO learning model 32b in response to the input of the captured image. Further, in the present embodiment, the human detection unit 31b calculates the coordinates of the center of the human head captured in the photographed image based on the information of the bounding box acquired from the YOLO learning model 32b. For example, when the YOLO learning model 32b outputs the coordinates (x1, y1) of the upper left corner portion and the coordinates (x2, y2) of the lower right corner portion of the bounding box, the human detection unit calculates the coordinates of the midpoint of these two points. By doing so, the coordinates of the center of the human head are calculated. The human detection unit 31b gives the calculated coordinates of the center of the human head to the coordinate conversion unit 31c.

座標変換部３１ｃは、人検出部３１ｂが検出した人の頭部の中心の座標に対して、所定の座標変換処理を行う。人検出部３１ｂから座標変換部３１ｃへ与えられる座標は、カメラ１が撮影した歪みのある撮影画像における２次元の座標である。座標変換部３１ｃは、この２次元の座標を、３次元仮想空間における３次元の座標に変換する処理を行う。座標変換部３１ｃが行う座標変換処理の詳細は、後述する。 The coordinate conversion unit 31c performs a predetermined coordinate conversion process on the coordinates of the center of the human head detected by the person detection unit 31b. The coordinates given from the person detection unit 31b to the coordinate conversion unit 31c are two-dimensional coordinates in the distorted captured image captured by the camera 1. The coordinate conversion unit 31c performs a process of converting the two-dimensional coordinates into three-dimensional coordinates in the three-dimensional virtual space. The details of the coordinate conversion process performed by the coordinate conversion unit 31c will be described later.

バウンディングボックス生成部３１ｄは、座標変換部３１ｃにより変換された人の頭部の中心（重心）の３次元仮想空間における３次元の座標に基づいて、この人の頭部を囲む３次元のバウンディングボックスを生成する。本実施の形態においてバウンディングボックス生成部３１ｄが生成する３次元のバウンディングボックスは、例えば直方体又は立方体の形状である。また更に本実施の形態に係るバウンディングボックス生成部３１ｄは、この人の身体（首より下の部分）を囲む３次元のバウンディングボックスを生成する。即ちバウンディングボックス生成部３１ｄは、撮影された画像から検出された人に対して、頭部と頭部以外の身体とをそれぞれ囲む２つの３次元のバウンディングボックスを生成する。なおバウンディングボックス生成部３１ｄは、身体を囲むバウンディングボックスを生成する際に、対象となる人の身長が所定値（例えば１６０ｃｍ、１７０ｃｍなど）と仮定して、バウンディングボックスの高さ（長さ）を決定する。 The bounding box generation unit 31d is a three-dimensional bounding box that surrounds the person's head based on the three-dimensional coordinates in the three-dimensional virtual space of the center (centroid) of the person's head converted by the coordinate conversion unit 31c. To generate. In the present embodiment, the three-dimensional bounding box generated by the bounding box generation unit 31d is, for example, a rectangular parallelepiped or a cube. Further, the bounding box generation unit 31d according to the present embodiment generates a three-dimensional bounding box that surrounds the person's body (the portion below the neck). That is, the bounding box generation unit 31d generates two three-dimensional bounding boxes that surround the head and the body other than the head, respectively, for the person detected from the captured image. When generating a bounding box that surrounds the body, the bounding box generation unit 31d assumes that the height of the target person is a predetermined value (for example, 160 cm, 170 cm, etc.) and determines the height (length) of the bounding box. decide.

座標逆変換部３１ｅは、バウンディングボックス生成部３１ｄが生成したバウンディングボックスの３次元の座標を、２次元の座標へ変換する処理を行う。座標逆変換部３１ｅが行う３次元座標から２次元座標への変換処理は、座標変換部３１ｃによる２次元座標から３次元座標への変換処理の逆変換に相当する。座標逆変換部３１ｅにより変換処理がなされたバウンディングボックスは、カメラ１が撮影した撮影画像の２次元平面に対して、３次元のバウンディングボックスを投影して２次元化した者に相当する。座標逆変換部３１ｅが行う座標変換処理の詳細は、後述する。 The coordinate inverse conversion unit 31e performs a process of converting the three-dimensional coordinates of the bounding box generated by the bounding box generation unit 31d into two-dimensional coordinates. The conversion process from the three-dimensional coordinates to the two-dimensional coordinates performed by the coordinate inverse conversion unit 31e corresponds to the inverse conversion of the conversion process from the two-dimensional coordinates to the three-dimensional coordinates by the coordinate conversion unit 31c. The bounding box that has been converted by the coordinate inverse conversion unit 31e corresponds to a person who projects a three-dimensional bounding box onto a two-dimensional plane of a captured image taken by the camera 1 to make it two-dimensional. The details of the coordinate conversion process performed by the coordinate inverse conversion unit 31e will be described later.

バウンディングボックス調整部３１ｆは、座標逆変換部３１ｅによる３次元のバウンディングボックスの２次元座標への変換結果に基づいて、バウンディングボックスのサイズを調整する処理を行う。上述のようにバウンディングボックス生成部３１ｄは、撮影画像に写る人の身体を囲むバウンディングボックスを生成する際に、この人の身長が所定値であると仮定してバウンディングボックスの高さを決定するが、実際の身長は仮定した身長とは異なる場合がある。このため、バウンディングボックス調整部３１ｆは、バウンディングボックスの高さを調整する。本実施の形態においてバウンディングボックス調整部３１ｆは、座標逆変換部３１ｅにより２次元座標に変換された頭部を囲むバウンディングボックスと、人検出部３１ｂがＹＯＬＯ学習モデル３２ｂを用いて生成した頭部を囲むバウンディングボックスとの大きさを比較し、この比較結果に基づいて身体を囲むバウンディングボックスの高さを調整する。 The bounding box adjusting unit 31f performs a process of adjusting the size of the bounding box based on the conversion result of the three-dimensional bounding box to the two-dimensional coordinates by the coordinate inverse conversion unit 31e. As described above, when the bounding box generation unit 31d generates the bounding box that surrounds the body of the person shown in the captured image, the height of the bounding box is determined on the assumption that the height of the person is a predetermined value. , Actual height may differ from the assumed height. Therefore, the bounding box adjusting unit 31f adjusts the height of the bounding box. In the present embodiment, the bounding box adjusting unit 31f has a bounding box surrounding the head converted into two-dimensional coordinates by the coordinate inverse conversion unit 31e, and a head generated by the human detection unit 31b using the YOLO learning model 32b. Compare the size with the surrounding bounding box, and adjust the height of the surrounding bounding box based on the comparison result.

画像重畳部３１ｇは、画像取得部３１ａが取得したカメラ１の撮影画像に対して、バウンディングボックス調整部３１ｆが高さを調整した最終的なバウンディングボックスを重畳する画像処理を行う。画像重畳部３１ｇは、撮影画像にバウンディングボックスを重畳した画像を、画像送信部３１ｈへ与える。 The image superimposing unit 31g performs image processing for superimposing the final bounding box whose height has been adjusted by the bounding box adjusting unit 31f on the captured image of the camera 1 acquired by the image acquiring unit 31a. The image superimposing unit 31g gives an image in which a bounding box is superimposed on the captured image to the image transmitting unit 31h.

画像送信部３１ｈは、画像重畳部３１ｇから与えられた画像を、通信部３３にて所定の端末装置５へ送信する処理を行うことにより、端末装置５にこの画像を表示させる。なお本実施の形態においては、撮影画像にバウンディングボックスを重畳する画像処理をサーバ装置３が行っているが、これに限るものではなく、例えばサーバ装置３は撮影画像及びバウンディングボックスの情報を端末装置５へ送信し、端末装置５が撮影画像にバウンディングボックスを重畳して表示してもよい。 The image transmitting unit 31h causes the terminal device 5 to display this image by performing a process of transmitting the image given from the image superimposing unit 31g to the predetermined terminal device 5 by the communication unit 33. In the present embodiment, the server device 3 performs image processing for superimposing the bounding box on the captured image, but the present invention is not limited to this. For example, the server device 3 uses the captured image and the bounding box information as a terminal device. It may be transmitted to 5 and the terminal device 5 may superimpose the bounding box on the captured image and display it.

図３は、本実施の形態に係る端末装置５の構成を示すブロック図である。本実施の形態に係る端末装置５は、処理部５１、記憶部（ストレージ）５２、通信部（トランシーバ）５３、表示部（ディスプレイ）５４及び操作部５５等を備えて構成されている。端末装置５は、例えばカメラ１が撮影した画像に基づいて監視等の業務を行うユーザが使用する装置であり、例えばパーソナルコンピュータ、スマートフォン又はタブレット型端末装置等の情報処理装置を用いて構成され得る。 FIG. 3 is a block diagram showing the configuration of the terminal device 5 according to the present embodiment. The terminal device 5 according to the present embodiment includes a processing unit 51, a storage unit (storage) 52, a communication unit (transceiver) 53, a display unit (display) 54, an operation unit 55, and the like. The terminal device 5 is a device used by a user who performs a business such as monitoring based on an image taken by a camera 1, and may be configured by using an information processing device such as a personal computer, a smartphone, or a tablet terminal device, for example. ..

処理部５１は、ＣＰＵ又はＭＰＵ等の演算処理装置、ＲＯＭ及び等を用いて構成されている。処理部５１は、記憶部５２に記憶されたプログラム５２ａを読み出して実行することにより、サーバ装置３から送信されるカメラ１の撮影画像を受信して表示部に表示する等の種々の処理を行う。 The processing unit 51 is configured by using an arithmetic processing unit such as a CPU or MPU, a ROM, and the like. By reading and executing the program 52a stored in the storage unit 52, the processing unit 51 performs various processes such as receiving the captured image of the camera 1 transmitted from the server device 3 and displaying it on the display unit. ..

記憶部５２は、例えばハードディスク又はフラッシュメモリ等の記憶装置を用いて構成されている。記憶部５２は、処理部５１が実行する各種のプログラム、及び、処理部５１の処理に必要な各種のデータを記憶する。本実施の形態において記憶部５２は、処理部５１が実行するプログラム５２ａを記憶している。本実施の形態においてプログラム５２ａは遠隔のサーバ装置等により配信され、これを端末装置５が通信にて取得し、記憶部５２に記憶する。ただしプログラム５２ａは、例えば端末装置５の製造段階において記憶部５２に書き込まれてもよい。例えばプログラム５２ａは、メモリカード又は光ディスク等の記録媒体９８に記録されたプログラム５２ａを端末装置５が読み出して記憶部５２に記憶してもよい。例えばプログラム５２ａは、記録媒体９８に記録されたものを書込装置が読み出して端末装置５の記憶部５２に書き込んでもよい。プログラム５２ａは、ネットワークを介した配信の態様で提供されてもよく、記録媒体９８に記録された態様で提供されてもよい。 The storage unit 52 is configured by using a storage device such as a hard disk or a flash memory. The storage unit 52 stores various programs executed by the processing unit 51 and various data required for processing by the processing unit 51. In the present embodiment, the storage unit 52 stores the program 52a executed by the processing unit 51. In the present embodiment, the program 52a is distributed by a remote server device or the like, which is acquired by the terminal device 5 by communication and stored in the storage unit 52. However, the program 52a may be written in the storage unit 52, for example, at the manufacturing stage of the terminal device 5. For example, in the program 52a, the terminal device 5 may read the program 52a recorded on the recording medium 98 such as a memory card or an optical disk and store it in the storage unit 52. For example, in the program 52a, the writing device may read out what was recorded on the recording medium 98 and write it in the storage unit 52 of the terminal device 5. The program 52a may be provided in a mode of distribution via a network, or may be provided in a mode recorded on a recording medium 98.

通信部５３は、携帯電話通信網、無線ＬＡＮ及びインターネット等を含むネットワークＮを介して、種々の装置との間で通信を行う。本実施の形態において通信部５３は、ネットワークＮを介して、サーバ装置３との間で通信を行う。通信部５３は、処理部５１から与えられたデータを他の装置へ送信すると共に、他の装置から受信したデータを処理部５１へ与える。 The communication unit 53 communicates with various devices via a network N including a mobile phone communication network, a wireless LAN, the Internet, and the like. In the present embodiment, the communication unit 53 communicates with the server device 3 via the network N. The communication unit 53 transmits the data given from the processing unit 51 to another device, and gives the data received from the other device to the processing unit 51.

表示部５４は、液晶ディスプレイ等を用いて構成されており、処理部５１の処理に基づいて種々の画像及び文字等を表示する。操作部５５は、ユーザの操作を受け付け、受け付けた操作を処理部５１へ通知する。例えば操作部５５は、機械式のボタン又は表示部５４の表面に設けられたタッチパネル等の入力デバイスによりユーザの操作を受け付ける。また例えば操作部５５は、マウス及びキーボード等の入力デバイスであってよく、これらの入力デバイスは端末装置５に対して取り外すことが可能な構成であってもよい。 The display unit 54 is configured by using a liquid crystal display or the like, and displays various images, characters, and the like based on the processing of the processing unit 51. The operation unit 55 accepts the user's operation and notifies the processing unit 51 of the accepted operation. For example, the operation unit 55 accepts a user's operation by an input device such as a mechanical button or a touch panel provided on the surface of the display unit 54. Further, for example, the operation unit 55 may be an input device such as a mouse and a keyboard, and these input devices may be configured to be removable with respect to the terminal device 5.

また本実施の形態に係る端末装置５は、記憶部５２に記憶されたプログラム５２ａを処理部５１が読み出して実行することにより、画像受信部５１ａ及び表示処理部５１ｂ等がソフトウェア的な機能部として処理部５１に実現される。なおプログラム５２ａは、本実施の形態に係る情報処理システムに専用のプログラムであってもよく、インターネットブラウザ又はウェブブラウザ等の汎用のプログラムであってもよい。 Further, in the terminal device 5 according to the present embodiment, the processing unit 51 reads out and executes the program 52a stored in the storage unit 52, so that the image receiving unit 51a, the display processing unit 51b, and the like serve as software-like functional units. It is realized in the processing unit 51. The program 52a may be a program dedicated to the information processing system according to the present embodiment, or may be a general-purpose program such as an Internet browser or a web browser.

画像受信部５１ａは、サーバ装置３が送信する画像を、通信部５３にて受信する処理を行う。画像受信部５１ａは、受信した画像を例えば記憶部５２に一時的に記憶する。本実施の形態において画像受信部５１ａがサーバ装置３から受信する画像は、カメラ１の撮影画像にバウンディングボックスが重畳された画像である。 The image receiving unit 51a performs a process of receiving the image transmitted by the server device 3 by the communication unit 53. The image receiving unit 51a temporarily stores the received image in, for example, a storage unit 52. In the present embodiment, the image received by the image receiving unit 51a from the server device 3 is an image in which the bounding box is superimposed on the captured image of the camera 1.

表示処理部５１ｂは、表示部５４に対して種々の文字又は画像等を表示する処理を行う。本実施の形態において表示処理部５１ｂは、画像受信部５１ａがサーバ装置３から受信した画像、カメラ１の撮影画像にバウンディングボックスが重畳された画像を、表示部５４に表示する。なお、本実施の形態においてカメラ１は１秒間に数十回程度の頻度で撮影を繰り返し行っている、即ち動画像の撮影を行っている。サーバ装置３は、カメラ１が１秒間に数十回程度の頻度で撮影した各撮影画像に対してバウンディングボックス重畳して端末装置５へ送信し、端末装置５の表示処理部５１ｂはサーバ装置３から受信した画像を１秒間に数十回程度の頻度で表示することにより、バウンディングボックスが重畳された動画像を表示する。 The display processing unit 51b performs processing for displaying various characters, images, and the like on the display unit 54. In the present embodiment, the display processing unit 51b displays on the display unit 54 an image received by the image receiving unit 51a from the server device 3 and an image in which the bounding box is superimposed on the captured image of the camera 1. In the present embodiment, the camera 1 repeatedly shoots at a frequency of several tens of times per second, that is, shoots a moving image. The server device 3 superimposes a bounding box on each captured image captured by the camera 1 at a frequency of several tens of times per second and transmits the bounding box to the terminal device 5, and the display processing unit 51b of the terminal device 5 is the server device 3. By displaying the image received from the server at a frequency of several tens of times per second, a moving image on which the bounding box is superimposed is displayed.

＜バウンディングボックス生成処理＞
図４は、本実施の形態に係るサーバ装置３が行うバウンディングボックス生成処理を説明するための模式図である。本図において符号１１０を付した四角形は、カメラ１が撮影した撮影画像を示している。本実施の形態に係るカメラ１は広角レンズを通して撮影を行うものであり、撮影画像１１０は、レンズによる歪みを含む画像（歪曲画像）である。カメラ１から撮影画像１１０を取得したサーバ装置３は、ＹＯＬＯ学習モデル３２ｂを用いて、この撮影画像１１０に含まれる人の頭部を検出する処理を行う。ＹＯＬＯ学習モデル３２ｂを用いた検出処理を行うことによってサーバ装置３は、撮影画像１１０に含まれる人の頭部を囲むバウンディングボックス１１１の座標等の情報を検出結果として取得することができる。サーバ装置３は、ＹＯＬＯ学習モデル３２ｂから取得したバウンディングボックス１１１の座標等の情報に基づいて、人の頭部の中心点１１２（バウンディングボックスの中心点）の座標（２次元座標）を算出する。 <Bounding box generation process>
FIG. 4 is a schematic diagram for explaining the bounding box generation process performed by the server device 3 according to the present embodiment. In this figure, the quadrangle with reference numeral 110 indicates a photographed image taken by the camera 1. The camera 1 according to the present embodiment shoots through a wide-angle lens, and the captured image 110 is an image (distorted image) including distortion due to the lens. The server device 3 that has acquired the captured image 110 from the camera 1 uses the YOLO learning model 32b to perform a process of detecting the head of a person included in the captured image 110. By performing the detection process using the YOLO learning model 32b, the server device 3 can acquire information such as the coordinates of the bounding box 111 that surrounds the human head included in the captured image 110 as the detection result. The server device 3 calculates the coordinates (two-dimensional coordinates) of the center point 112 (center point of the bounding box) of the human head based on the information such as the coordinates of the bounding box 111 acquired from the YOLO learning model 32b.

次いでサーバ装置３は、撮影画像１１０に含まれる人の頭部の中心点１１２の座標を、撮影画像１１０から歪みを取り除いた２次元平面において対応する点１１３の２次元の座標に変換する。レンズによる歪みを含む画像を補正する画像処理は、従来技術であるため詳細な説明は省略するが、例えばレンズ歪みモデルを用いた以下の近似式に基づいて、歪みのある画像の点（ｘｄ，ｙｄ）と歪みのない画像の点（ｘ，ｙ）とを相互に変換することができる。 Next, the server device 3 converts the coordinates of the center point 112 of the human head included in the captured image 110 into the two-dimensional coordinates of the corresponding points 113 in the two-dimensional plane obtained by removing the distortion from the captured image 110. Since the image processing for correcting an image including distortion due to the lens is a conventional technique, detailed description thereof will be omitted. However, for example, based on the following approximate expression using a lens distortion model, a point (xd, yd) and distortion-free image points (x, y) can be converted to each other.

サーバ装置３は、歪みを含む撮影画像１１０に含まれる人の頭部の中心に相当する点１１２の２次元座標を、上記の（１）式に基づいて、歪みを取り除いた２次元平面における点１１３の２次元座標に変換する。なお本実施の形態においてサーバ装置３は、撮影画像１１０に含まれる全ての点について座標変換を行う必要はなく、人の頭部の中心に相当する点１１２の座標のみを変換すればよい。 The server device 3 sets the two-dimensional coordinates of the point 112 corresponding to the center of the human head included in the captured image 110 including the distortion as a point in the two-dimensional plane from which the distortion is removed based on the above equation (1). Convert to 113 two-dimensional coordinates. In the present embodiment, the server device 3 does not need to perform coordinate conversion on all the points included in the captured image 110, and only needs to convert the coordinates of the point 112 corresponding to the center of the human head.

次いでサーバ装置３は、人の頭部の中心に相当する点１１３の２次元座標を、３次元仮想空間における３次元座標に変換する処理を行う。この際に行われる２次元座標から３次元座標への変換は、一般的に３次元仮想空間に仮想カメラを設定して２次元画像を生成する際に行われる変換処理の逆変換処理に相当するものである。このため、まず３次元座標から２次元座標への変換処理を説明し、その後にこの変換処理を逆方向へ適用した２次元座標から３次元座標への変換処理を説明する。 Next, the server device 3 performs a process of converting the two-dimensional coordinates of the point 113 corresponding to the center of the human head into the three-dimensional coordinates in the three-dimensional virtual space. The conversion from 2D coordinates to 3D coordinates performed at this time generally corresponds to the reverse conversion process of the conversion process performed when a virtual camera is set in the 3D virtual space and a 2D image is generated. It is a thing. Therefore, first, the conversion process from the three-dimensional coordinates to the two-dimensional coordinates will be described, and then the conversion process from the two-dimensional coordinates to the three-dimensional coordinates to which this conversion process is applied in the opposite direction will be described.

３次元座標には、ワールド座標系とカメラ座標系との２種類が存在する。ワールド座標系は、３次元仮想空間の特定位置を原点としてＸＹＺの座標を定めたものである。カメラ座標系は、３次元仮想空間に配置されたカメラを原点としてＸＹＺの座標を定めたものである。ワールド座標系の座標を（Ｘｗ，Ｙｗ，Ｚｗ）とし、カメラ座標系の座標を（Ｘｃ，Ｙｃ，Ｚｃ）とすると、ワールド座標系からカメラ座標系への変換は、カメラの外部変数（extrinsic parameters）と呼ばれる定数を用いて、以下の（２）式で行うことができる。 There are two types of three-dimensional coordinates, the world coordinate system and the camera coordinate system. The world coordinate system defines the coordinates of XYZ with a specific position in the three-dimensional virtual space as the origin. The camera coordinate system defines the coordinates of XYZ with the camera arranged in the three-dimensional virtual space as the origin. Assuming that the coordinates of the world coordinate system are (Xw, Yw, Zw) and the coordinates of the camera coordinate system are (Xc, Yc, Zc), the conversion from the world coordinate system to the camera coordinate system is an external variable (extrinsic parameters) of the camera. ) Can be used by the following equation (2).

更に、２次元平面における点の座標を（ｕ，ｖ）とすると、カメラ座標系から２次元座標への変換は、カメラの内部変数（intrinsic parameters）と呼ばれる定数を用いて、以下の（３）式で行うことができる。 Further, assuming that the coordinates of the points in the two-dimensional plane are (u, v), the conversion from the camera coordinate system to the two-dimensional coordinates uses a constant called the internal variables (intrinsic parameters) of the camera and uses the following (3). It can be done by an expression.

なお、上記の（２）式及び（３）式において、カメラの内部変数及び外部変数は、カメラの特性及びカメラの位置等に基づいて決定される定数である。また（３）式の定数ｃは、スケール又は比率等を表す数値であり、システムの構成等に応じて決定される。上記の（２）式及び（３）式に基づいて、ワールド座標系の３次元座標から２次元平面の２次元座標への変換は、以下の（４）式で行うことができる。 In the above equations (2) and (3), the internal variables and external variables of the camera are constants determined based on the characteristics of the camera, the position of the camera, and the like. Further, the constant c in the equation (3) is a numerical value representing a scale or a ratio or the like, and is determined according to the system configuration or the like. Based on the above equations (2) and (3), the conversion from the three-dimensional coordinates of the world coordinate system to the two-dimensional coordinates of the two-dimensional plane can be performed by the following equation (4).

ワールド座標系の任意の点（Ｘｗ，Ｙｗ，Ｚｗ）を上記の（４）式へ代入することで、対応する２次元座標（ｕ，ｖ）を一意に特定することができる。（４）式の逆演算、即ちカメラの内部変数及び外部変数の行列の逆行列と２次元座標との積を算出する演算を行う事で、２次元座標を３次元座標へ変換することが期待できるが、２次元座標から３次元座標への変換は次元数が増加するため、３次元座標を一意に特定するには情報が不足している。そこで本実施の形態に係る情報処理システムでは、撮影画像１１０から検出される人の身長が所定値（例えば１７５ｃｍ）であると仮定する。即ち、本実施の形態に係るサーバ装置３は、（４）式においてＺｗの値を所定値と仮定し、２次元座標（ｕ，ｖ）から３次元座標のＸｗ及びＹｗの値を算出することで、２次元座標を３次元座標への変換を行う。 By substituting any point (Xw, Yw, Zw) in the world coordinate system into the above equation (4), the corresponding two-dimensional coordinates (u, v) can be uniquely specified. It is expected that the two-dimensional coordinates will be converted to the three-dimensional coordinates by performing the inverse calculation of the equation (4), that is, the calculation of the product of the inverse matrix of the matrix of the internal and external variables of the camera and the two-dimensional coordinates. However, since the number of dimensions increases in the conversion from two-dimensional coordinates to three-dimensional coordinates, there is insufficient information to uniquely identify the three-dimensional coordinates. Therefore, in the information processing system according to the present embodiment, it is assumed that the height of the person detected from the captured image 110 is a predetermined value (for example, 175 cm). That is, the server device 3 according to the present embodiment assumes that the value of Zw is a predetermined value in the equation (4), and calculates the values of Xw and Yw of the three-dimensional coordinates from the two-dimensional coordinates (u, v). Then, the two-dimensional coordinates are converted into the three-dimensional coordinates.

サーバ装置３は、歪みが取り除かれた２次元平面上の点１１３の２次元座標を、上記の（４）式に基づいて、ワールド座標系の対応する点１１４の３次元座標へ変換する。なお図４においては、２次元座標からカメラ座標系の３次元座標への変換と、カメラ座標系の３次元座標からワールド座標系の３次元座標への変換とを分けて図示しているが、サーバ装置３はこの２段階の座標変換を一括して行ってよい。 The server device 3 converts the two-dimensional coordinates of the point 113 on the two-dimensional plane from which the distortion has been removed into the three-dimensional coordinates of the corresponding point 114 in the world coordinate system based on the above equation (4). In FIG. 4, the conversion from the two-dimensional coordinates to the three-dimensional coordinates of the camera coordinate system and the conversion from the three-dimensional coordinates of the camera coordinate system to the three-dimensional coordinates of the world coordinate system are shown separately. The server device 3 may collectively perform these two steps of coordinate conversion.

次いでサーバ装置３は、ワールド座標系の３次元空間において、撮影画像１１０から検出された人の頭部を囲む３次元のバウンディングボックス１１５と、頭部より下の身体を囲む３次元のバウンディングボックス１１６とを生成する処理を行う。まずサーバ装置３は、上記の座標変換により得られた点１１４を人の頭部の中心（重心）であると見なし、点１１４を中心に所定サイズの直方体型の枠を生成することで、頭部を囲む３次元のバウンディングボックス１１５を生成する。本例では、頭部を囲む３次元のバウンディングボックス１１５を、２０ｃｍ×２０ｃｍ×２０ｃｍの立方体型としている。 Next, the server device 3 has a three-dimensional bounding box 115 that surrounds the human head detected from the captured image 110 and a three-dimensional bounding box 116 that surrounds the body below the head in the three-dimensional space of the world coordinate system. Performs the process of generating and. First, the server device 3 regards the point 114 obtained by the above coordinate conversion as the center (center of gravity) of the human head, and generates a rectangular parallelepiped frame of a predetermined size around the point 114 to generate the head. A three-dimensional bounding box 115 that surrounds the portion is generated. In this example, the three-dimensional bounding box 115 that surrounds the head is a cube of 20 cm × 20 cm × 20 cm.

サーバ装置３は、頭部を囲む３次元のバウンディングボックス１１５の下方に、所定サイズの直方体型の枠を生成することで、身体を囲む３次元のバウンディングボックス１１６を生成する。本例では、人の身長を１７５ｃｍとし、頭部を囲む３次元のバウンディングボックス１１５との間に５ｃｍの隙間を設けて、身体を囲む３次元のバウンディングボックス１１６を５０ｃｍ×５０ｃｍ×１５０ｃｍの直方体型としている。 The server device 3 generates a three-dimensional bounding box 116 that surrounds the body by generating a rectangular parallelepiped frame of a predetermined size below the three-dimensional bounding box 115 that surrounds the head. In this example, the height of a person is 175 cm, a gap of 5 cm is provided between the person and the three-dimensional bounding box 115 that surrounds the head, and the three-dimensional bounding box 116 that surrounds the body is a rectangular parallelepiped type of 50 cm × 50 cm × 150 cm. It is supposed to be.

次いでサーバ装置３は、ワールド座標系の３次元空間において生成した３次元のバウンディングボックス１１５及び１１６を、カメラ座標系の３次元空間の３次元のバウンディングボックス１１７及び１１８へ変換し、更に２次元平面におけるバウンディングボックス１１９及び１２０へ変換する。このときにサーバ装置３は、上記の（４）式を用いて、ワールド座標系の３次元のバウンディングボックス１１５及び１１６、２次元平面におけるバウンディングボックス１１９及び１２０へ直接的に変換してよい。サーバ装置３は、３次元のバウンディングボックス１１５及び１１６に含まれる複数の頂点に対して（４）式の座標変換を行い、座標変換後の２次元平面上の複数の頂点を結んでバウンディングボックス１１９及び１２０を生成してもよく、３次元のバウンディングボックス１１５及び１１６を構成する頂点及び枠線等の全ての点について座標変換を行ってバウンディングボックス１１９及び１２０を生成してもよい。 Next, the server device 3 converts the three-dimensional bounding boxes 115 and 116 generated in the three-dimensional space of the world coordinate system into the three-dimensional bounding boxes 117 and 118 of the three-dimensional space of the camera coordinate system, and further converts the two-dimensional plane. Converted to the bounding boxes 119 and 120 in. At this time, the server device 3 may directly convert to the three-dimensional bounding boxes 115 and 116 of the world coordinate system and the bounding boxes 119 and 120 in the two-dimensional plane by using the above equation (4). The server device 3 performs the coordinate conversion of the equation (4) on the plurality of vertices included in the three-dimensional bounding boxes 115 and 116, and connects the plurality of vertices on the two-dimensional plane after the coordinate conversion to the bounding box 119. And 120 may be generated, and the bounding boxes 119 and 120 may be generated by performing coordinate conversion on all points such as vertices and borders constituting the three-dimensional bounding boxes 115 and 116.

次いでサーバ装置３は、歪みのない２次元平面上のバウンディングボックス１１９及び１２０を、広角レンズの歪みを含む画像上のバウンディングボックス１２１及び１２２に変換する処理を行う。このときにサーバ装置３は、上記の（１）式に基づいて、バウンディングボックスの座標を変換する。サーバ装置３は、歪みのない２次元平面上のバウンディングボックス１１９及び１２０に含まれる複数の頂点に対して（１）式の座標変換を行い、座標変換後の歪みのある画像上の複数の頂点を結んでバウンディングボックス１２１及び１２２を生成してもよく、バウンディングボックス１１９及び１２０を構成する頂点及び枠線等の全ての点について座標変換を行ってバウンディングボックス１２１及び１２２を生成してもよい。図４に示す例では、バウンディングボックスの全ての点について座標変換を行っており、これにより歪んだ線を含むバウンディングボックス１２１及び１２２が得られている。 The server device 3 then performs a process of converting the distortion-free two-dimensional plane bounding boxes 119 and 120 into the distortion-free bounding boxes 121 and 122 on the image including the distortion of the wide-angle lens. At this time, the server device 3 converts the coordinates of the bounding box based on the above equation (1). The server device 3 performs the coordinate conversion of the equation (1) on the plurality of vertices included in the bounding boxes 119 and 120 on the distortion-free two-dimensional plane, and the plurality of vertices on the distorted image after the coordinate conversion. The bounding boxes 121 and 122 may be generated by connecting the above, or the bounding boxes 121 and 122 may be generated by performing coordinate conversion on all points such as vertices and borders constituting the bounding boxes 119 and 120. In the example shown in FIG. 4, coordinate transformation is performed for all the points of the bounding box, whereby the bounding boxes 121 and 122 including the distorted lines are obtained.

なお、上述の処理により生成されるバウンディングボックス１２１及び１２２は、立方体及び直方体等の立体形状の３次元のバウンディングボックス１１５及び１１６を２次元平面に投影した形状である。サーバ装置３は、この立方体及び直方体等の立体形状のバウンディングボックスを、例えば正方形又は長方形等の平面形状のバウンディングボックスに整形してもよい。 The bounding boxes 121 and 122 generated by the above processing are shapes in which three-dimensional bounding boxes 115 and 116 having a three-dimensional shape such as a cube and a rectangular parallelepiped are projected onto a two-dimensional plane. The server device 3 may shape the three-dimensional bounding box such as a cube or a rectangular parallelepiped into a planar bounding box such as a square or a rectangle.

＜バウンディングボックス調整処理＞
上述のバウンディングボックス生成処理にて生成されるバウンディングボックスは、撮影画像から検出された人が立っているものとし、その身長を所定値（例えば１７５ｃｍ）と仮定して生成したものである。人毎の身長の差異は数１０ｃｍ以内に収まると考えられるが、人が座っている場合には地面から頭部までの高さは身長よりも５０ｃｍ以上低くなる可能性があり、座っている人に対して生成したバウンディングボックスの高さが不適切なものとなる可能性がある。そこで本実施の形態に係るサーバ装置３は、バウンディングボックス（身体を囲むバウンディングボックス）の高さを調整する処理を行う。 <Bounding box adjustment process>
The bounding box generated by the above-mentioned bounding box generation process is generated assuming that a person detected from the captured image is standing and the height is a predetermined value (for example, 175 cm). It is thought that the difference in height between people is within a few tens of centimeters, but when a person is sitting, the height from the ground to the head may be 50 cm or more lower than the height, and the person sitting. The height of the generated bounding box may be inappropriate. Therefore, the server device 3 according to the present embodiment performs a process of adjusting the height of the bounding box (bounding box surrounding the body).

図５及び図６は、本実施の形態に係るサーバ装置３が行うバウンディングボックス調整処理を説明するための模式図である。サーバ装置３は、カメラ１が撮影した撮影画像１１０に写された人の頭部に対してＹＯＬＯ学習モデル３２ｂが付与したバウンディングボックス１１１のサイズ（第２サイズ）と、ワールド座標系において人の頭部を囲む３次元のバウンディングボックス１１５に基づいて生成したバウンディングボックス１２１のサイズ（第１サイズ）とを比較し、比較結果に基づいて人の身体を囲むバウンディングボックス１２２の高さを調整する。 5 and 6 are schematic views for explaining the bounding box adjustment process performed by the server device 3 according to the present embodiment. The server device 3 has the size (second size) of the bounding box 111 given by the YOLO learning model 32b to the human head captured in the captured image 110 captured by the camera 1, and the human head in the world coordinate system. The size (first size) of the bounding box 121 generated based on the three-dimensional bounding box 115 surrounding the portion is compared, and the height of the bounding box 122 surrounding the human body is adjusted based on the comparison result.

図５に示す示の例では、カメラ１が撮影対象となる人よりも高い位置に設置されており、カメラ１が俯瞰する態様で周囲の撮影を行っている。カメラ１の撮影により得られる撮影画像１１０は、カメラ１の光軸に垂直な平面に投影された像とみなすことができ、図中では符号１１０を付した直線で示している。図示の状況において、カメラ１から近い位置Ａに立っている人の頭部と、カメラ１から遠い位置Ｂに座っている人の頭部とは、撮影画像１１０において同じ位置に写されることとなる。ただし、カメラ１に近い位置Ａの人の頭部の方が、カメラ１から遠い位置Ｂの人の頭部より、撮影画像１１０において大きく写される。 In the example shown in FIG. 5, the camera 1 is installed at a position higher than the person to be photographed, and the camera 1 takes a bird's-eye view of the surroundings. The captured image 110 obtained by photographing the camera 1 can be regarded as an image projected on a plane perpendicular to the optical axis of the camera 1, and is indicated by a straight line with a reference numeral 110 in the drawing. In the illustrated situation, the head of a person standing at a position A near the camera 1 and the head of a person sitting at a position B far from the camera 1 are captured at the same position in the captured image 110. Become. However, the head of the person at the position A near the camera 1 is larger than the head of the person at the position B far from the camera 1 in the captured image 110.

本実施の形態に係る情報処理システムでは、バウンディングボックスを生成する際に撮影画像１１０から検出した人が所定の身長で立っているものと仮定している。このため撮影画像１１０に写された人に対して生成したバウンディングボックスは、図５の位置Ａに立っている人に適したものとなるが、実際には位置Ｂに座っている人が撮影画像１１０に写されている可能性がある。 In the information processing system according to the present embodiment, it is assumed that the person detected from the captured image 110 when the bounding box is generated stands at a predetermined height. Therefore, the bounding box generated for the person captured in the photographed image 110 is suitable for the person standing at the position A in FIG. 5, but the person actually sitting at the position B is the photographed image. It may have been copied to 110.

撮影画像１１０に写された人が位置Ｂに座っている人である場合、図６上段に示すように、撮影画像１１０からＹＯＬＯ学習モデル３２ｂが検出した人の頭部を囲むバウンディングボックス１１１は、３次元のバウンディングボックス１１５に基づいて生成したバウンディングボックス１２１より小さい。これに対して、撮影画像１１０に写された人が位置Ａに立っている人である場合、図６下段に示すように、撮影画像１１０からＹＯＬＯ学習モデル３２ｂが検出した人の頭部を囲むバウンディングボックス１１１と、３次元のバウンディングボックス１１５に基づいて生成したバウンディングボックス１２１とは、略同じ大きさとなる。 When the person photographed in the photographed image 110 is a person sitting in the position B, as shown in the upper part of FIG. 6, the bounding box 111 surrounding the head of the person detected by the YOLO learning model 32b from the photographed image 110 is formed. It is smaller than the bounding box 121 generated based on the three-dimensional bounding box 115. On the other hand, when the person captured in the captured image 110 is a person standing at position A, as shown in the lower part of FIG. 6, it surrounds the head of the person detected by the YOLO learning model 32b from the captured image 110. The bounding box 111 and the bounding box 121 generated based on the three-dimensional bounding box 115 have substantially the same size.

本実施の形態に係るサーバ装置３は、３次元のバウンディングボックス１１５に基づいて撮影画像１１０に重畳する２次元のバウンディングボックス１２１を生成した後、このバウンディングボックス１２１のサイズ（第１サイズ）と、ＹＯＬＯ学習モデル３２ｂが検出した人の頭部を囲むバウンディングボックス１１１のサイズ（第２サイズ）とを比較する。サイズの比較は、例えば面積、最も長い辺の長さ、又は、対角線の長さ等の比較により行われ得る。なおサーバ装置３は、バウンディングボックス１２１は立体形状であるため、これを平面形状（長方形又は正方形）のバウンディングボックスに変換してサイズを比較してもよい。サーバ装置３は、例えばバウンディングボックス１２１を内包する最小の正方形又は長方形の枠体を生成することで、立体形状のバウンディングボックス１２１を平面形状へ変換することができる。２つのバウンディングボックス１１１，１２１のサイズが略同じである場合、サーバ装置３は、生成したバウンディングボックス１２１，１２２が適切なものであると判断し、これを採用する。 The server device 3 according to the present embodiment generates a two-dimensional bounding box 121 superimposed on the captured image 110 based on the three-dimensional bounding box 115, and then determines the size (first size) of the bounding box 121. The size (second size) of the bounding box 111 surrounding the human head detected by the YOLO learning model 32b is compared. The size comparison can be made, for example, by comparing the area, the length of the longest side, the length of the diagonal line, and the like. Since the bounding box 121 has a three-dimensional shape, the server device 3 may convert it into a plane-shaped (rectangular or square) bounding box and compare the sizes. The server device 3 can convert the three-dimensional bounding box 121 into a planar shape, for example, by generating the smallest square or rectangular frame containing the bounding box 121. When the sizes of the two bounding boxes 111 and 121 are substantially the same, the server device 3 determines that the generated bounding boxes 121 and 122 are appropriate, and adopts the generated bounding boxes 121 and 122.

３次元のバウンディングボックス１１５に基づいて生成したバウンディングボックス１２１のサイズよりＹＯＬＯ学習モデル３２ｂのバウンディングボックス１１１のサイズが小さい場合、サーバ装置３は、生成したバウンディングボックス１２１，１２２が不適切なものであると判断し、バウンディングボックスの調整を行う。サーバ装置３は、撮影画像１１０に写された人の身長（地面から頭部までの高さ）を、立っている人の所定の身長（例えば１７０ｃｍ）から、座っている人の地面から頭部までの所定の高さ（例えば１００ｃｍ）へ変更し、上述のバウンディングボックスの生成処理を行う事で、座っている人に適した撮影画像１１０上のバウンディングボックス１２１，１２２を生成する。 If the size of the bounding box 111 of the YOLO learning model 32b is smaller than the size of the bounding box 121 generated based on the three-dimensional bounding box 115, the server device 3 is inappropriate for the generated bounding boxes 121 and 122. Judging that, adjust the bounding box. The server device 3 sets the height (height from the ground to the head) of the person captured in the captured image 110 from a predetermined height (for example, 170 cm) of the standing person to the head of the sitting person from the ground. By changing the height to a predetermined height (for example, 100 cm) and performing the above-mentioned binding box generation process, the bounding boxes 121 and 122 on the captured image 110 suitable for a sitting person are generated.

なお、頭部の高さの推定値を変更することによって、（４）式にて算出されるワールド座標系での３次元座標のＸｗ，Ｙｗの値も変化する。このため、上述のバウンディングボックスを調整する処理は、ワールド座標系における人の位置を調整する処理でもある。このため、サーバ装置３は、例えば撮影画像１１０に写された人の頭部の高さの推定値を増減させ、３次元のバウンディングボックス１１５から生成したバウンディングボックス１２１のサイズと、ＹＯＬＯ学習モデル３２ｂのバウンディングボックス１１１のサイズとが一致する（差が閾値以下となる）場合の頭部の高さを探索することで、ワールド座標系における人の身長（地面から頭部までの高さ）を精度よく推定することができると共に、ワールド座標系における人の位置を精度よく推定することができる。 By changing the estimated value of the height of the head, the values of Xw and Yw of the three-dimensional coordinates in the world coordinate system calculated by the equation (4) also change. Therefore, the process of adjusting the bounding box described above is also a process of adjusting the position of a person in the world coordinate system. Therefore, the server device 3 increases or decreases the estimated value of the height of the head of the person captured in the captured image 110, for example, the size of the bounding box 121 generated from the three-dimensional bounding box 115, and the YOLO learning model 32b. By searching for the height of the head when the size of the bounding box 111 of the above matches (the difference is less than or equal to the threshold value), the height of the person (height from the ground to the head) in the world coordinate system is accurate. It can be estimated well, and the position of a person in the world coordinate system can be estimated accurately.

また人の身長及び位置の推定は、高さの推定値の増減を繰り返して２つのバウンディングボックス１２１，１１１のサイズが一致する値を探索する方法の他に、例えば機械学習により学習がなされた学習モデルを用いる方法が採用され得る。学習モデルは、例えば２つのバウンディングボックス１２１，１１１のサイズ又は２つのバウンディングボックス１２１、１１１が付された画像を入力として受け付けて、対象となる人の高さを出力するように、予め機械学習がなされたものとすることができる。 In addition to the method of searching for values that match the sizes of the two bounding boxes 121 and 111 by repeatedly increasing and decreasing the height estimates, the height and position of a person can be estimated by learning by machine learning, for example. A method using a model can be adopted. The learning model is preliminarily machine-learned to accept, for example, an image with the size of two bounding boxes 121, 111 or an image with two bounding boxes 121, 111 as input and output the height of the target person. It can be made.

本実施の形態においては、地面から頭部までの高さについて１７０ｃｍ又は１００ｃｍ等の数値を用いて演算を行っているが、これらの数値は一例であって、これに限るものではない。これらの数値は、本実施の形態に係る情報処理システムの設計者等により予め決定されて、サーバプログラム３２ａと共に処理のデフォルト値としてサーバ装置３の記憶部３２に記憶されてよい。またこれらの数値については、サーバ装置３が処理を実施する際にユーザからの入力を受け付けてもよい。サーバ装置３は、例えばユーザが数値を入力した場合にはこの数値を用いて処理を行い、数値が入力されない場合にはデフォルト値を用いて処理を行うことができる。 In the present embodiment, the calculation is performed using numerical values such as 170 cm or 100 cm for the height from the ground to the head, but these numerical values are examples and are not limited thereto. These numerical values may be determined in advance by the designer of the information processing system according to the present embodiment and stored in the storage unit 32 of the server device 3 as default values for processing together with the server program 32a. Further, these numerical values may be input from the user when the server device 3 performs processing. For example, when the user inputs a numerical value, the server device 3 can perform processing using this numerical value, and when the numerical value is not input, the server device 3 can perform processing using a default value.

バウンディングボックス１２１，１２２の調整を終えたサーバ装置３は、得られたバウンディングボックス１２１，１２２をカメラ１による撮影画像１１０に重畳した画像を生成する。サーバ装置３は、生成した画像を所定の端末装置５へ送信し、この画像を端末装置５の表示部５４に表示させる。これにより端末装置５を使用するユーザは、カメラ１の撮影画像１１０に写された人がバウンディングボックス１２１，１２２で囲まれた画像を端末装置５にて確認することができる。 The server device 3 that has completed the adjustment of the bounding boxes 121 and 122 generates an image in which the obtained bounding boxes 121 and 122 are superimposed on the captured image 110 by the camera 1. The server device 3 transmits the generated image to a predetermined terminal device 5, and causes the display unit 54 of the terminal device 5 to display this image. As a result, the user who uses the terminal device 5 can confirm the image of the person photographed on the captured image 110 of the camera 1 surrounded by the bounding boxes 121 and 122 on the terminal device 5.

図７は、端末装置５による画像の表示例を示す模式図である。図示の表示例では、カメラ１が撮影する部屋の中に２人の人がおり、各人に対して頭部を囲むバウンディングボックスと、身体を囲むバウンディングボックスとが各人に重畳して表示されている。カメラ１は例えば１秒間に数十回程度の頻度で撮影を繰り返し行っており、端末装置５が表示する画像も同程度の頻度で更新される。即ち、本実施の形態においてカメラ１は動画像を撮影し、端末装置５は動画像を表示する。各人を囲むバウンディングボックスは、画像中において人が移動した場合に、この人の移動に追従して移動する。 FIG. 7 is a schematic diagram showing an example of displaying an image by the terminal device 5. In the illustrated display example, there are two people in the room photographed by the camera 1, and the bounding box surrounding the head and the bounding box surrounding the body are displayed superimposed on each person. ing. The camera 1 repeatedly takes pictures at a frequency of, for example, several tens of times per second, and the image displayed by the terminal device 5 is also updated at the same frequency. That is, in the present embodiment, the camera 1 captures a moving image, and the terminal device 5 displays the moving image. The bounding box that surrounds each person moves following the movement of the person when the person moves in the image.

また図示の例では、端末装置５の表示部５４に表示される画面の右下に設けられた正方形の領域に、検出した２人の位置関係を示す点が２つ示されている。本例においてサーバ装置３は、上述のバウンディングボックスのサイズ比較に基づく人の位置推定を行い、推定の結果得られた各人のワールド座標系における３次元座標からＸＹの２次元についての座標（Ｘｗ，Ｙｗ）を取得する。サーバ装置３は、各人の２次元座標（Ｘｗ，Ｙｗ）に基づいて、正方形領域に各人に対応する点をプロットすることにより、撮影画像から検出された人の位置の推定結果をユーザに提示することができる。 Further, in the illustrated example, two points indicating the positional relationship between the two detected persons are shown in the square area provided at the lower right of the screen displayed on the display unit 54 of the terminal device 5. In this example, the server device 3 estimates the position of a person based on the size comparison of the above-mentioned bounding box, and the coordinates (Xw) from the three-dimensional coordinates in the world coordinate system of each person obtained as a result of the estimation to the two-dimensional coordinates of XY. , Yw). The server device 3 plots the points corresponding to each person in a square area based on the two-dimensional coordinates (Xw, Yw) of each person, so that the user can estimate the position of the person detected from the captured image. Can be presented.

＜フローチャート＞
図８は、本実施の形態に係るサーバ装置３が行う処理の手順を示すフローチャートである。本実施の形態に係るサーバ装置３の処理部３１の画像取得部３１ａは、通信部３３にてカメラ１との通信を行い、カメラ１が撮影した撮影画像を取得する（ステップＳ１）。処理部３１の人検出部３１ｂは、記憶部３２に記憶されたＹＯＬＯ学習モデル３２ｂを用いて、ステップＳ１にて取得した撮影画像に写された人を検出する（ステップＳ２）。人検出部３１ｂは、ＹＯＬＯ学習モデル３２ｂが出力する人の頭部を囲むバウンディングボックスの位置等の情報に基づいて、人の頭部の中心点の座標を算出する（ステップＳ３）。 <Flow chart>
FIG. 8 is a flowchart showing a procedure of processing performed by the server device 3 according to the present embodiment. The image acquisition unit 31a of the processing unit 31 of the server device 3 according to the present embodiment communicates with the camera 1 by the communication unit 33, and acquires the captured image taken by the camera 1 (step S1). The person detection unit 31b of the processing unit 31 detects a person copied in the captured image acquired in step S1 by using the YOLO learning model 32b stored in the storage unit 32 (step S2). The human detection unit 31b calculates the coordinates of the center point of the human head based on the information such as the position of the bounding box surrounding the human head output by the YOLO learning model 32b (step S3).

処理部３１の座標変換部３１ｃは、ステップＳ３にて算出した撮影画像における頭部の中心点の座標を、上述の（１）式を用いた歪みを除去する演算を行うことにより、歪みが除去された２次元平面における２次元座標に変換する（ステップＳ４）。座標変換部３１ｃは、検出された人が立っており且つ身長が所定値（例えば１７０ｃｍ）であると仮定し、上述の（４）式に基づいて、ステップＳ４にて変換した２次元座標をワールド座標系の３次元座標へ変換する（ステップＳ５）。 The coordinate conversion unit 31c of the processing unit 31 removes the distortion by performing a calculation for removing the distortion of the center point of the head in the captured image calculated in step S3 using the above equation (1). It is converted into the two-dimensional coordinates in the two-dimensional plane (step S4). The coordinate conversion unit 31c assumes that the detected person is standing and the height is a predetermined value (for example, 170 cm), and based on the above equation (4), the two-dimensional coordinates converted in step S4 are world-wide. Convert to three-dimensional coordinates of the coordinate system (step S5).

処理部３１のバウンディングボックス生成部３１ｄは、ワールド座標系の３次元仮想空間において、ステップＳ５にて変換した３次元座標を中心として、所定サイズ（例えば２０ｃｍ×２０ｃｍ×２０ｃｍ）の立方体形の枠体を生成することにより、検出された人の頭部を囲む３次元のバウンディングボックスを生成する（ステップＳ６）。バウンディングボックス生成部３１ｄは、ステップＳ６にて生成した頭部を囲むバウンディングボックスの下方に、所定サイズ（例えば５０ｃｍ×５０ｃｍ×１５０ｃｍ）の直方体形の枠体を生成することにより、検出された人の身体（頭部より下の部分）を囲むバウンディングボックスを生成する（ステップＳ７）。 The bounding box generation unit 31d of the processing unit 31 is a cubic frame of a predetermined size (for example, 20 cm × 20 cm × 20 cm) centered on the three-dimensional coordinates converted in step S5 in the three-dimensional virtual space of the world coordinate system. By generating the above, a three-dimensional bounding box surrounding the detected person's head is generated (step S6). The bounding box generation unit 31d generates a rectangular parallelepiped frame of a predetermined size (for example, 50 cm × 50 cm × 150 cm) below the bounding box that surrounds the head generated in step S6. A bounding box that surrounds the body (the part below the head) is generated (step S7).

処理部３１の座標逆変換部３１ｅは、上述の（４）式に基づいて、生成した２つのバウンディングボックスを２次元平面におけるバウンディングボックスに変換する（ステップＳ８）。座標逆変換部３１ｅは、ステップＳ８にて変換された歪みのない２次元平面上のバウンディングボックスを、上述の（１）式を用いた歪みを付与する演算を行うことにより、歪みのある２次元平面（撮影画像）上のバウンディングボックスに変換する（ステップＳ９）。 The coordinate inverse conversion unit 31e of the processing unit 31 converts the two generated bounding boxes into a bounding box on a two-dimensional plane based on the above equation (4) (step S8). The coordinate inverse conversion unit 31e performs a calculation of applying distortion to the bounding box on the distortion-free two-dimensional plane converted in step S8 using the above equation (1), thereby causing distortion in two dimensions. It is converted into a bounding box on a plane (captured image) (step S9).

処理部３１のバウンディングボックス調整部３１ｆは、ステップＳ９にて変換したバウンディングボックスのうち、検出した人の頭部を囲むバウンディングボックスのサイズが、ステップＳ２にてＹＯＬＯ学習モデル３２ｂが出力した人の頭部を囲むバウンディングボックスのサイズと略同じ（差異が閾値以内）であるか否かを判定する（ステップＳ１０）。両バウンディングボックスのサイズが異なる（差異が閾値を超える）場合（Ｓ１０：ＮＯ）、バウンディングボックス調整部３１ｆは、検出した人の身長、即ち地面から頭部までの高さを所定値（例えば１７０ｃｍ）から別の値へ変更し（ステップＳ１１）、ステップＳ４へ処理を戻す。 In the bounding box adjusting unit 31f of the processing unit 31, the size of the bounding box surrounding the head of the detected person among the bounding boxes converted in step S9 is the head of the person output by the YOLO learning model 32b in step S2. It is determined whether or not the size of the bounding box surrounding the portion is substantially the same (the difference is within the threshold value) (step S10). When the sizes of the two bounding boxes are different (the difference exceeds the threshold value) (S10: NO), the bounding box adjusting unit 31f sets the height of the detected person, that is, the height from the ground to the head to a predetermined value (for example, 170 cm). To another value (step S11), and the process returns to step S4.

なお、検出した人が立っているか又は座っているかの判定ができ、高さの精度が、立っている場合と座っている場合との二通りの精度でよい場合は、ステップＳ１１においてバウンディングボックス調整部３１ｆは、人が座っている場合の頭部の高さとして予め定められた値（例えば１００ｃｍ）に変更する。この場合に処理部３１は、高さを変更してステップＳ４～Ｓ９の処理を行った後、ステップＳ１０の判定は行わずに、ステップＳ１２へ処理を進めてよい。 If it is possible to determine whether the detected person is standing or sitting and the height accuracy is sufficient in two ways, standing and sitting, the bounding box is adjusted in step S11. The portion 31f is changed to a predetermined value (for example, 100 cm) as the height of the head when a person is sitting. In this case, the processing unit 31 may change the height and perform the processing in steps S4 to S9, and then proceed to the processing in step S12 without performing the determination in step S10.

これに対して、検出した人の身長及び座標等をより詳細に推定する場合には、ステップＳ１１においてバウンディングボックス調整部３１ｆは、人の身長の推定値を所定値（例えば１ｃｍ）ずつ増加又は減少させ、ステップＳ１０の判定により両バウンディングボックスのサイズが略同じになるまでステップＳ４～Ｓ１１の処理を繰り返し行う。バウンディングボックス調整部３１ｆは、例えばステップＳ９にて変換したバウンディングボックスのサイズが、ＹＯＬＯ学習モデル３２ｂによるバウンディングボックスのサイズより大きい場合、人の身長の推定値を減少させる。これに対してバウンディングボックス調整部３１ｆは、ステップＳ９にて変換したバウンディングボックスのサイズが、ＹＯＬＯ学習モデル３２ｂによるバウンディングボックスのサイズより大きい場合、人の身長の推定値を増加させる。 On the other hand, when estimating the height and coordinates of the detected person in more detail, the bounding box adjusting unit 31f increases or decreases the estimated value of the person's height by a predetermined value (for example, 1 cm) in step S11. Then, the processes of steps S4 to S11 are repeated until the sizes of both bounding boxes become substantially the same according to the determination in step S10. The bounding box adjusting unit 31f reduces the estimated value of the height of a person when, for example, the size of the bounding box converted in step S9 is larger than the size of the bounding box by the YOLO learning model 32b. On the other hand, when the size of the bounding box converted in step S9 is larger than the size of the bounding box by the YOLO learning model 32b, the bounding box adjusting unit 31f increases the estimated value of the height of the person.

両バウンディングボックスのサイズが略同じであると判定した場合（Ｓ１０：ＹＥＳ）、処理部３１の画像重畳部３１ｇは、ステップＳ１にて取得した撮影画像に、ステップＳ９にて変換したバウンディングボックスを重畳した画像を生成する（ステップＳ１２）。処理部３１の画像送信部３１ｈは、ステップＳ１２にて生成した画像を、通信部３３にて所定の端末装置５へ送信し（ステップＳ１３）、処理を終了する。 When it is determined that the sizes of both bounding boxes are substantially the same (S10: YES), the image superimposing unit 31g of the processing unit 31 superimposes the bounding box converted in step S9 on the captured image acquired in step S1. The resulting image is generated (step S12). The image transmission unit 31h of the processing unit 31 transmits the image generated in step S12 to the predetermined terminal device 5 by the communication unit 33 (step S13), and ends the processing.

＜画像抽出＞
本実施の形態に係る情報処理システムでは、カメラ１の撮影画像からサーバ装置３が生成したバウンディングボックスについて、撮影画像に重畳して端末装置５に表示させる処理以外にも、種々の処理に用いることが期待できる。例えばサーバ装置３は、生成したバウンディングボックスに基づいて、撮影画像から人を含む画像領域を抽出する処理を行うことができる。 <Image extraction>
In the information processing system according to the present embodiment, the bounding box generated by the server device 3 from the captured image of the camera 1 is used for various processes other than the process of superimposing the bounding box on the captured image and displaying it on the terminal device 5. Can be expected. For example, the server device 3 can perform a process of extracting an image area including a person from a captured image based on the generated bounding box.

なお、上述のバウンディングボックスの生成処理では、サーバ装置３は、撮影画像に写された人の頭部を囲むバウンディングボックスと、身体を囲むバウンディングボックスとの２つを生成した。本実施の形態に係るサーバ装置３は、画像抽出を行う際には、上記の２つのバウンディングボックスではなく、人の全身を囲む１つのバウンディングボックスを生成する。まず、この画像抽出用のバウンディングボックスの生成処理について説明する。 In the above-mentioned bounding box generation process, the server device 3 generated two, a bounding box surrounding the head of the person captured in the photographed image and a bounding box surrounding the body. When extracting an image, the server device 3 according to the present embodiment generates one bounding box that surrounds the whole body of a person instead of the above two bounding boxes. First, the process of generating the bounding box for image extraction will be described.

図９は、本実施の形態に係るサーバ装置３が行う画像抽出用のバウンディングボックス生成処理を説明するための模式図である。サーバ装置３は、カメラ１から取得した撮影画像１１０について、ＹＯＬＯ学習モデル３２ｂを用いた人の頭部の検出を行う。サーバ装置３は、歪みのある撮影画像における人の頭部の座標を、歪みを取り除いた２次元平面上の２次元座標、カメラ座標系の３次元空間における３次元座標、ワールド座標系の３次元空間における３次元座標へ順に変換する。なお、ここまでの座標変換処理は、図４において説明した処理と同じであるため、詳細な説明は省略する。 FIG. 9 is a schematic diagram for explaining a bounding box generation process for image extraction performed by the server device 3 according to the present embodiment. The server device 3 detects the human head using the YOLO learning model 32b for the captured image 110 acquired from the camera 1. The server device 3 sets the coordinates of the human head in the distorted photographed image as the two-dimensional coordinates on the two-dimensional plane from which the distortion is removed, the three-dimensional coordinates in the three-dimensional space of the camera coordinate system, and the three-dimensional coordinates of the world coordinate system. Converts to 3D coordinates in space in order. Since the coordinate conversion process up to this point is the same as the process described with reference to FIG. 4, detailed description thereof will be omitted.

サーバ装置３は、ワールド座標系の３次元空間において、上記の座標変換処理により得られた頭部に対応する点１１４を囲むように、２つの長方形の平面状のバウンディングボックス１４１，１４２を生成する。１つ目のバウンディングボックス１４１は、例えば３次元座標のｘ方向に５０ｃｍ、ｙ方向に０ｃｍ、ｚ方向に１７５ｃｍの長方形の枠体とし、ｘ方向の中心位置且つｚ方向の上から１０ｃｍの位置に点１１４が含まれるよう３次元空間に配置される。２つ目のバウンディングボックス１４２は、例えば３次元座標のｘ方向に０ｃｍ、ｙ方向に５０ｃｍ、ｚ方向に１７５ｃｍの長方形の枠体とし、ｙ方向の中心位置且つｚ方向の上から１０ｃｍの位置に点１１４が含まれるよう３次元空間に配置される。この配置により、２つのバウンディングボックス１４１，１４２は交差（直交）する。 The server device 3 generates two rectangular planar bounding boxes 141 and 142 so as to surround the point 114 corresponding to the head obtained by the above coordinate conversion process in the three-dimensional space of the world coordinate system. .. The first bounding box 141 is, for example, a rectangular frame of 50 cm in the x direction, 0 cm in the y direction, and 175 cm in the z direction in three-dimensional coordinates, and is located at the center position in the x direction and 10 cm from the top in the z direction. It is arranged in a three-dimensional space so as to include the point 114. The second bounding box 142 is, for example, a rectangular frame of 0 cm in the x direction, 50 cm in the y direction, and 175 cm in the z direction of the three-dimensional coordinates, and is located at the center position in the y direction and 10 cm from the top in the z direction. It is arranged in a three-dimensional space so as to include the point 114. Due to this arrangement, the two bounding boxes 141 and 142 intersect (orthogonally).

なお上記の２つのバウンディングボックス１４１，１４２の形状及びサイズ等は一例であって、これに限るものではない。バウンディングボックス１４１，１４２の形状は、四角形でなくてよく、例えば三角形又は五角形以上の多角形であってよい。またバウンディングボックス１４１，１４２に関する５０ｃｍ及び１７５ｃｍ等の数値は、適宜に変更され得る。これらの数値は、本実施の形態に係る情報処理システムの設計者等により予め決定されて、サーバプログラム３２ａと共に処理のデフォルト値としてサーバ装置３の記憶部３２に記憶されてよい。またこれらの数値については、サーバ装置３が処理を実施する際にユーザからの入力を受け付けてもよい。サーバ装置３は、例えばユーザが数値を入力した場合にはこの数値を用いて処理を行い、数値が入力されない場合にはデフォルト値を用いて処理を行うことができる。 The shapes and sizes of the above two bounding boxes 141 and 142 are examples, and the present invention is not limited to these. The shape of the bounding boxes 141 and 142 does not have to be a quadrangle, and may be, for example, a triangle or a polygon having a pentagon or more. In addition, numerical values such as 50 cm and 175 cm for the bounding boxes 141 and 142 can be changed as appropriate. These numerical values may be determined in advance by the designer of the information processing system according to the present embodiment and stored in the storage unit 32 of the server device 3 as default values for processing together with the server program 32a. Further, these numerical values may be input from the user when the server device 3 performs processing. For example, when the user inputs a numerical value, the server device 3 can perform processing using this numerical value, and when the numerical value is not input, the server device 3 can perform processing using a default value.

またサーバ装置３は、図５及び図６等に示した方法で人の身長（地面から頭部まの高さ）を推定する処理を行っている場合、２つのバウンディングボックス１４１，１４２の高さ（ｚ方向の長さ）を、推定された身長とすることができる。 Further, when the server device 3 performs the process of estimating the height of a person (height from the ground to the head) by the method shown in FIGS. 5 and 6, the heights of the two bounding boxes 141 and 142 are high. (Length in the z direction) can be the estimated height.

サーバ装置３は、ワールド座標系の３次元空間において生成したバウンディングボックス１４１，１４２を、カメラ座標系の３次元空間の３次元のバウンディングボックス１４３，１４４へ変換し、更に２次元平面におけるバウンディングボックス１４５，１４６へ変換する。このときにサーバ装置３は、上記の（４）式を用いて、ワールド座標系のバウンディングボックス１４１，１４２を、２次元平面におけるバウンディングボックス１４５，１４６へ直接的に変換してよい。 The server device 3 converts the bounding boxes 141 and 142 generated in the three-dimensional space of the world coordinate system into the three-dimensional bounding boxes 143 and 144 in the three-dimensional space of the camera coordinate system, and further converts the bounding boxes 145 in the two-dimensional plane. , 146. At this time, the server device 3 may directly convert the bounding boxes 141 and 142 of the world coordinate system into the bounding boxes 145 and 146 in the two-dimensional plane by using the above equation (4).

サーバ装置３は、歪みのない２次元平面上のバウンディングボックス１４５，１４６を、広角レンズの歪みを含む画像上のバウンディングボックスに変換する。このときにサーバ装置３は、上記の（１）式に基づいて、バウンディングボックスの座標を変換することができる。次いでサーバ装置３は、座標変換により得られた２つのバウンディングボックスのうち、いずれか一方のバウンディングボックスを選択し、選択した一方を画像抽出用のバウンディングボックス１４７とする。サーバ装置３は、例えば歪みを含む画像上の２つのバウンディングボックスの幅又は面積等を比較し、幅又は面積等が大きいバウンディングボックスを選択することができる。 The server device 3 converts the distortion-free two-dimensional plane bounding boxes 145 and 146 into a distortion-free bounding box on the image of the wide-angle lens. At this time, the server device 3 can convert the coordinates of the bounding box based on the above equation (1). Next, the server device 3 selects one of the two bounding boxes obtained by the coordinate transformation, and sets the selected one as the bounding box 147 for image extraction. The server device 3 can compare, for example, the width or area of two bounding boxes on an image containing distortion, and select a bounding box having a large width or area.

サーバ装置３は、上記の手順で生成した画像抽出用のバウンディングボックス１４７を用いて、カメラ１から取得した歪みを含む撮影画像から、この画像に写された人を含む画像領域を抽出する処理を行う。なおサーバ装置３は、図９に示した手順で画像抽出用のバウンディングボックス１４７を生成するのではなく、図４に示した手順で生成した２つのバウンディングボックス１２１，１２２に基づいて、画像抽出用のバウンディングボックスを生成してもよい。この場合、サーバ装置３は、人の頭部を囲むバウンディングボックス１２１と、身体を囲むバウンディングボックス１２２とを統合して、画像抽出用の１つの２次元（長方形）のバウンディングボックスを生成することができる。サーバ装置３は、例えば頭部を囲むバウンディングボックス１２１と身体を囲むバウンディングボックス１２２とを内包する最小の長方形の枠体を生成することで、画像抽出用のバウンディングボックスを生成することができる。 The server device 3 uses the bounding box 147 for image extraction generated in the above procedure to extract an image area including a person captured in this image from a captured image including distortion acquired from the camera 1. conduct. Note that the server device 3 does not generate the bounding box 147 for image extraction by the procedure shown in FIG. 9, but is used for image extraction based on the two bounding boxes 121 and 122 generated by the procedure shown in FIG. You may generate a bounding box for. In this case, the server device 3 may integrate the bounding box 121 surrounding the human head and the bounding box 122 surrounding the body to generate one two-dimensional (rectangular) bounding box for image extraction. can. The server device 3 can generate a bounding box for image extraction, for example, by generating a minimum rectangular frame including a bounding box 121 surrounding the head and a bounding box 122 surrounding the body.

図１０は、バウンディングボックスに基づく画像抽出を説明するための模式図である。図１０の上部には、カメラ１が撮影した撮影画像に写された人及びその周辺を拡大した画像が図示されている。カメラ１が広角レンズを通して撮影を行ったことで、図示の画像では人が地面に対して直立しておらず、斜めに傾いた（歪んだ）状態で人が画像中に写されている。この画像には、実線の長方形で示したバウンディングボックス１３１と、破線の長方形で示したバウンディングボックス１３２とが重ねて示されている。実線のバウンディングボックス１３１は、本実施の形態に係るサーバ装置３が、図９に示した手順により生成した画像抽出用のバウンディングボックス１３１である。破線のバウンディングボックス１３２は、ＹＯＬＯの手法により人の全体を検出してバウンディングボックスを付した場合のバウンディングボックス１３２である。 FIG. 10 is a schematic diagram for explaining image extraction based on a bounding box. In the upper part of FIG. 10, a magnified image of a person and its surroundings taken in a photographed image taken by the camera 1 is shown. Since the camera 1 took a picture through a wide-angle lens, in the illustrated image, the person is not upright with respect to the ground, and the person is shown in the image in a state of being tilted (distorted) at an angle. In this image, the bounding box 131 shown by a solid line rectangle and the bounding box 132 shown by a broken line rectangle are shown superimposed. The solid line bounding box 131 is an image extraction bounding box 131 generated by the server device 3 according to the present embodiment according to the procedure shown in FIG. The broken line bounding box 132 is a bounding box 132 when the entire person is detected by the method of YOLO and a bounding box is attached.

サーバ装置３は、撮影画像に対して生成したバウンディングボックス１３１内の画像を抽出することで、歪みのある撮影画像から人が写された画像領域を抽出することができる。バウンディングボックス１３１に基づいて抽出された画像を、図１０の左下に示している。サーバ装置３は、撮影画像に対して傾いたバウンディングボックス１３１から抽出した画像に対して、傾きを補正する処理（例えば画像の回転処理など）を行う。図１０の左下には、抽出した画像に対する傾き補正後の画像が示されている。本実施の形態に係るサーバ装置３が生成したバウンディングボックス１３１に基づいて抽出される画像は、検出された人が画像の縦横方向（垂直方向及び水平方向）に沿って立つ画像となる。なお抽出した画像の傾きを補正する処理は、画像の回転以外の方法で行われてもよい。 The server device 3 can extract an image area in which a person is captured from a distorted photographed image by extracting the image in the bounding box 131 generated for the photographed image. The image extracted based on the bounding box 131 is shown in the lower left of FIG. The server device 3 performs a process of correcting the inclination (for example, an image rotation process) of the image extracted from the bounding box 131 that is inclined with respect to the captured image. In the lower left of FIG. 10, an image after tilt correction with respect to the extracted image is shown. The image extracted based on the bounding box 131 generated by the server device 3 according to the present embodiment is an image in which a detected person stands along the vertical and horizontal directions (vertical direction and horizontal direction) of the image. The process of correcting the inclination of the extracted image may be performed by a method other than the rotation of the image.

これに対してＹＯＬＯのバウンディングボックス１３２は、歪みのある撮影画像に対して、画像の縦横方向に沿う長方形のバウンディングボックスとなる。撮影画像からＹＯＬＯのバウンディングボックス１３２に基づいて画像領域を抽出した場合の画像を、図１０の右下に示している。ＹＯＬＯのバウンディングボックス１３２に基づいて抽出される画像は、検出された人が画像内で傾いている（歪んでいる）ものとなる。また抽出された画像には、人以外のもの（例えば背景など）に属する画素の数が（バウンディングボックス１３１に基づいて抽出された画像と比較して）多く、人以外のものに関する情報を多く含む画像となる。 On the other hand, the YOLO bounding box 132 is a rectangular bounding box along the vertical and horizontal directions of a distorted captured image. An image when an image area is extracted from the captured image based on the YOLO bounding box 132 is shown in the lower right of FIG. The image extracted based on the YOLO bounding box 132 is such that the detected person is tilted (distorted) in the image. Also, the extracted image has a large number of pixels belonging to something other than humans (such as the background) (compared to the image extracted based on the bounding box 131) and contains a lot of information about things other than humans. It becomes an image.

サーバ装置３は、バウンディングボックス１３１に基づいて抽出した画像を、例えばこの画像に写された人が誰であるかを特定する処理、この画像に写された人の行動を認識する処理、又は、画像に写された人を追跡する処理等の種々の処理に対しての入力情報として用いることができる。本実施の形態に係るサーバ装置３が生成するバウンディングボックス１３１に基づいて抽出された画像は、ＹＯＬＯのバウンディングボックス１３２に基づいて抽出された画像よりも、全画素数に対して検出された人が占める画素数の割合が高くなることが期待でき、より後続の処理の精度を高めることが期待できる。 The server device 3 uses the image extracted based on the bounding box 131, for example, a process of identifying who is the person copied in this image, a process of recognizing the behavior of the person copied in this image, or a process of recognizing the behavior of the person copied in this image. It can be used as input information for various processes such as a process of tracking a person captured in an image. The image extracted based on the bounding box 131 generated by the server device 3 according to the present embodiment is more detected by a person for the total number of pixels than the image extracted based on the bounding box 132 of YOLO. It can be expected that the ratio of the number of pixels to be occupied will be high, and the accuracy of subsequent processing can be expected to be further improved.

図１１は、本実施の形態に係るサーバ装置３が行う画像抽出処理の手順を示すフローチャートである。本実施の形態に係るサーバ装置３の処理部３１は、画像取得部３１ａにてカメラ１が撮影した撮影画像を取得し（ステップＳ３１）、人検出部３１ｂにてＹＯＬＯ学習モデル３２ｂを用いた人検出を行う（ステップＳ３２）。人検出部３１ｂは、ＹＯＬＯ学習モデル３２ｂが出力する人の頭部を囲むバウンディングボックスの位置等の情報に基づいて、人の頭部の中心点の座標を算出する（ステップＳ３３）。処理部３１の座標変換部３１ｃは、算出した頭部の中心点の座標を、歪みが除去された２次元平面における２次元座標に変換し（ステップＳ３４）、２次元座標をワールド座標系の３次元座標へ変換する（ステップＳ３５）。 FIG. 11 is a flowchart showing a procedure of image extraction processing performed by the server device 3 according to the present embodiment. The processing unit 31 of the server device 3 according to the present embodiment acquires a photographed image taken by the camera 1 in the image acquisition unit 31a (step S31), and uses the YOLO learning model 32b in the person detection unit 31b. Detection is performed (step S32). The human detection unit 31b calculates the coordinates of the center point of the human head based on the information such as the position of the bounding box surrounding the human head output by the YOLO learning model 32b (step S33). The coordinate conversion unit 31c of the processing unit 31 converts the calculated coordinates of the center point of the head into two-dimensional coordinates in the two-dimensional plane from which distortion has been removed (step S34), and the two-dimensional coordinates are converted into 3 in the world coordinate system. Convert to dimensional coordinates (step S35).

処理部３１のバウンディングボックス生成部３１ｄは、ワールド座標系の３次元仮想空間において、ステップＳ３５にて変換した頭部の中心点の３次元座標を含む２つの長方形の平面状のバウンディングボックスを生成する（ステップＳ３６）。２つのバウンディングボックスは、例えば５０ｃｍ×０ｃｍ×１７５ｃｍの長方形の平面形状の枠体、及び、０ｃｍ×５０ｃｍ×１７５ｃｍの長方形の平面形状の枠体とすることができる。 The bounding box generation unit 31d of the processing unit 31 generates two rectangular planar bounding boxes including the three-dimensional coordinates of the center point of the head converted in step S35 in the three-dimensional virtual space of the world coordinate system. (Step S36). The two bounding boxes can be, for example, a rectangular planar frame of 50 cm × 0 cm × 175 cm and a rectangular planar frame of 0 cm × 50 cm × 175 cm.

処理部３１の座標逆変換部３１ｅは、生成した２つのバウンディングボックスを２次元平面におけるバウンディングボックスに変換する（ステップＳ３７）。座標逆変換部３１ｅは、ステップＳ３７にて変換された歪みのない２次元平面上の２つのバウンディングボックスを、歪みのある２次元平面（撮影画像）上の２つのバウンディングボックスに変換する（ステップＳ３８）。 The coordinate inverse conversion unit 31e of the processing unit 31 converts the two generated bounding boxes into a bounding box on a two-dimensional plane (step S37). The coordinate inverse conversion unit 31e converts the two bounding boxes on the distorted two-dimensional plane converted in step S37 into two bounding boxes on the distorted two-dimensional plane (captured image) (step S38). ).

処理部３１は、ステップＳ３８にて生成した撮影画像上の２つのバウンディングボックスのうち、幅又は面積等が大きい一方のバウンディングボックスを画像抽出用に選択し、撮影画像から画像抽出用のバウンディングボックスで囲まれた画像領域を抽出する（ステップＳ３９）。処理部３１は、抽出した画像領域に対して、例えば回転等の処理を行う事で、傾きを補正し（ステップＳ４０）、画像抽出処理を終了する。 The processing unit 31 selects one of the two bounding boxes on the captured image generated in step S38, which has a larger width or area, for image extraction, and uses the bounding box for image extraction from the captured image. The enclosed image area is extracted (step S39). The processing unit 31 corrects the inclination (step S40) by performing a process such as rotation on the extracted image area, and ends the image extraction process.

＜まとめ＞
以上の構成の本実施の形態に係る情報処理システムでは、広角レンズを通してカメラ１が撮影した歪みを含む撮影画像（歪曲画像）をサーバ装置３が取得し、取得した撮影画像に写された対象物（人）を検出する。サーバ装置３は、検出した人の２次元座標を仮想の３次元空間における３次元座標に変換し、３次元空間において対象物を囲む立体枠（立体形状のバウンディングボックス）を生成する。サーバ装置３は、生成した立体枠の３次元座標をカメラ１が撮影した撮影画像における２次元座標へ逆変換する。サーバ装置３は、これらにより得られた平面枠（平面形状のバウンディングボックス）を、カメラ１が撮影した歪みのある撮影画像に重畳して、端末装置５等に表示させる。これによりサーバ装置３は、歪みのある撮影画像に適したバウンディングボックスを重畳して、撮影画像からの対象物の検出結果をユーザへ提示することが期待できる。 <Summary>
In the information processing system according to the present embodiment having the above configuration, the server device 3 acquires a captured image (distorted image) including distortion captured by the camera 1 through a wide-angle lens, and the object is copied to the acquired captured image. Detect (person). The server device 3 converts the two-dimensional coordinates of the detected person into three-dimensional coordinates in a virtual three-dimensional space, and generates a three-dimensional frame (three-dimensional bounding box) surrounding the object in the three-dimensional space. The server device 3 reversely converts the three-dimensional coordinates of the generated three-dimensional frame into the two-dimensional coordinates in the captured image captured by the camera 1. The server device 3 superimposes the flat frame (flat bounding box) obtained by these on the distorted captured image taken by the camera 1 and displays it on the terminal device 5 or the like. As a result, the server device 3 can be expected to superimpose a bounding box suitable for a distorted captured image and present the detection result of the object from the captured image to the user.

また本実施の形態に係るサーバ装置３は、カメラ１が撮影した歪みのある撮影画像における対象物の２次元座標を、歪みを取り除いた画像における２次元座標へ変換する。サーバ装置３は、歪みを取り除いた画像における２次元座標をカメラ１を中心とするカメラ座標系の３次元座標へ変換し、カメラ座標系の３次元座標をワールド座標系の３次元座標へ変換して、ワールド座標系において対象物を囲む立体枠を生成する。これによりサーバ装置３は、広角レンズを通して撮影された歪みのある撮影画像から、この撮影画像に写された対象物を囲む立体枠を精度よく生成することが期待できる。なお本実施の形態においてサーバ装置３は、２次元座標からカメラ座標系の３次元座標への変換と、カメラ座標系の３次元座標からワールド座標系の３次元座標への変換とを（４）式に基づいて一括して行っているが、これに限るものではなく、各座標変換を個別に行ってもよい。 Further, the server device 3 according to the present embodiment converts the two-dimensional coordinates of the object in the distorted captured image captured by the camera 1 into the two-dimensional coordinates in the distorted image. The server device 3 converts the two-dimensional coordinates in the distorted image into the three-dimensional coordinates of the camera coordinate system centered on the camera 1, and converts the three-dimensional coordinates of the camera coordinate system into the three-dimensional coordinates of the world coordinate system. To generate a three-dimensional frame that surrounds the object in the world coordinate system. As a result, the server device 3 can be expected to accurately generate a three-dimensional frame surrounding the object captured in the captured image from the distorted captured image captured through the wide-angle lens. In the present embodiment, the server device 3 converts the two-dimensional coordinates to the three-dimensional coordinates of the camera coordinate system and the three-dimensional coordinates of the camera coordinate system to the three-dimensional coordinates of the world coordinate system (4). Although it is performed collectively based on the formula, it is not limited to this, and each coordinate conversion may be performed individually.

また本実施の形態に係るサーバ装置３は、歪みを含む撮影画像に写された人の頭部を検出し、人の頭部の２次元座標を３次元座標へ変換し、３次元仮想空間において人の頭部を囲む第１の立体枠を生成し、人の身体を囲む第２の立体枠を生成し、第１の立体枠及び第２の立体枠を結合して人を囲む立体枠を生成する。これにより、サーバ装置３が行う座標変換の処理負荷を低減することが期待できる。 Further, the server device 3 according to the present embodiment detects the human head captured in the captured image including distortion, converts the two-dimensional coordinates of the human head into three-dimensional coordinates, and in the three-dimensional virtual space. A first three-dimensional frame surrounding the human head is generated, a second three-dimensional frame surrounding the human body is generated, and the first three-dimensional frame and the second three-dimensional frame are combined to form a three-dimensional frame surrounding the person. Generate. This can be expected to reduce the processing load of the coordinate conversion performed by the server device 3.

また本実施の形態に係るサーバ装置３は、人の頭部を囲む所定サイズの第１の立体枠を生成し、人の身体を囲む所定サイズの第２の立体枠を生成し、第１の立体枠及び第２の立体枠を結合して人を囲む立体枠を生成し、この立体枠の３次元座標を撮影画像の２次元平面における２次元座標へ変換する。サーバ装置３は、２次元座標へ変換された平面枠のうち第１の立体枠に相当する部分のサイズを算出し、歪みを含む撮影画像からの人の頭部の検出結果に基づく平面枠のサイズとの比較を行い、比較結果に基づいて第２の立体枠の高さを調整する。これによりサーバ装置３は、所定サイズとして生成した第２の立体枠を、検出した人の身長又は姿勢等に適したサイズに調整することが期待できる。 Further, the server device 3 according to the present embodiment generates a first three-dimensional frame of a predetermined size surrounding the human head, generates a second three-dimensional frame of a predetermined size surrounding the human body, and first. A three-dimensional frame and a second three-dimensional frame are combined to generate a three-dimensional frame that surrounds a person, and the three-dimensional coordinates of this three-dimensional frame are converted into two-dimensional coordinates in the two-dimensional plane of the captured image. The server device 3 calculates the size of the part corresponding to the first three-dimensional frame in the plane frame converted into two-dimensional coordinates, and calculates the size of the plane frame based on the detection result of the human head from the captured image including distortion. Comparison with the size is performed, and the height of the second three-dimensional frame is adjusted based on the comparison result. As a result, the server device 3 can be expected to adjust the second three-dimensional frame generated as a predetermined size to a size suitable for the height or posture of the detected person.

また本実施の形態に係るサーバ装置３は、立体枠から変換された２次元の平面枠のうち第１の立体枠に相当する部分のサイズが、撮影画像からの人の頭部の検出結果に基づく平面枠のサイズより大きい場合、３次元仮想空間における第２の立体枠の高さを低減する。これによりサーバ装置３は、所定サイズとして生成した第２の立体枠を、検出した人の身長又は姿勢等に適したサイズに精度よく調整することが期待できる。 Further, in the server device 3 according to the present embodiment, the size of the portion corresponding to the first three-dimensional frame in the two-dimensional plane frame converted from the three-dimensional frame is the detection result of the human head from the captured image. If it is larger than the size of the based plane frame, the height of the second three-dimensional frame in the three-dimensional virtual space is reduced. As a result, the server device 3 can be expected to accurately adjust the second three-dimensional frame generated as a predetermined size to a size suitable for the height or posture of the detected person.

また本実施の形態に係るサーバ装置３は、歪みを含む撮影画像に写された対象物の検出を、ＹＯＬＯのアルゴリズムを用いて行う。ＹＯＬＯのアルゴリズムは、画像から対象物を検出するアルゴリズムとして精度よく実績のあるものであり、検出した対象物にバウンディングボックスを付すことができるものであるため、本実施の形態に係るサーバ装置３が行う処理に適したアルゴリズムである。ただし、サーバ装置３はＹＯＬＯ以外のアルゴリズムを利用して、撮影画像から対象物を検出してもよい。 Further, the server device 3 according to the present embodiment detects an object captured in the captured image including distortion by using the YOLO algorithm. The YOLO algorithm has an accurate track record as an algorithm for detecting an object from an image, and a bounding box can be attached to the detected object. Therefore, the server device 3 according to the present embodiment is used. It is an algorithm suitable for the processing to be performed. However, the server device 3 may detect an object from the captured image by using an algorithm other than YOLO.

また本実施の形態に係る情報処理システムでは、広角レンズを通してカメラ１が撮影した歪みを含む撮影画像（歪曲画像）をサーバ装置３が取得し、取得した撮影画像に写された対象物（人）を囲む立体形状の枠（バウンディングボックス）を生成し、生成した立体形状の枠撮影画像に重畳して端末装置５の表示部５４に表示させる。これにより情報処理システムは、歪みのある撮影画像に適したバウンディングボックスを重畳して、撮影画像に写された対象物の存在をユーザへ提示することが期待できる。 Further, in the information processing system according to the present embodiment, the server device 3 acquires a captured image (distorted image) including distortion captured by the camera 1 through a wide-angle lens, and the object (person) copied on the acquired captured image. A three-dimensional frame (bounding box) surrounding the above is generated, superimposed on the generated three-dimensional frame photographed image, and displayed on the display unit 54 of the terminal device 5. As a result, the information processing system can be expected to superimpose a bounding box suitable for a distorted captured image and present the existence of the object captured in the captured image to the user.

また本実施の形態に係る情報処理システムは、生成した立体形状の枠に基づいて、歪みを含む撮影画像から対象物を含む部分画像を抽出する。これにより、抽出した部分画像に基づいて、例えば顔認証又は行動認識等の処理を精度よく行うことが期待できる。 Further, the information processing system according to the present embodiment extracts a partial image including an object from a photographed image including distortion based on the generated three-dimensional frame. As a result, it can be expected that processing such as face recognition or action recognition can be performed accurately based on the extracted partial image.

また本実施の形態に係る情報処理システムが生成する立体形状の枠には、対象物として撮影画像に写された人を検出し、人の頭部を囲む第１の枠と、この人の身体を囲む第２の枠とを含む。このような複数の枠を含む立体形状の枠を生成することによって、例えば顔認証又は行動認識等の処理に対して適した画像を抽出して用いることが可能となる。 Further, in the three-dimensional frame generated by the information processing system according to the present embodiment, a first frame that detects a person captured in a photographed image as an object and surrounds the person's head and the body of this person. Includes a second frame surrounding. By generating a three-dimensional frame including such a plurality of frames, it is possible to extract and use an image suitable for processing such as face recognition or action recognition.

また本実施の形態に係る情報処理システムでは、サーバ装置３は、ワールド座標系の３次元の仮想空間で生成した立体のバウンディングボックスに基づいて撮影画像の２次元平面における平面枠を生成し、撮影画像からＹＯＬＯ等のアルゴリズムにより直接的に生成した平面枠とのサイズ比較を行う。サーバ装置３は、この比較結果に基づいて対象物の地面からの高さ、３次元仮想空間における位置等を推定することができる。 Further, in the information processing system according to the present embodiment, the server device 3 generates a plane frame in the two-dimensional plane of the captured image based on the three-dimensional bounding box generated in the three-dimensional virtual space of the world coordinate system, and captures the image. The size of the image is compared with the plane frame directly generated by an algorithm such as YOLO. The server device 3 can estimate the height of the object from the ground, the position in the three-dimensional virtual space, and the like based on the comparison result.

今回開示された実施形態はすべての点で例示であって、制限的なものではないと考えられるべきである。本発明の範囲は、上記した意味ではなく、特許請求の範囲によって示され、特許請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time should be considered to be exemplary in all respects and not restrictive. The scope of the present invention is indicated by the scope of claims, not the above-mentioned meaning, and is intended to include all modifications within the meaning and scope equivalent to the scope of claims.

１カメラ
３サーバ装置
５端末装置
３１処理部
３１ａ画像取得部
３１ｂ人検出部
３１ｃ座標変換部
３１ｄバウンディングボックス生成部
３１ｅ座標逆変換部
３１ｆバウンディングボックス調整部
３１ｇ画像重畳部
３１ｈ画像送信部
３２記憶部
３２ａサーバプログラム
３２ｂＹＯＬＯ学習モデル
３３通信部
５１処理部
５１ａ画像受信部
５１ｂ表示処理部
５２記憶部
５２ａプログラム
５３通信部
５４表示部
５５操作部
９８，９９記録媒体
１０１バウンディングボックス
Ｎネットワーク
1 Camera 3 Server device 5 Terminal device 31 Processing unit 31a Image acquisition unit 31b Human detection unit 31c Coordinate conversion unit 31d Bounding box generation unit 31e Coordinate reverse conversion unit 31f Bounding box adjustment unit 31g Image superimposition unit 31h Image transmission unit 32 Storage unit 32a Server program 32b YOLO learning model 33 Communication unit 51 Processing unit 51a Image reception unit 51b Display processing unit 52 Storage unit 52a Program 53 Communication unit 54 Display unit 55 Operation unit 98,99 Recording medium 101 Bounding box N network

Claims

Acquires a distorted image including the distortion taken by the camera,
A three-dimensional frame surrounding the object copied in the acquired distorted image is generated.
A partial image including the object is extracted from the distorted image based on the generated three-dimensional frame.
The three-dimensional frame is a shape in which two planar frames surrounding the object intersect.
Image processing method.

The two-dimensional coordinates of the object in the distorted image are converted into three-dimensional coordinates in a virtual three-dimensional space.
In the three-dimensional space, two intersecting planar frames surrounding the object are generated.
The generated three-dimensional coordinates of the frame are inversely converted into the two-dimensional coordinates in the distorted image.
Select one of the two planar shaped frames based on the size in the distorted image.
A partial image including the object is extracted from the distorted image based on the frame of the selected planar shape.
The image processing method according to claim 1 .

Correct the tilt of the extracted partial image,
The image processing method according to claim 1 or 2 .

On the computer
Acquires a distorted image including the distortion taken by the camera,
A three-dimensional frame surrounding the object copied in the acquired distorted image is generated.
Based on the generated three-dimensional frame, a process of extracting a partial image including the object from the distorted image is executed .
The three-dimensional frame is a computer program in which two planar frames surrounding the object intersect .

An acquisition unit that acquires a distorted image including distortion taken by the camera,
A generation unit that generates a three-dimensional frame surrounding the object copied in the acquired distorted image, and a generation unit.
A processing unit for extracting a partial image including the object from the distorted image based on the generated three-dimensional frame is provided .
The three-dimensional frame is an image processing device having a shape in which two planar frames surrounding the object intersect .