JP2019175093A

JP2019175093A - Apparatus, method and program for estimation, and apparatus, method and program for learning

Info

Publication number: JP2019175093A
Application number: JP2018061911A
Authority: JP
Inventors: 川口　京子; Kyoko Kawaguchi; 京子川口
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2018-03-28
Filing date: 2018-03-28
Publication date: 2019-10-10

Abstract

To provide an apparatus for estimation more suited for estimating a feature point, position or movement of an object concerned.SOLUTION: An apparatus for estimation comprises a compression processing section 23 that compresses each pixel value in an area of an object concerned reflected in an image into a first tonal range out of an overall tonal range expressing the image and an estimating section 24 that performs image analysis using first discriminator model Dm1 after learning on the image on which the compression processing has been done and estimates a feature point, position or movement of the object concerned. The first discriminator model Dm1 has been processed of learning using first learning data Dt1 that a teacher image on which the compression processing has been done is associated with the feature point, position or movement of the object concerned reflected in the teacher image.SELECTED DRAWING: Figure 1

Description

本開示は、推定装置、推定方法、推定プログラム、学習装置、学習方法、及び学習プログラムに関する。 The present disclosure relates to an estimation device, an estimation method, an estimation program, a learning device, a learning method, and a learning program.

近年、取得した画像から、当該画像内に映る人の関節位置、姿勢又は動作等の特徴を認識する画像認識技術が求められている。 In recent years, there has been a demand for an image recognition technique for recognizing features such as a joint position, posture, or motion of a person shown in the acquired image from the acquired image.

このような背景から、機械学習を利用した画像認識技術が注目されている。この種の画像認識技術においては、識別器に対して、学習データとして教師画像を用いた機械学習を施すことによって、当該識別器に、画像データに潜在する確率分布の特徴を把握させる。これによって、学習済みの識別器は、画像の画素値情報を入力するだけで、画像パターンを識別し得るようになる。 Against this background, image recognition technology using machine learning has attracted attention. In this type of image recognition technology, the classifier is subjected to machine learning using a teacher image as learning data, thereby causing the classifier to understand the characteristics of the probability distribution latent in the image data. As a result, the learned classifier can identify the image pattern only by inputting the pixel value information of the image.

例えば、非特許文献１には、畳み込みニュートラルネットワーク（Convolutional Neural Network）を識別器として用いて、人の姿勢を推定する画像認識技術が開示されている。 For example, Non-Patent Document 1 discloses an image recognition technique for estimating a human posture using a convolutional neural network as a discriminator.

Alexander Toshev, et al. "Deep Pose:Human Pose Estimation via Deep Neural Networks", in CVPR, 2014, ("URL: http://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Toshev_DeepPose_Human_Pose_2014_CVPR_paper.pdf")Alexander Toshev, et al. "Deep Pose: Human Pose Estimation via Deep Neural Networks", in CVPR, 2014, ("URL: http://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Toshev_DeepPose_Human_Pose_2014_CVPR_paper.pdf" )

ところで、一般的な畳み込みニューラルネットワーク等の識別器においては、色のバリエーション（例えば、人物の服装）が対象物体の全体的特徴（例えば、人物の関節位置又は姿勢等）を識別する際には過剰データとなっており、当該識別器の識別精度の悪化、及び識別器の学習効率の悪化の要因となりやすい。そのため、例えば、識別器に学習させる際に、服装等が異なる画像をそのまま学習データとして用いると、人物の姿勢等を精度良く識別し得るように構成されるまでに、膨大な学習時間が掛かる。 By the way, in a discriminator such as a general convolution neural network, color variations (for example, a person's clothes) are excessive when identifying the overall characteristics of the target object (for example, a person's joint position or posture). This is data, and is likely to cause a deterioration in the identification accuracy of the classifier and a learning efficiency of the classifier. Therefore, for example, when learning is performed by the classifier, if an image with different clothes or the like is used as it is as learning data, it takes an enormous amount of learning time until the posture of the person can be accurately identified.

本開示は、上記の問題点に鑑みてなされたものであり、学習させる際に必要な学習時間を抑制することができる、推定装置、推定方法、推定プログラム、学習装置、学習方法、及び学習プログラムを提供することを目的とする。 The present disclosure has been made in view of the above-described problems, and an estimation device, an estimation method, an estimation program, a learning device, a learning method, and a learning program that can suppress a learning time necessary for learning. The purpose is to provide.

前述した課題を解決する主たる本開示は、
画像に映る対象物体の領域における各画素値を、前記画像を表現する全階調域のうちの第１の階調範囲内に圧縮する圧縮処理部と、
前記圧縮処理が施された前記画像に対して、学習済みの第１の識別器モデルを用いた画像解析を施して、前記対象物体の特徴点、姿勢又は動作を推定する推定部と、
を備え、
前記第１の識別器モデルは、前記圧縮処理が施された教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、学習処理が施されている
推定装置である。 The main present disclosure for solving the above-described problems is as follows.
A compression processing unit that compresses each pixel value in the region of the target object shown in the image within a first gradation range of all gradation areas representing the image;
An estimation unit that performs image analysis using the learned first discriminator model on the image subjected to the compression processing, and estimates a feature point, posture, or motion of the target object;
With
The first discriminator model uses the first learning data in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, Is an estimation device.

又、他の局面では、
教師画像に映る対象物体の領域における各画素値を、前記教師画像を表現する全階調域のうちの第１の階調範囲内に圧縮する圧縮処理部と、
前記圧縮処理が施された前記教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、第１の識別器モデルに対して学習処理を施す第１の学習処理部と、
を備える学習装置である。 In other aspects,
A compression processing unit that compresses each pixel value in the region of the target object shown in the teacher image within a first gradation range of all gradation regions representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other A first learning processing unit for performing processing;
It is a learning apparatus provided with.

又、他の局面では、
画像に映る対象物体の領域における各画素値を、前記画像を表現する全階調域のうちの第１の階調範囲内に圧縮し、
前記圧縮処理が施された前記画像に対して、学習済みの第１の識別器モデルを用いた画像解析を施して、前記対象物体の特徴点、姿勢又は動作を推定する推定方法であって、
前記第１の識別器モデルは、前記圧縮処理が施された教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、学習処理が施されている
推定方法である。 In other aspects,
Compressing each pixel value in the area of the target object shown in the image within a first gradation range of all gradation areas representing the image;
An estimation method for performing an image analysis using a learned first discriminator model on the image subjected to the compression processing to estimate a feature point, posture, or motion of the target object,
The first discriminator model uses the first learning data in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, This is an estimation method.

又、他の局面では、
教師画像に映る対象物体の領域における各画素値を、前記教師画像を表現する全階調域のうちの第１の階調範囲内に圧縮し、
前記圧縮処理が施された前記教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、第１の識別器モデルに対して学習処理を施す、
を備える学習方法である。 In other aspects,
Compressing each pixel value in the area of the target object shown in the teacher image within a first gradation range of all gradation areas representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other Process,
Is a learning method comprising

又、他の局面では、
コンピュータに、
画像に映る対象物体の領域における各画素値を、前記画像を表現する全階調域のうちの第１の階調範囲内に圧縮する処理と、
前記圧縮処理が施された前記画像に対して、学習済みの第１の識別器モデルを用いた画像解析を施して、前記対象物体の特徴点、姿勢又は動作を推定する処理と、
を実行させる推定プログラムであって、
前記第１の識別器モデルは、前記圧縮処理が施された教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、学習処理が施されている
推定プログラムである。 In other aspects,
On the computer,
A process of compressing each pixel value in the area of the target object shown in the image within a first gradation range of all gradation areas representing the image;
A process of performing image analysis using the learned first discriminator model on the image subjected to the compression process, and estimating a feature point, posture, or motion of the target object;
An estimation program for executing
The first discriminator model uses the first learning data in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, This is an estimation program.

又、他の局面では、
コンピュータに、
教師画像に映る対象物体の領域における各画素値を、前記教師画像を表現する全階調域のうちの第１の階調範囲内に圧縮する処理と、
前記圧縮処理が施された前記教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、第１の識別器モデルに対して学習処理を施す処理と、
を実行させる学習プログラムである。 In other aspects,
On the computer,
A process of compressing each pixel value in the area of the target object shown in the teacher image within a first gradation range of all gradation areas representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other Processing to perform processing,
It is a learning program that executes.

本開示によれば、学習させる際に必要な学習時間を抑制することができる。 According to the present disclosure, it is possible to suppress the learning time required for learning.

第１の実施形態に係る画像認識システムの全体構成を示すブロック図1 is a block diagram showing the overall configuration of an image recognition system according to a first embodiment. 一般的な畳み込みニューラルネットワークの構成を示す図Diagram showing the structure of a general convolutional neural network 第１の実施形態に係る学習装置及び推定装置のハードウェア構成の一例を示す図The figure which shows an example of the hardware constitutions of the learning apparatus and estimation apparatus which concern on 1st Embodiment. 第１の実施形態に係る領域検出部が検出する画像内の人領域の一例を示す図The figure which shows an example of the human area | region in the image which the area | region detection part which concerns on 1st Embodiment detects. 第１の実施形態に係る第２の識別器に対する学習処理の一例を示すフローチャートThe flowchart which shows an example of the learning process with respect to the 2nd discriminator which concerns on 1st Embodiment. 第１の実施形態に係る推定部が推定する特徴点、及び姿勢の一例を示す図The figure which shows an example of the feature point which the estimation part which concerns on 1st Embodiment estimates, and an attitude | position 第１の実施形態に係る第１の識別器に対する学習処理の一例を示すフローチャートThe flowchart which shows an example of the learning process with respect to the 1st discriminator which concerns on 1st Embodiment. 第１の実施形態に係る推定装置が画像を識別する際に実行する処理の一例を示すフローチャートThe flowchart which shows an example of the process performed when the estimation apparatus which concerns on 1st Embodiment identifies an image. 第２の実施形態に係る圧縮処理部の構成について、説明する図The figure explaining the structure of the compression process part which concerns on 2nd Embodiment

以下に添付図面を参照しながら、本開示の好適な実施形態について詳細に説明する。尚、本明細書及び図面において、実質的に同一の機能を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the appended drawings. In the present specification and drawings, components having substantially the same function are denoted by the same reference numerals, and redundant description is omitted.

（第１の実施形態）
［画像認識システムの全体構成］
以下、図１〜図３を参照して、第１の実施形態に係る画像認識システムＵの全体構成の一例について説明する。 (First embodiment)
[Overall configuration of image recognition system]
Hereinafter, an example of the overall configuration of the image recognition system U according to the first embodiment will be described with reference to FIGS.

図１は、本実施形態に係る画像認識システムＵの全体構成を示すブロック図である。尚、図１中の矢印は、各機能の処理のフロー、及びデータＤ１〜Ｄ５の流れを表す。 FIG. 1 is a block diagram showing an overall configuration of an image recognition system U according to the present embodiment. In addition, the arrow in FIG. 1 represents the flow of the process of each function, and the flow of data D1-D5.

本実施形態に係る画像認識システムＵは、例えば、撮像装置が生成した画像データに対して画像処理を施して、画像内に映る人物の関節位置、姿勢又は動作等（以下、「人物の姿勢等」と略称する）を推定するためのシステムである。 The image recognition system U according to the present embodiment performs, for example, image processing on the image data generated by the imaging apparatus, and the joint position, posture, action, etc. of a person reflected in the image (hereinafter referred to as “person's posture, etc.”). Is abbreviated as “)”.

本実施形態に係る画像認識システムＵは、第１の識別器モデルＤｍ１（以下、「第１の識別器Ｄｍ１」と略称する）及び第２の識別器モデルＤｍ２（以下、「第２の識別器Ｄｍ２」と略称する）に対して学習処理を施す学習装置１０、及び、学習装置１０にて学習処理が施された第１の識別器Ｄｍ１及び第２の識別器Ｄｍ２を用いて、人物の姿勢等を推定する推定装置２０を備えている。 The image recognition system U according to the present embodiment includes a first discriminator model Dm1 (hereinafter abbreviated as “first discriminator Dm1”) and a second discriminator model Dm2 (hereinafter “second discriminator”). The posture of the person using the learning device 10 that performs learning processing on the learning device 10 and the first discriminator Dm1 and the second discriminator Dm2 that have been subjected to the learning processing by the learning device 10 The estimation apparatus 20 which estimates etc. is provided.

本実施形態に係る画像認識システムＵにおいては、まず、学習装置１０が、第２の学習処理部１１にて第２の識別器Ｄｍ２の機械学習を行った後に（フェーズＴ１）、当該第２の識別器Ｄｍ２を用いて第１の学習処理部１４にて第１の識別器Ｄｍ１の機械学習を行う（フェーズＴ２）。そして、推定装置２０が、学習装置１０から、学習済みの第２の識別器Ｄｍ２、及び学習済みの第１の識別器Ｄｍ１に係るモデルデータを取得し、識別処理を実行する（フェーズＴ３）。尚、各フェーズＴ１、Ｔ２、Ｔ３は、典型的には、別個に実行される。 In the image recognition system U according to the present embodiment, first, after the learning device 10 performs machine learning of the second classifier Dm2 in the second learning processing unit 11 (phase T1), the second Using the discriminator Dm2, the first learning processing unit 14 performs machine learning of the first discriminator Dm1 (phase T2). Then, the estimation device 20 acquires model data relating to the learned second discriminator Dm2 and the learned first discriminator Dm1 from the learning device 10, and executes discrimination processing (phase T3). Each phase T1, T2, T3 is typically executed separately.

学習装置１０は、フェーズＴ１の学習処理を実行する第２の学習処理部１１、並びに、フェーズＴ２の学習処理を実行する領域検出部１２、圧縮処理部１３、及び第１の学習処理部１４を備えている。 The learning device 10 includes a second learning processing unit 11 that executes the learning process of the phase T1, an area detection unit 12, a compression processing unit 13, and a first learning processing unit 14 that execute the learning process of the phase T2. I have.

又、推定装置２０は、撮像装置等が生成した画像データを取得する入力部２１、学習済みの第２の識別器Ｄｍ２を用いて画像内において対象物体が映る領域を検出する領域検出部２２、対象物体が映る領域における各画素領域の画素値を所定の階調範囲に圧縮する圧縮処理部２３、学習済みの第１の識別器Ｄｍ１を用いて人物の姿勢等を推定する推定部２４、及び、出力部２５を備えている。 In addition, the estimation device 20 includes an input unit 21 that acquires image data generated by an imaging device or the like, a region detection unit 22 that detects a region in which the target object appears in the image using the learned second discriminator Dm2, A compression processing unit 23 that compresses the pixel value of each pixel region in a region in which the target object is imaged to a predetermined gradation range, an estimation unit 24 that estimates the posture of a person using the learned first discriminator Dm1, and The output unit 25 is provided.

第１の識別器Ｄｍ１及び第２の識別器Ｄｍ２としては、より好適には、高い識別性能を有すると共に、画像の変化に対してロバスト性を有する畳み込みニューラルネットワークが用いられる。尚、図１に示す第１の識別器Ｄｍ１のモデルデータ及び第２の識別器Ｄｍ２のモデルデータは、例えば、畳み込みニューラルネットワークの入力層、中間層及び出力層の構造に関するデータ、並びに、当該畳み込みニューラルネットワーク内の重み係数及びバイアスに関するデータ等を含んで構成される。 More preferably, the first discriminator Dm1 and the second discriminator Dm2 are convolutional neural networks having high discrimination performance and robustness against image changes. The model data of the first discriminator Dm1 and the model data of the second discriminator Dm2 shown in FIG. 1 include, for example, data related to the structure of the input layer, intermediate layer, and output layer of the convolutional neural network, and the convolution. It includes data on weighting factors and biases in the neural network.

図２は、一般的な畳み込みニューラルネットワークの構成を示す図である。 FIG. 2 is a diagram illustrating a configuration of a general convolutional neural network.

畳み込みニューラルネットワークは、特徴抽出部Ｍｓと識別部Ｍｔとを有し、推定部Ｍｓが、入力される画像から画像特徴を抽出する処理を施し、識別部Ｍｔが、画像特徴から対象物体を識別する処理を施す。 The convolutional neural network includes a feature extraction unit Ms and an identification unit Mt. The estimation unit Ms performs a process of extracting an image feature from an input image, and the identification unit Mt identifies a target object from the image feature. Apply processing.

特徴抽出部Ｍｓは、複数の特徴量抽出層Ｍｓ１、Ｍｓ２・・・が階層的に接続されて構成される。各特徴量抽出層Ｍｓ１、Ｍｓ２・・・は、それぞれ、畳み込み層（Convolution layer）、活性化層（Activation layer）及びプーリング層（Pooling layer）を備える。 The feature extraction unit Ms is configured by hierarchically connecting a plurality of feature quantity extraction layers Ms1, Ms2,. Each of the feature quantity extraction layers Ms1, Ms2,... Includes a convolution layer, an activation layer, and a pooling layer.

第１層目の特徴量抽出層Ｍｓ１は、入力される画像を、ラスタスキャンにより所定サイズ毎に走査する。そして、特徴量抽出層Ｍｓ１は、走査したデータに対して、畳み込み層、活性化層及びプーリング層によって特徴量抽出処理を施すことにより、入力画像に含まれる特徴量を抽出する。第１層目の特徴量抽出層Ｍｓ１は、例えば、水平方向に延びる線状の特徴量や斜め方向に延びる線状の特徴量等の比較的シンプルな単独の特徴量を抽出する。 The first feature amount extraction layer Ms1 scans an input image for each predetermined size by raster scanning. The feature amount extraction layer Ms1 extracts feature amounts included in the input image by performing feature amount extraction processing on the scanned data using a convolution layer, an activation layer, and a pooling layer. The first layer feature quantity extraction layer Ms1 extracts relatively simple single feature quantities such as a linear feature quantity extending in the horizontal direction and a linear feature quantity extending in the oblique direction.

第２層目の特徴量抽出層Ｍｓ２は、前階層の特徴量抽出層Ｍｓ１から入力される画像（特徴マップとも称される）を、例えば、ラスタスキャンにより所定サイズ毎に走査する。そして、特徴量抽出層Ｍｓ２は、走査したデータに対して、同様に、畳み込み層、活性化層及びプーリング層による特徴量抽出処理を施すことにより、入力画像に含まれる特徴量を抽出する。尚、第２層目の特徴量抽出層Ｍｓ２は、第１層目の特徴量抽出層Ｍｓ１が抽出した複数の特徴量の位置関係などを考慮しながら統合させることで、より高次元の複合的な特徴量を抽出する。 The second feature amount extraction layer Ms2 scans an image (also referred to as a feature map) input from the previous feature amount extraction layer Ms1 for each predetermined size by, for example, raster scanning. Then, the feature amount extraction layer Ms2 similarly extracts the feature amount included in the input image by performing the feature amount extraction processing by the convolution layer, the activation layer, and the pooling layer on the scanned data. The second-layer feature quantity extraction layer Ms2 is integrated with consideration of the positional relationship of a plurality of feature quantities extracted by the first-layer feature quantity extraction layer Ms1, so that a higher-dimensional composite is obtained. Feature quantities are extracted.

第２層目以降の特徴量抽出層（図示せず）は、第２層目の特徴量抽出層Ｍｓ２と同様の処理を実行する。そして、最終層の特徴量抽出層の出力（複数の特徴マップのマップ内の各値）が、識別部Ｍｔに対して入力される。 The second and subsequent feature quantity extraction layers (not shown) perform the same processing as the second feature quantity extraction layer Ms2. Then, the output of the feature amount extraction layer of the final layer (each value in the map of the plurality of feature maps) is input to the identification unit Mt.

識別部Ｍｔは、例えば、複数の全結合層（Fully Connected）が階層的に接続された多層パーセプトロンによって構成される。識別部Ｍｔの入力側の全結合層は、特徴抽出部Ｍｓから取得した複数の特徴マップのマップ内の各値に全結合し、その各値に対して重み係数を異ならせながら積和演算を行って出力する。識別部Ｍｔの次階層の全結合層は、前階層の全結合層の各素子が出力する値に全結合し、その各値に対して重み係数を異ならせながら積和演算を行う。 The identification unit Mt is configured by, for example, a multilayer perceptron in which a plurality of Fully Connected layers are hierarchically connected. The total coupling layer on the input side of the identification unit Mt is fully coupled to each value in the map of the plurality of feature maps acquired from the feature extraction unit Ms, and the product-sum operation is performed while varying the weighting coefficient for each value. Go and output. The all coupled layers in the next layer of the identification unit Mt are fully coupled to the values output by the elements in the all coupled layers in the previous layer, and the product-sum operation is performed while varying the weighting coefficient for each value.

識別部Ｍｔは、例えば、多層パーセプトロンの出力層の各素子からの出力値に対して、ソフトマックス関数等を適用する処理を実行し、複数のクラスのうち、識別対象が該当するクラスについて、積和演算による演算結果の値が大きくなるように識別結果を出力する。 For example, the identification unit Mt performs a process of applying a softmax function or the like to the output value from each element of the output layer of the multilayer perceptron, and among the plurality of classes, the product corresponding to the class to be identified is multiplied. The identification result is output so that the value of the result of the sum operation becomes large.

畳み込みニューラルネットワークは、例えば、正解クラスを付した教師画像を用いて学習処理が施されることによって、ネットワークパラメータ（例えば、畳み込み層の重み係数及びバイアス、並びに、全結合層の重み係数及びバイアス）が調整され、上記のように機能する。 For example, the convolutional neural network is subjected to a learning process using a teacher image with a correct answer class, whereby network parameters (for example, weighting factors and biases of convolutional layers and weighting factors and biases of all connection layers). Are adjusted and function as described above.

尚、畳み込みニューラルネットワークにおける演算処理のアルゴリズムは、公知の手法（例えば、非特許文献１を参照）と同様であるため、ここでの説明は省略する。 Note that the algorithm of the arithmetic processing in the convolutional neural network is the same as a known method (for example, see Non-Patent Document 1), and thus description thereof is omitted here.

このような一般的な畳み込みニューラルネットワークの識別精度を向上させるためには、当該畳み込みニューラルネットワークが全体として大型化し、当該畳み込みニューラルネットワークを最適化するための学習量も膨大となる。 In order to improve the identification accuracy of such a general convolutional neural network, the convolutional neural network becomes larger as a whole, and the amount of learning for optimizing the convolutional neural network also becomes enormous.

本実施形態に係る画像処理システムＵは、かかる観点から、人物等の領域を検出する第２の識別器Ｄｍ２と人物等の姿勢を推定する第１の識別器Ｄｍ１とを別個にして、学習済みの第２の識別器Ｄｍ２を用いて人物等の領域を検出する領域検出部２２、人領域Ｒ１における各画素領域の画素値を圧縮する圧縮処理部２３、及び、学習済みの第１の識別器Ｄｍ１を用いて人物の姿勢等を推定する推定部２４を順に実行する。 From this point of view, the image processing system U according to the present embodiment has learned the second classifier Dm2 that detects a region such as a person separately from the first classifier Dm1 that estimates the posture of a person or the like. A region detector 22 for detecting a region such as a person using the second discriminator Dm2, a compression processor 23 for compressing pixel values of each pixel region in the human region R1, and a learned first discriminator The estimation unit 24 that estimates the posture of the person using Dm1 is sequentially executed.

但し、第１の識別器Ｄｍ１及び第２の識別器Ｄｍ２を構成する畳み込みニューラルネットワークとしては、図２に示した構造に限らず、公知の種々の構造が適用されてよい。例えば、セグメンテーションに供される第２の識別器Ｄｍ２を構成する畳み込みニューラルネットワークとしては、多層パーセプトロンに代えて、入力される画像と同一サイズの畳み込み層が、識別部Ｍｔとして用いられてもよい。 However, the convolutional neural network constituting the first discriminator Dm1 and the second discriminator Dm2 is not limited to the structure shown in FIG. 2, and various known structures may be applied. For example, as a convolutional neural network constituting the second discriminator Dm2 used for segmentation, a convolutional layer having the same size as the input image may be used as the discriminating unit Mt instead of the multilayer perceptron.

又、第１の識別器Ｄｍ１及び第２の識別器Ｄｍ２としては、畳み込みニューラルネットワークに限らず、ＳＶＭ（Support Vector Machine）、ベイズ識別器等、又は他の識別器が用いられてもよい。又、第１の識別器Ｄｍ１と第２の識別器Ｄｍ２とは、互いに異なる種別の識別器で構成されていてもよい。又、第１の識別器Ｄｍ１及び第２の識別器Ｄｍ２としては、その他、アンサンブルモデルが用いられてもよいし、複数種類の識別器が組み合わされて構成されてもよいし、領域分割処理や色分割処理等の前処理部と組み合わされて構成されてもよい。 Further, the first discriminator Dm1 and the second discriminator Dm2 are not limited to the convolutional neural network, but may be an SVM (Support Vector Machine), a Bayes discriminator, or other discriminators. Further, the first discriminator Dm1 and the second discriminator Dm2 may be configured by different types of discriminators. In addition, as the first discriminator Dm1 and the second discriminator Dm2, an ensemble model may be used, or a plurality of types of discriminators may be combined. It may be configured in combination with a pre-processing unit such as a color division process.

図３は、学習装置１０及び推定装置２０のハードウェア構成の一例を示す図である。 FIG. 3 is a diagram illustrating an example of a hardware configuration of the learning device 10 and the estimation device 20.

学習装置１０及び推定装置２０は、いずれも、主たるコンポーネントとして、ＣＰＵ（Central Processing Unit）１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３、外部記憶装置（例えば、フラッシュメモリ）１０４、及び通信インターフェイス１０５等を備えたコンピュータによって構成される。 Each of the learning device 10 and the estimation device 20 includes a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, and an external storage device (for example, a flash memory) 104 as main components. And a computer including the communication interface 105 and the like.

学習装置１０及び推定装置２０が有する各機能は、例えば、ＣＰＵ１０１がＲＯＭ１０２、ＲＡＭ１０３、外部記憶装置１０４等に記憶された制御プログラム（例えば、処理プログラム）や各種データを参照することによって実現される。但し、各機能の一部又は全部は、ＣＰＵによる処理に代えて、又は、これと共に、ＤＳＰ（Digital Signal Processor）による処理によって実現されてもよい。又、同様に、各機能の一部又は全部は、ソフトウェアによる処理に代えて、又は、これと共に、専用のハードウェア回路による処理によって実現されてもよい。尚、学習装置１０と推定装置２０とは、互いの通信インターフェイス１０５を介して、通信接続されている。 Each function which the learning apparatus 10 and the estimation apparatus 20 have is implement | achieved, for example, when CPU101 refers to the control program (for example, processing program) and various data which were memorize | stored in ROM102, RAM103, the external storage device 104 grade | etc.,. However, some or all of the functions may be realized by processing by a DSP (Digital Signal Processor) instead of or by processing by the CPU. Similarly, some or all of the functions may be realized by processing by a dedicated hardware circuit instead of or together with processing by software. Note that the learning device 10 and the estimation device 20 are communicatively connected via the communication interface 105 of each other.

［推定装置及び学習装置の構成の詳細］
次に、図４〜図８を参照して、本実施形態に係る推定装置２０及び学習装置１０の各構成について、詳述する。以下では、説明の便宜として、推定装置２０の構成を中心に説明する。 [Details of Configuration of Estimation Device and Learning Device]
Next, with reference to FIGS. 4 to 8, each configuration of the estimation device 20 and the learning device 10 according to the present embodiment will be described in detail. Hereinafter, for convenience of explanation, the configuration of the estimation device 20 will be mainly described.

＜入力部２１の構成について＞
入力部２１は、例えば、撮像装置から、当該撮像装置が生成した画像データＤ１（以下、「入力画像Ｄ１」と略称する）を取得する。又、入力部２１は、外部記憶装置１０４に格納された画像データＤ１や、インターネット回線等を介して提供された画像データＤ１を取得する構成であってもよい。 <About the configuration of the input unit 21>
The input unit 21 acquires, for example, image data D1 generated by the imaging device (hereinafter abbreviated as “input image D1”) from the imaging device. Further, the input unit 21 may be configured to acquire image data D1 stored in the external storage device 104 or image data D1 provided via an Internet line or the like.

尚、入力部２１は、画像を所定のサイズに変換する処理、アスペクト比に変換する処理、又は色分割処理等の前処理を行ってもよい。 Note that the input unit 21 may perform preprocessing such as processing for converting an image into a predetermined size, processing for converting to an aspect ratio, or color division processing.

＜領域検出部２２の構成について＞
領域検出部２２は、入力部２１から入力画像Ｄ１を取得する。そして、領域検出部２２は、入力画像Ｄ１に対して学習済みの第２の識別器Ｄｍ２を用いた画像解析を施して、画像内において対象物体が映る領域（ここでは、人物が映る領域を表す。以下、「人領域」と称する）を検出する。 <Regarding Configuration of Area Detection Unit 22>
The region detection unit 22 acquires the input image D1 from the input unit 21. Then, the region detection unit 22 performs image analysis using the learned second discriminator Dm2 on the input image D1, and represents a region in which the target object appears in the image (here, a region in which a person appears). (Hereinafter referred to as “human region”).

図４は、領域検出部２２が検出する画像内の人領域の一例を示す図である。 FIG. 4 is a diagram illustrating an example of a human region in an image detected by the region detection unit 22.

図４中において、Ｒａｌｌは画像の全画像領域、Ｒ１は画像内における人領域、Ｒ２は画像内における人領域Ｒ１以外の領域（以下、「周囲領域」と称する）を表している。 In FIG. 4, Rall represents the entire image region of the image, R1 represents a human region in the image, and R2 represents a region other than the human region R1 in the image (hereinafter referred to as “surrounding region”).

第２の識別器Ｄｍ２を構成する畳み込みニューラルネットワークは、例えば、入力される画像に対して、画像内の各画素領域が該当するクラス（ここでは、人物クラス又は人物以外クラス）を出力する画素領域毎の出力素子、を含んで構成される。 The convolutional neural network that constitutes the second discriminator Dm2 is, for example, a pixel area that outputs a class (here, a person class or a class other than a person) to which each pixel area in the image corresponds to an input image. Each output element.

第２の識別器Ｄｍ２は、入力された画像のうち、人領域Ｒ１に該当する画素領域と周囲領域Ｒ２に該当する画素領域とを識別して出力するように、第２の学習処理部１１によって、学習処理が施されている。 The second classifier Dm2 uses the second learning processing unit 11 to identify and output the pixel area corresponding to the human area R1 and the pixel area corresponding to the surrounding area R2 in the input image. A learning process has been performed.

図５は、第２の識別器Ｄｍ２に対する学習処理（学習フェーズＴ１）の一例を示すフローチャートである。 FIG. 5 is a flowchart illustrating an example of a learning process (learning phase T1) for the second discriminator Dm2.

ステップＳ１１において、学習装置１０は、まず、第２の学習データＤｔ２を生成する。学習装置１０は、例えば、背景差分法、テンプレートマッチング、又はＨＯＧ（Histograms of Oriented Gradients）特徴量等を用いて、教師画像内における人領域Ｒ１を検出し、当該人領域Ｒ１と教師画像とを関連付けることにより、第２の学習データＤｔ２を生成する。 In step S11, the learning device 10 first generates second learning data Dt2. The learning device 10 detects a human region R1 in the teacher image using, for example, a background difference method, template matching, or HOG (Histograms of Oriented Gradients) feature amount, and associates the human region R1 with the teacher image. Thus, the second learning data Dt2 is generated.

尚、第２の学習データＤｔ２は、教師画像と、当該教師画像内における人領域Ｒ１に係る正解データとが関連付けられたデータセットである。そして、第２の学習データＤｔ２は、かかるデータセットを複数有している。 The second learning data Dt2 is a data set in which a teacher image and correct data related to the human region R1 in the teacher image are associated with each other. The second learning data Dt2 has a plurality of such data sets.

ステップＳ１２において、学習装置１０（第２の学習処理部１１）は、第２の学習データＤｔ２を用いて、第２の識別器Ｄｍ２に対する学習処理を施す。 In step S12, the learning device 10 (second learning processing unit 11) performs learning processing on the second discriminator Dm2 using the second learning data Dt2.

尚、このステップＳ１２において、第２の学習処理部１１は、例えば、第２の識別器Ｄｍ２の入力素子に第２の学習データＤｔ２の教師画像を入力すると共に、第２の識別器Ｄｍ２の出力素子に当該第２の学習データＤｔ２の人領域Ｒ１に係る正解データを設定して（例えば、人領域Ｒ１は「１」、人領域Ｒ１以外は「０」と設定する）、第２の識別器Ｄｍ２のネットワークパラメータ（重み係数、及びバイアス等）の最適化を行う。この際、第２の学習処理部１１は、例えば、交差エントロピーを損失関数として用いて、公知の誤差逆伝播法等によって、損失関数が最小化するように、第２の識別器Ｄｍ２のネットワークパラメータ（重み係数、及びバイアス等）の最適化を行う。 In step S12, for example, the second learning processing unit 11 inputs the teacher image of the second learning data Dt2 to the input element of the second discriminator Dm2, and outputs the second discriminator Dm2. Correct data related to the human region R1 of the second learning data Dt2 is set in the element (for example, “1” is set for the human region R1 and “0” is set for other than the human region R1), and the second discriminator. Dm2 network parameters (weighting factor, bias, etc.) are optimized. At this time, the second learning processing unit 11 uses the network parameter of the second discriminator Dm2 so that the loss function is minimized by, for example, a known error back propagation method using the cross entropy as the loss function. (Weighting factors, bias, etc.) are optimized.

ステップＳ１３において、学習装置１０は、学習処理が施された第２の識別器Ｄｍ２のネットワークパラメータを、例えば、外部記憶装置１０４に格納すると共に、領域検出部１２及び領域検出部２２に対して入力し、一連の処理を終了する。 In step S <b> 13, the learning device 10 stores the network parameters of the second discriminator Dm <b> 2 that has undergone the learning process, for example, in the external storage device 104 and inputs the network parameters to the region detection unit 12 and the region detection unit 22. Then, a series of processing ends.

領域検出部２２は、このようにして、学習処理が施された第２の識別器Ｄｍ２を用いて入力画像Ｄ１に対して画像解析（例えば、畳み込みニューラルネットワークの順伝搬処理）を施して、入力画像Ｄ１から人領域Ｒ１を検出する。 In this way, the region detection unit 22 performs image analysis (for example, forward propagation processing of a convolutional neural network) on the input image D1 using the second discriminator Dm2 subjected to the learning processing, and performs input. A human region R1 is detected from the image D1.

＜圧縮処理部２３の構成について＞
圧縮処理部２３は、領域検出部２２から入力画像Ｄ１、及び当該入力画像Ｄ１内において検出された人領域Ｒ１に係るデータＤ２を取得する。そして、圧縮処理部２３は、入力画像Ｄ１内における人領域Ｒ１の各画素値を、画像を表現する全階調域のうちの所定の階調範囲内に圧縮する。即ち、圧縮処理部２３は、人領域Ｒ１の色のバリエーションを低減する。 <Configuration of Compression Processing Unit 23>
The compression processing unit 23 acquires the input image D1 from the region detection unit 22 and data D2 related to the human region R1 detected in the input image D1. Then, the compression processing unit 23 compresses each pixel value of the human region R1 in the input image D1 within a predetermined gradation range in the entire gradation region expressing the image. That is, the compression processing unit 23 reduces the color variation of the human region R1.

本発明の発明者等は、識別精度の向上、識別器の小型化、及び学習効率の向上等の観点から鋭意検討し、服装等の色のバリエーションが、人物の姿勢等を識別する際には過剰データとなっており、識別器の識別精度の悪化、及び識別器の学習効率の悪化の要因となりやすい、という知見を得て、当該圧縮処理部２３を導入するに到った。 The inventors of the present invention have intensively studied from the viewpoints of improving identification accuracy, downsizing the classifier, improving learning efficiency, etc., and when color variations such as clothes identify a person's posture, etc. Obtaining the knowledge that the data is excessive and easily deteriorates the discrimination accuracy of the discriminator and the learning efficiency of the discriminator, the compression processing unit 23 has been introduced.

圧縮処理部２３は、例えば、２５６階調で表現された人領域Ｒ１における画素値（ここでは、０〜２５５のいずれかの値）を、５分の１程度の階調範囲２００〜２５５内に圧縮する。圧縮処理部２３が行う圧縮処理は、典型的には、各画素領域間における画素値の大小関係を維持したまま、画素値の差を低減する処理である。これによって、人物の姿勢等を識別するために必要となる人領域Ｒ１の主要部（例えば、人体の各部位）の位置関係の把握を可能としながら、人領域Ｒ１に含まれる過剰な色情報を圧縮する。 For example, the compression processing unit 23 sets the pixel value (in this case, any value from 0 to 255) in the human region R1 expressed by 256 gradations within a gradation range 200 to 255 of about 1/5. Compress. The compression processing performed by the compression processing unit 23 is typically processing for reducing the difference in pixel values while maintaining the magnitude relationship between the pixel values between the pixel regions. As a result, it is possible to grasp the positional relationship of the main part (for example, each part of the human body) of the human region R1 necessary for identifying the posture of the person, and the excessive color information included in the human region R1. Compress.

この際、圧縮後の階調範囲は、より好適には、全階調域のうちの上限側の３分の１の階調範囲よりも狭い範囲（例えば、白色側に相当する２００〜２５５の階調範囲）、又は、下限側の３分の１の階調範囲よりも狭い範囲（例えば、黒色側に相当する０〜５０の階調範囲）のいずれかに設定される。これによって、人領域Ｒ１における画素値が、周囲領域Ｒ２に含まれる画素値から分離され、後段の推定部２４において、人物の姿勢等を識別する際に、人領域Ｒ１の画像に加えて、周囲領域Ｒ２の画像も有効に利用することが可能となる。 At this time, the gradation range after compression is more preferably a range narrower than the gradation range of the upper third of the entire gradation range (for example, 200 to 255 corresponding to the white side). (Gradation range) or a range narrower than the lower one-third gradation range (for example, 0 to 50 gradation range corresponding to the black side). As a result, the pixel value in the human region R1 is separated from the pixel value included in the surrounding region R2, and in the estimation unit 24 in the subsequent stage, in addition to the image of the human region R1, The image in the region R2 can also be used effectively.

圧縮処理部２３は、例えば、以下の式（１）を用いて、当該人領域Ｒ１の画素値の圧縮処理を行う。
Y =（X - x_min）×（max - min）/（x_max - x_min）+ min … 式（１）
（但し、X：入力される画素領域の画素値、Y：出力する圧縮後の画素値、
max：圧縮後の階調範囲で表現し得る最大画素値（ここでは２５５）、
min：圧縮後の階調範囲で表現し得る最小画素値（ここでは２００）、
x_max：圧縮前の画像が表現し得る全階調域の最大画素値（ここでは２５５）、
x_min：圧縮前の画像が表現し得る全階調域の最小画素値（ここでは０）） The compression processing unit 23 performs a compression process on the pixel value of the human area R1 using, for example, the following equation (1).
Y = (X−x_min) × (max−min) / (x_max−x_min) + min (1)
(However, X: Pixel value of input pixel area, Y: Output pixel value after compression,
max: maximum pixel value (255 here) that can be expressed in the gradation range after compression,
min: Minimum pixel value (200 here) that can be expressed in the gradation range after compression,
x_max: the maximum pixel value (here 255) in all gradation ranges that can be expressed by the image before compression,
x_min: Minimum pixel value of all gradation regions that can be represented by the image before compression (here, 0))

圧縮処理部２３は、人領域Ｒ１の各画素領域の画素値を、式（１）のＸに対して入力し、圧縮後の画素値Ｙに変換する。これによって、２５６階調で表現された人領域Ｒ１における画素値を、各画素領域間における画素値の大小関係を維持したまま、階調範囲２００〜２５５内に圧縮することができる。 The compression processing unit 23 inputs the pixel value of each pixel area of the human area R1 to X in Expression (1), and converts it into a compressed pixel value Y. As a result, the pixel values in the human region R1 expressed in 256 gradations can be compressed within the gradation range 200 to 255 while maintaining the magnitude relationship of the pixel values between the pixel regions.

尚、圧縮処理部２３の圧縮処理は、式（１）以外の手法を用いてもよく、例えば、以下の式（２）を用いて、当該圧縮処理を行ってもよい。
Y={（X - μ）/σ} ×（max - min）/2 + （min + max）/2 … 式（２）
（但し、X：入力される画素領域の画素値、Y：出力する圧縮後の画素値、
μ：人領域Ｒ１の画素値の平均値、
σ：人領域Ｒ１の画素値の標準偏差、
max：圧縮後の階調範囲で表現し得る最大画素値（ここでは２５５）、
min：圧縮後の階調範囲で表現し得る最小画素値（ここでは２００）） Note that the compression processing of the compression processing unit 23 may use a method other than the equation (1), and for example, the compression processing may be performed using the following equation (2).
Y = {(X−μ) / σ} × (max−min) / 2 + (min + max) / 2 (2)
(However, X: Pixel value of input pixel area, Y: Output pixel value after compression,
μ: average value of pixel values in the human region R1,
σ: standard deviation of the pixel value of the human region R1,
max: maximum pixel value (255 here) that can be expressed in the gradation range after compression,
min: Minimum pixel value (200 here) that can be expressed in the gradation range after compression

式（２）によれば、人領域Ｒ１の画素値の平均値を０、分散を１にする正規化処理を行った上で、２５６階調で表現された人領域Ｒ１における画素値を、各画素領域間における画素値の大小関係を維持したまま、階調範囲２００〜２５５内に圧縮することができる。 According to the equation (2), after performing normalization processing in which the average value of the pixel values in the human region R1 is 0 and the variance is 1, the pixel values in the human region R1 expressed in 256 gradations The pixel values can be compressed within the gradation range 200 to 255 while maintaining the magnitude relationship between the pixel values.

尚、本実施形態に係る圧縮処理部２３は、周囲領域Ｒ２の色情報の過度の低減を避けるため、周囲領域Ｒ２の画像については上記の圧縮処理を施さない構成としているが、周囲領域Ｒ２の画像についても、同様に、上記の圧縮処理を施してもよい（例えば、後述する第２の実施形態を参照）。 Note that the compression processing unit 23 according to the present embodiment is configured not to perform the above-described compression processing on the image of the surrounding area R2 in order to avoid excessive reduction of the color information of the surrounding area R2. Similarly, the image may be subjected to the above-described compression processing (see, for example, a second embodiment described later).

＜推定部２４の構成について＞
推定部２４は、入力画像Ｄ１に対して、圧縮処理部２３にて圧縮処理が施された画像データＤ３を取得する。そして、推定部２４は、学習済みの第１の識別器Ｄｍ１を用いた画像解析を施して、対象物体（ここでは、人物）の特徴点、姿勢又は動作を推定する。 <About the structure of the estimation part 24>
The estimation unit 24 acquires image data D3 obtained by performing compression processing on the input image D1 by the compression processing unit 23. Then, the estimation unit 24 performs image analysis using the learned first discriminator Dm1, and estimates the feature point, posture, or motion of the target object (here, a person).

図６は、推定部２４が推定する人物の特徴点及び姿勢の一例を示す図である。尚、図６は、図４に対応する図である。 FIG. 6 is a diagram illustrating an example of a person's feature points and postures estimated by the estimation unit 24. FIG. 6 corresponds to FIG.

推定部２４が推定する特徴点、姿勢又は動作としては、典型的には、人体の関節位置、人体の各部位の位置（例えば、頭部の位置、足部の位置等）、人体の姿勢の種別（例えば、立ち上がろうとした状態、前屈みの状態等）、人体の動作の種別（例えば、物を取ろうとした状態、読書している状態等）、又はこれらの時間的変化等である。尚、推定部２４が出力するデータ形式は、種別形式、座標形式、又は部位間の相対位置等、任意である。 As the feature points, postures, or movements estimated by the estimation unit 24, typically, the joint position of the human body, the position of each part of the human body (for example, the position of the head, the position of the foot), and the posture of the human body The type (for example, the state of standing up, the state of bending forward), the type of movement of the human body (for example, the state of trying to pick up an object, the state of reading, etc.), or temporal changes thereof. The data format output by the estimation unit 24 is arbitrary, such as a type format, a coordinate format, or a relative position between parts.

図６には、推定部２４が推定する特徴点と姿勢の一例として、右足首ｐ０、右膝ｐ１、右腰ｐ２、左腰ｐ３、左膝ｐ４、左足首ｐ５、右手首ｐ６、右肘ｐ７、右肩ｐ８、左肩ｐ９、左肘ｐ１０、左手首ｐ１１、のどｐ１２、及び頭頂部ｐ１３の人体の関節位置、並びに、「姿勢クラスＮｏ：１（例えば、立位）」等の人体の姿勢クラスを示している。 In FIG. 6, as an example of the feature points and postures estimated by the estimation unit 24, the right ankle p0, the right knee p1, the right hip p2, the left hip p3, the left knee p4, the left ankle p5, the right wrist p6, and the right elbow p7. , The right shoulder p8, the left shoulder p9, the left elbow p10, the left wrist p11, the throat p12, and the head joint position of the human head p13, and the posture class of the human body such as “posture class No: 1 (for example, standing)” Is shown.

第１の識別器Ｄｍ１を構成する畳み込みニューラルネットワークは、例えば、入力される画像に対して、人体の関節位置ｐ０〜ｐ１３それぞれの座標（Ｘ座標、Ｙ座標）を出力する出力素子、及び、人体の姿勢クラスを出力する出力素子を含んで構成される。 The convolutional neural network constituting the first discriminator Dm1 includes, for example, an output element that outputs the coordinates (X coordinate, Y coordinate) of the joint positions p0 to p13 of the human body with respect to the input image, and the human body It includes an output element that outputs a posture class of

第１の識別器Ｄｍ１は、図６に示すように、入力された画像に基づいて、人体の関節位置ｐ０〜ｐ１３それぞれの座標、及び、人体の姿勢クラスを出力するように、第１の学習処理部１４によって、学習処理が施されている。 As shown in FIG. 6, the first discriminator Dm1 performs the first learning so as to output the coordinates of the joint positions p0 to p13 of the human body and the posture class of the human body based on the input image. A learning process is performed by the processing unit 14.

図７は、第１の識別器Ｄｍ１に対する学習処理（学習フェーズＴ２）の一例を示すフローチャートである。 FIG. 7 is a flowchart illustrating an example of a learning process (learning phase T2) for the first discriminator Dm1.

ステップＳ２１において、学習装置１０は、まず、第１の学習データＤｔ１を生成する。学習装置１０は、例えば、手入力により、人物が映る教師画像に対して、当該人物の姿勢等（ここでは、人体の関節位置ｐ０〜ｐ１３それぞれの座標、及び、人体の姿勢クラス）が設定されることによって、第１の学習データＤｔ１を生成する。 In step S21, the learning device 10 first generates first learning data Dt1. For example, the learning device 10 sets the posture of the person (in this case, the coordinates of the joint positions p0 to p13 of the human body and the posture class of the human body) with respect to the teacher image in which the person is reflected by manual input. Thus, the first learning data Dt1 is generated.

尚、第１の学習データＤｔ１は、教師画像と、当該教師画像内に映る人の姿勢等（例えば、人体の関節位置ｐ０〜ｐ１３それぞれの座標、及び、人体の姿勢クラス）に係る正解データとが関連付けられたデータセットである。そして、第１の学習データＤｔ１は、かかるデータセットを複数有している。 The first learning data Dt1 includes the teacher image and correct data related to the posture of the person shown in the teacher image (for example, the coordinates of the joint positions p0 to p13 of the human body and the posture class of the human body). Is the associated data set. The first learning data Dt1 has a plurality of such data sets.

ステップＳ２２において、学習装置１０（領域検出部１２）は、学習済みの第２の識別器Ｄｍ２を用いて、画像の人領域Ｒ１を検出する。 In step S22, the learning device 10 (region detection unit 12) detects the human region R1 of the image using the learned second discriminator Dm2.

尚、このステップＳ２２を実行する領域検出部１２は、典型的には、推定装置２０の領域検出部２２と同一の構成となっている。つまり、このステップＳ２２において、領域検出部１２は、学習済みの第２の識別器Ｄｍ２を用いた画像解析（例えば、畳み込みニューラルネットワークの順伝搬処理）を施して、画像内における人領域Ｒ１を検出する。 Note that the region detection unit 12 that executes step S22 typically has the same configuration as the region detection unit 22 of the estimation device 20. That is, in this step S22, the region detection unit 12 performs image analysis (for example, forward propagation processing of a convolutional neural network) using the learned second discriminator Dm2, and detects a human region R1 in the image. To do.

ステップＳ２３において、学習装置１０（圧縮処理部１３）は、ステップＳ２２において推定した人領域Ｒ１の各画素領域の画素値を所定の階調範囲内に圧縮する処理を施す。 In step S23, the learning device 10 (compression processor 13) performs a process of compressing the pixel value of each pixel area of the human area R1 estimated in step S22 within a predetermined gradation range.

尚、このステップＳ２３を実行する圧縮処理部１３は、典型的には、推定装置２０の圧縮処理部２３と同一の構成となっている。つまり、このステップＳ２３において、圧縮処理部１３は、２５６階調で表現された人領域Ｒ１の各画素領域の画素値（ここでは、０〜２５５）を、各画素領域間における画素値の大小関係を維持したまま、階調範囲２００〜２５５内に圧縮する。 The compression processing unit 13 that executes this step S23 typically has the same configuration as the compression processing unit 23 of the estimation device 20. That is, in this step S23, the compression processing unit 13 uses the pixel value (here, 0 to 255) of each pixel region of the human region R1 expressed in 256 gradations, and the magnitude relationship of the pixel values between the pixel regions. Is maintained within the gradation range 200 to 255.

ステップＳ２４において、学習装置１０（第１の学習処理部１４）は、第１の学習データＤｔ１を用いて、第１の識別器Ｄｍ１に対する学習処理を施す。 In step S24, the learning device 10 (first learning processing unit 14) performs learning processing on the first discriminator Dm1 using the first learning data Dt1.

尚、このステップＳ２４においては、第１の学習処理部１４は、例えば、第１の識別器Ｄｍ１の入力素子に教師画像を入力すると共に、正解データとして、第１の識別器Ｄｍ１の出力素子に当該画像内に映る人物の姿勢等を設定して（例えば、人体の関節位置ｐ０〜ｐ１３それぞれの座標の正解値を設定すると共に、人体の姿勢クラスの正解クラスを「１」、それ以外のクラスを「０」と設定する）、第１の識別器Ｄｍ１のネットワークパラメータ（重み係数、及びバイアス等）の最適化を行う。この際、第１の学習処理部１４は、例えば、交差エントロピーを損失関数として用いて、公知の誤差逆伝播法等によって、損失関数が最小化するように、第１の識別器Ｄｍ１のネットワークパラメータ（重み係数、及びバイアス等）の最適化を行う。 In this step S24, for example, the first learning processing unit 14 inputs a teacher image to the input element of the first discriminator Dm1, and also outputs it to the output element of the first discriminator Dm1 as correct answer data. Set the posture of the person shown in the image (for example, set correct values of the coordinates of the joint positions p0 to p13 of the human body, set the correct class of the posture class of the human body to “1”, and other classes Is set to “0”), the network parameters (weight coefficient, bias, etc.) of the first discriminator Dm1 are optimized. At this time, the first learning processing unit 14 uses the network parameter of the first discriminator Dm1 so that the loss function is minimized by, for example, a known error back propagation method using the cross entropy as the loss function. (Weighting factors, bias, etc.) are optimized.

ステップＳ２５において、学習装置１０は、学習処理によって調整した第１の識別器Ｄｍ１のネットワークパラメータを記憶部（例えば、外部記憶装置１０４）に格納すると共に、推定装置２０の推定部２４に対して入力する。 In step S25, the learning device 10 stores the network parameter of the first discriminator Dm1 adjusted by the learning process in the storage unit (for example, the external storage device 104) and inputs it to the estimation unit 24 of the estimation device 20. To do.

推定部２４は、このようにして学習処理が施された第１の識別器Ｄｍ１を用いて、圧縮処理部２３から入力される画像Ｄ３に対して画像解析（例えば、畳み込みニューラルネットワークの順伝搬処理）を施して、当該画像Ｄ３内に映る人物の姿勢等を推定する。 The estimation unit 24 uses the first discriminator Dm1 subjected to learning processing in this way to perform image analysis (for example, forward propagation processing of a convolutional neural network) on the image D3 input from the compression processing unit 23. ) To estimate the posture or the like of the person shown in the image D3.

尚、推定部２４が第１の識別器Ｄｍ１に入力する画像Ｄ３は、人領域Ｒ１の画像のみであってもよいが、より好適には、周囲領域Ｒ２の少なくとも一部の画像も含むものとする。これによって、周囲領域Ｒ２の物体等との関係から、人物の姿勢等を推定することができるため、より人物の姿勢等の識別精度を向上させることができる。 The image D3 input to the first discriminator Dm1 by the estimation unit 24 may be only the image of the human region R1, but more preferably includes at least a part of the image of the surrounding region R2. Accordingly, since the posture of the person can be estimated from the relationship with the object or the like in the surrounding region R2, the identification accuracy of the posture of the person can be further improved.

＜出力部２５の構成について＞
出力部２５は、推定部２４から出力される人物の姿勢等のデータＤ４を取得する。そして、出力部２５は、当該データＤ４を所定の画像形式に加工して、当該加工後のデータＤ５を表示装置等に出力する。 <About the configuration of the output unit 25>
The output unit 25 acquires data D4 such as the posture of the person output from the estimation unit 24. Then, the output unit 25 processes the data D4 into a predetermined image format, and outputs the processed data D5 to a display device or the like.

この際、出力部２５は、例えば、推定部２４によって推定された関節位置ｐ０〜ｐ１３それぞれの位置、及び、クラス該当度が最大の人物の姿勢に係るクラスが、入力画像Ｄ１と重畳して表示されるように、表示装置等に出力する。 At this time, for example, the output unit 25 superimposes and displays the positions of the joint positions p0 to p13 estimated by the estimation unit 24 and the class related to the posture of the person with the highest class matching degree with the input image D1. Output to a display device or the like.

［推定装置の動作］
図８は、推定装置２０が画像内に映る人物の姿勢等を推定する際に実行する処理の一例を示すフローチャートである。 [Operation of estimation device]
FIG. 8 is a flowchart illustrating an example of processing executed when the estimating apparatus 20 estimates the posture of a person shown in an image.

ステップＳ３１において、推定装置２０（入力部２１）は、画像データＤ１を取得すると共に、当該画像データの画像を所定の形状に加工する等の前処理を行う。 In step S31, the estimation apparatus 20 (input unit 21) acquires image data D1 and performs preprocessing such as processing an image of the image data into a predetermined shape.

ステップＳ３２において、推定装置２０（領域検出部２２）は、ステップＳ３１で取得した入力画像に対して第２の識別器Ｄｍ２を用いた画像解析を施して、当該入力画像Ｄ１内において人物が映る人領域Ｒ１を検出する。 In step S32, the estimation device 20 (region detection unit 22) performs image analysis using the second discriminator Dm2 on the input image acquired in step S31, and a person in which a person appears in the input image D1 A region R1 is detected.

ステップＳ３３において、推定装置２０（圧縮処理部２３）は、上記した式（１）等を用いて、ステップＳ３２で推定された入力画像の人領域Ｒ１における各画素領域の画素値を、所定の階調範囲内に圧縮する。 In step S33, the estimation device 20 (compression processor 23) uses the above-described equation (1) and the like to calculate the pixel value of each pixel area in the human area R1 of the input image estimated in step S32 to a predetermined floor. Compress within key range.

ステップＳ３４において、推定装置２０（推定部２４）は、ステップＳ３３で圧縮処理が施された画像Ｄ３に対して第１の識別器Ｄｍ１を用いた画像解析を施して、当該入力画像内において人物の姿勢等を推定する。 In step S34, the estimating apparatus 20 (estimating unit 24) performs image analysis using the first discriminator Dm1 on the image D3 subjected to the compression processing in step S33, and the person's character is included in the input image. Estimate posture.

ステップＳ３５において、推定装置２０（出力部２５）は、ステップＳ３４で生成された人物の姿勢等を、所定の画像形式に変換して、表示装置等に出力する。 In step S35, the estimation device 20 (output unit 25) converts the posture of the person generated in step S34 into a predetermined image format and outputs the image to a display device or the like.

［効果］
以上のように、本実施形態に係る推定装置２０は、画像Ｄ１内の対象物体が映る領域（本実施形態に係る人領域Ｒ１）における各画素領域の画素値を、画像Ｄ１を表現する全階調域（例えば、０〜２５５）のうちの第１の階調範囲（例えば、２００〜２５５）内に圧縮する圧縮処理部２３と、圧縮処理が施された画像Ｄ３に対して、学習済みの第１の識別器Ｄｍ１を用いた画像解析を施して、人物の姿勢等を推定する推定部２４と、を備えている。 [effect]
As described above, the estimation apparatus 20 according to the present embodiment uses the pixel values of each pixel region in the region (the human region R1 according to the present embodiment) in which the target object is shown in the image D1 as the whole floor that represents the image D1. The compression processing unit 23 that compresses within the first gradation range (for example, 200 to 255) in the adjustment range (for example, 0 to 255) and the image D3 that has been subjected to the compression processing have been learned. An estimation unit 24 that performs image analysis using the first discriminator Dm1 and estimates the posture of the person and the like.

従って、本実施形態に係る推定装置２０によれば、服装等の色のバリエーションに影響を受けにくく、より対象物体の特徴点、姿勢又は動作等を識別する用に好適な第１の識別器Ｄｍ１を構築することが可能である。換言すると、これによって、第１の識別器Ｄｍ１の小型化、第１の識別器Ｄｍ１による識別精度の向上、及び、第１の識別器Ｄｍ１の学習効率の向上等が可能となる。又、これによって、第１の識別器Ｄｍ１に対して学習処理を施すための学習データ量を削減することもできる。 Therefore, according to the estimation apparatus 20 according to the present embodiment, the first discriminator Dm1 that is less affected by color variations such as clothes and is more suitable for identifying the feature point, posture, action, or the like of the target object. It is possible to build In other words, this makes it possible to reduce the size of the first discriminator Dm1, improve the discrimination accuracy by the first discriminator Dm1, improve the learning efficiency of the first discriminator Dm1, and the like. This also reduces the amount of learning data for performing the learning process on the first discriminator Dm1.

又、本実施形態に係る推定装置２０は、領域検出部２２用の第２の識別器Ｄｍ２と、推定部２４用の第１の識別器Ｄｍ１と、を別個に設け、当該第２の識別器Ｄｍ２と第１の識別器Ｄｍ１を用いた二段階の画像解析によって、対象物体の姿勢等を識別する。これによって、機械学習の特性である画像の多様性に対する頑健性を最大限に活かすことができる。又、これによって、人領域Ｒ１の検出精度も向上するため、結果として、第１の識別器Ｄｍ１の識別精度及び第１の識別器Ｄｍ１の学習効率をより向上させることができる。 In addition, the estimation device 20 according to the present embodiment separately includes a second discriminator Dm2 for the region detection unit 22 and a first discriminator Dm1 for the estimation unit 24, and the second discriminator. The posture or the like of the target object is identified by two-stage image analysis using Dm2 and the first classifier Dm1. This makes it possible to maximize the robustness against image diversity, which is a characteristic of machine learning. This also improves the detection accuracy of the human region R1, and as a result, the identification accuracy of the first discriminator Dm1 and the learning efficiency of the first discriminator Dm1 can be further improved.

又、本実施形態に係る推定装置２０において、圧縮処理部２３は、画像Ｄ１内の対象物体が映る領域（本実施形態に係る人領域Ｒ１）における各画素領域の画素値を、全階調域のうちの上限側の階調範囲又は下限側の階調範囲に圧縮する。これによって、より第１の識別器Ｄｍ１の識別精度を向上させ、又、第１の識別器Ｄｍ１の学習効率を向上させることができる。又、これによって、人領域Ｒ１と周囲領域Ｒ２とのコントラストを高めることができるため、対象物体の姿勢等の識別精度をより向上させることができる。 Further, in the estimation apparatus 20 according to the present embodiment, the compression processing unit 23 calculates the pixel value of each pixel region in the region (the human region R1 according to the present embodiment) in which the target object in the image D1 is reflected in the entire gradation region. Are compressed to the upper gradation range or the lower gradation range. As a result, the identification accuracy of the first classifier Dm1 can be further improved, and the learning efficiency of the first classifier Dm1 can be improved. In addition, this makes it possible to increase the contrast between the human region R1 and the surrounding region R2, thereby further improving the identification accuracy such as the posture of the target object.

尚、上記実施形態では、第１の識別器Ｄｍ１にて識別する対象物体の一例として、人物全体の関節位置、姿勢又は動作を示した。しかしながら、本発明の第１の識別器Ｄｍ１が識別する対象物体は、人物の特定の部位（例えば、頭部又は腕部等）であってもよい。その場合、領域検出部２２にて画像内に映る当該人物の特定の部位の領域を検出し、圧縮処理部２３にて圧縮処理を施す対象領域を当該人物の特定の部位のみとする構成としてもよい。例えば、識別する対象物体が頭部の場合、圧縮処理を施す対象領域を頭髪部としてもよい。また例えば、識別する対象物体が腕部の場合、圧縮処理を施す対象領域を服装部としてもよい。 In the above embodiment, the joint position, posture, or movement of the entire person is shown as an example of the target object identified by the first classifier Dm1. However, the target object identified by the first classifier Dm1 of the present invention may be a specific part of a person (for example, a head or an arm). In that case, the region detection unit 22 may detect a region of a specific part of the person shown in the image, and the compression processing unit 23 may set the target region to be subjected to compression processing only to the specific part of the person. Good. For example, when the target object to be identified is the head, the target region on which the compression process is performed may be the hair portion. Further, for example, when the target object to be identified is an arm part, a target area to be subjected to compression processing may be a clothing part.

（第２の実施形態）
本実施形態に係る画像認識システムＵは、圧縮処理部２３（及び圧縮処理部１３）の構成の点で、第１の実施形態と相違する。 (Second Embodiment)
The image recognition system U according to the present embodiment is different from the first embodiment in the configuration of the compression processing unit 23 (and the compression processing unit 13).

図９は、本実施形態に係る圧縮処理部２３の構成について、説明する図である。 FIG. 9 is a diagram illustrating the configuration of the compression processing unit 23 according to the present embodiment.

図９は、全階調域（ここでは、０〜２５５）における人領域Ｒ１における各画素値Ｒ１ａの分布（画素値毎の出現頻度）、周囲領域Ｒ２における各画素値Ｒ２ａの分布（画素値毎の出現頻度）を模式的に表している。 FIG. 9 shows the distribution of the pixel values R1a in the human region R1 (appearance frequency for each pixel value) in the entire gradation region (here, 0 to 255) and the distribution of the pixel values R2a in the surrounding region R2 (for each pixel value). Frequency of occurrence) schematically.

本実施形態に係る圧縮処理部２３は、人領域Ｒ１における各画素値Ｒ１ａを、全階調域（ここでは、０〜２５５）のうちの上限側の３分の１の階調範囲よりも狭い範囲Ｒ１ｂ（ここでは、白色側に相当する２００〜２５５の階調範囲）に圧縮すると共に、周囲領域Ｒ２における画素値Ｒ２ａを、全階調域のうちの下限側の３分の１の階調範囲よりも狭い範囲Ｒ２ｂ（ここでは、黒色側に相当する０〜５０の階調範囲）に圧縮する。つまり、本実施形態に係る圧縮処理部２３は、画像内の人領域Ｒ１における各画素値Ｒ１ａと周囲領域Ｒ２における各画素値Ｒ１ｂとが、画像を表現する全階調域のうちの互いに分離されるようにする。 In the compression processing unit 23 according to the present embodiment, each pixel value R1a in the human region R1 is narrower than the upper one-side gradation range of the entire gradation region (here, 0 to 255). While compressing to the range R1b (here, the gradation range of 200 to 255 corresponding to the white side), the pixel value R2a in the surrounding region R2 is set to the gradation of the lower third of the entire gradation region. Compression is made to a range R2b (here, a gradation range of 0 to 50 corresponding to the black side) that is narrower than the range. That is, in the compression processing unit 23 according to the present embodiment, each pixel value R1a in the human region R1 and each pixel value R1b in the surrounding region R2 in the image are separated from each other in all gradation regions expressing the image. So that

このように、本実施形態に係る画像認識システムＵによれば、人領域Ｒ１における画素値と周囲領域Ｒ２における画素値との間のコントラストを高めることができ、これによって、第１の識別器Ｄｍ１における識別精度をより向上させることが可能である。 As described above, according to the image recognition system U according to the present embodiment, the contrast between the pixel value in the human region R1 and the pixel value in the surrounding region R2 can be increased, whereby the first discriminator Dm1. It is possible to further improve the identification accuracy.

（その他の実施形態）
本発明は、上記実施形態に限らず、種々に変形態様が考えられる。 (Other embodiments)
The present invention is not limited to the above embodiment, and various modifications can be considered.

上記実施形態では、画像認識システムＵにおける「識別対象」の一例として、人物の関節位置、姿勢又は動作を示したが、本発明に係る画像認識システムＵは、画像内に映る任意の物体の特徴点、姿勢又は動作を推定する用に、適用することが可能であり、例えば、動物、車両又は置物等の姿勢等を推定する用に、適用することも可能である。又、本発明に係る画像認識システムＵは、撮像装置が生成した画像データに代えて、イラスト画像の画像データ等に対しても、適用することが可能である。 In the above-described embodiment, the joint position, posture, or movement of a person is shown as an example of “identification target” in the image recognition system U. However, the image recognition system U according to the present invention is a feature of an arbitrary object reflected in an image. The present invention can be applied to estimate a point, posture, or motion, and can be applied to, for example, estimate the posture of an animal, a vehicle, a figurine, or the like. Further, the image recognition system U according to the present invention can be applied to image data of an illustration image instead of the image data generated by the imaging apparatus.

又、上記実施形態では、領域検出部２２の一例として、畳み込みニューラルネットワークによって構成される態様を示した。しかしながら、領域検出部２２が人領域Ｒ１を検出する手法は、必ずしも畳み込みニューラルネットワーク等の識別器（学習器）を用いる方法に限定されない。例えば、領域検出部２２は、背景差分法、テンプレートマッチング、ＨＯＧ（Histograms of Oriented Gradients）特徴量、又はＳＶＭ（Support Vector Machine）等の従来手法を用いて、人領域Ｒ１を検出してもよい。 Moreover, in the said embodiment, the aspect comprised by the convolution neural network as an example of the area | region detection part 22 was shown. However, the method by which the region detection unit 22 detects the human region R1 is not necessarily limited to a method using a discriminator (learning device) such as a convolutional neural network. For example, the region detection unit 22 may detect the human region R1 by using a conventional method such as a background subtraction method, template matching, HOG (Histograms of Oriented Gradients) feature, or SVM (Support Vector Machine).

又、上記実施形態では、圧縮処理部２３の一例として、人領域Ｒ１の各画素値をＲＧＢの三原色それぞれを均等に圧縮する態様を示した。しかしながら、圧縮処理部２３は、ＲＧＢのそれぞれについて圧縮率（即ち、階調範囲）を異ならせたりしてもよい。 In the above-described embodiment, as an example of the compression processing unit 23, a mode in which the pixel values of the human region R1 are uniformly compressed for the three primary colors of RGB is shown. However, the compression processing unit 23 may change the compression rate (that is, the gradation range) for each of RGB.

又、上記実施形態では、学習装置１０と推定装置２０とが別体のコンピュータによって実現される態様を示したが、学習装置１０と推定装置２０とが一のコンピュータによって実現される態様としてものは勿論である。又、コンピュータに読み出されるプログラムやデータ、及び当該コンピュータが書き込むデータ等が、複数のコンピュータに分散して格納されてもよい。 Moreover, in the said embodiment, although the learning apparatus 10 and the estimation apparatus 20 showed the aspect implement | achieved by a separate computer, as an aspect implement | achieved by the learning apparatus 10 and the estimation apparatus 20 by one computer, Of course. Further, programs and data read by a computer, data written by the computer, and the like may be distributed and stored in a plurality of computers.

以上、本発明の具体例を詳細に説明したが、これらは例示にすぎず、請求の範囲を限定するものではない。請求の範囲に記載の技術には、以上に例示した具体例を様々に変形、変更したものが含まれる。 Specific examples of the present invention have been described in detail above, but these are merely examples and do not limit the scope of the claims. The technology described in the claims includes various modifications and changes of the specific examples illustrated above.

本明細書および添付図面の記載により、少なくとも以下の事項が明らかとなる。 At least the following matters will become apparent from the description of this specification and the accompanying drawings.

画像に映る対象物体の領域における各画素値を、前記画像を表現する全階調域のうちの第１の階調範囲内に圧縮する圧縮処理部２３と、
前記圧縮処理が施された前記画像に対して、学習済みの第１の識別器モデルＤｍ１を用いた画像解析を施して、前記対象物体の特徴点、姿勢又は動作を推定する推定部２４と、
を備え、
前記第１の識別器モデルＤｍ１は、前記圧縮処理が施された教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データＤｔ１を用いて、学習処理が施されている
推定装置２０を開示する。 A compression processing unit 23 that compresses each pixel value in the region of the target object shown in the image within a first gradation range of all gradation areas representing the image;
An estimation unit 24 that performs image analysis using the learned first discriminator model Dm1 on the image subjected to the compression processing, and estimates a feature point, posture, or motion of the target object;
With
The first discriminator model Dm1 uses the first learning data Dt1 in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, An estimation device 20 that has been subjected to learning processing is disclosed.

又、前記推定装置２０は、より好適には、前記画像に対して、学習済みの第２の識別器モデルＤｍ２を用いた画像解析を施して、前記画像に映る前記対象物体の領域を検出する領域検出部、を更に備え、
前記第２の識別器モデルＤｍ２は、教師画像と当該教師画像内において前記対象物体が映る領域とが関連付けられた第２の学習データＤｔ２を用いて、学習処理が施されている。 More preferably, the estimation device 20 performs image analysis using the learned second discriminator model Dm2 on the image, and detects a region of the target object appearing in the image. An area detector,
The second discriminator model Dm2 is subjected to learning processing using second learning data Dt2 in which a teacher image and a region where the target object is reflected in the teacher image are associated.

又、前記推定装置２０において、前記圧縮処理部２３は、より好適には、各画素領域間における画素値の差を低減するように、前記圧縮処理を行う。 In the estimation device 20, the compression processing unit 23 more preferably performs the compression processing so as to reduce the difference in pixel values between the pixel regions.

又、前記推定装置２０において、前記第１の階調範囲は、より好適には、前記画像を表現する全階調域のうちの上限側の３分の１の階調範囲よりも狭い範囲、又は、下限側の３分の１の階調範囲よりも狭い範囲に設定される。 In the estimation apparatus 20, the first gradation range is more preferably a range narrower than the upper one-side gradation range of the entire gradation range expressing the image, Or, it is set to a range narrower than the lower one-third gradation range.

又、前記推定装置２０において、前記圧縮処理部２３は、より好適には、更に、前記画像内の前記対象物体の周囲領域における各画素値を、前記画像を表現する全階調域のうちの前記第１の階調範囲から分離された第２の階調範囲に圧縮する。 Further, in the estimation device 20, the compression processing unit 23 more preferably further calculates each pixel value in the peripheral region of the target object in the image, out of the entire gradation region expressing the image. Compress to a second gradation range separated from the first gradation range.

又、前記推定装置２０において、前記対象物体は、より好適には、人物である。 In the estimation device 20, the target object is more preferably a person.

又、前記推定装置２０において、前記第１の識別器モデルＤｍ１及び前記第２の識別器モデルＤｍ２は、より好適には、それぞれ、畳み込みニューラルネットワークを含んで構成される。 In the estimation device 20, the first discriminator model Dm1 and the second discriminator model Dm2 are more preferably configured so as to each include a convolutional neural network.

又、他の局面では、
教師画像に映る対象物体の領域における各画素値を、前記教師画像を表現する全階調域のうちの第１の階調範囲内に圧縮する圧縮処理部１３と、
前記圧縮処理が施された前記教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、第１の識別器モデルに対して学習処理を施す第１の学習処理部１４と、
を備える学習装置１０を開示する。 In other aspects,
A compression processing unit 13 that compresses each pixel value in the region of the target object shown in the teacher image within a first gradation range of all gradation regions representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other A first learning processing unit 14 for performing processing;
The learning apparatus 10 provided with this is disclosed.

又、他の局面では、
画像に映る対象物体の領域における各画素値を、前記画像を表現する全階調域のうちの第１の階調範囲内に圧縮し、
前記圧縮処理が施された前記画像に対して、学習済みの第１の識別器モデルＤｍ１を用いた画像解析を施して、前記対象物体の特徴点、姿勢又は動作を推定する推定方法であって、
前記第１の識別器モデルＤｍ１は、前記圧縮処理が施された教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データＤｔ１を用いて、学習処理が施されている
推定方法を開示する。 In other aspects,
Compressing each pixel value in the area of the target object shown in the image within a first gradation range of all gradation areas representing the image;
An estimation method for performing an image analysis using the learned first discriminator model Dm1 on the image subjected to the compression process to estimate a feature point, posture, or motion of the target object. ,
The first discriminator model Dm1 uses the first learning data Dt1 in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, An estimation method in which learning processing is performed is disclosed.

又、他の局面では、
教師画像に映る対象物体の領域における各画素値を、前記教師画像を表現する全階調域のうちの第１の階調範囲内に圧縮し、
前記圧縮処理が施された前記教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、第１の識別器モデルに対して学習処理を施す、
を備える学習方法を開示する。 In other aspects,
Compressing each pixel value in the area of the target object shown in the teacher image within a first gradation range of all gradation areas representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other Process,
A learning method comprising:

又、他の局面では、
コンピュータに、
画像に映る対象物体の領域における各画素値を、前記画像を表現する全階調域のうちの第１の階調範囲内に圧縮する処理と、
前記圧縮処理が施された前記画像に対して、学習済みの第１の識別器モデルＤｍ１を用いた画像解析を施して、前記対象物体の特徴点、姿勢又は動作を推定する処理と、
を実行させる推定プログラムであって、
前記第１の識別器モデルＤｍ１は、前記圧縮処理が施された教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データＤｔ１を用いて、学習処理が施されている
推定プログラムを開示する。 In other aspects,
On the computer,
A process of compressing each pixel value in the area of the target object shown in the image within a first gradation range of all gradation areas representing the image;
A process of performing image analysis using the learned first discriminator model Dm1 on the image subjected to the compression process, and estimating a feature point, posture, or motion of the target object;
An estimation program for executing
The first discriminator model Dm1 uses the first learning data Dt1 in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, An estimation program that has been subjected to learning processing is disclosed.

又、他の局面では、
コンピュータに、
教師画像に映る対象物体の領域における各画素値を、前記教師画像を表現する全階調域のうちの第１の階調範囲内に圧縮する処理と、
前記圧縮処理が施された前記教師画像と当該教師画像に映る前記対象物体の特徴点、姿勢又は動作とが関連付けられた第１の学習データを用いて、第１の識別器モデルに対して学習処理を施す処理と、
を実行させる学習プログラムを開示する。 In other aspects,
On the computer,
A process of compressing each pixel value in the area of the target object shown in the teacher image within a first gradation range of all gradation areas representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other Processing to perform processing,
A learning program for executing is disclosed.

本開示に係る推定装置は、対象物体の特徴点、姿勢又は動作を推定する用に、より好適である。 The estimation device according to the present disclosure is more suitable for estimating the feature point, posture, or motion of the target object.

Ｕ画像認識システム
１０学習装置
１１第２の学習処理部
１２領域検出部
１３圧縮処理部
１４第１の学習処理部
２０推定装置
２１入力部
２２領域検出部
２３圧縮処理部
２４推定部
２５出力部
Ｄｍ１第１の識別器
Ｄｍ２第２の識別器 U image recognition system 10 learning device 11 second learning processing unit 12 region detection unit 13 compression processing unit 14 first learning processing unit 20 estimation device 21 input unit 22 region detection unit 23 compression processing unit 24 estimation unit 25 output unit Dm1 1st discriminator Dm2 2nd discriminator

Claims

A compression processing unit that compresses each pixel value in the region of the target object shown in the image within a first gradation range of all gradation areas representing the image;
An estimation unit that performs image analysis using the learned first discriminator model on the image subjected to the compression processing, and estimates a feature point, posture, or motion of the target object;
With
The first discriminator model uses the first learning data in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, Has been given,
Estimating device.

A region detection unit that performs image analysis using the learned second discriminator model on the image and detects a region of the target object reflected in the image;
The second discriminator model is subjected to learning processing using second learning data in which a teacher image and a region in which the target object is reflected in the teacher image are associated.
The estimation apparatus according to claim 1.

The compression processing unit performs the compression processing so as to reduce a difference in pixel values between the pixel regions.
The estimation apparatus according to claim 1 or 2.

The first gradation range is a range that is narrower than the upper one-side gradation range of all gradation areas representing the image, or the lower-limit one-third gradation range. Set to a narrower range,
The estimation apparatus as described in any one of Claims 1 thru | or 3.

The compression processing unit further includes a second pixel value obtained by separating each pixel value in a peripheral region of the target object in the image from the first gradation range in the entire gradation area representing the image. Compress to the gradation range,
The estimation apparatus as described in any one of Claims 1 thru | or 4.

The target object is a person or a predetermined part of a person,
The estimation apparatus as described in any one of Claims 1 thru | or 5.

Each of the first classifier model and the second classifier model includes a convolutional neural network.
The estimation apparatus according to claim 2.

A compression processing unit that compresses each pixel value in the region of the target object shown in the teacher image within a first gradation range of all gradation regions representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other A first learning processing unit for performing processing;
A learning apparatus comprising:

Compressing each pixel value in the area of the target object shown in the image within a first gradation range of all gradation areas representing the image;
An estimation method for performing an image analysis using a learned first discriminator model on the image subjected to the compression processing to estimate a feature point, posture, or motion of the target object,
The first discriminator model uses the first learning data in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, Has been given,
Estimation method.

Compressing each pixel value in the area of the target object shown in the teacher image within a first gradation range of all gradation areas representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other Process,
A learning method comprising:

On the computer,
A process of compressing each pixel value in the area of the target object shown in the image within a first gradation range of all gradation areas representing the image;
A process of performing image analysis using the learned first discriminator model on the image subjected to the compression process, and estimating a feature point, posture, or motion of the target object;
An estimation program for executing
The first discriminator model uses the first learning data in which the teacher image subjected to the compression processing and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other, Has been given,
Estimation program.

On the computer,
A process of compressing each pixel value in the area of the target object shown in the teacher image within a first gradation range of all gradation areas representing the teacher image;
Learning with respect to the first discriminator model using the first learning data in which the compression-processed teacher image and the feature point, posture, or motion of the target object reflected in the teacher image are associated with each other Processing to perform processing,
Learning program to execute