JP7230963B2

JP7230963B2 - Image processing method, device and storage medium

Info

Publication number: JP7230963B2
Application number: JP2021131444A
Authority: JP
Inventors: ジャオイン; ジャンイフェイ; ワンガン
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2020-08-12
Filing date: 2021-08-11
Publication date: 2023-03-01
Anticipated expiration: 2041-08-11
Also published as: JP2022033037A; CN114078193A

Description

本発明は、コンピュータ視覚（computer vision）分野に属し、詳しく、画像処理方法、装置及び記憶媒体に関する。 The present invention belongs to the field of computer vision, and in particular relates to an image processing method, apparatus and storage medium.

現代科学技術の発展、特にコンピュータ技術の進歩に伴って、機器にも人類の視覚機能を備えさせることが、研究者が直面する挑戦的な研究課題となったことで、コンピュータ視角学科は形成された。コンピュータ視覚に関する研究のターゲットは、画像やビデオなどのビジュアルメディアで世界モデルを構築または復元し、現実世界を認知することである。この世界において、人間の運動により人間社会にとって重要な情報を大量に持ち込まれる。人と人、人と物、人と環境のインタラクションは視認可能な媒体の主要なコンテンツを構成する。このため、視覚媒体における人体の運動情報を研究し、効果的な表示、分析、理解を行うことが重要である。姿勢推定はコンピュータ視覚研究における重要なカテゴリとして注目されている。もちろん、具体的な適用場面によっては、姿勢推定の対象は、現実的な人間であることもあれば、仮想イメージであることもある。 With the development of modern science and technology, especially the advancement of computer technology, it became a challenging research topic for researchers to equip equipment with human visual functions, and the Department of Computer Vision was formed. rice field. The target of research on computer vision is to perceive the real world by constructing or reconstructing a world model with visual media such as images and videos. In this world, human movements bring in a large amount of information important to human society. Human-human, human-object, and human-environment interactions constitute the primary content of visible media. For this reason, it is important to study human body motion information in visual media for effective display, analysis, and understanding. Pose estimation has attracted attention as an important category in computer vision research. Of course, depending on the specific application, the subject of pose estimation may be a real person or a virtual image.

一般には、姿勢推定は、推定する対象(例えば、現実的な人間や仮想イメージ)の身体部位のキーポイントを確定し、そのキーポイントに基づいて姿勢を推定する。しかし、姿勢推定に用いる画像において、推定する対象が遮蔽される(例えば、人と人、物と人の間の遮蔽)ことは避けられないことである。推定対象の身体の一部が遮られる場合には、身体の各部のキーポイントをロバストに検出することが緊急課題である。
従来の姿勢推定方法では、隣接するキーポイント同士間の関係を利用して遮蔽問題に対処してきた。しかし、遠距離キーポイントによる寄与は十分に研究されず、場合によっては無視されてきた。 In general, pose estimation involves determining keypoints of body parts of an object to be estimated (eg, a realistic human or a virtual image) and estimating the pose based on the keypoints. However, in an image used for pose estimation, it is unavoidable that an object to be estimated is shielded (for example, shielding between one person and another, or between an object and a person). When the body part to be estimated is occluded, it is imperative to robustly detect the keypoints of each body part.
Conventional pose estimation methods exploit the relationship between adjacent keypoints to address the occlusion problem. However, the contribution by long-range keypoints has been poorly studied and in some cases neglected.

上記事情を鑑みて、本発明は、任意のキーポイント間の関係を学習し、十分に捜査されたキーポイント関係に基づいて最終的なキーポイント決定結果を最適化可能な、全体的な関係に基づく新しい画像処理方法、装置及び記憶媒体の提供を課題とする。 In view of the above circumstances, the present invention learns the relationship between arbitrary keypoints and develops a global relationship that can optimize the final keypoint decision result based on well-explored keypoint relationships. An object is to provide a new image processing method, apparatus, and storage medium based on the above.

まず、本発明によれば、入力画像に基づいて、それぞれ身体部位におけるキーポイントに対応する所定数のキーポイント特徴図を抽出し；前記所定数のキーポイント特徴図に基づいて、キーポイント特徴図毎に他のキーポイント特徴図との複数の関係を表す関係特徴図を抽出し；各キーポイント特徴図及びそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新し；更新されたキーポイント特徴図に基づいて、前記入力画像における身体の各部のキーポイントの位置を確定することを含む画像処理方法を提供する。 First, according to the present invention, a predetermined number of keypoint feature maps corresponding to keypoints in a body part are extracted based on an input image; Extract a relationship feature map representing multiple relationships with other keypoint feature maps for each; update the keypoint feature map based on each keypoint feature map and its corresponding relationship feature map; updated An image processing method is provided, comprising determining the positions of keypoints of body parts in the input image based on the keypoint feature map.

また、本発明の実施例にかかる画像処理方法によれば、前記所定数のキーポイント特徴図に基づいて、キーポイント特徴図毎に各キーポイント特徴図と別のキーポイント特徴図とで複数の関係を表す関係特徴図を抽出する前に、すべてのキーポイントの平均になる中心点の中心点特徴図を確定し;キーポイント特徴図毎に前記中心点特徴図との関係を示す関係特徴図を抽出し、当該関係特徴図に基づいてキーポイント特徴図を更新することを含む。 Further, according to the image processing method according to the embodiment of the present invention, based on the predetermined number of keypoint feature maps, each keypoint feature map and another keypoint feature map are divided into a plurality of keypoint feature maps. Before extracting the relationship feature map representing the relationship, determine the center point feature map of the center point that is the average of all the key points; and updating the keypoint feature map based on the relationship feature map.

また、本発明の実施例にかかる画像処理方法によれば、前記所定数のキーポイント特徴図に基づいて、キーポイント特徴図毎に各キーポイント特徴図と他のキーポイント特徴図とで複数の関係を表す関係特徴図を抽出することは、各キーポイント特徴図を順次にカレントキーポイント特徴図として、カレントキーポイント特徴図を除く他のキーポイント特徴図を複数の少なくとも1つのキーポイント特徴図を含むグループに振り分け；カレントキーポイント特徴図と、他のキーポイント特徴図からなるグループのそれぞれとを入力として、畳込み処理を行うことにより、カレントキーポイント特徴図と、他のキーポイント特徴図からなるグループのそれぞれとの関係を表す関係特徴図を得る、ことを繰り返して実行することが含む。 Further, according to the image processing method according to the embodiment of the present invention, based on the predetermined number of keypoint feature maps, each keypoint feature map and other keypoint feature maps are divided into a plurality of Extracting a relationship feature map representing a relationship includes sequentially taking each keypoint feature map as a current keypoint feature map, and extracting other keypoint feature maps other than the current keypoint feature map as a plurality of at least one keypoint feature map. Groups consisting of the current keypoint feature map and other keypoint feature maps are input, and convolution processing is performed to obtain the current keypoint feature map and other keypoint feature maps obtaining a relationship feature diagram representing the relationship with each of the groups consisting of .

また、本発明の実施例にかかる画像処理方法によれば、前記畳込み処理において、他のキーポイント特徴図からなるグループにおける各特徴図は対応する第1重みを有し、他のキーポイント特徴図からなるグループにおける各特徴図に対応する身体部位と、前記カレントキーポイント特徴図に対応する身体部位との間の関係に基づいて、第1重みの初期値を確定する。 Further, according to the image processing method according to the embodiment of the present invention, in the convolution process, each feature map in the group of other keypoint feature maps has a corresponding first weight, and the other keypoint feature map has a corresponding first weight. An initial value of a first weight is determined based on the relationship between the body part corresponding to each feature map in the group of figures and the body part corresponding to the current keypoint feature map.

また、本発明の実施例にかかる画像処理方法によれば、各キーポイント特徴図及びそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新することは、各キーポイント特徴図とそれに対応する関係特徴図を入力として、畳込み処理を実行し；畳込みにより得られる特徴図に基づき、各キーポイント特徴図を更新することを含む。 Further, according to the image processing method according to the embodiment of the present invention, updating the keypoint feature map based on each keypoint feature map and the relationship feature map corresponding thereto is performed by each keypoint feature map and its Taking the corresponding relationship feature map as input, performing the convolution process; and updating each key-point feature map based on the feature map obtained by convolution.

また、本発明の実施例にかかる画像処理方法によれば、前記畳込み処理において、各関係特徴図はそれぞれ対応する第2重みを有し、各関係特徴図におけるすべての画素の値の平均画素値または最大画素値に基づいて、前記第2重みの初期値を確定する。 Further, according to the image processing method according to the embodiment of the present invention, in the convolution process, each relational feature map has a corresponding second weight, and the average pixel value of all the pixel values in each relational feature map is An initial value for the second weight is determined based on the value or the maximum pixel value.

また、本発明の実施例にかかる画像処理方法によれば、徴抽出ネットワークを介して入力画像からキーポイント特徴図を抽出し、関係抽出ネットワークを介して前記関係特徴図を抽出し、融合ネットワークを介して各キーポイント特徴図及びそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新し、前記特徴抽出ネットワーク、前記抽出ネットワーク及び前記融合ネットワークは、トレーニング画像を入力画像として特徴抽出ネットワークに供給し;前記特徴抽出ネットワークから出力されるキーポイント特徴図に基づいて、前記トレーニング画像におけるキーポイントの大まかな位置を確定し、かつ前記キーポイントの大まかな位置と真実のキーポイント位置とのズレに基づいて第2損失関数を確定し;前記第1損失関数と前記第2損失関数に基づいて、前記特徴抽出ネットワーク、前記関係抽出ネットワーク及び前記融合ネットワークのパラメータを調整するステップによりトレーニングされる。 Further, according to the image processing method according to the embodiment of the present invention, a key point feature map is extracted from an input image via a feature extraction network, the relationship feature map is extracted via a relationship extraction network, and a fusion network is generated. update the keypoint feature map based on each keypoint feature map and its corresponding relational feature map through the feature extraction network, the extraction network and the fusion network using a training image as an input image to perform a feature extraction network determining rough positions of keypoints in the training images based on the keypoint feature map output from the feature extraction network, and comparing the rough positions of the keypoints with the true keypoint positions; determining a second loss function based on the deviation; and training by adjusting parameters of the feature extraction network, the relationship extraction network and the fusion network based on the first loss function and the second loss function. .

そして、本発明は、入力画像に基づいて、それぞれ身体部位におけるキーポイントに対応する所定数のキーポイント特徴図を抽出するキーポイント特徴図抽出手段と；前記所定数のキーポイント特徴図に基づいて、キーポイント特徴図毎に他のキーポイント特徴図との複数の関係を表す関係特徴図を抽出する関係特徴図抽出手段と；各キーポイント特徴図及びそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新するキーポイント特徴図更新手段と；更新されたキーポイント特徴図に基づいて、前記入力画像における身体の各部のキーポイントの位置を確定するキーポイント特徴図確定手段と、を含む画像処理装置を提供する。 Then, the present invention comprises keypoint feature map extracting means for extracting a predetermined number of keypoint feature maps corresponding to keypoints in each body part based on an input image; , relationship feature diagram extracting means for extracting relationship feature diagrams representing a plurality of relationships with other keypoint feature diagrams for each keypoint feature diagram; and based on each keypoint feature diagram and the corresponding relationship feature diagram, the keypoint feature map updating means for updating the keypoint feature map; and keypoint feature map determining means for determining the positions of keypoints of each part of the body in the input image based on the updated keypoint feature map. An image processing device including:

また、本発明の実施例にかかる画像処理装置によれば、前記キーポイント特徴図抽出手段は、更に、すべてのキーポイントの平均になる中心点の中心点特徴図を確定し;キーポイント特徴図毎に前記中心点特徴図との関係を示す関係特徴図を抽出し、当該関係特徴図に基づいてキーポイント特徴図を更新するように構成される。 Further, according to the image processing apparatus according to the embodiment of the present invention, the keypoint feature map extracting means further determines the center point feature map of the center point that is the average of all the keypoints; A relationship feature map showing a relationship with the center point feature map is extracted each time, and the key point feature map is updated based on the relationship feature map.

また、本発明の実施例にかかる画像処理装置によれば、前記関係特徴図抽出手段は、更に、各キーポイント特徴図を順次にカレントキーポイント特徴図として、カレントキーポイント特徴図を除く他のキーポイント特徴図を複数の少なくとも1つのキーポイント特徴図を含むグループに振り分け；カレントキーポイント特徴図と、他のキーポイント特徴図からなるグループのそれぞれとを入力として、畳込み処理を行うことにより、カレントキーポイント特徴図と、他のキーポイント特徴図からなるグループのそれぞれとの関係を表す関係特徴図を得る、ことを繰り返して実行するように構成される。 Further, according to the image processing apparatus according to the embodiment of the present invention, the relational feature map extracting means further sequentially sets each keypoint feature map as a current keypoint feature map, and extracts other keypoint feature maps other than the current keypoint feature map. Distributing the keypoint feature map into a plurality of groups containing at least one keypoint feature map; by taking the current keypoint feature map and each of the groups of other keypoint feature maps as input and performing a convolution process , obtaining a relationship feature map representing the relationship between the current keypoint feature map and each of a group of other keypoint feature maps.

また、本発明の実施例にかかる画像処理装置によれば、前記キーポイント特徴図更新手段は、各キーポイント特徴図とそれに対応する関係特徴図を入力として、畳込み処理を実行し；畳込みにより得られる特徴図に基づき、各キーポイント特徴図を更新するように構成される。 Further, according to the image processing apparatus according to the embodiment of the present invention, the keypoint feature map updating means receives each keypoint feature map and the corresponding relational feature map as input, and executes convolution processing; is configured to update each keypoint feature map based on the feature map obtained by .

また、本発明の実施例にかかる画像処理装置によれば、前記畳込み処理において、各関係特徴図はそれぞれ対応する第2重みを有し、関係特徴図におけるすべての画素点の平均画素値または最大画素値に基づいて、前記第2重みの初期値を確定する。 Further, according to the image processing apparatus according to the embodiment of the present invention, in the convolution process, each relational feature map has a corresponding second weight, and the average pixel value or An initial value of the second weight is determined based on the maximum pixel value.

また、本発明の実施例にかかる画像処理装置によれば、徴抽出ネットワークを介して入力画像からキーポイント特徴図を抽出し、関係抽出ネットワークを介して前記特徴図を抽出し、融合ネットワークを介して各キーポイント特徴図と対応する関係特徴図に基づいて、当該キーポイント特徴図を更新し、前記特徴抽出ネットワーク、前記抽出ネットワーク及び前記融合ネットワークは、トレーニング画像を入力画像として特徴抽出ネットワークに供給し;前記特徴抽出ネットワークから出力されるキーポイント特徴図に基づいて、前記トレーニング画像中のキーポイントの概略位置を確定し、かつ前記キーポイントの概略位置と真実のキーポイント位置とのズレに基づいて第2損失関数を確定し;前記第1損失関数と前記第2損失関数に基づいて、前記特徴抽出ネットワーク、前記関連抽出ネットワーク及び前記融合ネットワークのパラメータを調整するステップによりトレーニングされる。 Further, according to the image processing apparatus according to the embodiment of the present invention, the key point feature map is extracted from the input image via the feature extraction network, the feature map is extracted via the relationship extraction network, and the fusion network is used. based on each keypoint feature map and the corresponding relationship feature map, the feature extraction network, the extraction network and the fusion network provide training images as input images to the feature extraction network. determining the approximate positions of the keypoints in the training image based on the keypoint feature map output from the feature extraction network, and based on the deviation between the approximate positions of the keypoints and the true keypoint positions; determining a second loss function by using; and adjusting parameters of the feature extraction network, the association extraction network and the fusion network based on the first loss function and the second loss function.

更に、本発明は、コンピュータプログラムが記憶されるメモリ記憶装置と;プロセッサと、備え、前記プロセッサが前記メモリに記憶されるコンピュータプログラムを実行する時に、入力画像に基づいて、それぞれ身体部位におけるキーポイントに対応する所定数のキーポイント特徴図を抽出し；前記所定数のキーポイント特徴図に基づいて、キーポイント特徴図毎に別のキーポイント特徴図との複数の関係を表す関係特徴図を抽出し；各キーポイント特徴図及びそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新し；更新されたキーポイント特徴図に基づいて、前記入力画像における身体の各部のキーポイントの位置を確定する処理が実行される画像処理装置を提供する。 Further, the present invention comprises: a memory storage device in which a computer program is stored; extracting a predetermined number of keypoint feature maps corresponding to; based on the predetermined number of keypoint feature maps, extracting relationship feature maps representing a plurality of relationships between each keypoint feature map and another keypoint feature map; updating the keypoint feature map based on each keypoint feature map and its corresponding relational feature map; based on the updated keypoint feature map, the positions of the keypoints of each part of the body in the input image. Provided is an image processing device in which processing for determining is executed.

また、本発明は、コンピュータプログラムが記憶され、当該コンピュータプログラムをプロセッサが実行すると前記画像処理方法が実行されるコンピュータ読取り可能な記憶媒体を提供する。 The present invention also provides a computer-readable storage medium in which a computer program is stored and the image processing method is executed when the computer program is executed by a processor.

本発明の画像処理方法、装置及び記憶媒体によれば、全体的な関係、すなわち隣接するキーポイント間の関係に加え、遠距離キーポイントとの関係を考慮して、関連するキーポイントからコンテキスト情報を抽出し、遮蔽されたキーポイントを推定することにより、入力画像におけるすべてのキーポイントの位置を決定することにより、遮蔽された身体部位のキーポイントであっても確実に予測され得る。 According to the image processing method, apparatus, and storage medium of the present invention, context information is obtained from related keypoints by considering the overall relationship, that is, the relationship between adjacent keypoints as well as the relationship with distant keypoints. By extracting and estimating the occluded keypoints, by determining the positions of all keypoints in the input image, even occluded body part keypoints can be reliably predicted.

本発明が適用可能な場面を示す図である。FIG. 4 is a diagram showing a scene to which the present invention is applicable; 本発明の実施例にかかる画像処理方法を示すフローチャートである。4 is a flow chart showing an image processing method according to an embodiment of the present invention; 人体のキーポイントの設定例を示す図である。FIG. 4 is a diagram showing an example of setting key points of a human body; 図3に示した人体のキーポイント同士間の接続関係の一例を示す図である。4 is a diagram showing an example of connection relationships between key points of the human body shown in FIG. 3. FIG. キーポイント特徴図の更新を行うためのネットワーク構成を示す図である。FIG. 2 is a diagram showing a network configuration for updating keypoint feature maps; FIG. 特徴抽出ネットワーク、関係抽出ネットワーク、融合ネットワークをトレーニングするためのフローチャートである。4 is a flowchart for training feature extraction networks, relationship extraction networks, fusion networks; 本発明の実施例にかかる画像処理装置の機能構成を示すブロック図である。1 is a block diagram showing the functional configuration of an image processing apparatus according to an embodiment of the present invention; FIG. 本発明例の実施例にかかる計算装置のアーキテクチャを示す図である。FIG. 2 is a diagram showing the architecture of a computing device according to an embodiment of the example of the present invention;

以下、図面を参照して本発明の好ましい実施形態を説明する。以下に図面を参考して説明するには、本発明の請求範囲および均等物によって限定された本発明の実施形態、各種細部への理解を訳立つためのものであり、例示的なものである。このため、本発明の範囲と精神から逸脱しない限り、ここで述べられる実施形態に対して様々な変更と修正が可能である。なお、説明をより簡潔にするために、本発明分野に周知される機能や構成に関する詳細な記述は省略する。 Preferred embodiments of the present invention will now be described with reference to the drawings. The following description, with reference to the drawings, is intended to provide an understanding of the various details and embodiments of the invention as defined by the claims and equivalents of the invention and is illustrative. . Thus, various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. For the sake of brevity, detailed descriptions of functions and configurations that are well known in the art are omitted.

まず、本発明にかかる画像処理方法の説明を先立ち、図1に示す画像処理方法の適用可能な場面を説明する。カメラ101によりユーザ画像を撮影し、本発明の画像処理方法により当該ユーザ画像における身体の各部に対応する複数のキーポイント102を特定する。特定された複数のキーポイント102に基づいて、仮想イメージ103の姿勢を連携して制御することができる。これは、例えば、インタラクティブゲームでは、人間のプレーヤーの運動を追跡して、ゲームに対応する仮想キャラクタのアクションをレンダリングすることができる。 First, before describing the image processing method according to the present invention, a scene to which the image processing method shown in FIG. 1 can be applied will be described. A user image is captured by a camera 101, and a plurality of key points 102 corresponding to each part of the body in the user image are specified by the image processing method of the present invention. Based on the identified keypoints 102, the pose of the virtual image 103 can be jointly controlled. For example, in an interactive game, this can track the movements of a human player to render the actions of a virtual character corresponding to the game.

もちろん、本発明の実施例にかかる画像処理方法は、図1に示すようなシーンには限らず、多様な場面に適用される。本発明の画像処理方法により身体の各部に対応する複数のキーポイントを特定することで、人体がある時間内に姿勢の変化を追跡することにより、活動、ジェスチャーと歩行識別に用いられる。例えば、可能な適用場面としては、知能ビデオモニタ、患者モニタリングシステム、自動運転、ヒューマンインタラクション、バーチャルリアリティ、人体アニメーション、インテリジェントルーム、知能セキュリティ、スポーツ選手の支援トレーニングなどがあるが、これらに限らない。 Of course, the image processing method according to the embodiment of the present invention is not limited to the scene shown in FIG. 1, and can be applied to various scenes. By identifying a plurality of key points corresponding to each part of the body by the image processing method of the present invention, it can be used to identify activities, gestures and gait by tracking the changes in posture of the human body in a certain period of time. For example, possible application scenarios include, but are not limited to, intelligent video monitors, patient monitoring systems, autonomous driving, human interaction, virtual reality, human animation, intelligent rooms, intelligent security, and athlete support training.

次に、本発明の実施例にかかる画像処理方法を図2に参照しながら説明する。図2に示すように、画像処理方法において、まず、ステップS201では、入力画像に基づいて、それぞれ身体部位におけるキーポイントに対応する所定数のキーポイント特徴図を抽出する。 Next, an image processing method according to an embodiment of the present invention will be described with reference to FIG. As shown in FIG. 2, in the image processing method, first, in step S201, a predetermined number of keypoint feature maps corresponding to keypoints in each body part are extracted based on the input image.

ここで、対応する身体部位のキーポイントは骨格キーポイントである。骨格キーポイントは、イメージ(例えば、人間、動物、仮想イメージ)の姿勢を表示してイメージの行動を予測するための重要な要素である。このため、特定のイメージに対して、骨格部位を規定する所定数のキーポイントをあらかじめ設定しておくことができる。 Here, the corresponding body part keypoints are skeleton keypoints. Skeletal keypoints are important elements for displaying the pose of an image (eg, human, animal, virtual image) and predicting the behavior of the image. Therefore, a predetermined number of keypoints defining the skeletal part can be set in advance for a particular image.

図3は、人体キーポイントの設定例を示す。図3に示すように、17個の人体キーポイントが設定され、上から順次、鼻P1、左目P2、右目P3、左耳P4、右耳P5、左肩P6、右肩P7、左手肘P8、右手肘P9、左手首P10、右手首P11、左臀部P12、右臀部P13、左膝P14、右膝P15、左足首P16及び右足首P17が示されている。すなわち、図3に示すように人体のキーポイントを設定した場合、ステップS201でユーザ画像を入力すると、図3に示す17つの身体部位のキーポイントを有する特徴図が抽出される。以下の説明では、図3に示す人体キーポイント設定を例に説明する。もちろん、キーポイントの設定は具体的な適用に応じて別のキーポイント設定は任意に行われる。 FIG. 3 shows an example of setting human body keypoints. As shown in Figure 3, 17 human body key points are set, from top to bottom: nose P1, left eye P2, right eye P3, left ear P4, right ear P5, left shoulder P6, right shoulder P7, left hand elbow P8, right hand Elbow P9, left wrist P10, right wrist P11, left hip P12, right hip P13, left knee P14, right knee P15, left ankle P16 and right ankle P17 are shown. That is, when the keypoints of the human body are set as shown in FIG. 3, when the user image is input in step S201, a feature diagram having 17 body part keypoints shown in FIG. 3 is extracted. In the following description, the human body keypoint setting shown in FIG. 3 will be described as an example. Of course, the setting of keypoints can be optionally made according to the specific application.

ステップS201のキーポイント特徴図の抽出処理は、畳込みニューラルネットワークにより実現される。ここで、畳込みニューラルネットワークを例えば、特徴抽出ネットワークと呼ぶ。畳込みニューラルネットワークには、入力層、抑制層（hidden layer）、出力層を含む。入力層は識別する画像を受信する。抑制層は入力された画像に対して特徴の解析や抽出を実行する。出力層は最終に抽出された所定数のキーポイント特徴図を出力する。 The keypoint feature map extraction process in step S201 is realized by a convolutional neural network. Here, the convolutional neural network is called a feature extraction network, for example. A convolutional neural network includes an input layer, a hidden layer, and an output layer. The input layer receives an identifying image. The suppression layer performs feature analysis and extraction on the input image. The output layer finally outputs a predetermined number of extracted keypoint feature maps.

抑制層には、畳込み層、プール化層、および全接続（全結合）層を含む。畳込みニューラルネットワークにおいて、特徴の抽出には複数の畳込み層(直列、並列、クロス接続などの複雑な構造)が使用される。異なるパラメータを持つ複数の畳込み、正規化、およびプール化処理により、異なる階層の特徴が抽出される。まず、畳込みを利用して入力画像をチェックして畳込みを行うことで、畳込みマッピングを得る。次に、通常の校正線形ユニットとバッチ正規化法を利用して畳込みマッピングを正規化し、正規化された畳込みマッピングを得る。更に、正規化された畳込みマッピングに対して最大または平均のプール化処理を行う。畳込み層で特徴抽出後に、出力された特徴図はプール化層に転送され、特徴選択と情報フィルタリングを行う。プール化層に含まれた予め設定されたプール化関数は、特徴図における単一点の結果を隣接する領域の特徴図の統計量に置き換える機能を備える。例えば、最大プール化は、単一の画素点の値を隣接する領域の画素点の最大値に置き換える。平均プール化は、単一の画素点の値を隣接する領域の画素点の平均値に置き換える。多様な特徴を得るには、抑制層に複数の畳込み層と複数のプール化層を設定し、上記過程を複数回繰り返すことができる。なお、複数回のサンプリング過程により多様な多尺度の特徴マッピングが抽出される。 Constrained layers include convolutional layers, pooled layers, and fully connected (fully connected) layers. In convolutional neural networks, multiple convolutional layers (complex structures such as series, parallel and cross-connections) are used for feature extraction. Different hierarchies of features are extracted by multiple convolution, normalization and pooling processes with different parameters. First, the input image is checked and convolved using convolution to obtain the convolution mapping. The convolutional mapping is then normalized using the usual calibration linear unit and batch normalization method to obtain a normalized convolutional mapping. In addition, a maximum or average pooling process is performed on the normalized convolutional mappings. After feature extraction in the convolution layer, the output feature map is transferred to the pooling layer for feature selection and information filtering. A preset pooling function included in the pooling layer provides the ability to replace single point results in the feature map with the feature map statistics of adjacent regions. For example, max pooling replaces the value of a single pixel point with the maximum value of pixel points in adjacent regions. Average pooling replaces the value of a single pixel point with the average value of pixel points in adjacent regions. To obtain diverse features, the suppression layer can be set with multiple convolutional layers and multiple pooling layers, and the above process can be repeated multiple times. In addition, various multi-scale feature mappings are extracted through a plurality of sampling processes.

出力層は、抑制層により抽出された多尺度の特徴マッピングに基づいて、所定数の特徴図が最終に出力される。入力画像は実際のユーザ画像である場合は、図3に示すような人体キーポイント設定の場合、出力層から17のキーポイント特徴図が出力される。 The output layer finally outputs a predetermined number of feature maps based on the multiscale feature mapping extracted by the suppression layer. When the input image is an actual user image, 17 keypoint feature maps are output from the output layer in the case of human body keypoint setting as shown in FIG.

ここで、出力層から出力される17個のキーポイント特徴図は、所定の身体部位の種別に順番付られる。例えば、1番目のキーポイント特徴図は左目特徴図であり、2番目のキーポイント特徴図は右目特徴図であり、3番目のキーポイント特徴図は鼻特徴図である。各キーポイント特徴図には1つの身体部位に対応するキーポイントのみが含まれる。例えば、左目特徴図では、左目のキーポイントに関連付けられる特徴のみが含まれる。また、入力画像に1人しか映らない場合は、1つのキーポイント特徴に関連付けられている特徴しか含まれない。例えば、左目特徴図には1つの左目キーポイントに関連付けられる特徴のみが含まれる。しかし、入力画像に複数の人が映っている場合には、各キーポイント特徴図には複数のキーポイントに関連付けられる特徴が含まれる場合がある。例えば、左目特徴図には2つの左目キーポイントに関連付けられる特徴が含まれる場合があるが、この2つのキーポイントは同じ身体部位(左目)に対応するキーポイントである。 Here, the 17 keypoint feature maps output from the output layer are ordered according to the types of predetermined body parts. For example, the first keypoint feature map is the left eye map, the second keypoint map is the right eye map, and the third keypoint map is the nose map. Each keypoint feature map contains only keypoints corresponding to one body part. For example, in the left eye feature map, only features associated with left eye keypoints are included. Also, if only one person appears in the input image, only the features associated with one keypoint feature are included. For example, the left eye feature map contains only features associated with one left eye keypoint. However, if multiple people appear in the input image, each keypoint feature map may include features associated with multiple keypoints. For example, a left eye feature map may contain features associated with two left eye keypoints, but the two keypoints correspond to the same body part (left eye).

図2に戻る。ステップS201で所定数のキーポイント特徴図を抽出した後に、ステップS202へ処理を進む。ステップS202では、前記所定数のキーポイント特徴図に基づいて、キーポイント特徴図毎に別のキーポイント特徴図との複数の関係を表す関係特徴図を抽出する。 Return to Figure 2. After extracting a predetermined number of key point feature maps in step S201, the process proceeds to step S202. In step S202, based on the predetermined number of keypoint feature maps, relationship feature maps are extracted that represent a plurality of relationships between each keypoint feature map and another keypoint feature map.

具体的には、各キーポイント特徴図を順次カレントキーポイント特徴図として、以下のステップを繰り返し実行する。 Specifically, each keypoint feature map is sequentially set as the current keypoint feature map, and the following steps are repeatedly executed.

まず、カレントキーポイント特徴図を除く他のキーポイント特徴図を複数のグループに振り分ける。 First, keypoint feature maps other than the current keypoint feature map are sorted into a plurality of groups.

次に、カレントキーポイント特徴図と、他のキーポイント特徴図からなるグループのそれぞれとを入力として、畳込み処理を行う。例えば、ここでの畳込み処理は、畳込みニューラルネットワークを介して実現される。また、例えば、ここでの畳込みニューラルネットワークを関係抽出ネットワークと呼ぶ。もちろん、ここでの畳込みニューラルネットワークは、ステップS201でキーポイント特性図を抽出するための畳込みニューラルネットワークとで具体的なネットワークパラメータ(例えば、ネットワーク深度、ノードウェイトなど)は異なる。2つ以上のキーポイント特徴図を入力として、畳込みニューラルネットワークに提供する。畳込みニューラルネットワークは、入力したキーポイント特徴図に特徴抽出を行い、入力されたキーポイント特徴図の特徴記述を出力する。入力された各キーポイント特徴図は、キーポイントに関連付けられているため、出力された畳込みの結果としての特徴図は、キーポイント間の関係を表すものとして見なしてもよい。換言すれば、2つ以上のキーポイント特徴図を畳込み処理することにより、カレントキーポイント特徴図と他のキーポイント特徴図との関係が得られる。なお、畳込み処理の結果としての特徴図は関係特徴図と呼ぶ。 Next, the current keypoint feature map and each group of other keypoint feature maps are input, and convolution processing is performed. For example, the convolution processing here is realized via a convolutional neural network. Also, for example, the convolutional neural network here is called a relationship extraction network. Of course, the convolutional neural network here differs from the convolutional neural network for extracting the keypoint characteristic diagram in step S201 in specific network parameters (eg, network depth, node weight, etc.). Two or more keypoint feature maps are provided as inputs to a convolutional neural network. The convolutional neural network performs feature extraction on the input keypoint feature map and outputs a feature description of the input keypoint feature map. Since each input keypoint feature map is associated with a keypoint, the output feature map resulting from the convolution may be viewed as representing the relationship between the keypoints. In other words, by convolving two or more keypoint feature maps, the relationship between the current keypoint feature map and other keypoint feature maps is obtained. Note that the feature map as a result of the convolution process is called a relationship feature map.

例えば、キーポイント特徴図毎に、他のキーポイント特徴図における各特徴図との関係を表す関係特徴図を抽出する。または、他のキーポイント特徴図における複数の特徴図との関係を表す関係特徴図を抽出してもよい。ここで、他のキーポイント特徴図における複数の特徴図に対応するキーポイントは、カレントキーポイントに隣接するキーポイントに限らない。後述するように、隣接するキーポイントは距離が1であるキーポイントである。他のキーポイント特徴図における複数の特徴図が対応するキーポイントはカレントキーポイントから遠く離れる遠距離キーポイント(例えば、距離が1より大きいキーポイント)も含まれる。 For example, for each keypoint feature diagram, a relationship feature diagram representing the relationship with each feature diagram in other keypoint feature diagrams is extracted. Alternatively, it is possible to extract a relational feature diagram representing relationships with a plurality of feature diagrams in other keypoint feature diagrams. Here, keypoints corresponding to a plurality of feature maps in other keypoint feature maps are not limited to keypoints adjacent to the current keypoint. Adjacent keypoints are keypoints with a distance of 1, as described below. Keypoints corresponding to multiple feature maps in other keypoint feature maps include long-distance keypoints far away from the current keypoint (eg, keypoints with a distance greater than 1).

また、可能な実施形態として、前記畳込み処理において、他のキーポイント特徴図からなるグループにおける各特徴図は対応する第1重みを有する。 Also, as a possible embodiment, in said convolution process, each feature map in a group of other keypoint feature maps has a corresponding first weight.

例えば、他のキーポイント特徴図からなるグループにおける各特徴図に対応する身体部位と、前記カレントキーポイント特徴図に対応する身体部位との間の関係に基づいて、第1重みの値を確定する。図3に示す身体キーポイントの設定を例とする場合に、各キーポイント特徴図に対応する重みは17行×17列の二次元マトリックスG_17×17から得られる。二次元マトリックスG_17×17はキーポイント間の距離dを表す。ここで、距離dは2つのキーポイント間に経由して接続される中間点の個数により決定される。なお、キーポイント間の接続関係(例えば、接続可能又は不可能なポイント、如何に接続するか)は、人体の骨格構造に基づいて予め決定される。例えば、図4はキーポイント間の接続関係例を示す。図4に示すように、鼻P1と右目P3は直接に接続可能であるが、鼻P1と左臀部P12は必ず中心点Pｃを経由して接続される。ここで、中心点Pｃは17個の身体キーポイントの位置に基づく平均値の点である。これにより、キーポイントの鼻P1とキーポイントの右目P3との間の距離は1で、キーポイント鼻P1と左臀部P12との距離は2である。これに応じて、G_17×17における要素G(1、3)=1、G(1、12)=2になる。G_17×17に含まれるすべての要素および対応する要素値は次のとおりである。
（外１）

i番目のキーポイント特徴図について、j番目キーポイント特徴図との関係特徴図を決定する場合に、j番目のキーポイント特徴図に対応する第1重みは二次元マトリックスG_17×17の要素G(I,j)に基づいて取得する。キーポイント間の距離が短いほど、その関連が大きくなると考えられるため、大きな重みを割り当てる。逆の場合にも同様である。 For example, determine the value of the first weight based on the relationship between the body part corresponding to each feature map in a group of other keypoint feature maps and the body part corresponding to the current keypoint feature map. . Taking the body keypoint setting shown in FIG. 3 as an example, the weight corresponding to each keypoint feature map is obtained from a two-dimensional matrix G _17×17 with 17 rows×17 columns. A two-dimensional matrix G _17×17 represents the distance d between keypoints. Here, the distance d is determined by the number of intermediate points connected between two keypoints. The connection relationship between key points (for example, connectable or unconnectable points, how to connect) is determined in advance based on the skeletal structure of the human body. For example, FIG. 4 shows an example of connection relationships between keypoints. As shown in FIG. 4, the nose P1 and right eye P3 can be directly connected, but the nose P1 and left buttock P12 are always connected via the center point Pc. Here, the center point Pc is the mean point based on the positions of 17 body keypoints. Thus, the distance between the keypoint nose P1 and the keypoint right eye P3 is 1, and the distance between the keypoint nose P1 and the left hip P12 is 2. Correspondingly, the elements G(1,3)=1 and G(1,12)=2 in G _17×17 . All the elements contained in G _17×17 and the corresponding element values are:
(Outside 1)

For the i-th keypoint feature map, when determining the relationship feature map with the j-th keypoint feature map, the first weight corresponding to the j-th keypoint feature map is the element G of the two-dimensional matrix G _17×17 Get based on (I,j). Smaller distances between keypoints are assumed to be more relevant and are therefore assigned higher weights. The same is true in the opposite case.

また、可能な実施形態として、他のキーポイント特徴図における各特徴図に対応する第1重みは、後述するトレーニングプロセスを通して学習することにより得られる。なお、前記したようにキーポイント間の距離に基づいて決定された重みを初期値とする。前記初期値に基づいて監督ありにトレーニングが行われる。 Also, as a possible embodiment, the first weight corresponding to each feature map in the other keypoint feature map is obtained by learning through a training process described below. Note that the weight determined based on the distance between the keypoints as described above is used as the initial value. Supervised training is performed based on the initial values.

ステップS202では、キーポイント特徴図毎に他のキーポイント特徴図における関係特徴図を抽出する。即ち、隣接するキーポイント間の関係に加えて、遠距離のキーポイント間の関係も考慮される。 In step S202, for each keypoint feature diagram, a relationship feature diagram in another keypoint feature diagram is extracted. That is, in addition to the relationships between neighboring keypoints, the relationships between distant keypoints are also considered.

キーポイント特徴図毎に他のキーポイント特徴図との関係特徴図を抽出することにより、より多くの情報（例えば、他のキーポイントに関連付けられる情報）が導入される。また、1つのキーポイント特徴図を単一のチャネルと見なす場合には、当該キーポイント特徴図に対応する複数の関係特徴図が生成されるため、特徴図のチャネル数が増えると考えられる。これによって、特徴の表現能力は、キーポイント特徴図とそれに対応する関係特徴図とを組み合わせることにより、特徴の表現能力が強くなり、後続するキーポイントの検出に寄与される。 More information (eg, information associated with other keypoints) is introduced by extracting relationship feature maps for each keypoint feature map with other keypoint feature maps. Also, when one keypoint feature diagram is regarded as a single channel, a plurality of relational feature diagrams corresponding to the keypoint feature diagram are generated, so the number of channels of the feature diagram is considered to increase. Thereby, the expressive power of the feature is strengthened by combining the keypoint feature map and the corresponding relational feature map, which contributes to the detection of subsequent keypoints.

図3に示す人体キーポイント設定を例とする。この場合、キーポイント特徴図毎に16個の他のキーポイント特徴図のそれぞれとの関係特徴図を抽出する。即ち、1つのキーポイント特徴図に対して16個の関係特徴図が取得される。もちろん、1つのキーポイント特徴図に対して、16個の他のキーポイント特徴図における任意個(2つなど)の関係特徴図を抽出してもよい。例えば、1つのキーポイント特徴図に対して、2つの他のキーポイント特徴図との関連特徴図を抽出する場合、8個の関係特徴図が取得される。 Take the human body keypoint setting shown in FIG. 3 as an example. In this case, for each keypoint feature map, a relationship feature map with each of the other 16 keypoint feature maps is extracted. That is, 16 relationship feature maps are obtained for one keypoint feature map. Of course, for one keypoint feature map, any number (such as two) of relationship feature maps in 16 other keypoint feature maps may be extracted. For example, when extracting related feature maps with two other keypoint feature maps for one keypoint feature map, eight relational feature maps are obtained.

また、他の可能な実施方式として、ステップS202の処理を行う前に、ステップS201で取得された予所定数のキーポイント特徴図に対して初期最適化を行ってもよい。具体的には、ステップS201とステップS202の間で、以下のステップを更に含む。 Also, as another possible implementation, the initial optimization may be performed on a predetermined number of keypoint feature maps obtained in step S201 before performing the processing in step S202. Specifically, the following steps are further included between step S201 and step S202.

まず、中心点特徴図を特定する。上記の図4に示すように、前記中心点はすべてのキーポイントの平均である。中心点特徴図は、次の2つの方法で取得される。1つ目の方法は、すべてのキーポイント特徴図を平均した特徴図を中心点特徴図とする方法である。2つ目の方法は、所定数（例えば17個）のキーポイント特徴図を出力し、更に中心点特徴図をも出力するように、キーポイント特徴図を抽出するための畳込みニューラルネットワークの構造を調整する。 First, identify the center point feature map. As shown in Figure 4 above, the center point is the average of all keypoints. A centroid feature map is obtained in two ways: The first method is to use the feature map obtained by averaging all the keypoint feature maps as the center point feature map. The second method is to construct a convolutional neural network for extracting the keypoint feature map to output a predetermined number (e.g. 17) of keypoint feature maps and also to output the center point feature map. to adjust.

次に、キーポイント特徴図毎に、中心点特徴図との関係特徴図を抽出し、前記関連特徴図に基づいて、キーポイント特徴図を更新する。ここでは、キーポイント特徴図毎に前記中心点特徴図との関係特徴図を抽出する具体的な方法は、キーポイント特徴図間の関係特徴図を抽出する方法に類似する。例えば、キーポイント特徴図に対して、それと中心点特徴図に対して畳込み処理を施して新しい特徴図が生成される。この新しい特徴図では、当該キーポイントと中心点と間の関係が融合されている。また、この新しい特徴図を最適化されたキーポイント特徴図とする。 Next, for each keypoint feature map, a relationship feature map with the center point feature map is extracted, and the keypoint feature map is updated based on the related feature map. Here, the specific method of extracting the relationship feature map with the center point feature map for each keypoint feature map is similar to the method of extracting the relationship feature map between keypoint feature maps. For example, a new feature map is generated by performing a convolution process on the keypoint feature map and the center point feature map. In this new feature map, the relationship between the keypoint and the center point is fused. Also, let this new feature map be the optimized keypoint feature map.

キーポイント特徴図毎に、中心点特徴図との関係特徴図を抽出し、得られた関係特徴図でキーポイント特徴図を置き換えることにより、各キーポイント特徴図において中心に関連する特徴が加えられる。さらに、後続するステップS202で各キーポイント間の関係を抽出する際には、これらキーポイントの間の空間関係を抽出すると共に、骨格関係を抽出することができる。このような関係はより正確である。 For each keypoint feature map, extract the relationship feature map with the center point feature map, and replace the keypoint feature map with the obtained relationship feature map, thereby adding centrally related features in each keypoint feature map. . Furthermore, when extracting the relationship between each keypoint in subsequent step S202, the spatial relationship between these keypoints can be extracted, and the skeletal relationship can be extracted. Such relationships are more accurate.

図2に戻る。ステップS202の後に、処理はステップS203へ進む。ステップS203では、各キーポイント特徴図とそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新する。 Return to Figure 2. After step S202, the process proceeds to step S203. In step S203, based on each keypoint feature map and its corresponding relational feature map, the keypoint feature map is updated.

具体的には、各キーポイント特徴図とそれに対応する関係特徴図を入力として、畳込み処理を実行することにより、キーポイント特徴図とそれに対応する関係特徴図における特徴を融合する。そして、畳込みにより得られた特徴図に基づいて、各キーポイント特徴図を更新する。ここで、畳込みニューラルネットワークを、例えば、融合ネットワークと称する。もちろん、ここの融合ネットワークと、ステップS201でキーポイント特徴図を抽出するための特徴抽出ネットワークおよびステップS202でキーポイント特徴図間の関係特徴図を抽出するための関係抽出ネットワークとで、具体的なネットワークパラメータ(例えば、ネットワーク深度、ノードウェイトなど)は異なる。また、畳込みニューラルネットワークの具体的な構造は、本分野の当業者に周知されるものであり、具体的な機能設計要求に応じて、畳込みニューラルネットワークの層数を適宜に増加または減少させ、各ノードの重みパラメータを修正することができる。キーポイント特徴図および対応するすべての関係特徴図を複数のチャネルの入力として畳込みニューラルネットワークに提供する。そして、ニューラルネットワークから1つのチャネルの特徴図を出力し、キーポイント特徴図と対応するすべての関係特徴図の畳込み結果とする。出力された特徴図でキーポイント特徴図を置き換えることにより、キーポイント特徴図が更新される。 Specifically, each keypoint feature map and the corresponding relationship feature map are input, and convolution processing is performed to fuse the features in the keypoint feature map and the corresponding relationship feature map. Then, each keypoint feature map is updated based on the feature map obtained by convolution. Here, a convolutional neural network is called a fusion network, for example. Of course, a specific Network parameters (eg, network depth, node weights, etc.) are different. In addition, the specific structure of the convolutional neural network is well known to those skilled in the art, and the number of layers of the convolutional neural network can be increased or decreased as appropriate according to specific functional design requirements. , the weight parameter of each node can be modified. A keypoint feature map and all corresponding relationship feature maps are provided as inputs in multiple channels to a convolutional neural network. Then, one channel feature map is output from the neural network as the convolution result of the keypoint feature map and all the corresponding relational feature maps. The keypoint feature map is updated by replacing the keypoint feature map with the output feature map.

前記畳込み処理において、各関係特徴図はそれぞれ対応する第2重みを有する。第2重みは、初期値を基に、後述するトレーニングを通して学習することにより得られる。可能な実施形態として、第2重みの初期値は以下のように取得する。関係特徴図におけるすべての画素点の平均画素値または最大画素値に基づいて、前記第2重みの初期値を確定する。具体的には、関係特徴図に含まれる画すべての画素の値の中の最大値または平均値を確定し、その最大値または平均値に基づいて、当該関係特徴図に対応する第2の重みを確定する。 In the convolution process, each relationship feature map has a corresponding second weight. The second weight is obtained by learning through training, which will be described later, based on the initial value. As a possible embodiment, the initial value of the second weight is obtained as follows. Determine the initial value of the second weight based on the average pixel value or the maximum pixel value of all pixel points in the relationship feature map. Specifically, the maximum value or average value among the pixel values of all pixels included in the relationship feature map is determined, and based on the maximum value or average value, the second weight corresponding to the relationship feature map confirm.

図5はキーポイント特徴図を更新するためのネットワークを示す図である。図5は、便宜のため、3つのキーポイント特徴図を例にして、キーポイント特徴図毎に他のキーポイント特徴図との関係特徴図を特定する例を示す。 FIG. 5 is a diagram showing a network for updating keypoint feature maps. For the sake of convenience, FIG. 5 shows an example of specifying a relational feature diagram with other keypoint feature diagrams for each keypoint characteristic diagram, using three keypoint characteristic diagrams as an example.

図5では、ステップS201で抽出されたキーポイントi、j、kの特徴図をそれぞれブロック501、502、503で示す。ここで、キーポイントiはある一種のキーポイント、例えば鼻キーポイントだけを表すが、キーポイントiの特徴図においてはキーポイントiしか含まれないことを意味するものではない。例えば、複数の人の場合は、複数の鼻キーポイントが存在する場合がある。 In FIG. 5, blocks 501, 502, and 503 indicate feature maps of keypoints i, j, and k extracted in step S201, respectively. Here, keypoint i represents only one kind of keypoint, for example, the nose keypoint, but it does not mean that only keypoint i is included in the feature map of keypoint i. For example, for multiple people, there may be multiple nose keypoints.

図5では、6つの異なる畳込みニューラルネットワークをそれぞれステップS202で関係特徴図抽出処理を行う丸504-509で表し、また、得られた各関係特徴図を四角510-515で表す。例えば、504は、入力が2チャネル(即ちキーポイントi特徴図とキーポイントj特徴図)、出力が1チャネル(即ちiとjの間の関係特性図)である畳込みニューラルネットワークを表す。ブロック510、511はキーポイントiとj間の関係特徴図及びキーポイントiとk間の関係特徴図を表す。 In FIG. 5, six different convolutional neural networks are represented by circles 504-509, respectively, for which the relational feature diagram extraction process is performed in step S202, and each resulting relational feature diagram is represented by squares 510-515. For example, 504 represents a convolutional neural network whose input is two channels (ie keypoint i feature map and keypoint j feature map) and whose output is one channel (ie the relational map between i and j). Blocks 510, 511 represent the relationship feature diagram between keypoints i and j and the relationship feature diagram between keypoints i and k.

また、図5において、丸516～518はそれぞれ3つの異なる畳込みニューラルネットワークを表し、ステップS203の特徴融合処理を実行する。3つの畳込みニューラルネットワークの入力は、入力がいずれも3チャネル(即ち、1つのキーポイント特徴図と2つ関連する関係特徴図)、出力が1チャネル(即ち特徴融合後に得られる新しい特徴図)である。また、3つの畳込みニューラルネットワークにより出力され特徴融合される特徴図は、それぞれブロック519～521で表し、キーポイントi、j、kの更新された特徴図とする。 Also, in FIG. 5, circles 516 to 518 each represent three different convolutional neural networks, which perform feature fusion processing in step S203. The inputs of the three convolutional neural networks are 3-channel input (i.e. 1 keypoint feature map and 2 related relational feature maps) and 1-channel output (i.e. new feature map obtained after feature fusion). is. Also, the feature maps output by the three convolutional neural networks to be feature fused are represented by blocks 519-521, respectively, and are the updated feature maps for keypoints i, j, and k.

図5に示す破線枠500はすべての畳込みニューラルネットワークを含む。破線枠500に含まれるネットワークの全体を関係コードネットワークとし、キーポイント特徴図を受信して入力とし、任意のキーポイント特徴図間の関係を符号化し、最後に関係符号化された特徴図を出力して、更新されたキーポイント特徴図とする。 The dashed box 500 shown in FIG. 5 contains all convolutional neural networks. The entire network contained in the dashed frame 500 is defined as a relational code network, receives keypoint feature maps as input, encodes relationships between arbitrary keypoint feature maps, and finally outputs relationship-encoded feature maps. and the updated keypoint feature map.

図2に戻る。ステップS203の後にステップS204に処理へ進む。ステップS204で、更新されたキーポイント特徴図に基づいて、入力された画像における身体部位のキーポイント位置を確定する。例えば、可能な実施形態として、更新されたキーポイント特徴図に基づいて、各キーポイントに対応するヒートマップが得られる。このヒートマップにおいて、各画素の画素値は、その画素がキーポイントであるか否かの確率値である。上記の概率値は0から1の値である。このため、ヒートマップにおいて、局部極値を特定することにより、キーポイント位置が確定される。 Return to Figure 2. After step S203, the process proceeds to step S204. In step S204, the keypoint locations of the body parts in the input image are determined based on the updated keypoint feature map. For example, in one possible embodiment, a heatmap corresponding to each keypoint is obtained based on the updated keypoint feature map. In this heat map, the pixel value of each pixel is the probability value of whether that pixel is a keypoint or not. The approximate value above is a value between 0 and 1. Thus, keypoint locations are established in the heatmap by identifying local extrema.

このため、本発明の実施例にかかる画像処理方法において、十分な関係感知、即ち、隣接するキーポイント間の関係に加え、遠距離キーポイント間の関係を考慮することにより、入力画像に対して、より正確にキーポイントの位置を特定し、正確な姿勢認識を実現することができる。また、本発明によれば、遮蔽された身体部位のキーポイントでも確実に予測される。例えば、キーポイントが遮蔽される場合は、当該キーポイントと他のキーポイントとの関係特徴図により、そのキーポイントの大まかな位置(例えば、右目が遮蔽された場合、右目-鼻、右目-口、右目-右手などの関係により、その右目の凡その位置が予測される)を予測することができる。 For this reason, in the image processing method according to the embodiment of the present invention, by considering the relationship between distant keypoints in addition to the relationship between neighboring keypoints, the relationship between adjacent keypoints is considered. , which can locate keypoints more accurately and achieve accurate pose recognition. The present invention also reliably predicts keypoints in occluded body parts. For example, if a keypoint is occluded, the keypoint's approximate position (for example, if the right eye is occluded, right eye-nose, right eye-mouth , right-eye-right-hand relationship, etc., predicts the approximate position of the right eye).

以上に述べられたように、図2におけるステップS201、ステップS202、ステップS203の処理は、具体的な畳込みニューラルネットワークを介して実現される。具体的には、特徴抽出ネットワークを介して入力画像からキーポイント特徴図を抽出し、関係抽出ネットワークを介して関係特徴図を抽出し、融合ネットワークを介して、各キーポイント特徴図及びそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新する。 As described above, the processing of steps S201, S202, and S203 in FIG. 2 is realized through a specific convolutional neural network. Specifically, a keypoint feature map is extracted from an input image through a feature extraction network, a relationship feature map is extracted through a relationship extraction network, and each keypoint feature map and its corresponding Update the keypoint feature map based on the relationship feature map.

次に、図6を参照しながら特徴抽出ネットワーク、関係抽出ネットワーク、および融合ネットワークをトレーニングする具体的なステップについて説明する。 Next, specific steps for training the feature extraction network, relationship extraction network, and fusion network will be described with reference to FIG.

まず、ステップS601では、トレーニング画像を入力画像として特徴抽出ネットワークに供給する。ここで、トレーニング画像と上記ステップS201の入力画像との違いは、トレーニング画像がマーキングデータを有する。即ち、トレーニングデータにおいて、身体の各部のキーポイントの位置は既知である。 First, in step S601, training images are supplied to the feature extraction network as input images. Here, the difference between the training image and the input image in step S201 is that the training image has marking data. That is, in the training data, the positions of key points on each part of the body are known.

次に、ステップS602では、前記特徴抽出ネットワークから出力されるキーポイント特徴図に基づいて、前記トレーニング画像におけるキーポイントの大まかな位置を確定する。ここで、特徴抽出ネットワークから出力されるキーポイント特徴図は、後述する特徴融合を施されていないため、このようなキーポイント特徴図に基づいて得られるキーポイント位置は比較的に大まかでかつ不正確である。前記キーポイントの大まかな位置と真実のキーポイント位置とのズレに基づいて第1損失関数を確定する。 Next, in step S602, the rough locations of keypoints in the training images are determined based on the keypoint feature map output from the feature extraction network. Here, since the keypoint feature map output from the feature extraction network is not subjected to feature fusion, which will be described later, the keypoint positions obtained based on such a keypoint feature map are relatively rough and unclear. Accurate. A first loss function is determined based on the deviation between the rough keypoint position and the true keypoint position.

次に、ステップS603で、特徴抽出ネットワーク、関連抽出ネットワーク、および前記融合ネットワークを介して更新されたキーポイント位置と、実際のキーポイント位置とのズレにより、第2損失関数を確定する。 Next, in step S603, a second loss function is determined according to the deviation between the keypoint positions updated through the feature extraction network, the relational extraction network and the fusion network and the actual keypoint positions.

最後にステップS604で、前記第1損失関数と前記第2損失関数に基づいて、前記特徴抽出ネットワーク、前記関係抽出ネットワーク及び前記融合ネットワークのパラメータを調整する。総損失関数は、第1損失関数と第2損失関数に基づいて決定される。例えば、第1損失関数と第2損失関数に対して平均を求めて、総損失関数が得られる。もちろん、本発明はこれに限らない。総損失関数は、別の方法で第1損失関数と第2損失関数を組み合わせて得られる。総損失関数が収束するときに、前記特徴抽出ネットワーク、前記関係抽出ネットワーク、および前記融合ネットワークのトレーニングが終了する。 Finally, in step S604, adjust the parameters of the feature extraction network, the relationship extraction network and the fusion network according to the first loss function and the second loss function. A total loss function is determined based on the first loss function and the second loss function. For example, averaging over the first loss function and the second loss function yields the total loss function. Of course, the present invention is not limited to this. A total loss function is obtained by combining the first loss function and the second loss function in another way. The training of the feature extraction network, the relationship extraction network and the fusion network is finished when the total loss function converges.

これにより、本発明はエンドツーエンドのアプローチによってキーポイント同士間の関係を学習し、任意の対象のキーポイント関係抽出を容易に適用される。 Thus, the present invention learns the relationships between keypoints by an end-to-end approach and is easily applied to keypoint relationship extraction for arbitrary targets.

以上に述べられたように、本発明の実施例にかかる画像処理方法によれば、入力画像に含まれる身体の各部のキーポイント位置が得られる。また、身体の各部に基づくキーポイント位置は具体的な適用場面に応じて、例えば、モニタリングやマンマシンインタラクションを行うように、イメージ(例えば、仮想人物またはリアルユーザなど)の姿勢をさらに推定することができる。もちろん、本発明はこれに限らない。抽出されたキーポイント位置を使用する必要がある任意の適用場面は、同様に本発明にかかる方法が適用される。 As described above, according to the image processing method according to the embodiment of the present invention, the keypoint positions of each part of the body included in the input image are obtained. In addition, the keypoint positions based on each part of the body can further estimate the pose of the image (such as a virtual person or a real user) according to the specific application scene, such as monitoring or man-machine interaction. can be done. Of course, the present invention is not limited to this. Any application that requires the use of extracted keypoint locations will likewise apply the method according to the present invention.

以上、本発明にかかる画像処理方法を図1～図6に参照しながら詳細に説明した。以下、本発明にかかる画像処理装置を図7に参照しながら説明する。 The image processing method according to the present invention has been described in detail above with reference to FIGS. An image processing apparatus according to the present invention will be described below with reference to FIG.

画像処理装置700は、キーポイント特徴図抽出手段701、関係特徴図抽出手段702、キーポイント特徴図更新手段703及びキーポイント位置決定手段704を含む。 The image processing apparatus 700 includes keypoint feature map extraction means 701 , relationship feature map extraction means 702 , keypoint feature map update means 703 and keypoint position determination means 704 .

キーポイント特徴図抽出装置701は、入力画像に基づいて、それぞれ身体部位におけるキーポイントに対応する所定数のキーポイント特徴図を抽出する。 The keypoint feature map extracting device 701 extracts a predetermined number of keypoint feature maps corresponding to keypoints in each body part based on the input image.

ここで、対応する身体部位のキーポイントは骨格キーポイントである。骨格キーポイントは、イメージ(例えば、人間、動物、仮想イメージ)の姿勢を表示してイメージの行動を予測するための重要な要素である。このため、特定のイメージに対して、骨格部位を規定する所定数のキーポイントをあらかじめ設定しておくことができる。例えば、図3に示すように、身体の各部に対応するキーポイントを予め設定しておくことができる。 Here, the corresponding body part keypoints are skeleton keypoints. Skeletal keypoints are important elements for displaying the pose of an image (eg, human, animal, virtual image) and predicting the behavior of the image. Therefore, a predetermined number of keypoints defining the skeletal part can be set in advance for a particular image. For example, as shown in FIG. 3, key points corresponding to each part of the body can be set in advance.

キーポイント特徴図抽出装置701は、例えば、畳込みニューラルネットワークによりキーポイント特徴図の抽出処理を実現する。ここで、畳込みニューラルネットワークを例えば、特徴抽出ネットワークと呼ぶ。畳込みニューラルネットワークには、入力層、抑制層、出力層を含む。入力層は識別する画像を受信する。抑制層は入力された画像に対して特徴の解析や抽出を実行する。出力層は最終に抽出された所定数のキーポイント特徴図を出力する。 The keypoint feature map extraction device 701 implements extraction processing of keypoint feature maps by, for example, a convolutional neural network. Here, the convolutional neural network is called a feature extraction network, for example. A convolutional neural network includes an input layer, a suppression layer, and an output layer. The input layer receives an identifying image. The suppression layer performs feature analysis and extraction on the input image. The output layer finally outputs a predetermined number of extracted keypoint feature maps.

抑制層には、畳込み層、プール化層、および全接続層を含む。畳込みニューラルネットワークにおいて、特徴の抽出には複数の畳込み層(直列、並列、クロス接続などの複雑な構造)が使用される。異なるパラメータを持つ複数の畳込み、正規化、およびプール化処理により、異なる階層の特徴が抽出される。まず、畳込みを利用して入力画像をチェックして畳込みを行うことで、畳込みマッピングを得る。次に、通常の校正線形ユニットとバッチ正規化法を利用して畳込みマッピングを正規化し、正規化された畳込みマッピングを得る。更に、正規化された畳込みマッピングに対して最大または平均のプール化処理を行う。畳込み層で特徴抽出後に、出力された特徴図はプール化層に転送され、特徴選択と情報フィルタリングを行う。プール化層に含まれた予め設定されたプール化関数は、特徴図における単一点の結果を隣接する領域の特徴図の統計量に置き換える機能を備える。例えば、最大プール化は、単一の画素点の値を隣接する領域の画素点の最大値に置き換える。平均プール化は、単一の画素点の値を隣接する領域の画素点の平均値に置き換える。多様な特徴を得るには、抑制層に複数の畳込み層と複数のプール化層を設定し、上記過程を複数回繰り返すことができる。なお、複数回のサンプリング過程により多様な多尺度の特徴マッピングが抽出される。 Constrained layers include convolutional layers, pooled layers, and fully connected layers. In convolutional neural networks, multiple convolutional layers (complex structures such as series, parallel and cross-connections) are used for feature extraction. Different hierarchies of features are extracted by multiple convolution, normalization and pooling processes with different parameters. First, the input image is checked and convolved using convolution to obtain the convolution mapping. The convolutional mapping is then normalized using the usual calibration linear unit and batch normalization method to obtain a normalized convolutional mapping. In addition, a maximum or average pooling process is performed on the normalized convolutional mappings. After feature extraction in the convolution layer, the output feature map is transferred to the pooling layer for feature selection and information filtering. A preset pooling function included in the pooling layer provides the ability to replace single point results in the feature map with the feature map statistics of adjacent regions. For example, max pooling replaces the value of a single pixel point with the maximum value of pixel points in adjacent regions. Average pooling replaces the value of a single pixel point with the average value of pixel points in adjacent regions. To obtain diverse features, the suppression layer can be set with multiple convolutional layers and multiple pooling layers, and the above process can be repeated multiple times. In addition, various multi-scale feature mappings are extracted through a plurality of sampling processes.

出力層は、抑制層により抽出された多尺度の特徴マッピングに基づいて、所定数の特徴図が最終に出力される。入力画像は実際のユーザ画像である場合は、図3に示すような人体キーポイント設定の場合、出力層から17個のキーポイント特徴図が出力される。 The output layer finally outputs a predetermined number of feature maps based on the multiscale feature mapping extracted by the suppression layer. When the input image is an actual user image, 17 keypoint feature maps are output from the output layer in the case of human body keypoint setting as shown in FIG.

関係特徴図抽出装置702は、前記所定数のキーポイント特徴図に基づいて、キーポイント特徴図毎に別のキーポイント特徴図との複数の関係を表す関係特徴図を抽出する。 Based on the predetermined number of keypoint feature maps, the relationship feature map extracting device 702 extracts a relationship feature map representing a plurality of relationships between each keypoint feature map and another keypoint feature map.

具体的には、上記の特徴図抽出手段702は、各キーポイント特徴図を順次カレントキーポイント特徴図として、以下のステップを繰り返し実行するように構成される。
まず、カレントキーポイント特徴図を除く他のキーポイント特徴図を複数のグループに振り分ける。 Specifically, the above feature map extracting means 702 is configured to repeatedly execute the following steps with each keypoint feature map as the current keypoint feature map in sequence.
First, keypoint feature maps other than the current keypoint feature map are sorted into a plurality of groups.

次に、カレントキーポイント特徴図と、他のキーポイント特徴図からなるグループのそれぞれとを入力として、畳込み処理を行う。例えば、ここでの畳込み処理は、畳込みニューラルネットワークを介して実現される。また、例えば、ここでの畳込みニューラルネットワークを関係抽出ネットワークと呼ぶ。もちろん、ここでの畳込みニューラルネットワークは、キーポイント特徴図抽出手段701がキーポイント特性図を抽出するための畳込みニューラルネットワークとで具体的なネットワークパラメータ(例えば、ネットワーク深度、ノードウェイトなど)は異なる。2つ以上のキーポイント特徴図を入力として、畳込みニューラルネットワークに提供する。畳込みニューラルネットワークは、入力したキーポイント特徴図に特徴抽出を行い、入力されたキーポイント特徴図の特徴記述を出力する。入力された各キーポイント特徴図は、キーポイントに関連付けられているため、出力された畳込みの結果としての特徴図は、キーポイント間の関係を表すものとして見なしてもよい。換言すれば、2つ以上のキーポイント特徴図を畳込み処理することにより、カレントキーポイント特徴図と他のキーポイント特徴図との関係が得られる。なお、畳込み処理の結果としての特徴図は関係特徴図と呼ぶ。 Next, the current keypoint feature map and each group of other keypoint feature maps are input, and convolution processing is performed. For example, the convolution processing here is realized via a convolutional neural network. Also, for example, the convolutional neural network here is called a relationship extraction network. Of course, the convolutional neural network here is the convolutional neural network for the keypoint feature diagram extraction means 701 to extract the keypoint characteristic diagram, and the specific network parameters (for example, network depth, node weight, etc.) are different. Two or more keypoint feature maps are provided as inputs to a convolutional neural network. The convolutional neural network performs feature extraction on the input keypoint feature map and outputs a feature description of the input keypoint feature map. Since each input keypoint feature map is associated with a keypoint, the output feature map resulting from the convolution may be viewed as representing the relationship between the keypoints. In other words, by convolving two or more keypoint feature maps, the relationship between the current keypoint feature map and other keypoint feature maps is obtained. Note that the feature map as a result of the convolution process is called a relationship feature map.

例えば、他のキーポイント特徴図からなるグループにおける各特徴図に対応する身体部位と、前記カレントキーポイント特徴図に対応する身体部位との間の関係に基づいて、第1重みの値を確定する。例えば、前述のように、キーポイント間の距離に基づいて、対応する重みを決定する。キーポイント間の距離が短いほど、その関連が大きくなると考えられるため、大きな重みを割り当てる。逆の場合にも同様である。 For example, determine the value of the first weight based on the relationship between the body part corresponding to each feature map in a group of other keypoint feature maps and the body part corresponding to the current keypoint feature map. . For example, as described above, the corresponding weights are determined based on the distances between keypoints. Smaller distances between keypoints are assumed to be more relevant and are therefore assigned higher weights. The same is true in the opposite case.

関係特徴図抽出装置702は、キーポイント特徴図毎に他のキーポイント特徴図における関係特徴図を抽出する。即ち、関係特徴図抽出装置702は、隣接するキーポイント間の関係に加えて、遠距離のキーポイント間の関係も考慮される。 The relationship feature diagram extraction device 702 extracts relationship feature diagrams in other keypoint feature diagrams for each keypoint feature diagram. That is, the relational feature map extractor 702 considers the relation between distant keypoints in addition to the relation between adjacent keypoints.

また、他の可能な実施方式として、前記キーポイント特徴図抽出手段701は、入力画像からキーポイント特徴図を抽出した後、さらに、すべてのキーポイントの平均である中心点の中心点特徴図を確定し; キーポイント特徴図毎に、中心点特徴図との関係特徴図を抽出し、前記関連特徴図に基づいて、キーポイント特徴図を更新するように構成される。 As another possible implementation, the keypoint feature map extracting means 701 extracts the keypoint feature map from the input image, and then extracts the center point feature map of the center point, which is the average of all the keypoints. determining; for each keypoint feature map, it is configured to extract a relationship feature map with the center point feature map, and update the keypoint feature map based on said related feature map.

前述のように、中心点特徴図は、次の2つの方法で取得される。1つ目の方法は、すべてのキーポイント特徴図を平均した特徴図を中心点特徴図とする方法である。2つ目の方法は、所定数（例えば17個）のキーポイント特徴図を出力し、更に中心点特徴図をも出力するように、キーポイント特徴図を抽出するための畳込みニューラルネットワークの構造を調整する。 As mentioned above, the centroid feature map is obtained in two ways. The first method is to use the feature map obtained by averaging all the keypoint feature maps as the center point feature map. The second method is to construct a convolutional neural network for extracting the keypoint feature map to output a predetermined number (e.g. 17) of keypoint feature maps and also to output the center point feature map. to adjust.

また、キーポイント特徴図毎に前記中心点特徴図との関係特徴図を抽出する具体的な方法は、キーポイント特徴図間の関係特徴図を抽出する方法に類似する。例えば、キーポイント特徴図に対して、それと中心点特徴図に対して畳込み処理を施して新しい特徴図が生成される。この新しい特徴図では、当該キーポイントと中心点と間の関係が融合されている。また、この新しい特徴図を最適化されたキーポイント特徴図とする。 Further, a specific method of extracting a relationship feature map with the center point feature map for each keypoint feature map is similar to a method of extracting a relationship feature map between keypoint feature maps. For example, a new feature map is generated by performing a convolution process on the keypoint feature map and the center point feature map. In this new feature map, the relationship between the keypoint and the center point is fused. Also, let this new feature map be the optimized keypoint feature map.

キーポイント特徴図毎に、中心点特徴図との関係特徴図を抽出し、得られた関係特徴図でキーポイント特徴図を置き換えることにより、各キーポイント特徴図において中心に関連する特徴が加えられる。さらに、後続する関係特徴図抽出手段702が各キーポイント間の関係を抽出する際には、これらキーポイントの間の空間関係を抽出すると共に、骨格関係を抽出することができる。このような関係はより正確である。 For each keypoint feature map, extract the relationship feature map with the center point feature map, and replace the keypoint feature map with the obtained relationship feature map, thereby adding centrally related features in each keypoint feature map. . Furthermore, when the subsequent relationship feature diagram extraction means 702 extracts the relationship between each keypoint, it is possible to extract the spatial relationship between these keypoints as well as the skeletal relationship. Such relationships are more accurate.

キーポイント特徴図更新手段703は、各キーポイント特徴図とそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新する。 The keypoint feature map updating means 703 updates the keypoint feature map based on each keypoint feature map and the relationship feature map corresponding thereto.

具体的には、キーポイント特徴図更新手段703は、各キーポイント特徴図とそれに対応する関係特徴図を入力として、畳込み処理を実行することにより、キーポイント特徴図とそれに対応する関係特徴図における特徴を融合するように構成される。そして、畳込みにより得られた特徴図に基づいて、各キーポイント特徴図を更新する。ここで、畳込みニューラルネットワークを、例えば、融合ネットワークと称する。もちろん、ここの融合ネットワークと、前記キーポイント特徴図抽出手段701がキーポイント特徴図を抽出するための特徴抽出ネットワークおよび前記関係特徴図抽出手段702がキーポイント特徴図間の関係特徴図を抽出するための関係抽出ネットワークとで、具体的なネットワークパラメータ(例えば、ネットワーク深度、ノードウェイトなど)は異なる。また、畳込みニューラルネットワークの具体的な構造は、本分野の当業者に周知されるものであり、具体的な機能設計要求に応じて、畳込みニューラルネットワークの層数を適宜に増加または減少させ、各ノードの重みパラメータを修正することができる。キーポイント特徴図および対応するすべての関係特徴図を複数のチャネルの入力として畳込みニューラルネットワークに提供する。そして、ニューラルネットワークから1つのチャネルの特徴図を出力し、キーポイント特徴図と対応するすべての関係特徴図の畳込み結果とする。出力された特徴図でキーポイント特徴図を置き換えることにより、キーポイント特徴図が更新される。 Specifically, the keypoint feature map updating means 703 receives each keypoint feature map and the corresponding relationship feature map as input, and executes convolution processing to convert the keypoint feature map and the corresponding relationship feature map into configured to blend features in Then, each keypoint feature map is updated based on the feature map obtained by convolution. Here, a convolutional neural network is called a fusion network, for example. Of course, this fusion network, the feature extraction network for the keypoint feature map extraction means 701 to extract the keypoint feature maps, and the relationship feature map extraction means 702 to extract the relationship feature maps between the keypoint feature maps The specific network parameters (eg, network depth, node weights, etc.) are different for relational extraction networks. In addition, the specific structure of the convolutional neural network is well known to those skilled in the art, and the number of layers of the convolutional neural network can be increased or decreased as appropriate according to specific functional design requirements. , the weight parameter of each node can be modified. A keypoint feature map and all corresponding relationship feature maps are provided as inputs in multiple channels to a convolutional neural network. Then, one channel feature map is output from the neural network as the convolution result of the keypoint feature map and all the corresponding relational feature maps. The keypoint feature map is updated by replacing the keypoint feature map with the output feature map.

キーポイント位置確定手段704は、更新されたキーポイント特徴図に基づいて、入力された画像における身体部位のキーポイント位置を確定する。例えば、可能な実施形態として、キーポイント位置確定手段704は、更新されたキーポイント特徴図に基づいて、各キーポイントに対応するヒートマップを得る。このヒートマップにおいて、各画素の画素値は、その画素がキーポイントであるか否かの確率値である。上記の概率値は0から1の値である。このため、ヒートマップにおいて、局部極値を特定することにより、キーポイント位置が確定される。 A keypoint position determining means 704 determines the keypoint positions of the body parts in the input image based on the updated keypoint feature map. For example, in one possible embodiment, the keypoint location determiner 704 obtains a heatmap corresponding to each keypoint based on the updated keypoint feature map. In this heat map, the pixel value of each pixel is the probability value of whether that pixel is a keypoint or not. The approximate value above is a value between 0 and 1. Thus, keypoint locations are established in the heatmap by identifying local extrema.

このため、本発明の実施例にかかる画像処理装置は、十分な関係感知、即ち、隣接するキーポイント間の関係に加え、遠距離キーポイント間の関係を考慮することにより、入力画像に対して、より正確にキーポイントの位置を特定し、正確な姿勢認識を実現することができる。また、本発明によれば、遮蔽された身体部位のキーポイントでも確実に予測される。例えば、キーポイントが遮蔽される場合は、当該キーポイントと他のキーポイントとの関係特徴図により、そのキーポイントの大まかな位置(例えば、右目が遮蔽された場合、右目-鼻、右目-口、右目-右手などの関係により、その右目の凡その位置が予測される)を予測することができる。 For this reason, the image processing apparatus according to the embodiment of the present invention provides sufficient relation sensing, that is, by considering the relation between distant keypoints in addition to the relation between adjacent keypoints, a , which can locate keypoints more accurately and achieve accurate pose recognition. The present invention also reliably predicts keypoints in occluded body parts. For example, if a keypoint is occluded, the keypoint's approximate position (for example, if the right eye is occluded, right eye-nose, right eye-mouth , right-eye-right-hand relationship, etc., predicts the approximate position of the right eye).

以上に述べられたように、キーポイント特徴図抽出手段701は特徴抽出ネットワークを介して入力画像からキーポイント特徴図を抽出し、関係特徴図抽出手段702は関係抽出ネットワークを介して前記関係特徴図を抽出し、キーポイント特徴図更新手段703は融合ネットワークを介して各キーポイント特徴図及びそれに対応の関係特徴図に基づいて、当該キーポイント特徴図を更新する。 As described above, the keypoint feature map extraction means 701 extracts the keypoint feature map from the input image via the feature extraction network, and the relational feature map extraction means 702 extracts the relational feature map via the relation extraction network. , and the keypoint feature map updating means 703 updates the keypoint feature map based on each keypoint feature map and its corresponding relational feature map through the fusion network.

前記特徴抽出ネットワーク、前記関係抽出ネットワーク、および前記融合ネットワークは、上記した図6を参照して説明したトレーニングステップによりトレーニングされる。 The feature extraction network, the relationship extraction network and the fusion network are trained by the training steps described with reference to FIG. 6 above.

以上に述べられたように、本発明の実施例にかかる画像処理装置によれば、入力画像に含まれる身体の各部のキーポイント位置が得られる。また、身体の各部に基づくキーポイント位置は具体的な適用場面に応じて、例えば、モニタリングやマンマシンインタラクションを行うように、イメージ(例えば、仮想人物またはリアルユーザなど)の姿勢をさらに推定することができる。もちろん、本発明はこれに限らない。抽出されたキーポイント位置を使用する必要がある任意の適用場面は、同様に本発明にかかる装置が適用される。 As described above, according to the image processing apparatus according to the embodiments of the present invention, the keypoint positions of each part of the body included in the input image are obtained. In addition, the keypoint positions based on each part of the body can further estimate the pose of the image (such as a virtual person or a real user) according to the specific application scene, such as monitoring or man-machine interaction. can be done. Of course, the present invention is not limited to this. Any application that requires the use of extracted keypoint locations will likewise apply the apparatus according to the present invention.

また、本発明の画像処理装置は、コンピュータプログラムが記憶されるメモリ、及びプロセッサを含み、前記プロセッサが前記メモリに記憶されるコンピュータプログラムを実行する時に、入力画像に基づいて、それぞれ身体部位におけるキーポイントに対応する所定数のキーポイント特徴図を抽出し；前記所定数のキーポイント特徴図に基づいて、キーポイント特徴図毎に他のキーポイント特徴図との複数の関係を表す関係特徴図を抽出し；各キーポイント特徴図及びそれに対応する関係特徴図に基づいて、当該キーポイント特徴図を更新し；更新されたキーポイント特徴図に基づいて、前記入力画像における身体の各部のキーポイントの位置を確定する処理が実行される。 Also, the image processing apparatus of the present invention includes a memory in which a computer program is stored, and a processor. When the processor executes the computer program stored in the memory, key points in each body part are displayed based on an input image. extracting a predetermined number of keypoint feature maps corresponding to the points; based on the predetermined number of keypoint feature maps, creating a relationship feature map representing a plurality of relationships with other keypoint feature maps for each keypoint feature map; based on each keypoint feature map and its corresponding relationship feature map, update the keypoint feature map; based on the updated keypoint feature map, keypoints of each part of the body in the input image A process of determining the position is performed.

具体的には、本発明の実施例にかかる方法や装置は、図8に示す計算装置800のアーキテクチャにより実現される。図8に示すように、計算装置800は、バス810、1つまたは複数のCPU820、読出し専用メモリ（ROM）830、ランダムアクセスメモリ(RAM)840、ネットワークに接続された通信ポート850、入出力コンポーネント860、ハードディスク870などを含む。計算装置800において、ROM830やハードディスク870などのメモリに本発明の画像処理方法の処理及び／又は通信に使用する様々なデータやファイル、およびCPUが実行するプログラム命指令を格納する。もちろん、図8に示すアーキテクチャは例示的なものに過ぎず、実際の装置の場合には、必要に応じて、図8に示す計算装置の1つ以上の要素を省略してもよい。 Specifically, the method and apparatus according to embodiments of the present invention are implemented by the architecture of computing device 800 shown in FIG. As shown in FIG. 8, the computing device 800 includes a bus 810, one or more CPUs 820, read only memory (ROM) 830, random access memory (RAM) 840, a communication port 850 connected to a network, input/output components. 860, hard disk 870, etc. In computing device 800, memories such as ROM 830 and hard disk 870 store various data and files used for processing and/or communication of the image processing method of the present invention, and program instructions executed by the CPU. Of course, the architecture shown in FIG. 8 is only exemplary, and one or more elements of the computing device shown in FIG. 8 may be omitted in the actual device, if desired.

更に、本発明の実施例はコンピュータ読取り可能な記憶媒体により実現される。本発明の実施例にかかるコンピュー読取り可能な記憶媒体にはコンピュータ読取り可能な指令が格納される。前記コンピュータ読取り可能な指令がプロセッサにより実行されるときに、上記の図を参照して説明した本発明の実施例にかかる画像処理方法が実行される。前記コンピュータ読取り可能な記憶媒体は、例えば揮発性記憶装置及び/或は非揮発性記憶装置を含むが、これらに限らない。前記揮発性メモリは例えばランダムアクセスメモリ(RAM)やキャッシュなどを含む。前記非揮発性メモリは例えばROM(ROM)、ハードディスク、フラッシュメモリなどを含む。 Furthermore, embodiments of the present invention are implemented with a computer-readable storage medium. A computer readable storage medium according to an embodiment of the present invention stores computer readable instructions. When the computer readable instructions are executed by the processor, the image processing method according to the embodiments of the invention described with reference to the above figures is performed. Examples of the computer-readable storage medium include, but are not limited to, volatile and/or non-volatile storage. Volatile memory includes, for example, random access memory (RAM) and cache. The non-volatile memory includes, for example, ROM (ROM), hard disk, flash memory, and the like.

これまでに、本発明の実施例にかかる画像処理方法、装置及び記憶媒体を図1～図8を参照して説明した。本発明の画像処理方法、装置及び記憶媒体によれば、全体的な関係、すなわち隣接するキーポイント間の関係に加え、遠距離キーポイントとの関係を考慮して、関連するキーポイントからコンテキスト情報を抽出して、遮蔽されたキーポイントを推定することにより、入力画像内のすべてのキーポイントの位置を決定する。このため、遮蔽された身体部分のキーポイントであっても確実に予測される。 So far, the image processing method, apparatus, and storage medium according to the embodiments of the present invention have been described with reference to FIGS. 1 to 8. FIG. According to the image processing method, apparatus, and storage medium of the present invention, context information is obtained from related keypoints by considering the overall relationship, that is, the relationship between adjacent keypoints as well as the relationship with distant keypoints. We determine the positions of all keypoints in the input image by extracting and estimating the occluded keypoints. Thus, even keypoints on occluded body parts are reliably predicted.

なお、本明細書において、用語「含む」またはその変形用語は、非排他的に含む意味する。したがって、一連の要素を含むプロセス、方法、物または装置には、それらの要素に加え、その他明示しない要素、また、固有する要素を含むことができる。特別に限定されなければ、これら要素を含むプロセス、方法、物、または装置に同様な要素が別個として同時に存在することもあり得る。 In this specification, the term "include" or variations thereof means to include non-exclusively. Thus, a process, method, article or apparatus comprising a set of elements may include those elements as well as other elements not explicitly stated or inherent. Unless specifically limited, similar elements may be present separately and concurrently in a process, method, article, or apparatus containing these elements.

最後に、上記の一連の処理は、上記述べられた順序に従って時系列で実行される処理に加え、並列または別々に、時間順に従わずに行う処理を含む。 Finally, the above series of processes includes processes that are performed in chronological order according to the order described above, as well as processes that are performed in parallel or separately, out of chronological order.

以上の実施形態の説明によれば、本技術分野の当業者にとって、本発明がソフトウェアに必要なハードウェアプラットフォームを組み合わせて実現されることが明らかであるが、もちろんすべてをソフトウェアで実行されることもできる。このような理解に基づいて、本発明は背景技術に対して貢献する全部或は一部がソフトウェア製品の形で現れることができる。このコンピュータソフトウェア製品は、例えばROM/RAM、フロッピーディスク、光ディスクなどの記憶媒体に記憶され、コンピュータ装置（PC、サーバ又はネットワークデバイスなど）を本発明の各実施例或は実施例の一部で述べた方法を実行させる指令を含む。 According to the above description of the embodiments, it is obvious to those skilled in the art that the present invention can be realized by combining hardware platforms required for software. can also Based on such an understanding, the present invention may appear in whole or in part in the form of a software product contributing to the background art. This computer software product is stored in a storage medium such as ROM/RAM, floppy disk, optical disk, etc., and computer equipment (PC, server, network device, etc.) is described in each embodiment or part of the embodiments of the present invention. contains instructions that cause the method to be executed.

以上で本発明について詳細に説明した。本明細書において、具体的な例を適用して本発明の原理及び実施形態を説明したが、これらの実施例は本発明の原理およびコア思想を理解する助けを目的としたものである。また、当業者は、本発明の思想に基づいて具体的な実施形態および応用範囲を変更してもよい。更に、上記の内容は本発明を限定するものとして捉えてはならない。 The present invention has been described in detail above. Although the principles and embodiments of the present invention have been described herein by applying specific examples, these examples are intended to aid in understanding the principles and core ideas of the present invention. Also, those skilled in the art may change the specific embodiments and application range based on the idea of the present invention. Furthermore, the above should not be taken as limiting the present invention.

Claims

An image processing method executed by a processor,
extracting a predetermined number of keypoint feature maps, each corresponding to a keypoint in the body part, based on the input image;
Based on the predetermined number of keypoint feature maps, for each keypoint feature map, keypoints adjacent to the keypoint corresponding to the keypoint feature map and distant keypoints not adjacent to the keypoint feature map are calculated. extracting relationship feature maps representing multiple relationships with other keypoint feature maps including distance keypoints ;
updating the keypoint feature map based on each keypoint feature map and its corresponding relationship feature map representing multiple relationships with the other keypoint feature map ; and based on the updated keypoint feature map. to determine the positions of key points of body parts in said input image.

Based on the predetermined number of keypoint feature maps , before extracting relationship feature maps representing a plurality of relationships between the keypoint feature map and the other keypoint feature maps for each keypoint feature map,
determining a center point feature map of the center point that is the average of all key points; and extracting a relationship feature map showing the relationship between each key point feature map and the center point feature map, based on the relationship feature map 2. The image processing method of claim 1, comprising updating said keypoint feature map.

Extracting relationship feature diagrams representing a plurality of relationships between the keypoint feature diagram and the other keypoint feature diagrams for each keypoint feature diagram based on the predetermined number of keypoint feature diagrams,
Each keypoint feature map is taken as the current keypoint feature map in turn, and the other keypoint feature maps other than the current keypoint feature map are sorted into a plurality of groups containing at least one keypoint feature map; and the current keypoint feature map. and each group consisting of other keypoint feature maps are input, and convolution processing is performed to obtain a relationship representing the relationship between the current keypoint feature map and each group consisting of other keypoint feature maps get the feature map,
2. The image processing method according to claim 1, comprising repeatedly performing:

in the convolution process, each feature map in a group of other keypoint feature maps has a corresponding first weight;
determining an initial value of a first weight based on the relationship between the body part corresponding to each feature map in a group of other keypoint feature maps and the body part corresponding to the current keypoint feature map; 4. The image processing method according to claim 3.

Based on each keypoint feature map and its corresponding relationship feature map, updating the keypoint feature map includes:
Taking each keypoint feature map and its corresponding relationship feature map as input, performing a convolution process; and updating each keypoint feature map based on the feature map obtained by the convolution. The described image processing method.

In the convolution process, each relationship feature map has a corresponding second weight,
6. The image processing method according to claim 5, wherein the initial value of the second weight is determined based on an average pixel value or maximum pixel value of all pixels in each relationship feature map.

extracting each keypoint feature map from the input image through a feature extraction network, extracting the relationship feature map through a relationship extraction network, and extracting each keypoint feature map and its corresponding relationship feature map through a fusion network; An image processing method for updating the keypoint feature map based on
The feature extraction network, the relationship extraction network and the fusion network are
providing training images with known keypoint locations for each part of the body as input images to the feature extraction network;
Determine the rough position of each keypoint in the training image based on the feature map of each keypoint output from the feature extraction network, and determine the difference between the rough position of each keypoint and the true keypoint position. determine a first loss function based on, and determine a second loss function based on the deviation between the position of each keypoint updated through the feature extraction network, the relationship extraction network and the fusion network and the true keypoint position; and adjusting parameters of the feature extraction network, the relationship extraction network and the fusion network based on the first loss function and the second loss function. Image processing method.

a key point feature map extracting means for extracting a predetermined number of key point feature maps corresponding to key points in each body part based on the input image;
Based on the predetermined number of keypoint feature maps, for each keypoint feature map, keypoints adjacent to the keypoint corresponding to the keypoint feature map and distant keypoints not adjacent to the keypoint feature map relationship feature diagram extracting means for extracting relationship feature diagrams representing a plurality of relationships with other keypoint feature diagrams including distance keypoints ;
keypoint feature map updating means for updating the keypoint feature map based on each keypoint feature map and its corresponding relational feature map representing a plurality of relationships with the other keypoint feature maps;
and a keypoint feature map determining means for determining the positions of the keypoints of each part of the body in the input image based on the updated keypoint feature map.

a memory in which a computer program is stored;
and a processor connected to the memory, wherein
An image processing apparatus, wherein the processor is configured to implement the image processing method according to any one of claims 1 to 7 by executing the computer program.

A program for causing a computer to execute the image processing method according to any one of claims 1 to 7.

11. A computer-readable storage medium storing the program according to claim 10.