JP7479809B2

JP7479809B2 - Image processing device, image processing method, and program

Info

Publication number: JP7479809B2
Application number: JP2019172192A
Authority: JP
Inventors: 寛之内山; 洋東條; 真司山本
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2024-05-09
Anticipated expiration: 2039-09-20
Also published as: JP2021051376A

Description

本発明は、画像における人物の検出に関する。 The present invention relates to detecting people in images.

監視カメラシステムにおいて、カメラ画像から人物などの物体を検出して、他のカメラで検出された物体と同一であるか否かを判定する技術がある。同定対象の物体が人物である場合、まず、カメラ画像中から物体を検出する。次に、その物体の領域から物体固有の特徴を表す照合特徴を抽出する。そして、異なるカメラで検出された物体の照合特徴を比較することで、これらの物体が同一であるか否かを識別する。 In surveillance camera systems, there is technology that detects objects such as people from camera images and determines whether they are the same as objects detected by other cameras. When the object to be identified is a person, the object is first detected from the camera image. Next, matching features that represent the object's unique characteristics are extracted from the object's area. Then, by comparing the matching features of objects detected by different cameras, it is possible to identify whether these objects are the same.

非特許文献１では、人物の画像から関節点を抽出し、さらにそれぞれの関節毎に当該関節近傍の画像特徴を抽出する。関節毎に抽出された画像特徴を基に全身の照合特徴を生成する。 In Non-Patent Document 1, joint points are extracted from an image of a person, and then image features near each joint are extracted. Matching features for the entire body are generated based on the image features extracted for each joint.

Ｃ．Ｓｕｅｔａｌ．“Ｐｏｓｅ‐ｄｒｉｖｅｎＤｅｅｐＣｏｎｖｏｌｕｔｉｏｎａｌＭｏｄｅｌｆｏｒＰｅｒｓｏｎＲｅ‐ｉｄｅｎｔｉｆｉｃａｔｉｏｎ，” ＩＥＥＥ，２０１７C. Su et al. "Pose-driven Deep Convolutional Model for Person Re-identification," IEEE, 2017

非特許文献１の手法では、人物の一部が遮蔽されて見えなくなっている場合、遮蔽された関節の周辺領域から抽出された画像特徴は人物照合に用いる画像特徴が含まれていない可能性が高い。そのため、非特許文献１の手法では、一部が遮蔽された人物の照合は失敗する可能性が高い。本発明は上記課題に鑑みてなされたものであり、人物の一部が他の物体に遮蔽された状況においても、適切に人物の照合が行えるようにすることを目的とする。 In the method of Non-Patent Document 1, if a person is partially occluded and cannot be seen, the image features extracted from the area surrounding the occluded joint are likely to not include the image features used for person matching. Therefore, the method of Non-Patent Document 1 is likely to fail to match a person who is partially occluded. The present invention has been made in consideration of the above problems, and aims to make it possible to properly match people even in situations where a person is partially occluded by another object.

本発明の目的を達成するために、本発明に係る画像処理装置は、人物を撮像した画像から、前記人物における複数の部位に対応する複数の特徴点と、各特徴点が前記部位である確からしさを示す信頼度とを取得する取得手段と、前記複数の特徴点に対応する前記複数の部位毎の特徴量を前記取得手段により取得された信頼度に応じて重みづけして統合することで得られる特徴量を、予め登録された人物の特徴量と比較するための照合用の特徴量として抽出する抽出手段と、前記予め登録された人物の特徴量と、前記抽出手段により抽出された前記照合用の特徴量とが一致する場合、前記画像内の人物が前記予め登録された人物と同一人物であると判定する認識手段と、を有することを特徴とする。 In order to achieve the object of the present invention, the image processing device of the present invention is characterized in having an acquisition means for acquiring from an image of a person a plurality of feature points corresponding to a plurality of body parts of the person and a reliability indicating the likelihood that each feature point is the said body part, an extraction means for extracting a feature amount obtained by weighting and integrating the feature amounts for each of the plurality of body parts corresponding to the plurality of feature points in accordance with the reliability acquired by the acquisition means, as a matching feature amount for comparison with a feature amount of a person registered in advance, and a recognition means for determining that the person in the image is the same person as the person registered in advance if the feature amount of the person registered in advance and the matching feature amount extracted by the extraction means match .

本発明によれば、人物の一部が他の物体に遮蔽された状況においても、適切に人物の照合が行える。 The present invention allows for proper matching of people even when part of a person is occluded by another object.

実施形態画像表示装置の機能構成例を示すブロック図FIG. 1 is a block diagram showing an example of a functional configuration of an image display device according to an embodiment of the present invention; 画像特徴決定部の機能構成例を示すブロック図A block diagram showing an example of a functional configuration of an image feature determination unit. ハードウェア構成例を示すブロック図Block diagram showing an example of hardware configuration 実施形態画像処理装置が実行する処理の流れを示すフローチャート1 is a flowchart showing a flow of processing executed by an image processing apparatus according to an embodiment of the present invention. 画像処理装置が実行する処理の流れを示すフローチャート1 is a flowchart showing a flow of processing executed by an image processing apparatus; 画像処理装置が実行する処理の流れを示すフローチャート1 is a flowchart showing a flow of processing executed by an image processing apparatus; 腰の特徴点の補正の一例を説明する図FIG. 1 is a diagram illustrating an example of correction of waist feature points. 足の特徴点の補正の一例を説明する図FIG. 1 is a diagram for explaining an example of correction of foot feature points; 物体の領域を決定する処理を説明する図A diagram explaining the process of determining the area of an object. 画像処理装置が実行する処理の流れを示すフローチャート1 is a flowchart showing a flow of processing executed by an image processing apparatus; 部分画像領域外の特徴点を補正する処理を説明する図FIG. 13 is a diagram for explaining a process for correcting feature points outside a partial image area. ニューラルネットワークの構成例を説明する図A diagram explaining an example of a neural network configuration. ニューラルネットワークに学習させる処理の流れを示すフローチャートFlowchart showing the process of training a neural network 画面表示例を説明する図Diagram explaining screen display examples 顔における部位の例を説明する図A diagram explaining examples of parts of the face サブネットワークの構成例を説明する図A diagram explaining an example of a subnetwork configuration 画像統合サブネットワークの構成例を説明する図A diagram illustrating an example of the configuration of an image integration sub-network. 人物の遮蔽部分の一例を説明する図FIG. 1 is a diagram illustrating an example of an occluded portion of a person.

以下、本発明の実施形態について説明する。 The following describes an embodiment of the present invention.

＜実施形態１＞
図３に、本実施形態のハードウェア構成例を示す。図３で、３０１はＣＣＤ、ＣＭＯＳ、等で構成され、被写体像を光から電気信号に変換するための撮像素子（撮像手段）である。３０２は撮像素子３０１から得られた被写体像に関する時系列信号を処理し、デジタル信号に変換する信号処理回路である。３０１と３０２はカメラとしてバスに接続されている。３０３はＣＰＵであり、ＲＯＭ３０４に格納されている制御プログラムを実行することにより、本装置全体の制御を行う。３０４はＲＯＭであり、ＣＰＵ３０３が実行する制御プログラムや各種パラメータデータを格納する。制御プログラムは、ＣＰＵ３０３で実行されることにより、後述するフローチャートに示す各処理を実行するための各種手段として、当該装置を機能させる。３０５はＲＡＭであり、画像や各種情報を記憶する。また、ＲＡＭ３０５は、ＣＰＵ３０３のワークエリアやデータの一時待避領域として機能する。３０６はディスプレイである。３０７はマウス等のポインティングデバイスや、キーボード等の入力装置であり、ユーザからの入力を受け付ける。３０８はネットワークやバス等の通信装置であり、他の通信装置とデータや制御信号を通信する。なお、本実施形態では、後述するフローチャートの各ステップに対応する処理を、ＣＰＵ３０３を用いてソフトウェアで実現することとするが、その処理の一部または全部を電子回路などのハードウェアで実現するようにしても構わない。また、本発明の画像表示装置は、撮像素子３０１や信号処理回路３０２を省いて汎用ＰＣを用いて実現してもよいし、専用装置として実現するようにしても構わない。また、ネットワークまたは各種記憶媒体を介して取得したソフトウェア（プログラム）をパーソナルコンピュータ等の処理装置（ＣＰＵ，プロセッサ）にて実行してもよい。 <Embodiment 1>
FIG. 3 shows an example of the hardware configuration of this embodiment. In FIG. 3, 301 is an image sensor (imaging means) that is composed of a CCD, a CMOS, or the like, and converts a subject image from light into an electric signal. 302 is a signal processing circuit that processes a time series signal related to the subject image obtained from the image sensor 301 and converts it into a digital signal. 301 and 302 are connected to a bus as a camera. 303 is a CPU that controls the entire device by executing a control program stored in a ROM 304. 304 is a ROM that stores the control program executed by the CPU 303 and various parameter data. The control program is executed by the CPU 303 to make the device function as various means for executing each process shown in the flowchart described later. 305 is a RAM that stores images and various information. In addition, the RAM 305 functions as a work area for the CPU 303 and a temporary save area for data. 306 is a display. 307 is an input device such as a pointing device such as a mouse and a keyboard, which accepts input from a user. Reference numeral 308 denotes a communication device such as a network or a bus, which communicates data and control signals with other communication devices. In this embodiment, the processes corresponding to the steps of the flowchart described later are realized by software using the CPU 303, but part or all of the processes may be realized by hardware such as electronic circuits. The image display device of the present invention may be realized by using a general-purpose PC without the image sensor 301 and the signal processing circuit 302, or may be realized as a dedicated device. Software (programs) acquired via a network or various storage media may be executed by a processing device (CPU, processor) such as a personal computer.

実施形態の説明に先立って用語について説明する。特徴点とは、複数の部位から構成される物体の構成単位と対応づけられた点である。以下の説明において、特徴点は、具体的には画像における人物の関節の位置（２次元座標）とする。信頼度は、検出された前記特徴点毎に算出され、その特徴点に対応する部位が画像上の存在する尤度を示す０から１の実数である。例えば、特徴点として人物の頭の位置を検出するとき、画像においてある人物の頭部がはっきりと映っていれば信頼度は大きくなる。逆に、頭部が霞んで映っている場合や、何か他の物体に頭部が遮蔽されている場合は、頭部に対応する特徴点の信頼度は小さくなる。つまり、該特徴点が示す位置が該特徴点に対応する前記部位である確からしさを示す。本実施形態は監視対象の物体として人物を例に説明するが、これに限定せず、動物や車など他の物体でも構わない。すなわち、複数の部位からなる構造物であれば適用可能である。本実施形態では、人物の全身の特徴量を用いて人物を同定する。一方、顔を用いて人物の同定を行ってもよく、この場合、特に「顔認証」、「顔照合」、「顔検索」などの名称で知られている。 Prior to the description of the embodiment, the terminology will be explained. A feature point is a point associated with a constituent unit of an object composed of multiple parts. In the following description, the feature point is specifically the position (two-dimensional coordinates) of a person's joints in an image. The reliability is calculated for each detected feature point, and is a real number from 0 to 1 indicating the likelihood that a part corresponding to the feature point exists in the image. For example, when detecting the position of a person's head as a feature point, if the head of a person is clearly captured in the image, the reliability is high. Conversely, if the head is blurred or is occluded by some other object, the reliability of the feature point corresponding to the head is low. In other words, it indicates the likelihood that the position indicated by the feature point is the part corresponding to the feature point. In this embodiment, a person is used as an example of a monitored object, but this is not limited to this, and other objects such as animals and cars may also be used. In other words, the present invention is applicable to any structure composed of multiple parts. In this embodiment, a person is identified using the features of the entire body of the person. On the other hand, a person may be identified using a face, in which case it is known under the names of "face recognition," "face matching," "face search," etc.

本実施形態の構成を図１に示す。本実施形態は、画像取得部１０１、第１の検出部１０２、特徴群部１０３、第２の検出部１０４、特徴点記憶部１０５、領域決定部１０６、画像抽出部１０７、画像特徴抽出部１０８、認識部１０９、表示部１１０、学習部１１１、物体記憶部１１２で構成される。 The configuration of this embodiment is shown in Figure 1. This embodiment is composed of an image acquisition unit 101, a first detection unit 102, a feature group unit 103, a second detection unit 104, a feature point storage unit 105, an area determination unit 106, an image extraction unit 107, an image feature extraction unit 108, a recognition unit 109, a display unit 110, a learning unit 111, and an object storage unit 112.

画像取得部１０１はカメラから複数の部位を有する物体を撮像した画像フレームを取得する。第１の検出部１０２は画像フレームから物体の特徴点の位置とその信頼度を検出する。画像における人物の関節の位置とその信頼度を検出する方法の詳細は後述する。特徴群決定部１０３は、第１の検出部１０２で検出された特徴点の位置と信頼度に基づいて、信頼度が所定の値より小さい特徴点を検出するための特徴群を決定する。この特徴点群の組み合わせは事前に用意されており、この中から特徴点の信頼度の条件に応じて決定する。この具体的な方法は後述する。第２の検出部１０４は、第１の検出部によって検出された特徴点のうち所定の特徴点の信頼度が所定の値より小さい場合は、第１の検出手段とは異なる方法で、画像から前記所定の特徴点を検出する。特徴点の検出は、特徴点間の相対位置関係を用いて行う。具体的な方法は後述する。特徴点記憶部１０５は検出された特徴点を記憶する。領域決定部１０６は特徴点から物体が存在する領域を決定する。検出された特徴点のうち、事前に決められた特定の特徴点の組み合わせを用いて、画像特徴抽出の対象の物体が存在する領域を決定する。画像抽出部１０７は画像フレームから、領域決定部で決定された領域を切り出す。画像特徴抽出部１０８は切り出された部分画像からニューラルネットワークなどを用いて人物を識別するための画像特徴を抽出する。認識部１０９は抽出された画像特徴を用いて画像認識を行う。本実施形態では画像認識として人物の同定を行う。具体的には、抽出した画像特徴同士を比較することで、この特徴量が同一人物のものか否かを判別する。詳細は後述する。表示部１１０は画像認識の結果を画面に表示する。学習部１１１は画像特徴抽出部１０８で画像特徴抽出に用いるニューラルネットワークなどを学習する。物体記憶手段１１２は認識部１０９で使用する物体の情報が記憶されている。 The image acquisition unit 101 acquires an image frame of an object having multiple parts from a camera. The first detection unit 102 detects the positions of the feature points of the object from the image frame and their reliability. The method of detecting the positions of the joints of a person in an image and their reliability will be described in detail later. The feature group determination unit 103 determines a feature group for detecting feature points whose reliability is smaller than a predetermined value based on the positions and reliability of the feature points detected by the first detection unit 102. This combination of feature points is prepared in advance, and is determined from among them according to the condition of the reliability of the feature points. The specific method will be described later. If the reliability of a specific feature point among the feature points detected by the first detection unit is smaller than a predetermined value, the second detection unit 104 detects the specific feature point from the image by a method different from the first detection means. The feature points are detected using the relative positional relationship between the feature points. The specific method will be described later. The feature point storage unit 105 stores the detected feature points. The area determination unit 106 determines the area where the object exists from the feature points. The area where the object to be subjected to image feature extraction exists is determined using a combination of specific feature points determined in advance among the detected feature points. The image extraction unit 107 cuts out the area determined by the area determination unit from the image frame. The image feature extraction unit 108 extracts image features for identifying a person from the cut-out partial image using a neural network or the like. The recognition unit 109 performs image recognition using the extracted image features. In this embodiment, image recognition involves identifying a person. Specifically, by comparing extracted image features with each other, it is determined whether or not these features belong to the same person. Details will be described later. The display unit 110 displays the results of the image recognition on a screen. The learning unit 111 learns the neural network and the like used by the image feature extraction unit 108 to extract image features. The object storage means 112 stores information about objects used by the recognition unit 109.

図１の画像特徴抽出部１０８の構成例を図２に示す。画像特徴抽出部１０８は領域外特徴点補正部２０２、物体部位抽出部２０３、中間画像特徴抽出部２０４、信頼度変換部２０５、特徴統合部２０６、画像特徴出力部２０７で構成される。 Figure 2 shows an example of the configuration of the image feature extraction unit 108 in Figure 1. The image feature extraction unit 108 is composed of an outside-area feature point correction unit 202, an object part extraction unit 203, an intermediate image feature extraction unit 204, a reliability conversion unit 205, a feature integration unit 206, and an image feature output unit 207.

領域外特徴点補正部２０２は図１の特徴点抽出部１０２で抽出した特徴点のうち、部分画像領域外の特徴点を補正する。物体部位抽出部２０３は画像から物体の部位（パーツ）を抽出する。中間画像特徴抽出部２０４は画像と物体の部位から第１の画像特徴（中間画像特徴）を抽出する。信頼度変換部２０５は特徴点抽出部１０２で抽出した特徴点の信頼度に変換処理を適用する。特徴統合部２０６は中間画像特徴抽出部２０４の出力と信頼度変換部２０５の出力を統合する。画像特徴出力部２０７は特徴統合部２０６の出力から画像特徴を生成する。 The outside-area feature point correction unit 202 corrects feature points outside the partial image area, among the feature points extracted by the feature point extraction unit 102 in Figure 1. The object part extraction unit 203 extracts object parts (parts) from the image. The intermediate image feature extraction unit 204 extracts a first image feature (intermediate image feature) from the image and the object part. The reliability conversion unit 205 applies conversion processing to the reliability of the feature points extracted by the feature point extraction unit 102. The feature integration unit 206 integrates the output of the intermediate image feature extraction unit 204 and the output of the reliability conversion unit 205. The image feature output unit 207 generates image features from the output of the feature integration unit 206.

本画像処理装置の動作を図４のフローチャートで説明する。図４のフローチャートに示した処理は、コンピュータである図３のＣＰＵ３０３によりＲＯＭ３０４に格納されているコンピュータプログラムに従って実行される。 The operation of this image processing device will be described with reference to the flowchart in FIG. 4. The process shown in the flowchart in FIG. 4 is executed by the CPU 303 in FIG. 3, which is a computer, in accordance with a computer program stored in the ROM 304.

ステップ４０１はカメラから画像フレームを取得する。本ステップは図１の画像取得部１０１の動作に該当する。 Step 401 acquires an image frame from the camera. This step corresponds to the operation of the image acquisition unit 101 in Figure 1.

ステップ４０２は、ステップ４０１で取得した複数の部位を有する物体を撮像した画像から、該物体の部位に対応づけられた複数の特徴点を検出する（第１の検出方法）。本ステップは図１の第１の検出部１０２の動作に該当する。また、ステップ４０２では画像フレームを入力とし、画像中に存在する人物の複数の特徴点とそれらの信頼度を抽出する。検出された前記特徴点毎に、該特徴点が前記画像に映っている確からしさを示す信頼度を取得する。画像処理対象が人物であれば、特徴点として人体の関節位置を用いることができる。本ステップで検出する特徴点は、頭頂、首、腰、右足首、左足首の５点である。特徴点の検出には、ＣｏｎｖｏｌｕｔｉｏｎａｌＰｏｓｅＭａｃｈｉｎｅｓを使用する。（Ｓｈｉｈ－ＥｎＷｅｉｅｔａｌ．，“ＣｏｎｖｏｌｕｔｉｏｎａｌＰｏｓｅＭａｃｈｉｎｅｓ，”ＩＥＥＥ，２０１６．）。この方法では、学習済みモデル（ニューラルネットワーク）を用いて、それぞれの関節位置が画像上のどこに存在しているかを示す信頼度マップを算出する。信頼度マップは２次元のマップであり、関節点の数をＰとすると、Ｐ＋１枚存在する（１枚は背景に対応するマップ）。ある関節点の信頼度マップにおいて、信頼度の大きい位置をその関節点が存在する位置とみなす。信頼度は、その特徴点の存在する尤度を示す０から１の実数である。１に近いほど関節点が存在する確度が高い。他の物体に遮蔽されている関節点は、人物でない物体上から抽出されるため、人物関節としての尤もらしさが低下する。したがって、他の物体に遮蔽されていない関節に比べ、関節の位置の信頼度が低くなる。一方、他の物体に隠されていない関節は、人物上から良好に抽出されるため、関節の信頼度が高くなる。 Step 402 detects multiple feature points associated with the parts of an object from an image of the object having multiple parts obtained in step 401 (first detection method). This step corresponds to the operation of the first detection unit 102 in FIG. 1. In addition, in step 402, an image frame is input, and multiple feature points of a person present in the image and their reliability are extracted. For each detected feature point, a reliability indicating the likelihood that the feature point appears in the image is obtained. If the image processing target is a person, the joint positions of the human body can be used as feature points. The feature points detected in this step are five points: the top of the head, the neck, the waist, the right ankle, and the left ankle. Convolutional Pose Machines are used to detect the feature points. (Shih-En Wei et al., "Convolutional Pose Machines," IEEE, 2016.) In this method, a trained model (neural network) is used to calculate a reliability map that indicates where each joint position is located on the image. The reliability map is a two-dimensional map, and if the number of joint points is P, there are P+1 reliability maps (one map corresponds to the background). In the reliability map for a certain joint point, the position with the highest reliability is considered to be the position where that joint point exists. The reliability is a real number between 0 and 1 that indicates the likelihood that the feature point exists. The closer it is to 1, the higher the probability that the joint point exists. Joint points that are occluded by other objects are extracted from objects that are not people, so the likelihood that they are human joints decreases. Therefore, the reliability of the joint position is lower than that of joints that are not occluded by other objects. On the other hand, joints that are not hidden by other objects are well extracted from people, so the reliability of the joints is high.

なお、物体の特徴点とその信頼度の検出方法は、ＣｏｎｖｏｌｕｔｉｏｎａｌＰｏｓｅＭａｃｈｉｎｅｓ以外の方法を用いても構わない。例えば、ルールベースの方法を用いて、人体の各関節点について抽出した画像特徴を用いて各関節点を特定しても良い。他、画像から頭部の画像特徴を抽出し、頭部が抽出された位置から胴体の位置を推定しても良い。また、本実施形態では特徴点として人体の関節点を用いるが、画像処理対象が顔ならば、顔特徴点を用いることができる。顔特徴点として、目、眉毛、鼻、口、耳などのパーツの中心や端点、輪郭線上の点、顔全体形状の輪郭線上の点などを用いることができる。 The method of detecting feature points of an object and their reliability may be a method other than Convolutional Pose Machines. For example, a rule-based method may be used to identify each joint point of the human body using image features extracted from the joint points. Alternatively, image features of the head may be extracted from the image, and the position of the torso may be estimated from the position where the head is extracted. In addition, in this embodiment, the joint points of the human body are used as feature points, but if the image processing target is a face, facial feature points can be used. As facial feature points, the centers and end points of parts such as the eyes, eyebrows, nose, mouth, and ears, points on the contour line, points on the contour line of the entire face shape, etc. can be used.

ステップ４０３は、第２の検出方法に用いる特徴点群を決定する。ステップ４０３は図１の特徴群決定部１０３の動作に該当する。ステップ４０３で決定された特徴点群は、第２の検出方法に用いる。特徴点群は組み合わせのパターンが複数用意されており、この中から特徴点の信頼度の条件に応じて選択し、決定する。後のステップ４０４での第２の検出方法で使用される。特徴点群には、補正後の位置を決定するために用いる特徴点（ここでは、頭、首または腰）が含まれる。本実施形態において、所定の特徴点として補正の対象となる特徴点は、腰、右足首、左足首である。右足首と左足首の補正は同じ手順で行うため、右足首の補正のみを取り上げて説明する。以下、処理対象の片側の足首を単に「足首」と表記する。 Step 403 determines the feature point group to be used in the second detection method. Step 403 corresponds to the operation of the feature point determination unit 103 in FIG. 1. The feature point group determined in step 403 is used in the second detection method. A number of combination patterns of feature points are prepared, and one is selected and determined according to the condition of the reliability of the feature points. It is used in the second detection method in the subsequent step 404. The feature point group includes feature points (here, head, neck, or waist) used to determine the position after correction. In this embodiment, the feature points to be corrected as the specified feature points are the waist, right ankle, and left ankle. The correction of the right ankle and the left ankle is performed using the same procedure, so only the correction of the right ankle will be explained. Hereinafter, the ankle on one side to be processed will be simply referred to as "ankle".

ステップ４０３の動作を図５のフローチャートで説明する。後述するように、補正に用いる特徴点群の候補として、特徴点群Ａ１、Ａ２、Ａ３、Ｂ１、Ｂ２、Ｂ３の６種類が事前に用意されている。腰の補正に関する特徴点群Ａ１、Ａ２、Ａ３から１つと、第２の検出方法における足首の検出に関する特徴点群Ｂ１、Ｂ２、Ｂ３から１つを条件に応じて決定する。 The operation of step 403 is explained using the flowchart in Figure 5. As described later, six types of feature point groups, A1, A2, A3, B1, B2, and B3, are prepared in advance as candidates for feature point groups to be used for correction. One of the feature point groups A1, A2, and A3 related to correction of the waist and one of the feature point groups B1, B2, and B3 related to detection of the ankles in the second detection method are determined according to conditions.

詳細は後述するが、特徴点群Ａ１は空集合であり、第１の検出部の検出結果をそのまま採用する。特徴点群Ａ２を用いて、現在フレームでの頭と首の位置から、腰の位置を検出する。特徴点群Ａ３を用いて、過去フレームでの頭と腰の位置から現在の腰の位置を検出する。特徴点群Ｂ１は空集合であり、第１の検出部の検出結果をそのまま採用する。特徴点群Ｂ２を用いて、現在フレームでの首と腰の位置から、足首の位置を検出する。特徴点群Ｂ３を用いて、過去フレームでの首と足首の位置から現在のフレームでの足首の位置を検出する。 Details will be described later, but feature point group A1 is an empty set, and the detection results of the first detection unit are used as is. Feature point group A2 is used to detect the position of the waist from the positions of the head and neck in the current frame. Feature point group A3 is used to detect the current position of the waist from the positions of the head and waist in previous frames. Feature point group B1 is an empty set, and the detection results of the first detection unit are used as is. Feature point group B2 is used to detect the position of the ankles from the positions of the neck and waist in the current frame. Feature point group B3 is used to detect the position of the ankles in the current frame from the positions of the neck and ankles in previous frames.

図５のステップ５０１は、ステップ４０２で決定した現在のフレームでの腰の信頼度が事前に定められたしきい値以上か否かを評価する。しきい値以上だったらステップ５０３に進み、そうでなかったらステップ５０２に進む。 In step 501 of FIG. 5, it is evaluated whether the reliability of the waist in the current frame determined in step 402 is equal to or greater than a predefined threshold. If it is equal to or greater than the threshold, the process proceeds to step 503; if not, the process proceeds to step 502.

ステップ５０２では、特徴点記憶部１０５で記憶された過去のフレームにおける腰の信頼度がしきい値以上か否かを評価する。しきい値以上だったらステップ５０５に進み、そうでなかったら５０４に進む。過去のフレームとは、図４のフローチャートの繰り返しループにおいて、１つ前の繰り返しループのステップ４０１で取得された画像フレームである。ただし、特徴点記憶部１０５に過去のフレームにおける特徴点が記憶されていない場合、すなわち初めて図４のステップ４０３を実行する場合は、ステップ５０４に進む。 In step 502, it is evaluated whether the reliability of the waist in the past frame stored in the feature point storage unit 105 is equal to or greater than a threshold value. If it is equal to or greater than the threshold value, the process proceeds to step 505, and if it is not, the process proceeds to step 504. A past frame is the image frame acquired in step 401 of the previous repeat loop in the repeat loop of the flowchart in FIG. 4. However, if the feature points in the past frame are not stored in the feature point storage unit 105, that is, if step 403 in FIG. 4 is being executed for the first time, the process proceeds to step 504.

ステップ５０３では、第２の検出方法に用いる特徴点群として特徴点群Ａ１を決定し、ステップ５０６に進む。特徴点群Ａ１が決定される場合は、現在フレームの腰の特徴点が信頼できる場合であり、腰の特徴点を以降の処理で検出し直す必要がない。 In step 503, feature point group A1 is determined as the feature point group to be used in the second detection method, and the process proceeds to step 506. When feature point group A1 is determined, it means that the waist feature points in the current frame are reliable, and there is no need to redetect the waist feature points in the subsequent processing.

ステップ５０４では、第２の検出方法に用いる特徴点群として特徴点群Ａ２を決定し、ステップ５０６に進む。特徴点群Ａ２が決定される場合は、現在のフレームと過去のフレームの両方の腰の関節点が信頼できず、現在のフレームの頭と首の位置から現在のフレームの腰の位置を以降の処理で検出する。 In step 504, feature point group A2 is determined as the feature point group to be used in the second detection method, and the process proceeds to step 506. If feature point group A2 is determined, the waist joint points in both the current frame and the previous frame are unreliable, and the position of the waist in the current frame is detected in the subsequent process from the positions of the head and neck in the current frame.

ステップ５０５では、補正に用いる特徴点群として特徴点群Ａ３を選択し、ステップ５０６に進む。特徴点群Ａ３が選択される場合は現在のフレームの腰の特徴点が信頼できないが、過去のフレームの腰の特徴点は信頼できる場合であり、過去のフレームの頭と腰の位置から現在の腰の位置を以降の処理で補正する。 In step 505, feature point group A3 is selected as the feature point group to be used for correction, and the process proceeds to step 506. When feature point group A3 is selected, the waist feature points in the current frame are unreliable, but the waist feature points in the previous frame are reliable, and the current waist position is corrected in the subsequent processing from the head and waist positions in the previous frame.

ステップ５０６は、ステップ４０２で決定した現在のフレームでの足首の信頼度が事前に定められたしきい値以上か否かを評価する。しきい値以上だったらステップ５０８に進み、そうでなかったらステップ５０７に進む。 Step 506 evaluates whether the confidence of the ankle in the current frame determined in step 402 is equal to or greater than a predefined threshold. If it is equal to or greater than the threshold, proceed to step 508; if not, proceed to step 507.

ステップ５０７では、特徴点記憶部１０５で記憶された過去のフレームにおける足首の信頼度が事前に定められたしきい値以上か否かを評価する。しきい値以上だったらステップ５１０に進み、そうでなかったら５０９に進む。ただし、特徴点記憶部１０５に過去のフレームにおける特徴点が記憶されていない場合、すなわち初めて図４のステップ４０３を実行する場合は、ステップ５０９に進む。 In step 507, it is evaluated whether the reliability of the ankle in the past frame stored in the feature point storage unit 105 is equal to or greater than a predetermined threshold value. If it is equal to or greater than the threshold value, the process proceeds to step 510, and if it is not, the process proceeds to step 509. However, if the feature point storage unit 105 has not stored any feature points in the past frame, i.e., if this is the first time step 403 in FIG. 4 is executed, the process proceeds to step 509.

ここで、Ｓ５０１、Ｓ５０２、Ｓ５０６、Ｓ５０７で用いるしきい値は、本実施例ではそれぞれ異なる値とするが、同じ値としても構わない。 In this embodiment, the threshold values used in S501, S502, S506, and S507 are different values, but they may be the same value.

ステップ５０８では、補正に用いる特徴点群として特徴点群Ｂ１を選択し、図５のフローチャートの処理を終了する。特徴点群Ｂ１が選択された場合は、現在フレームでの足の特徴点が信頼できる場合であり、足の位置を後の処理で検出する必要がない。 In step 508, feature point group B1 is selected as the feature point group to be used for correction, and the process of the flowchart in FIG. 5 ends. When feature point group B1 is selected, the foot feature points in the current frame are reliable, and there is no need to detect the foot positions in subsequent processing.

ステップ５０９では、補正に用いる特徴点群として特徴点群Ｂ２を選択し、図５のフローチャートの処理を終了する。特徴点群Ｂ２が選択された場合は、現在フレームと過去フレームの両方で足の位置が信頼できない場合であり、現在フレームの足と腰の位置から現在フレームの足の位置を以降の処理で検出する。 In step 509, feature point group B2 is selected as the feature point group to be used for correction, and the processing of the flowchart in FIG. 5 is terminated. When feature point group B2 is selected, the foot positions are unreliable in both the current frame and previous frames, and the foot positions in the current frame are detected in subsequent processing from the foot and hip positions in the current frame.

ステップ５１０では、補正に用いる特徴点群として特徴点群Ｂ３を選択し、図５のフローチャートの処理を終了する。特徴点群Ｂ３が選択された場合は現在フレームで足の特徴点が信頼できないが、過去フレームで足の特徴点が信頼できる場合であり、過去フレームの首と足の位置から現在フレームの位置を以降の処理で検出する。 In step 510, feature point group B3 is selected as the feature point group to be used for correction, and the processing of the flowchart in FIG. 5 ends. When feature point group B3 is selected, the foot feature points in the current frame are unreliable, but the foot feature points in the previous frame are reliable, and the position of the current frame is detected from the positions of the neck and feet in the previous frame in the subsequent processing.

以上のステップ５０６、５０７、５０８、５０９、５１０の説明では片側の足首（右足首）のみを対象としたが、もう片側の足首（左足首）についても同様に第２の検出方法に用いる特徴点群を決定する。なお、足首の位置を検出するには、なるべく足首の位置に近い特徴点から足首の位置を推定できると良い。そのため、腰の位置が採用できる（腰の位置の信頼度が高い）場合は、腰の位置を用いて足首の位置を検出する。腰の位置が分からない（腰の位置の信頼度が低い）場合は、腰の次に足首に近い首の位置を用いて足首の位置を検出する。以下の処理順序は上記の意図を踏まえた順序になっているが、順序は変えても構わない。また、腰の位置を検出せずに、足首の位置だけを検出するように特徴群を決定してもよい。 Although the above explanation of steps 506, 507, 508, 509, and 510 has been directed to only one ankle (right ankle), the feature point group to be used in the second detection method is determined for the other ankle (left ankle) in the same manner. Note that, in order to detect the position of the ankle, it is preferable to estimate the position of the ankle from a feature point as close as possible to the position of the ankle. Therefore, when the position of the waist can be used (high reliability of the waist position), the position of the ankle is detected using the position of the waist. When the position of the waist is unknown (low reliability of the waist position), the position of the neck, which is the second closest to the ankle after the waist, is used to detect the position of the ankle. The following processing order is based on the above intention, but the order may be changed. Also, the feature group may be determined so as to detect only the position of the ankle without detecting the position of the waist.

図４のステップ４０４では、ステップ４０３で決定した特徴点群を用いて、第２の検出方法により所定の特徴点を検出する。ステップ４０４の処理は、図１の第２の検出部１０４に該当する。ステップ４０４の動作を図６のフローチャートを用いて説明する。図６の処理では、図５のフローチャートの処理で決定した特徴点群Ａ１、Ａ２、Ａ３、Ｂ１、Ｂ２、Ｂ３に基づいて所定の特徴点（足首の位置）を検出する。 In step 404 in FIG. 4, the group of feature points determined in step 403 is used to detect a predetermined feature point by a second detection method. The processing in step 404 corresponds to the second detection unit 104 in FIG. 1. The operation of step 404 will be explained using the flowchart in FIG. 6. In the processing in FIG. 6, a predetermined feature point (ankle position) is detected based on the group of feature points A1, A2, A3, B1, B2, and B3 determined in the processing in the flowchart in FIG. 5.

図４のステップ４０３と同様に、右足首と左足首の補正は同じ手順で行うため、右足首の検出のみを取り上げて説明する。以下、処理対象の片側の足首を単に「足首」と表記する。 As with step 403 in FIG. 4, the correction of the right and left ankles is performed using the same procedure, so only the detection of the right ankle will be described. Hereinafter, the ankle on one side being processed will simply be referred to as "ankle."

図６のステップ６０１では腰に関する特徴点群Ａ１、Ａ２、Ａ３のいずれが選択されているか判定する。特徴点群Ａ１が選択されていたらステップ６０２に進み、特徴点群Ａ２が選択されていたらステップ６０３に進み、特徴点群Ａ３が選択されていたらステップ６０４に進む。ステップ６０２、ステップ６０３、ステップ６０４では、第２の検出方法で腰の特徴点の位置を検出する。 In step 601 in FIG. 6, it is determined which of the feature point groups A1, A2, and A3 relating to the waist has been selected. If feature point group A1 has been selected, the process proceeds to step 602; if feature point group A2 has been selected, the process proceeds to step 603; if feature point group A3 has been selected, the process proceeds to step 604. In steps 602, 603, and 604, the positions of the waist feature points are detected using the second detection method.

ステップ６０２は、腰の特徴点の位置を検出しない。なぜなら、以前の処理で腰の特徴点の信頼度があるしきい値より大きく、信頼できると考えられるためである。 Step 602 does not detect the location of the waist feature points because the reliability of the waist feature points in previous processing is greater than a certain threshold and is therefore considered reliable.

ステップ６０３は、現在の画像フレームで検出された頭と首の位置から、腰の位置を検出する。図７を用いて処理を説明する。図７（ａ）のように、図４のステップ４０２によって、頭頂７０１、首７０２、腰７０３、右足首７０４、左足首７０５の特徴点が検出されている。まず、図７（ｂ）のように、頭と首を結ぶ直線７０６を計算する。また、頭と首の間の距離をそれぞれの位置座標から計算する。ここで、人体の頭と首の距離と頭と腰の距離の比は、個人差はあるものの、およそ同じであると仮定できる。このため、腰の位置が、頭と首を結ぶ直線上となり、頭と首の距離と頭と腰の距離の比が所定のものとなるように検出する。図７（ｃ）に補正後の腰の特徴点７０７の例を示す。この所定の比は、例えば平均的な成人の人体部位の比から定めることができる。 In step 603, the position of the waist is detected from the positions of the head and neck detected in the current image frame. The process will be described with reference to FIG. 7. As shown in FIG. 7(a), the feature points of the top of the head 701, the neck 702, the waist 703, the right ankle 704, and the left ankle 705 are detected by step 402 in FIG. 4. First, as shown in FIG. 7(b), a straight line 706 connecting the head and neck is calculated. Also, the distance between the head and neck is calculated from the respective position coordinates. Here, it can be assumed that the ratio of the head-neck distance and the head-waist distance of the human body is approximately the same, although there are individual differences. For this reason, the position of the waist is detected so that it is on the straight line connecting the head and neck, and the ratio of the head-neck distance and the head-waist distance is a predetermined value. An example of the feature point 707 of the waist after correction is shown in FIG. 7(c). This predetermined ratio can be determined, for example, from the ratio of the parts of the average adult human body.

ステップ６０４は、過去フレームでの頭と腰の位置から現在の腰の位置を検出する。まず、特徴点記憶部１０５で記憶された過去のフレームの特徴点から、頭と腰の距離を計算する。次に、現在のフレームにおいて、図７（ｂ）と同様に、頭と首を結ぶ直線を計算する。ここで、過去のフレームにおける頭と腰の距離と現在のフレームにおける頭と腰の距離はおよそ同じであると仮定する。そして、腰の位置が頭と首を結ぶ直線上となり、現在のフレームにおける頭と腰の距離が過去のフレームにおける頭と腰の距離と等しくなるように、現在のフレームにおける腰の位置を検出する。 In step 604, the current waist position is detected from the head and waist positions in the previous frame. First, the distance between the head and waist is calculated from the feature points of the previous frame stored in the feature point storage unit 105. Next, in the current frame, a straight line connecting the head and neck is calculated as in FIG. 7(b). Here, it is assumed that the distance between the head and waist in the previous frame and the distance between the head and waist in the current frame are approximately the same. Then, the position of the waist in the current frame is detected so that the position of the waist is on the straight line connecting the head and neck, and the distance between the head and waist in the current frame is equal to the distance between the head and waist in the previous frame.

図６のステップ６０５では足首に関する特徴点群Ｂ１、Ｂ２、Ｂ３のいずれが選択されているか判定する。特徴点群Ｂ１が選択されていたらステップ６０６に進み、特徴点群Ｂ２が選択されていたらステップ６０７に進み、特徴点群Ｂ３が選択されていたらステップ６０８に進む。ステップ６０７、ステップ６０８では、足首の特徴点の位置を検出する。ステップ６０６は、足首の特徴点の位置を検出しない。 In step 605 of FIG. 6, it is determined which of the ankle-related feature point groups B1, B2, or B3 has been selected. If feature point group B1 has been selected, the process proceeds to step 606; if feature point group B2 has been selected, the process proceeds to step 607; if feature point group B3 has been selected, the process proceeds to step 608. In steps 607 and 608, the positions of the ankle feature points are detected. In step 606, the positions of the ankle feature points are not detected.

ステップ６０７は、現在フレームでの首と腰の位置から、足首の位置を検出する。図８を用いて処理を説明する。図８（ａ）のように、図４のステップ４０２によって、頭頂８０１、首８０２、腰８０３、右足首８０４、左足首８０５の特徴点が検出されている。まず、図８（ｂ）のように、首と腰を結ぶ直線８０６（体軸）を計算する。また、首と腰の間の距離をそれぞれの位置座標から計算する。ここで、人体の首と腰の距離と首と右足首の距離の比は、個人差はあるものの、およそ同じであると仮定できる。このため、足首の位置が、首と腰を結ぶ直線上となり、首と腰の距離と首と足首の距離の比が所定のものとなるように検出する。左図８（ｃ）に足首８０７の特徴点の検出後の例を示す。 In step 607, the position of the ankle is detected from the position of the neck and waist in the current frame. The process will be described with reference to FIG. 8. As shown in FIG. 8(a), the feature points of the top of the head 801, neck 802, waist 803, right ankle 804, and left ankle 805 are detected by step 402 in FIG. 4. First, as shown in FIG. 8(b), a straight line 806 (body axis) connecting the neck and waist is calculated. Also, the distance between the neck and waist is calculated from the respective position coordinates. Here, it can be assumed that the ratio of the distance between the neck and waist and the distance between the neck and the right ankle of the human body is approximately the same, although there are individual differences. For this reason, the position of the ankle is detected so that it is on the straight line connecting the neck and waist, and the ratio of the distance between the neck and waist and the distance between the neck and ankle is a predetermined value. An example after the feature point of the ankle 807 is detected is shown in FIG. 8(c) on the left.

ステップ６０４は、過去フレームでの首と足首の位置から現在のフレームでの足首の位置を検出する。まず、特徴点記憶部１０５で記憶された過去のフレームの特徴点から、首と腰の距離を計算する。次に、現在のフレームにおいて、図８（ｂ）と同様に、首と腰を結ぶ直線（体軸）を計算する。ここで、過去のフレームにおける首と足首の距離と現在のフレームにおける首と足首の距離はおよそ同じであると仮定する。そして、足首の位置が体軸上となり、現在のフレームにおける首と足首の距離が過去のフレームにおける首と足首の距離と等しくなるように、現在のフレームにおける足首の位置を検出する。 In step 604, the position of the ankle in the current frame is detected from the positions of the neck and ankle in the previous frame. First, the distance between the neck and waist is calculated from the feature points of the previous frame stored in the feature point storage unit 105. Next, in the current frame, a straight line (body axis) connecting the neck and waist is calculated as in FIG. 8(b). Here, it is assumed that the distance between the neck and ankle in the previous frame is approximately the same as the distance between the neck and ankle in the current frame. Then, the position of the ankle in the current frame is detected so that the position of the ankle is on the body axis and the distance between the neck and ankle in the current frame is equal to the distance between the neck and ankle in the previous frame.

以上のステップ６０５、６０６、６０７、６０８の説明では右足首のみを対象としたが、左足首についても同様に検出を行う。この処理によって、足首部分がオクルージョンやノイズによって第１の検出部で上手く検出されない場合でも、より確からしい足首の位置を検出することができる。 In the above explanation of steps 605, 606, 607, and 608, only the right ankle was targeted, but the left ankle is also detected in the same way. This process makes it possible to detect a more accurate position of the ankle even if the ankle is not properly detected by the first detection unit due to occlusion or noise.

図４のステップ４０５では検出された前記特徴点に基づいて、前記物体が存在する領域を決定する。この部分画像領域は、撮像画像における人物が存在する領域を示し、後の処理で人物画像を画像フレームから抽出する領域の指定に用いる。ステップ４０５の動作は図１の領域決定部１０６に該当する。ステップ４０５の処理を図９（ａ）を用いて説明する。図９（ａ）のように、画像フレーム９０３中に頭頂、首、腰、右足首、左足首の特徴点が存在する。まず、右足首と左足首の中点を計算する。そして、頭とその中点を結ぶ直線９０１（体軸）を計算する。本実施形態では、部分画像領域は矩形であり、アスペクト比が事前に定められたものとする。矩形の縦方向が体軸に平行であり、矩形の中心軸が体軸と等しく、矩形の上辺が頭と接し、矩形の下辺が足首と接するように、矩形９０２を決定する。このとき、矩形の上辺と頭の間と、矩形の下辺と足首の間に余白を設けても構わない。例えば、頭と足首の距離（身長）に一定の係数を乗算した大きさの余白を設けても構わない。すなわち、部分画像領域は特徴点の外接矩形を基に決定する。本実施形態では、矩形のアスペクト比は後のニューラルネットワークへの入力を容易にするために固定としたが、後の処理の構成によっては固定でなくても構わない。なお、補正した関節位置を用いる場合、ここで決定した領域には人物の部位が遮蔽されていることや、ノイズが多く出ていることがあり得る。例えば、図１８のように、足首の部位が遮蔽物１８０３によって隠されている場合でも人物の部位を含む領域として決定する。このように領域を決定することで、矩形の中における人体の部位の配置が整合的な部分画像領域を決定できる。部位の配置を整合的にすることで、後段で行う特徴量の抽出処理において、各部位の特徴がより反映された各部位の特徴量を抽出できる効果がある。 In step 405 in FIG. 4, the area in which the object exists is determined based on the detected feature points. This partial image area indicates the area in the captured image in which a person exists, and is used to specify the area in which a person image is extracted from the image frame in a later process. The operation of step 405 corresponds to the area determination unit 106 in FIG. 1. The processing of step 405 will be explained using FIG. 9(a). As shown in FIG. 9(a), the image frame 903 has feature points for the top of the head, the neck, the waist, the right ankle, and the left ankle. First, the midpoint of the right ankle and the left ankle is calculated. Then, a straight line 901 (body axis) connecting the head and the midpoint is calculated. In this embodiment, the partial image area is a rectangle, and the aspect ratio is determined in advance. A rectangle 902 is determined so that the vertical direction of the rectangle is parallel to the body axis, the central axis of the rectangle is equal to the body axis, the upper side of the rectangle is in contact with the head, and the lower side of the rectangle is in contact with the ankles. At this time, a margin may be provided between the upper side of the rectangle and the head, and between the lower side of the rectangle and the ankles. For example, a margin of a size obtained by multiplying the distance between the head and the ankle (height) by a certain coefficient may be provided. That is, the partial image area is determined based on the circumscribing rectangle of the feature points. In this embodiment, the aspect ratio of the rectangle is fixed to facilitate subsequent input to the neural network, but it may not be fixed depending on the configuration of subsequent processing. When using corrected joint positions, the area determined here may have a person's body parts obscured or may contain a lot of noise. For example, as shown in FIG. 18, even if the ankle part is hidden by an obstruction 1803, the area is determined as including the person's body parts. By determining the area in this way, a partial image area in which the arrangement of the parts of the human body within the rectangle is consistent can be determined. By making the arrangement of the parts consistent, there is an effect that the feature amount of each part that is more accurately reflected in the feature amount extraction process performed later can be extracted.

図４のステップ４０６では、ステップ４０５で決定した部分画像領域を人物画像として画像フレームから切り出す。ステップ４０５で決定した部分画像領域の矩形が傾斜している場合は、矩形が直立するように画像を回転する。図９（ａ）から切り出した例を図９（ｂ）に図示する。ステップ４０６の動作は図１の画像抽出部１０７に該当する。 In step 406 in FIG. 4, the partial image area determined in step 405 is cut out from the image frame as a person image. If the rectangle of the partial image area determined in step 405 is tilted, the image is rotated so that the rectangle stands upright. An example of the cutout from FIG. 9(a) is shown in FIG. 9(b). The operation of step 406 corresponds to the image extraction unit 107 in FIG. 1.

ステップ４０７では、現在フレームにおける補正後の部位を記憶する。ステップ４０７の動作は図１の特徴点記憶部１０５に該当する。 In step 407, the corrected part in the current frame is stored. The operation of step 407 corresponds to the feature point storage unit 105 in FIG. 1.

ステップ４０８は部分画像領域（人物画像）から特徴量を抽出する。ステップ４０８の動作は図１および図２の画像特徴抽出部１０８に該当する。ステップ４０８の動作を図１０のフローチャートを用いて説明する。 Step 408 extracts features from the partial image region (person image). The operation of step 408 corresponds to the image feature extraction unit 108 in Figs. 1 and 2. The operation of step 408 will be explained using the flowchart in Fig. 10.

図１０のステップ１００１は領域外特徴点補正部２０２が、部分画像領域と特徴点の座標に基づいて、部分画像領域外の特徴点の信頼度を補正する。ステップ１００１は図２の領域外特徴点補正部２０２に該当する。部分画像領域の矩形のアスペクト比が固定である場合、手足を広げているときなど、特徴点が部分画像領域に含まれない場合がある。部分画像領域外にある人体部位は特徴抽出の範囲外であり、この部分における特徴抽出の精度が低下する問題がある。このため、後のステップでその影響を軽減するために、部分領域外の特徴点の信頼度を減少させる調整を施す。例えば、図１１において、右足首１１０４が矩形１１０６の範囲外であり、この右足首の特徴点の信頼度を減少させる。本実施形態では、元の信頼度に１より小さいあらかじめ定めた実数値を乗じた値を補正後の信頼度とする。このように、部分領域外の特徴点の信頼度を減少させることで、部分領域外に人体パーツが配置されたことによる特徴抽出の精度の低下の問題と、遮蔽による特徴抽出の精度の低下の問題を、以降で共通の処理で対処することができる。 In step 1001 in FIG. 10, the outside-area feature point correction unit 202 corrects the reliability of the feature points outside the partial image area based on the coordinates of the partial image area and the feature points. Step 1001 corresponds to the outside-area feature point correction unit 202 in FIG. 2. When the aspect ratio of the rectangle of the partial image area is fixed, the feature points may not be included in the partial image area, such as when the arms and legs are spread apart. The human body parts outside the partial image area are outside the range of feature extraction, and there is a problem that the accuracy of feature extraction in this part decreases. For this reason, in order to reduce the influence in a later step, an adjustment is made to reduce the reliability of the feature points outside the partial area. For example, in FIG. 11, the right ankle 1104 is outside the range of the rectangle 1106, and the reliability of the feature point of this right ankle is reduced. In this embodiment, the value obtained by multiplying the original reliability by a predetermined real value smaller than 1 is set as the reliability after correction. In this way, by reducing the reliability of feature points outside the partial region, the problem of reduced accuracy of feature extraction due to the placement of human body parts outside the partial region and the problem of reduced accuracy of feature extraction due to occlusion can be addressed by using a common process described below.

ステップ１００２は部分画像領域と特徴点の信頼度から特徴量を抽出する。特徴量の抽出はニューラルネットワークが使用できる。図１２にニューラルネットワークの構成例を示す。図１２のニューラルネットワークは画像１２０１と特徴点信頼度１２０６を入力とし、画像特徴１２１０を出力する。ニューラルネットワークは、画像変換サブネットワーク１２０２、信頼度変換サブネットワーク１２０７、統合サブネットワーク１２０８、特徴出力サブネットワーク１２０９で構成される。画像変換サブネットワーク１２０２は図２の中間画像特徴抽出部２０４に該当する。信頼度変換サブネットワーク１２０７は図２の信頼度変換部２０５に該当する。統合サブネットワーク１２０８は図２の特徴統合部２０６に該当する。特徴出力サブネットワーク１２０９は図２の画像特徴出力部２０７に該当する。 In step 1002, features are extracted from the partial image region and the reliability of the feature points. A neural network can be used to extract the features. FIG. 12 shows an example of the configuration of a neural network. The neural network in FIG. 12 receives an image 1201 and feature point reliability 1206 as input, and outputs an image feature 1210. The neural network is composed of an image conversion sub-network 1202, a reliability conversion sub-network 1207, an integration sub-network 1208, and a feature output sub-network 1209. The image conversion sub-network 1202 corresponds to the intermediate image feature extraction unit 204 in FIG. 2. The reliability conversion sub-network 1207 corresponds to the reliability conversion unit 205 in FIG. 2. The integration sub-network 1208 corresponds to the feature integration unit 206 in FIG. 2. The feature output sub-network 1209 corresponds to the image feature output unit 207 in FIG. 2.

ニューラルネットワークで扱う入力データ、中間データ、出力データはテンソルとして扱われる。テンソルは多次元の配列として表現されるデータで、その次元数は階数とよばれる。階数が０のテンソルはスカラー、階数が１のテンソルはベクトル、階数が２のテンソルは行列と呼ばれる。例えば、チャネル数が１の画像（グレースケール画像など）はサイズＨ×Ｗの階数２のテンソル、またはサイズＨ×Ｗ×１の階数３のテンソルとして扱える。また、ＲＧＢ成分を持つ画像はサイズＨ×Ｗ×３の階数３のテンソルとして扱える。 The input data, intermediate data, and output data handled by neural networks are treated as tensors. Tensors are data represented as multidimensional arrays, and their number of dimensions is called the rank. A tensor with rank 0 is called a scalar, a tensor with rank 1 is called a vector, and a tensor with rank 2 is called a matrix. For example, an image with one channel (such as a grayscale image) can be treated as a rank 2 tensor of size H x W, or a rank 3 tensor of size H x W x 1. Additionally, an image with RGB components can be treated as a rank 3 tensor of size H x W x 3.

テンソルをある次元のある位置で切断した面を取り出したデータおよびその操作をスライスと呼ぶ。例えば、サイズＨ×Ｗ×Ｃの階数３のテンソルを３番目の次元のｃ番目の位置でスライスすることで、Ｈ×Ｗの階数２のテンソルまたはＨ×Ｗ×１の階数３のテンソルが得られる。 The data obtained by cutting a tensor at a certain position in a certain dimension and the operations performed on it are called slices. For example, slicing a rank-3 tensor of size H x W x C at the cth position in the third dimension gives a rank-2 tensor of size H x W or a rank-3 tensor of size H x W x 1.

あるテンソルに畳み込み演算を行う層をコンボリューション層（Ｃｏｎｖと略記）と呼ぶ。畳み込み演算に用いるフィルタの係数を「重み」と呼ぶ。一例として、コンボリューション層によって、Ｈ×Ｗ×Ｃの入力テンソルからＨ×Ｗ×Ｄの出力テンソルを生成する。 A layer that performs a convolution operation on a tensor is called a convolution layer (abbreviated as Conv). The filter coefficients used in the convolution operation are called "weights." As an example, a convolution layer generates an output tensor of HxWxD from an input tensor of HxWxC.

あるベクトルに重み行列を乗算し、バイアスベクトルを加算する操作を行う層を全結合層（ＦＣと略記）と呼ぶ。一例として、長さＣのベクトルから、全結合層を適用することで長さＤのベクトルを生成する。 A layer that multiplies a vector by a weight matrix and adds a bias vector is called a fully connected layer (abbreviated as FC). As an example, a vector of length D is generated from a vector of length C by applying a fully connected layer.

あるテンソルを区間に区切り、その区間の最大値を取ることで、テンソルのサイズを縮小する操作を最大プーリングと呼ぶ。最大値ではなく、区間の平均値をとる場合には平均プーリングと呼ぶ。本実施形態では、最大プーリングを用い、これを行うニューラルネットワークの層を単にプーリング層（Ｐｏｏｌｉｎｇと略記）と呼ぶ。本実施形態では、プーリング層によって、１次元目と２次元目の大きさが入力の半分となるようなテンソルを出力する。具体的には、Ｈ×Ｗ×Ｃの入力テンソルからＨ／２×Ｗ／２×Ｃの出力テンソルを生成する。 The operation of dividing a tensor into intervals and taking the maximum value of the interval to reduce the size of the tensor is called max pooling. When the average value of the interval is taken instead of the maximum value, it is called average pooling. In this embodiment, max pooling is used, and the layer of the neural network that performs this is simply called the pooling layer (abbreviated as Pooling). In this embodiment, the pooling layer outputs a tensor whose first and second dimensions are half the size of the input. Specifically, an output tensor of H/2×W/2×C is generated from an input tensor of H×W×C.

ニューラルネットワークにおいて、通常コンボリューション層の後に適用する非線形関数を活性化関数と呼ぶ。活性化関数として正規化線形関数（ＲｅＬＵと略記）、シグモイド関数などがある。特に、シグモイド関数は出力値の範囲が０から１となる性質がある。本実施形態では、断りがなければ活性化関数としてＲｅＬＵを用いる。 In neural networks, a nonlinear function that is usually applied after the convolution layer is called an activation function. Activation functions include the normalized linear function (abbreviated as ReLU) and the sigmoid function. In particular, the sigmoid function has the property that the output value ranges from 0 to 1. In this embodiment, ReLU is used as the activation function unless otherwise specified.

ニューラルネットワークにおいて、テンソル同士をある次元方向に並べて連結する操作を「連結」と呼ぶ。 In neural networks, the operation of arranging and connecting tensors in a certain dimensional direction is called "concatenation."

Ｇｌｏｂａｌａｖｅｒａｇｅｐｏｏｌｉｎｇについて説明する。階数３のサイズＨ×Ｗ×Ｃのテンソルにおいて、３番目の次元の全ての位置でのスライスに対し、それぞれスライスに含まれる全要素の平均値をとる。そして、このＣ個の平均値を並べることで、長さＣのベクトルを生成する。この操作をＧｌｏｂａｌａｖｅｒａｇｅｐｏｏｌｉｎｇと呼ぶ。 Let us explain Global average pooling. In a rank-3 tensor of size H x W x C, for every slice in the third dimension, we take the average of all elements contained in each slice. Then, by arranging these C average values, we generate a vector of length C. This operation is called Global average pooling.

図１２において、ニューラルネットワークの入力となる画像１２０１のサイズは幅Ｗ１、高さＨ１、チャネル数３とする。すなわち、画像はＨ１×Ｗ１×３のテンソルとみなせる。 In Figure 12, the size of the image 1201 that is the input to the neural network is width W1, height H1, and the number of channels is 3. In other words, the image can be considered as a tensor of H1 x W1 x 3.

画像変換サブネットワーク１２０２は画像１２０１を特徴マップに変換する。画像変換サブネットワーク１２０２はさらに前処理サブネットワーク１２０３、パーツ推定サブネットワーク１２０４、画像統合サブネットワーク１２０５で構成される。 The image transformation subnetwork 1202 transforms the image 1201 into a feature map. The image transformation subnetwork 1202 is further composed of a preprocessing subnetwork 1203, a part estimation subnetwork 1204, and an image synthesis subnetwork 1205.

画像変換サブネットワーク１２０２は、検出された特徴点に対応する部位毎に物体を識別するための特徴量を抽出する。具体的にはＬ．Ｚｈａｏらの論文のように、パーツを推定し、パーツの特徴を抽出するモジュールを含む。画像変換サブネットワーク１２０２は図２の物体部位抽出部２０３に該当する。（Ｌ．Ｚｈａｏｅｔａｌ．“Ｄｅｅｐｌｙ－ＬｅａｒｎｅｄＰａｒｔ－ＡｌｉｇｎｅｄＲｅｐｒｅｓｅｎｔａｔｉｏｎｓｆｏｒＰｅｒｓｏｎＲｅ－Ｉｄｅｎｔｉｆｉｃａｔｉｏｎ，” ＩＥＥＥ，２０１７。）本実施形態では特徴抽出を行うニューラルネットワーク内で物体部位抽出部２０３を動作させるが、このニューラルネットの外で物体部位抽出部２０３を動作させ、外からパーツの位置や大きさに関する情報を与えてもいい。また、物体部位抽出部２０３と図１の第１の検出部１０２は互いに用途を兼ねてもよく、第１の検出部１０２の出力に由来する情報を物体部位抽出部２０３の出力として用いてもよく、その逆を行ってもよい。なお、ここで抽出される部位毎の特徴量は、後の処理で全体特徴量として統合される。その際、特徴点毎の信頼度に応じて各部位の特徴量を全体特徴量に反映する重みづけをする。つまり、信頼度が小さい特徴点に対応する部位から抽出された特徴量が最終的な認識結果に寄与することを抑制する。信頼度が小さい特徴点は物体が遮蔽されていることや、ノイズが多くなっている可能性があり、その部位から抽出された特徴量は必ずしもその物体の部位の特徴を示しているとは限らないためである。このような処理を行うことで、物体の特徴をより反映した特徴量を生成でき、物体の認識精度が向上する効果が期待できる。 The image conversion sub-network 1202 extracts features for identifying an object for each part corresponding to the detected feature points. Specifically, as in the paper by L. Zhao et al., it includes a module that estimates parts and extracts the features of the parts. The image conversion sub-network 1202 corresponds to the object part extraction unit 203 in Figure 2. (L. Zhao et al. "Deeply-Learned Part-Aligned Representations for Person Re-Identification," IEEE, 2017.) In this embodiment, the object part extraction unit 203 is operated within the neural network that performs feature extraction, but the object part extraction unit 203 may be operated outside this neural network and information regarding the position and size of the parts may be provided from outside. In addition, the object part extraction unit 203 and the first detection unit 102 in FIG. 1 may share the same purpose, and information derived from the output of the first detection unit 102 may be used as the output of the object part extraction unit 203, or vice versa. The feature amounts for each part extracted here are integrated as an overall feature amount in a later process. At that time, weighting is performed so that the feature amount for each part is reflected in the overall feature amount according to the reliability of each feature point. In other words, the feature amount extracted from a part corresponding to a feature point with low reliability is suppressed from contributing to the final recognition result. This is because a feature point with low reliability may indicate that the object is occluded or there is a lot of noise, and the feature amount extracted from that part does not necessarily indicate the feature of that part of the object. By performing such processing, a feature amount that better reflects the features of the object can be generated, and the effect of improving the recognition accuracy of the object can be expected.

画像変換サブネットワーク１２０２は１つ以上のコンボリューション層（Ｃｏｎｖ）、最大プーリング層（Ｐｏｏｌｉｎｇ）のシーケンスで構成できる。本実施形態では、「Ｃｏｎｖ、Ｃｏｎｖ、Ｐｏｏｌｉｎｇ、Ｃｏｎｖ、Ｐｏｏｌｉｎｇ、Ｃｏｎｖ、Ｐｏｏｌｉｎｇ、Ｃｏｎｖ」といったシーケンスで構成する。構成の概略を図１６（ａ）に示す。画像に画像変換サブネットワークを適用した結果、Ｈ２×Ｗ２×Ｃ２のテンソルを得る。 The image conversion sub-network 1202 can be configured with a sequence of one or more convolution layers (Conv) and max pooling layers (Pooling). In this embodiment, it is configured with a sequence of "Conv, Conv, Pooling, Conv, Pooling, Conv, Pooling, Conv". The outline of the configuration is shown in Figure 16 (a). As a result of applying the image conversion sub-network to an image, a tensor of H2 x W2 x C2 is obtained.

パーツ推定サブネットワーク１２０４は画像変換サブネットワーク１２０２の出力を入力とし、特徴マップであるＨ２×Ｗ２×Ｐ１のテンソルを出力する。ここで、Ｐ１は推定するパーツの数であり、事前に定められた任意の数でよい。このテンソルの３番目の次元の位置ｐでのスライス（サイズがＨ２×Ｗ２×１のテンソル）はｐ番目のパーツの存在位置を示すマスク画像である。それぞれの画素は０から１の値を取り、１に近いほどその位置にそのパーツが存在する確度が高い。パーツ推定サブネットワーク１２０４は１つのコンボリューション層とシグモイド関数で構成される。構成の概略を図１６（ｂ）に示す。パーツ推定ネットワークの構成はこれに限らず、複数のコンボリューション層で構成しても構わない。 The part estimation sub-network 1204 takes the output of the image conversion sub-network 1202 as input, and outputs a tensor of H2 x W2 x P1, which is a feature map. Here, P1 is the number of parts to be estimated, and may be any number determined in advance. A slice of this tensor at position p in the third dimension (a tensor of size H2 x W2 x 1) is a mask image indicating the location of the p-th part. Each pixel takes a value between 0 and 1, and the closer to 1 the pixel is, the higher the probability that the part is present at that position. The part estimation sub-network 1204 is composed of one convolution layer and a sigmoid function. An outline of the configuration is shown in Figure 16 (b). The configuration of the part estimation network is not limited to this, and it may be composed of multiple convolution layers.

画像統合サブネットワーク１２０５は画像変換サブネットワーク１２０２とパーツ推定サブネットワーク１２０４の出力を統合する。図１７に処理の流れを示す。まず、パーツ推定サブネットワークの出力テンソル１７０１の３番目の次元での位置ｐでのスライス１７０２（サイズがＨ２×Ｗ２×１のテンソル）をＣ２個コピーして３番目の次元方向に連結し、サイズＨ２×Ｗ２×Ｃ２のテンソル１７０３に拡張する。そして、このテンソルの各要素について、画像変換サブネットワーク１２０２の出力テンソル１７０４の各要素と乗算することで、新たなテンソル１７０５（サイズＨ２×Ｗ２×Ｃ２）を生成する。そして、このテンソルに対し、ｇｌｏｂａｌａｖｅｒａｇｅｐｏｏｌｉｎｇを適用することで、長さＣ２のベクトル１７０６を生成し、さらに全結合層を適用することで長さＣ３のベクトル１７０７を生成する。この処理をすべてのパーツのチャネルｐに対して適用し、それぞれの生成されたベクトルを連結したベクトル１７０８を生成する。すなわち、画像統合サブネットワークで生成されるベクトル１７０８の長さは（Ｃ３）Ｐ１である。本実施形態では統合対象のデータがベクトルであるが、ベクトルはテンソルの一種であり、統合対象のデータが２階以上のテンソルである場合にも同様に結合によって統合しても構わない。 The image integration sub-network 1205 integrates the outputs of the image conversion sub-network 1202 and the part estimation sub-network 1204. The process flow is shown in FIG. 17. First, the slice 1702 (tensor with size H2×W2×1) at the position p in the third dimension of the output tensor 1701 of the part estimation sub-network is copied C2 times and concatenated in the third dimension direction to expand it to a tensor 1703 of size H2×W2×C2. Then, each element of this tensor is multiplied by each element of the output tensor 1704 of the image conversion sub-network 1202 to generate a new tensor 1705 (size H2×W2×C2). Then, global average pooling is applied to this tensor to generate a vector 1706 of length C2, and a fully connected layer is further applied to generate a vector 1707 of length C3. This process is applied to the channel p of all parts, and a vector 1708 is generated by concatenating each generated vector. That is, the length of vector 1708 generated by the image integration sub-network is (C3)P1. In this embodiment, the data to be integrated is a vector, but a vector is a type of tensor, and even if the data to be integrated is a second-order or higher tensor, it may be integrated by combining in the same manner.

特徴点信頼度１２０６は長さＣ４のベクトルとする。本実施形態では、図４のステップ４０２で検出される特徴点の数が５なのでＣ４＝５である。 The feature point reliability 1206 is a vector of length C4. In this embodiment, the number of feature points detected in step 402 of FIG. 4 is 5, so C4=5.

信頼度変換サブネットワーク１２０７は、特徴点信頼度１２０６を長さＣ５のベクトルに変換する。信頼度変換サブネットワーク１２０７は０個以上の全結合層で構成できる。本実施形態では１個の全結合層とする。構成の概略を図１６（ｃ）に示す。 The confidence conversion sub-network 1207 converts the feature point confidence 1206 into a vector of length C5. The confidence conversion sub-network 1207 can be configured with zero or more fully connected layers. In this embodiment, it has one fully connected layer. An outline of the configuration is shown in Figure 16 (c).

統合サブネットワーク１２０８は画像統合サブネットワーク１２０５の出力ベクトルと信頼度変換サブネットワーク１２０７の出力ベクトルを統合する。統合サブネットワーク１２０８は長さＣ６のベクトルを出力する。本実施形態では、この２つのベクトルを連結する。構成の概略を図１６（ｄ）に示す。そのため、Ｃ６＝（Ｃ３）Ｐ１＋Ｃ５となる。 The merging subnetwork 1208 merges the output vector of the image merging subnetwork 1205 and the output vector of the confidence conversion subnetwork 1207. The merging subnetwork 1208 outputs a vector of length C6. In this embodiment, these two vectors are concatenated. The outline of the configuration is shown in FIG. 16(d). Therefore, C6 = (C3)P1 + C5.

特徴出力サブネットワーク１２０９は統合サブネットワーク１２０８の出力ベクトルを入力とし、長さＣ７のベクトルである画像特徴１２１０を出力する。特徴出力サブネットワーク１２０９は１つ以上の全結合層で構成できる。本実施形態では２つの全結合層で構成する。構成の概略を図１６（ｅ）に示す。この画像特徴は、「照合特徴」、「人物特徴」、「ディスクリプタ」、「ｅｍｂｅｄｄｉｎｇ」とも呼ばれる。 The feature output subnetwork 1209 receives the output vector of the integrating subnetwork 1208 as input, and outputs image features 1210, which are vectors of length C7. The feature output subnetwork 1209 can be composed of one or more fully connected layers. In this embodiment, it is composed of two fully connected layers. An outline of the configuration is shown in Figure 16(e). These image features are also called "matching features", "person features", "descriptors", and "embedding".

図４のステップ４０９は、ステップ４０８で抽出した人物画像の特徴量を人物データベースに保存してある特徴量と比較する。人物データベースとは、人物同定の対象のＮ人の人物の切り出し画像と特徴量があらかじめ登録されている記憶手段である。事前に人物同定対象の人物の画像を撮影しておき、ステップ４０２からステップ４０８と同様の方法で画像切り出しと特徴量抽出を行い保存してある。人物データベースは図１の物体記憶手段１１２に該当する。ステップ４０９では、人物データベース内の人物の特徴量とステップ４０８で抽出した人物画像の特徴量の距離を計算する。そして、距離順に人物データベース内の人物の並び替えを行い、最も距離の小さい人物を並びの先頭に配置する。ステップ４０９は図１の認識部１０９の処理に該当する。本実施形態では、特徴量の比較にユークリッド距離を用いる。特徴量の比較は他の方法でもよく、Ｌ１距離やコサイン距離などの他の距離指標でもよく、メトリクスラーニングやニューラルネットワークなどの機械学習を利用して比較しても構わない。 In step 409 of FIG. 4, the feature amount of the person image extracted in step 408 is compared with the feature amount stored in the person database. The person database is a storage means in which cut-out images and feature amounts of N people to be identified are registered in advance. An image of the person to be identified is photographed in advance, and the image is cut out and feature amounts are extracted and stored in the same manner as in steps 402 to 408. The person database corresponds to the object storage means 112 in FIG. 1. In step 409, the distance between the feature amount of the person in the person database and the feature amount of the person image extracted in step 408 is calculated. Then, the people in the person database are sorted in order of distance, and the person with the smallest distance is placed at the top of the order. Step 409 corresponds to the processing of the recognition unit 109 in FIG. 1. In this embodiment, Euclidean distance is used to compare the feature amounts. Other methods for comparing the feature amounts may be used, such as other distance indices such as L1 distance and cosine distance, and comparison may be made using machine learning such as metrics learning and neural networks.

図４のステップ４１０はステップ４０９で該当する人物を画面に表示する。ステップ４１０は図１の画像表示部１１０の処理に該当する。表示画面例を図１４に示す。表示画面１４０１にはクエリ１４０２とギャラリ１４０３で構成される。クエリ１４０２は検索したい人物の画像であり、ステップ４０６で切り出した人物画像を表示する。ギャラリ１４０３は検索結果の一覧であり、ステップ４０９で距離順に並び替えた人物データベース内の画像を順番に上位５人を表示する。この際、上位５人を表示してもいいし、５人の中から距離が事前に定めたしきい値以下の人物だけを表示しても構わない。ギャラリに表示される画像は、図４のステップ４０１からステップ４０７と同様の方法で切り出されてもいいし、他の方法で切り出されたものでよい。クエリとギャラリの人物の画像には、図１４のように、検出した特徴点の位置を示すマーカを重畳表示しても構わない。 Step 410 in FIG. 4 displays the person found in step 409 on the screen. Step 410 corresponds to the processing of the image display unit 110 in FIG. 1. An example of the display screen is shown in FIG. 14. The display screen 1401 is composed of a query 1402 and a gallery 1403. The query 1402 is an image of the person to be searched, and displays the person image extracted in step 406. The gallery 1403 is a list of search results, and displays the top five people in the images in the person database sorted in order of distance in step 409. In this case, the top five people may be displayed, or only people whose distance is less than a predetermined threshold may be displayed from among the five people. The image displayed in the gallery may be extracted in the same manner as in steps 401 to 407 in FIG. 4, or may be extracted in another manner. Markers indicating the positions of detected feature points may be superimposed on the images of the person in the query and gallery, as shown in FIG. 14.

図４のステップ４１１はフローチャートの処理を終了するか否かを判定する。本実施形態では、ステップ４１１の実行回数が規定回数以上になった場合、終了すると判定する。そうでなかった場合、ステップ４０１に進み、フローチャートの処理を続行する。 Step 411 in FIG. 4 determines whether or not to end the processing of the flowchart. In this embodiment, if the number of times step 411 is executed reaches or exceeds a specified number, it is determined that the processing should end. If not, the process proceeds to step 401 and continues the processing of the flowchart.

＜ニューラルネットワークの学習＞
図１の画像特徴抽出部１０８で使用するニューラルネットワークの学習の方法を図１３のフローチャートを用いて説明する。図１３のフローチャートの処理は図１の学習手段１１１に該当する。 <Neural network training>
A method of learning the neural network used in the image feature extraction unit 108 in Fig. 1 will be described with reference to the flowchart in Fig. 13. The process in the flowchart in Fig. 13 corresponds to the learning means 111 in Fig. 1.

ニューラルネットワークの構造は上述のように図１２で示される。ニューラルネットワークは画像１２０１と特徴点信頼度１２０６を入力とし、画像特徴１２１０を出力する。 The structure of the neural network is shown in Figure 12, as described above. The neural network takes an image 1201 and feature point reliability 1206 as input, and outputs image features 1210.

ニューラルネットワークはｔｒｉｐｌｅｔｌｏｓｓで学習する。（Ｆ．Ｓｈｒｏｆｆｅｔａｌ．“ＦａｃｅＮｅｔ：ＡＵｎｉｆｉｅｄＥｍｂｅｄｄｉｎｇｆｏｒＦａｃｅＲｅｃｏｇｎｉｔｉｏｎａｎｄＣｌｕｓｔｅｒｉｎｇ，”ａｒＸｉｖ：１５０３．０３８３２）。ｔｒｉｐｌｅｔｌｏｓｓでは、アンカーサンプルと呼ばれるサンプル、ポジティブサンプルと呼ばれるアンカーと同じ人物のサンプル、ネガティブサンプルと呼ばれるアンカーと異なる人物のサンプルで構成される三つ組（ｔｒｉｐｌｅｔ）を使用する。アンカーサンプル、ポジティブサンプル、ネガティブサンプルから得られるそれぞれの特徴量を比較してロス関数を計算することで、ネットワークを更新する。 The neural network is trained using triple loss. (F. Shroff et al. "Face Net: A Unified Embedding for Face Recognition and Clustering," arXiv:1503.03832). Triplet loss uses triplets consisting of a sample called an anchor sample, a sample of the same person as the anchor called a positive sample, and a sample of a different person than the anchor called a negative sample. The network is updated by comparing the features obtained from the anchor sample, positive sample, and negative sample to calculate a loss function.

図１３のステップ１３０１はネットワークを構成するコンボリューション層と全結合層の重みを初期化する。本実施形態では、重みの初期値として乱数を使用する。 Step 1301 in FIG. 13 initializes the weights of the convolution layer and the fully connected layer that make up the network. In this embodiment, random numbers are used as the initial values of the weights.

ステップ１３０２では学習データ群から学習データをランダムに取得する。１つの学習データは三つ組（ｔｒｉｐｌｅｔ）であり、アンカーサンプル、ポジティブサンプル、ネガティブサンプルを１つずつ含む。アンカーサンプル、ポジティブサンプル、ネガティブサンプルは、それぞれ画像と特徴点信頼度で構成される。画像と特徴点信頼度は図４のフローチャートで使用するニューラルネットワークに入力するものと同様の手順で生成が行われている。 In step 1302, training data is randomly acquired from the training data group. One training data is a triplet, including one anchor sample, one positive sample, and one negative sample. The anchor sample, the positive sample, and the negative sample each consist of an image and feature point reliability. The images and feature point reliability are generated using a procedure similar to that for inputting them to the neural network used in the flowchart of Figure 4.

ステップ１３０３は学習データでネットワークを更新する。まず、アンカーサンプル、ポジティブサンプル、ネガティブサンプルに対し、現在の状態のネットワークを適用して、それぞれ特徴量を計算する。これらの３つの特徴量に対し、ｔｒｉｐｌｅｔｌｏｓｓによってロスを計算する。そして、バックプロパゲーション法によって、ネットワーク内の重みを更新する。 Step 1303 updates the network with the training data. First, the current state of the network is applied to the anchor sample, positive sample, and negative sample to calculate the features for each. Losses are calculated for these three features using triplet loss. Then, the weights in the network are updated using the backpropagation method.

ステップ１３０４で学習を終了するか判定する。ステップ１３０４を規定回数実行した場合、終了すると判定し、図１３のフローチャートの一連の処理を終了する。終了しないと判定した場合、ステップ１３０２に進む。 In step 1304, it is determined whether to end the learning. If step 1304 has been executed a specified number of times, it is determined that the learning has ended, and the series of processes in the flowchart of FIG. 13 is terminated. If it is determined that the learning has not ended, the process proceeds to step 1302.

本実施形態によれば、特徴群決定部１０３および第２の検出部１０４において、良好な特徴点から良好でない特徴点をもう一度検出することができる。そのため、物体の一部が他の物体に遮蔽された状況や外乱を受けている状況においても、領域決定部１０６による物体領域決定の誤りを低減する効果が見込める。 According to this embodiment, the feature group determination unit 103 and the second detection unit 104 can detect unsatisfactory feature points from good feature points again. Therefore, even in a situation where part of the object is occluded by another object or is subject to disturbance, it is expected to have the effect of reducing errors in object region determination by the region determination unit 106.

物体の一部が他の物体に遮蔽された領域や外乱を受けている領域において、第１の検出部１０２で取得される特徴点の信頼度は正常時よりも低下して出力されると仮定できる。このとき、これらの局所領域から抽出される画像認識のための画像特徴の品質も同時に低下すると考えられる。そのため、画像特徴抽出部１０８において、ある局所領域の信頼性を表す指標として特徴点の信頼度の情報を用いることで、画像特徴の品質の低下を軽減する効果が見込める。したがって、画像認識の精度が向上する効果が見込める。 In areas where an object is partially occluded by other objects or is subject to disturbance, it can be assumed that the reliability of the feature points acquired by the first detection unit 102 will be lower than normal and will be output. At this time, it is believed that the quality of the image features for image recognition extracted from these local areas will also decrease at the same time. Therefore, by using information on the reliability of the feature points as an index representing the reliability of a certain local area in the image feature extraction unit 108, it is expected that the effect of reducing the decrease in the quality of the image features can be expected. Therefore, it is expected that the effect of improving the accuracy of image recognition can be expected.

図１０のステップ１００１は部分画像領域外の特徴点の信頼度を減少させる。部分画像領域外にある人体部位は特徴抽出の範囲外であり、この部分における特徴抽出の精度が低下する問題がある。このため、後のステップでその影響を軽減するために、部分領域外の特徴点の信頼度を減少させることで、画像特徴の品質の低下を軽減する効果が見込める。 Step 1001 in Figure 10 reduces the reliability of feature points outside the partial image region. Body parts outside the partial image region are outside the range of feature extraction, and there is a problem that the accuracy of feature extraction in this part decreases. Therefore, in order to reduce the impact in a later step, the reliability of feature points outside the partial region is reduced, which is expected to have the effect of reducing the deterioration of the quality of image features.

ステップ４０３とステップ４０４において、現在のフレームだけでなく過去のフレームの特徴点も用いて補正に用いる特徴点群の選択と特徴点の補正を行っている。過去のフレームの特徴点を用いることで、現在のフレームで特徴点の信頼度が低い場合においても、特徴点の補正精度を向上させる効果が見込める。 In steps 403 and 404, feature points from past frames as well as the current frame are used to select the group of feature points to be used for correction and to correct the feature points. By using feature points from past frames, it is expected that the accuracy of correction of feature points can be improved even when the reliability of the feature points in the current frame is low.

ステップ４０３において、特徴点の選択を予め定められた順序で行っている。ステップ４０４の特徴点の位置の補正において精度がよりよいと見込まれる特徴点を優先的に選択することで、より正しく特徴点位置を修正できる効果が見込める。 In step 403, feature points are selected in a predetermined order. By preferentially selecting feature points that are expected to have better accuracy in correcting the feature point positions in step 404, it is expected that the feature point positions can be corrected more accurately.

ステップ４０４において、所定の順序で特徴点を補正している。ここでは、腰、足という順番で特徴点を補正している。これは、人物は首、腰、足という順番で体の部位がつながっているためである。まず、腰の位置を修正した後、そのより正しい腰の位置を用いて足を修正することができる。このように、所定の順序で特徴点を比較することで、より正しく特徴点位置を修正できる効果が見込める。 In step 404, the feature points are corrected in a specified order. Here, the feature points are corrected in the order of the waist, then the legs. This is because a person's body parts are connected in the order of neck, waist, legs. First, the position of the waist is corrected, and then the legs can be corrected using this more accurate position of the waist. In this way, by comparing feature points in a specified order, it is expected that the feature point positions can be corrected more accurately.

ステップ４０４において、特徴点間の相対位置関係から特徴点の位置を補正している。実施形態では、特徴点間の距離の比や、特徴点から求められる直線（体軸）を基に特徴点を補正している。このように、物体の構造に関する事前知識を用いることで、より正しく特徴点の位置を修正できる効果が見込める。 In step 404, the positions of the feature points are corrected based on the relative positional relationship between the feature points. In this embodiment, the feature points are corrected based on the ratio of the distances between the feature points and the straight line (body axis) obtained from the feature points. In this way, by using prior knowledge about the structure of the object, it is expected that the positions of the feature points can be corrected more accurately.

＜実施形態１の変形例＞
ステップ４０２で抽出する特徴点は、頭頂、首、腰、右足首、左足首に限らず、手首、肘、膝など、他の部位でも構わない。また、必ずしも体の部位上でなくてもよく、右足首と左足首の中間点や体軸と左足首・右足首を結ぶ線の交点など、体の部位の位置関係から決まる他の点でも構わない。 <Modification of the First Embodiment>
The feature points extracted in step 402 are not limited to the top of the head, neck, waist, right ankle, and left ankle, but may be other parts such as wrists, elbows, knees, etc. Also, they do not necessarily have to be on body parts, but may be other points determined based on the positional relationship of body parts, such as the midpoint between the right ankle and the left ankle, or the intersection of the line connecting the body axis with the left ankle and the right ankle.

ステップ６０４で、過去フレームでの頭と腰の距離から現在のフレームにおける腰の位置を補正したが、他の方法でも構わない。過去のフレームでの頭と腰の位置座標の差異から、現在フレームの腰の位置を補正しても構わない。例えば、過去フレームでの頭と腰の位置座標の差異として、腰のｘ座標・ｙ座標は、頭のｘ座標・ｙ座標よりそれぞれＸピクセル、Ｙピクセル大きいとする。この過去フレームでの頭と腰との位置座標の差異と等しくなるように、現在のフレームにおいて腰の位置を補正しても構わない。また、頭と腰の位置座標の差異の代わりに、首と腰の位置座標の差異を用いても構わない。 In step 604, the position of the waist in the current frame is corrected based on the distance between the head and waist in the previous frame, but other methods may be used. The position of the waist in the current frame may be corrected based on the difference in the position coordinates of the head and waist in the previous frame. For example, the difference in the position coordinates of the head and waist in the previous frame may be set to X pixels and Y pixels larger than the x and y coordinates of the head. The position of the waist in the current frame may be corrected to be equal to the difference in the position coordinates of the head and waist in the previous frame. Also, the difference in the position coordinates of the neck and waist may be used instead of the difference in the position coordinates of the head and waist.

ステップ６０７では、人体の首と腰の距離と首と右足首（または左足首）の距離の比を用いたが、これに限らず、他の特徴点間の比を用いても構わない。一例として、頭と腰の距離と頭と右足首（または左足首）の距離の比のように、首の代わりに頭を用いてもよい。他の例として、頭と首の距離と腰と右足首（または左足首）の距離の比を用いてもよい。ステップ６０８も同様である。 In step 607, the ratio of the distance between the neck and waist of the human body and the distance between the neck and the right ankle (or the left ankle) is used, but this is not limiting and other ratios between feature points may be used. As one example, the head may be used instead of the neck, such as the ratio of the distance between the head and waist and the distance between the head and the right ankle (or the left ankle). As another example, the ratio of the distance between the head and neck and the distance between the waist and the right ankle (or the left ankle) may be used. The same applies to step 608.

ステップ６０７では、右足首と左足首が体軸の上になるように補正した。これに限らず、特徴点間の比があらかじめ定めたものとなるように、右足首（または左足首）を体軸方向に移動させることで補正しても構わない。ステップ６０８も同様である。 In step 607, the right ankle and the left ankle are corrected so that they are on the body axis. This is not limiting, and correction may be made by moving the right ankle (or the left ankle) in the direction of the body axis so that the ratio between the feature points becomes a predetermined value. The same is true for step 608.

領域決定部１０６では、部分画像領域を矩形としたが、他の形状でも構わない。例えば、多角形でもいいし、曲線に囲まれていてもよい。図形ではなく、物体領域とその他の領域を区別するマスク画像でもよい。 In the region determination unit 106, the partial image region is rectangular, but other shapes are also acceptable. For example, it may be polygonal, or may be surrounded by curves. Instead of a figure, a mask image that distinguishes the object region from other regions may also be acceptable.

実施形態１のニューラルネットワークの構造はこれに限定されない。例えば、サブネットワークの間に別のサブネットワークが挿入されてもいい。また、ネットワークの分岐構造が異なっていても構わない。サブネットワークの構成について、コンボリューション層やプーリング層、全結合層などの構成要素の種類や数が異なっていても構わない。 The structure of the neural network in the first embodiment is not limited to this. For example, another sub-network may be inserted between the sub-networks. The branching structure of the network may be different. The configuration of the sub-network may have different types and numbers of components such as convolution layers, pooling layers, and fully connected layers.

図１２の統合サブネットワーク１２０８では２つのベクトルを結合することで２つのベクトルを統合したが、他の演算方法を用いても構わない。例えば、２つのベクトルのサイズが同じであれば、ベクトルの要素同士の乗算や加算を代わりに用いても構わない。 In the integration subnetwork 1208 of FIG. 12, two vectors are integrated by combining them, but other calculation methods may be used. For example, if the two vectors are the same size, multiplication or addition of the elements of the vectors may be used instead.

図２の信頼度変換部２０５を図１２のように信頼度変換サブネットワーク１２０７として実施しているが、信頼度変換部２０５はニューラルネットワークの外部に設けても構わない。例えば、特徴点の信頼度に正規化処理や変換処理などの処理をニューラルネットワークの外部で行い、その処理結果をニューラルネットワークの入力の１つとしても構わない。 The reliability conversion unit 205 in FIG. 2 is implemented as a reliability conversion sub-network 1207 as in FIG. 12, but the reliability conversion unit 205 may be provided outside the neural network. For example, normalization processing, conversion processing, and other processing may be performed on the reliability of feature points outside the neural network, and the processing result may be one of the inputs to the neural network.

図４のステップ４０３およびステップ特徴点を補正において、現在のフレームと１つ前のフレームから補正に用いる特徴点群の選択や特徴点の補正を行った。１つ前のフレームだけでなく、それ以前のフレームを用いて特徴点群の選択や特徴点の補正を行っても構わない。さらに、現在のフレームと合わせ、３フレーム以上のフレームを用いても構わない。 In step 403 and step Feature point correction in FIG. 4, the feature points to be used for correction are selected from the current frame and the previous frame, and the feature points are corrected. The feature points may be selected and the feature points may be corrected using not only the previous frame, but also the frame before that. Furthermore, three or more frames may be used in combination with the current frame.

画像特徴抽出部１０８をニューラルネットで構成したが、ニューラルネット以外の方法を用いても構わない。例えば、ＨＯＧ（ＨｉｓｔｏｇｒａｍｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔｓ）特徴やＬＢＰ（ＬｏｃａｌＢｉｎａｒｙＰａｔｔｅｒｎ）特徴を抽出して、これを基に画像特徴を決定してもいい。他には、ＨＯＧ特徴やＬＢＰ特徴からパーツ推定を行ってもいい。 Although the image feature extraction unit 108 is configured with a neural network, methods other than a neural network may be used. For example, HOG (Histogram of Oriented Gradients) features or LBP (Local Binary Pattern) features may be extracted and image features may be determined based on these. Alternatively, parts may be estimated from HOG features or LBP features.

図６のステップ６０３で頭と首から図７の直線７０６を計算したが、頭または首のみから直線を計算しても構わない。例えば、人物の体軸が画像フレームのｙ軸と平行であると仮定できる場合には、直線は画像フレームのｙ軸に平行であると仮定することができ、首または頭のどちらか１点から直線を計算できる。同様に、図４のステップ４０５でも複数点から図９の直線９０１を計算しているが、１点から計算しても構わない。 In step 603 of Figure 6, straight line 706 in Figure 7 was calculated from the head and neck, but it is also possible to calculate the straight line from just the head or neck. For example, if it can be assumed that the axis of a person's body is parallel to the y-axis of the image frame, then it can be assumed that the straight line is parallel to the y-axis of the image frame, and the straight line can be calculated from a single point on either the neck or the head. Similarly, in step 405 of Figure 4, straight line 901 in Figure 9 was calculated from multiple points, but it is also possible to calculate it from a single point.

図１０のＳ１００１では、元の信頼度に１より小さいあらかじめ定めた実数値を乗じた値を補正後の信頼度としたが、他の方法でも構わない。信頼度の更新方法はこれに限らず、信頼度を０としてもいいし、信頼度からあらかじめ定めた実数値を減じてもいいし、他の方法を用いても構わない。 In S1001 of FIG. 10, the corrected reliability is determined by multiplying the original reliability by a predetermined real value smaller than 1, but other methods may be used. The method of updating the reliability is not limited to this, and the reliability may be set to 0, a predetermined real value may be subtracted from the reliability, or other methods may be used.

以上のように、実施形態１で説明した処理によって、人物の一部が他の物体に遮蔽された状況においても、適切に人物の照合が行える。 As described above, the process described in embodiment 1 allows appropriate matching of a person even in a situation where part of the person is occluded by another object.

＜実施形態２＞
実施形態１では人物の全身を画像処理の対象としたが、代わりに顔を画像処理の対象にしても構わない。実施形態２では実施形態１との差分のみ説明する。 <Embodiment 2>
In the first embodiment, the whole body of a person is the subject of image processing, but the face may be the subject of image processing instead. In the second embodiment, only the differences from the first embodiment will be described.

顔を対象とする場合、図４のステップ４０２では顔特徴点を検出する。図１５に図示する。ここでは、右目１５０１、左目１５０２、鼻１５０３、口の右端１５０４、口の左端１５０５を特徴点として検出するとする。 When the target is a face, facial feature points are detected in step 402 in FIG. 4. This is illustrated in FIG. 15. Here, the right eye 1501, left eye 1502, nose 1503, right edge of the mouth 1504, and left edge of the mouth 1505 are detected as feature points.

実施形態２においては、ステップ４０３、４０４において、右目の特徴点を鼻と口から補正するケースを考える。左目については、右目と同様の処理である。 In the second embodiment, we consider a case where the feature points of the right eye are corrected from the nose and mouth in steps 403 and 404. The left eye is processed in the same way as the right eye.

ステップ４０３の処理を説明する。まず右目の特徴点の信頼度を評価する。信頼度がしきい値以上の場合は特徴点群Ｃ１を選択する。信頼度がしきい値より小さい場合は、過去のフレームでの右目の信頼度がしきい値以上でなかったら特徴点群Ｃ２を選択し、しきい値以上だったら特徴点群Ｃ３を選択する。 The process of step 403 will now be described. First, the reliability of the feature points of the right eye is evaluated. If the reliability is equal to or greater than a threshold, feature point group C1 is selected. If the reliability is less than the threshold, and the reliability of the right eye in the previous frame was not equal to or greater than the threshold, feature point group C2 is selected, and if it is equal to or greater than the threshold, feature point group C3 is selected.

ステップ４０４の処理を説明する。補正に用いる特徴点群が特徴点群Ｃ１であったら、右目の位置を補正しない。特徴点群Ｃ２であったら、現在フレームの鼻と口の右端と口の左端の位置関係から、平均的な人物の顔のパーツの配置に近くなるように、現在フレームの右目の位置を補正する。特徴点群Ｃ３であったら、過去のフレームの右目、鼻、口の右端、口の左端の配置に近くなるように、現在フレームの右目の位置を補正する。 The processing of step 404 will be explained. If the feature point group used for correction is feature point group C1, the position of the right eye is not corrected. If it is feature point group C2, the position of the right eye in the current frame is corrected so that it is closer to the arrangement of the facial features of an average person, based on the positional relationship of the right edge of the nose and the right edge of the mouth, and the left edge of the mouth in the current frame. If it is feature point group C3, the position of the right eye in the current frame is corrected so that it is closer to the arrangement of the right eye, nose, right edge of the mouth, and left edge of the mouth in the previous frame.

他のステップの処理も全身から抽出する特徴点を顔の特徴点に置き換えれば、実施形態１の処理と同様である。 The processing in the other steps is similar to that in embodiment 1 if the feature points extracted from the whole body are replaced with facial feature points.

実施形態２では顔特徴点を右目、左目、鼻、口の右端、口の左端としたが、目じり、目頭、瞳、鼻の右端、鼻の下端、眉毛、顔の輪郭など、他の部分を特徴点としても構わない。そして、ステップ４０３やステップ４０４の処理をそれに合わせて変更しても構わない。 In the second embodiment, the facial feature points are the right eye, the left eye, the nose, the right edge of the mouth, and the left edge of the mouth, but other parts such as the corners of the eyes, the inner corners of the eyes, the pupils, the right edge of the nose, the bottom edge of the nose, the eyebrows, and the facial contour may be used as feature points. The processing in steps 403 and 404 may be changed accordingly.

実施形態２によれば、画像フレームからの顔画像の切り出しや顔認識の性能を向上させる効果が見込める。例えば、顔がサングラスやマスクなどのアクセサリで一部分が覆われているケースや、手などで一時的に顔の一部が隠れるケースにおいて有効である。 According to the second embodiment, it is expected that the performance of cutting out a face image from an image frame and of face recognition can be improved. For example, this is effective in cases where the face is partially covered by an accessory such as sunglasses or a mask, or where part of the face is temporarily hidden by a hand or the like.

本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、データ通信用のネットワーク又は各種記憶媒体を介してシステム或いは装置に供給する。そして、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。また、そのプログラムをコンピュータが読み取り可能な記録媒体に記録して提供してもよい。 The present invention can also be realized by executing the following process. That is, software (programs) that realize the functions of the above-described embodiments are supplied to a system or device via a data communication network or various storage media. The computer (or CPU, MPU, etc.) of the system or device then reads and executes the program. The program may also be provided by recording it on a computer-readable recording medium.

１０１画像取得部
１０２第１の検出部
１０３特徴群決定部
１０４第２の検出部
１０５特徴点記憶部
１０６領域決定部
１０７画像抽出部
１０８画像特徴抽出部
１０９認識部
１１０表示部
１１１学習部
１１２物体記憶部 Reference Signs List 101 Image acquisition unit 102 First detection unit 103 Feature group determination unit 104 Second detection unit 105 Feature point storage unit 106 Area determination unit 107 Image extraction unit 108 Image feature extraction unit 109 Recognition unit 110 Display unit 111 Learning unit 112 Object storage unit

Claims

an acquisition means for acquiring, from an image of a person, a plurality of feature points corresponding to a plurality of body parts of the person and a reliability indicating a likelihood that each of the feature points corresponds to the body parts;
an extraction means for extracting a feature amount obtained by weighting and integrating the feature amounts for the plurality of body parts corresponding to the plurality of feature points in accordance with the reliability acquired by the acquisition means, as a comparison feature amount to be compared with a feature amount of a person registered in advance;
a recognition means for determining that the person in the image is the same person as the person registered in advance when the feature amount of the person registered in advance matches the feature amount for comparison extracted by the extraction means;
13. An image processing device comprising:

The image processing device according to claim 1, characterized in that the extraction means extracts the features for matching using a neural network that receives as input a partial image including the person and the reliability and outputs the features for matching.

The image processing device according to claim 1 or 2, characterized in that one or more of the plurality of feature points corresponds to the position of a joint of a person.

4. The image processing apparatus according to claim 1, wherein one or more of the plurality of feature points correspond to positions of facial features of a person.

Further comprising a control means,
The recognition means compares the feature for matching extracted by the extraction means with feature amounts of the plurality of people registered in advance,
5. The image processing device according to claim 1, wherein the control means causes a display unit to display information representing a person among the plurality of persons, the person corresponding to a feature that has the shortest distance from the feature for matching.

6. The image processing device according to claim 5, wherein the control means identifies a predetermined number of people from among the plurality of people in order of shortest distance from the feature for matching, and causes information representing the predetermined number of people to be displayed on the display unit.

A program for causing a computer to function as each of the means included in the image processing apparatus according to any one of claims 1 to 6 .

an acquisition step of acquiring, from an image of a person, a plurality of feature points corresponding to a plurality of body parts of the person and a reliability indicating a likelihood that each of the feature points is the body part;
an extraction step of extracting a feature amount obtained by weighting and integrating the feature amounts for the plurality of body parts corresponding to the plurality of feature points in accordance with the reliability acquired in the acquisition step, as a matching feature amount to be compared with a feature amount of a person registered in advance;
a recognition step of determining that the person in the image is the same person as the person registered in advance when the feature amount of the person registered in advance matches the feature amount for comparison extracted in the extraction step;
13. An image processing method comprising:

9. The image processing method according to claim 8, wherein the extraction step extracts the feature for matching by a neural network that receives as input a partial image including the person and the reliability and outputs the feature for matching.