JP7300895B2

JP7300895B2 - Image processing device, image processing method, program, and storage medium

Info

Publication number: JP7300895B2
Application number: JP2019100724A
Authority: JP
Inventors: 知小松
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2023-06-30
Anticipated expiration: 2039-05-29
Also published as: JP2020194454A

Description

本発明は、画像中の物体の属性情報を推定する技術に関する。 The present invention relates to technology for estimating attribute information of an object in an image.

野生動物の観測や家畜の成長管理において、物体の寸法（体積）や質量を直接計測することが困難な場合があり、非接触に動物の属性情報を取得することが求められる。 In the observation of wild animals and the growth management of livestock, it is sometimes difficult to directly measure the size (volume) and mass of objects, and non-contact acquisition of animal attribute information is required.

非接触で物体表面の形状計測を行う方法としてパターン投影法、多眼撮影法、ＴＯＦ（Time of Flight）法などが知られている。非接触で計測した表面形状から体積を推定する方法として特許文献１が提案されている。特許文献１では、農作物の体積推定方法として光切断法を用いた表面形状計測に加え、農作物の陰面を接地面まで体積としたうえで補正係数によって体積の補正を行い、近似的に体積を推定する方法が提案されている。 A pattern projection method, a multi-view photographing method, a TOF (Time of Flight) method, and the like are known as methods for non-contact shape measurement of an object surface. Patent Document 1 proposes a method for estimating a volume from a surface shape measured without contact. In Patent Document 1, in addition to surface shape measurement using the light section method as a method for estimating the volume of crops, the hidden surface of the crop is assumed to be the volume up to the ground surface, and then the volume is corrected by the correction coefficient to estimate the volume approximately. A method to do so is proposed.

農作物のような静止物体に対し、動物は動きがありその動きを制御することは困難である。動物の寸法などの属性情報を取得するには動物の姿勢を知ることが必要となる。特許文献２では、撮像装置を用いて撮影した画像から特徴点を抽出し、予め学習して取得した特徴量と比較することで対象物体の姿勢を推定する方法が提案されている。 It is difficult to control the movement of animals as compared to stationary objects such as crops. In order to obtain attribute information such as animal dimensions, it is necessary to know the animal's posture. Patent Document 2 proposes a method of estimating the orientation of a target object by extracting feature points from an image captured using an imaging device and comparing them with feature amounts acquired through learning in advance.

特許第５６８６０５８号公報Japanese Patent No. 5686058 特許第４４４９４１０号公報Japanese Patent No. 4449410

特許文献１では、計測対象物体の背面形状は計測または推定を行っておらず、おおよその体積を推定することは可能であるが、対象物体の３次元的な形状を得ることができないため、体積や質量の推定精度が低下してしまう。 In Patent Document 1, the back surface shape of the object to be measured is not measured or estimated. and the accuracy of mass estimation decreases.

野生動物の観測や家畜の成長管理において、動物を対象とする場合は対象物体の姿勢を制御することは困難であり、撮像画像から対象動物の全長を推定する場合、異なる姿勢では推定結果も異なる。特許文献２では、対象物体を認識するために予め学習した特徴点との比較により対象物体の姿勢を推定して対象物体を認識しているが、対象物体の属性情報（寸法・形状・体積・質量など）の推定は行っていない。 In the observation of wild animals and the growth management of livestock, it is difficult to control the posture of the target object when the target is an animal. . In Patent Document 2, a target object is recognized by estimating the orientation of the target object by comparing with pre-learned feature points for recognizing the target object. (mass, etc.) are not estimated.

本発明は、上記課題に鑑みてなされ、その目的は、画像中の物体の属性情報を推定できる技術を実現することである。 SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and an object thereof is to realize a technique capable of estimating attribute information of an object in an image.

上記課題を解決し、目的を達成するために、本発明の画像処理装置は、被写体を撮像した撮像画像から前記被写体の深度方向の距離分布を示した深度情報を生成する深度生成手段と、前記撮像画像から特定の物体の領域を検出する物体検出手段と、前記特定の物体の姿勢を推定する姿勢推定手段と、前記撮像画像および前記深度情報における前記特定の物体の姿勢を当該特定の物体の属性情報の推定に適した特定の姿勢に変換する姿勢変換手段と、前記姿勢が変換された撮像画像と深度情報と画像の撮影条件とから前記特定の物体の属性情報を推定する属性情報推定手段と、を有する。 In order to solve the above problems and achieve the object, an image processing apparatus of the present invention includes: depth generation means for generating depth information indicating a distance distribution of a subject from a captured image of the subject; object detection means for detecting an area of a specific object from a captured image; posture estimation means for estimating a posture of the specific object; Attitude transforming means for transforming into a specific pose suitable for estimating attribute information; and attribute information estimating means for estimating attribute information of the specific object from the taken image whose pose has been transformed, depth information, and imaging conditions of the image. and have

本発明によれば、画像中の物体の属性情報を推定することが可能となる。 According to the present invention, it is possible to estimate attribute information of an object in an image.

実施形態１の装置構成を示すブロック図（ａ）、撮像素子の画素配列を示す図（ｂ）および撮像素子の断面構造を示す模式図（ｃ）。FIG. 1A is a block diagram showing the device configuration of Embodiment 1, FIG. 1B is a diagram showing the pixel array of an image sensor, and FIG. 実施形態１の撮像素子と光学系と画像の関係を説明する模式図。FIG. 2 is a schematic diagram for explaining the relationship between an image sensor, an optical system, and an image according to Embodiment 1; 実施形態１の属性情報推定処理を示すフローチャート。4 is a flowchart showing attribute information estimation processing according to the first embodiment; 実施形態１の対象物体選択画面を例示する図。4 is a view exemplifying a target object selection screen according to the first embodiment; FIG. 実施形態１の姿勢推定処理の一例を説明する図。4A and 4B are diagrams for explaining an example of posture estimation processing according to the first embodiment; FIG. 実施形態１の姿勢変換処理および寸法計測位置の一例を示す図。4A and 4B are views showing an example of attitude conversion processing and dimension measurement positions according to the first embodiment; FIG. 実施形態１の寸法計測位置の入力方法の一例を説明する図。4A and 4B are diagrams for explaining an example of a method of inputting a dimension measurement position according to the first embodiment; FIG. 実施形態２の属性情報推定処理を示すフローチャート。10 is a flowchart showing attribute information estimation processing according to the second embodiment; 実施形態２の姿勢推定・変換処理を示すフローチャート。10 is a flowchart showing posture estimation/conversion processing according to the second embodiment;

以下、添付図面を参照して実施形態を詳しく説明する。尚、以下の実施形態は特許請求の範囲に係る発明を限定するものではない。実施形態には複数の特徴が記載されているが、これらの複数の特徴の全てが発明に必須のものとは限らず、また、複数の特徴は任意に組み合わせられてもよい。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。 Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In addition, the following embodiments do not limit the invention according to the scope of claims. Although multiple features are described in the embodiments, not all of these multiple features are essential to the invention, and multiple features may be combined arbitrarily. Furthermore, in the accompanying drawings, the same or similar configurations are denoted by the same reference numerals, and redundant description is omitted.

［実施形態１］以下、実施形態１について説明する。
以下では、画像処理装置の一例としての、被写体の距離分布を示す深度情報を取得可能なデジタルカメラに、本発明を適用した実施形態の例を説明する。しかし、本発明は、撮像画像と撮像画像に対応する深度情報と画像の撮影条件とに基づいて物体の属性情報寸法・形状・体積・質量など）を推定することが可能な任意の機器に適用可能である。 [Embodiment 1] Embodiment 1 will be described below.
An example of an embodiment in which the present invention is applied to a digital camera capable of acquiring depth information indicating the distance distribution of a subject as an example of an image processing apparatus will be described below. However, the present invention can be applied to any device capable of estimating object attribute information (size, shape, volume, mass, etc.) based on a captured image, depth information corresponding to the captured image, and image capturing conditions. It is possible.

＜デジタルカメラの構成＞まず、図１を参照して、本実施形態のデジタルカメラ１００の構成および機能について説明する。 <Construction of Digital Camera> First, the construction and functions of a digital camera 100 according to the present embodiment will be described with reference to FIG.

撮像光学系１０は、デジタルカメラ１００が有する撮影レンズであり、被写体の光学像を撮像素子１１上に形成する。撮像光学系１０は、光軸１０２上に並んだ不図示の複数のレンズで構成され、撮像素子１１から所定距離離れた位置に射出瞳１０１を有する。なお、本明細書において、光軸１０２と平行な方向をｚ方向または深度方向とし、光軸１０２と直交し、撮像素子１１の水平方向と平行な方向をｘ方向、撮像素子１１の垂直方向と平行な方向をｙ方向として定義する、あるいは軸を設けるものとする。 The imaging optical system 10 is a photographing lens included in the digital camera 100 and forms an optical image of a subject on the imaging device 11 . The imaging optical system 10 is composed of a plurality of lenses (not shown) arranged on an optical axis 102 and has an exit pupil 101 at a predetermined distance from the imaging device 11 . In this specification, the direction parallel to the optical axis 102 is defined as the z direction or depth direction, the direction orthogonal to the optical axis 102 and parallel to the horizontal direction of the image sensor 11 is defined as the x direction, and the vertical direction of the image sensor 11 is defined as the x direction. Let us define the parallel direction as the y-direction, or provide an axis.

撮像素子１１は、例えばＣＣＤ（電荷結合素子）やＣＭＯＳセンサ（相補型金属酸化膜半導体）である。撮像素子１１は、撮像光学系１０を介して撮像面に形成された被写体像を光電変換し、該被写体像に係る画像信号を出力する。また、本実施形態では撮像素子１１は、後述するように撮像面位相差測距方式の測距機能を有しており、撮像画像に加えて、撮像装置から被写体までの距離（被写体距離）を示す距離情報を生成して出力可能である。 The imaging device 11 is, for example, a CCD (charge coupled device) or a CMOS sensor (complementary metal oxide semiconductor). The imaging device 11 photoelectrically converts a subject image formed on an imaging surface via the imaging optical system 10 and outputs an image signal related to the subject image. Further, in the present embodiment, the imaging device 11 has a ranging function of an imaging surface phase difference ranging method as described later, and in addition to the captured image, the distance from the imaging device to the subject (subject distance) is measured. It is possible to generate and output the indicated distance information.

制御部１２は、例えばＣＰＵやマイクロプロセッサなどの制御装置であり、デジタルカメラ１００が備える各ブロックの動作を制御する。制御部１２は、例えば、撮像時のオートフォーカス（ＡＦ：自動焦点合わせ）、フォーカス位置の変更、Ｆ値（絞り）の変更、画像の取り込み、記憶部１４や入力部１５、表示部１６、通信部１７の制御を行う。 The control unit 12 is a control device such as a CPU or a microprocessor, and controls the operation of each block included in the digital camera 100 . For example, the control unit 12 performs autofocus (AF: automatic focusing) at the time of imaging, change of focus position, change of F number (aperture), capture of image, storage unit 14, input unit 15, display unit 16, communication 17 is controlled.

画像処理装置１３は、デジタルカメラ１００が有する各種の画像処理を実現するブロックである。図示されるように画像処理装置１３は、画像生成部１３０、深度生成部１３１、物体検出部１３２、姿勢推定部１３３、姿勢変換部１３４、属性情報推定部１３５の画像処理ブロックと、画像処理の作業領域として用いられるメモリ１３６とを有している。画像処理装置１３は、論理回路を用いて構成することができる。また、別の形態として、中央演算処理装置（ＣＰＵ）と演算処理プログラムを格納するメモリとから構成してもよい。 The image processing device 13 is a block that implements various image processing of the digital camera 100 . As illustrated, the image processing device 13 includes image processing blocks of an image generation unit 130, a depth generation unit 131, an object detection unit 132, an orientation estimation unit 133, an orientation conversion unit 134, an attribute information estimation unit 135, and an image processing block. and a memory 136 used as a work area. The image processing device 13 can be configured using a logic circuit. As another form, it may be composed of a central processing unit (CPU) and a memory for storing an arithmetic processing program.

画像生成部１３０は、撮像素子１１から出力された画像信号のノイズ除去、デモザイキング、輝度信号変換、収差補正、ホワイトバランス調整、色補正などの各種信号処理を行う。画像生成部１３０から出力される画像データ（撮像画像）はメモリ１３６に蓄積され、物体検出部１３２および表示部１６に用いられる。 The image generation unit 130 performs various signal processing such as noise removal, demosaicing, luminance signal conversion, aberration correction, white balance adjustment, and color correction of the image signal output from the imaging device 11 . The image data (captured image) output from the image generating section 130 is accumulated in the memory 136 and used by the object detecting section 132 and the display section 16 .

深度生成部１３１は、後述する撮像素子１１が有する測距用画素に係り得られた信号を基づいて、深度情報の分布を表す深度画像を生成する。ここで、深度画像は、各画素に格納される値が、該画素に対応する撮像画像の領域に存在する被写体の被写体距離である２次元の情報である。 The depth generation unit 131 generates a depth image representing the distribution of depth information based on signals obtained from ranging pixels of the imaging device 11, which will be described later. Here, the depth image is two-dimensional information in which the value stored in each pixel is the subject distance of the subject existing in the area of the captured image corresponding to the pixel.

物体検出部１３２は、画像生成部１３０により生成された撮像画像を用いて、該撮像画像に含まれる、予め計測対象となる物体を検出し、撮像画像中の位置・大きさを特定する。予め計測対象となる物体の種類が指定されていない場合は、物体検出部１３２において種類を特定する。なお、本実施形態では、対象物体は人間以外の動物であるものとする。 The object detection unit 132 uses the captured image generated by the image generation unit 130 to detect in advance an object to be measured included in the captured image, and specifies the position and size in the captured image. If the type of object to be measured is not specified in advance, the object detection unit 132 identifies the type. Note that, in this embodiment, the target object is an animal other than a human being.

姿勢推定部１３３は、物体検出部１３２によって検出された物体領域において、対象物体の姿勢を予め学習して取得し記憶部１４に格納されている情報を利用して推定する。 The posture estimation unit 133 acquires the posture of the target object in the object region detected by the object detection unit 132 by learning in advance and estimates it using information stored in the storage unit 14 .

姿勢変換部１３４は、姿勢推定部１３３で推定された対象物体について、鑑賞用画像および深度画像における対象物体の姿勢を属性情報の推定に適した特定の姿勢に変換する。特定の姿勢は対象物体により異なり、予め指定された物体の種類、または物体検出部１３２で特定した物体情報に基づき、予め記憶部１４および／またはメモリ１３６に格納されている姿勢情報から決定する。 The posture transforming unit 134 transforms the posture of the target object estimated by the posture estimating unit 133 in the viewing image and the depth image into a specific posture suitable for estimating attribute information. The specific orientation differs depending on the target object, and is determined from the orientation information stored in the storage unit 14 and/or the memory 136 in advance based on the type of object specified in advance or the object information specified by the object detection unit 132 .

属性情報推定部１３５は、姿勢変換部１３４で対象物体の姿勢が特定の姿勢に変換された鑑賞用画像と深度画像とから対象物体の属性情報として寸法・形状・体積・質量の少なくとも１つを推定する。寸法推定では、対象物体により寸法を計測する位置が異なる。よって、予め指定された物体の種類または物体検出部１３２で特定した物体の種類に基づき予め記憶部１４および／またはメモリ１３６に格納されている寸法計測のための情報を利用して推定を行う。形状推定では、姿勢変換した深度画像により表面形状を取得し、物体検出部１３２で特定した物体の種類に応じて、予め記憶部１４および／またはメモリ１３６に格納されている対象物体の３次元形状データを参照して推定を行う。体積推定では、形状推定により求められた対象物体の３次元形状と撮影パラメータから推定を行う。質量推定では、体積推定により求められた体積と対象物体に応じた密度情報を利用して推定を行う。密度情報は予め物体ごとに計測しておき、記憶部１４に格納されている。 The attribute information estimation unit 135 calculates at least one of size, shape, volume, and mass as attribute information of the target object from the viewing image and the depth image in which the posture of the target object has been converted to a specific posture by the posture conversion unit 134. presume. In dimension estimation, the position where the dimension is measured differs depending on the target object. Therefore, based on the type of object specified in advance or the type of object specified by the object detection unit 132, the information for dimension measurement stored in advance in the storage unit 14 and/or the memory 136 is used for estimation. In the shape estimation, the surface shape is obtained from the depth image whose posture has been transformed, and the three-dimensional shape of the target object stored in advance in the storage unit 14 and/or the memory 136 is obtained according to the type of object identified by the object detection unit 132. Make an estimate by referring to the data. In volume estimation, estimation is performed from the three-dimensional shape of the target object obtained by shape estimation and imaging parameters. In mass estimation, estimation is performed using density information according to the volume obtained by volume estimation and the target object. Density information is measured in advance for each object and stored in the storage unit 14 .

記憶部１４は、撮像された画像データ、各ブロックの動作の過程で生成された中間データ、画像処理装置１３やデジタルカメラ１００の動作において参照されるパラメータなどが記録される不揮発性の記録媒体である。記憶部１４は、処理の実現にあたり許容される処理性能が担保されるものであれば、高速に読み書きでき、かつ、大容量の記録媒体であればどのようなものであってもよく、例えば、フラッシュメモリなどが望ましい。 The storage unit 14 is a non-volatile recording medium in which captured image data, intermediate data generated during the operation of each block, parameters referred to in the operation of the image processing device 13 and the digital camera 100, and the like are recorded. be. The storage unit 14 may be any recording medium that can be read and written at high speed and has a large capacity as long as it guarantees an acceptable processing performance for realizing processing. A flash memory or the like is desirable.

入力部１５は、例えば、ダイヤル、ボタン、スイッチ、タッチパネルなどの、デジタルカメラ１００に対してなされた情報入力や設定変更の操作入力を検出するユーザインターフェイスである。入力部１５は、なされた操作入力を検出すると、対応する制御信号を制御部１２に出力する。 The input unit 15 is, for example, a user interface such as a dial, button, switch, or touch panel that detects information input or setting change operation input to the digital camera 100 . The input unit 15 outputs a corresponding control signal to the control unit 12 when detecting the operation input made.

表示部１６は、例えば、液晶ディスプレイや有機ＥＬなどの表示装置である。表示部１６は、撮像画像をスルー表示することによる撮影時の構図確認や、各種設定画面やメッセージ情報の報知に用いられる。本実施形態では表示部１６は、物体の検出結果、形状・体積・質量など推定結果などの表示も行う。 The display unit 16 is, for example, a display device such as a liquid crystal display or organic EL. The display unit 16 is used for confirming the composition at the time of photographing by displaying a through display of the captured image, and for notifying various setting screens and message information. In this embodiment, the display unit 16 also displays the detection result of the object and the estimation result such as shape, volume, mass, and the like.

通信部１７は、デジタルカメラ１００が備える、外部との情報送受信を実現する通信インタフェースである。通信部１７は、得られた撮像画像や深度情報、被写体の属性情報（寸法・形状・体積・質量）の推定結果などを他の装置に送出可能に構成されていてよい。 The communication unit 17 is a communication interface provided in the digital camera 100 for realizing information transmission/reception with the outside. The communication unit 17 may be configured to be able to transmit the obtained captured image, depth information, estimation result of subject attribute information (size, shape, volume, mass), etc. to another device.

＜撮像素子の構成＞次に、図１（ｂ）、（ｃ）を参照して、本実施形態の撮像素子１１の詳細構成について説明する。 <Structure of Imaging Device> Next, the detailed structure of the imaging device 11 of the present embodiment will be described with reference to FIGS. 1(b) and 1(c).

撮像素子１１は、図１（ｂ）に示されるように、異なるカラーフィルタが適用された２行×２列の画素群１１０が複数連結して配列されることで構成されている。拡大図示されるように、画素群１１０は、赤（Ｒ）、緑（Ｇ）、青（Ｂ）のカラーフィルタが配置されており、各画素（光電変換素子）からは、Ｒ、Ｇ、Ｂのいずれかの色情報を示した画像信号が出力される。なお、本実施形態では一例として、カラーフィルタが、図示されるような分布担っているものとして説明するが、本発明の実施がこれに限られるものではないことは容易に理解されよう。 As shown in FIG. 1B, the imaging device 11 is configured by connecting and arranging a plurality of pixel groups 110 of 2 rows×2 columns to which different color filters are applied. As shown in the enlarged illustration, the pixel group 110 has red (R), green (G), and blue (B) color filters arranged therein. An image signal indicating any one of color information is output. In this embodiment, as an example, the color filters are distributed as shown in the figure, but it will be easily understood that the implementation of the present invention is not limited to this.

本実施形態の撮像素子１１は、撮像面位相差測距方式の測距機能を実現すべく、１つの画素（光電変換素子）は、撮像素子１１の水平方向に係る、図１（ｂ）のＩ－Ｉ’断面において、複数の光電変換部が並んで構成される。より詳しくは、図１（ｃ）に示されるように、各画素は、マイクロレンズ１１１およびカラーフィルタ１１２を含む導光層１１３と、第１の光電変換部１１５および第２の光電変換部１１６を含むと、で構成されている。 In the imaging element 11 of the present embodiment, one pixel (photoelectric conversion element) is arranged in the horizontal direction of the imaging element 11, as shown in FIG. A plurality of photoelectric conversion units are arranged side by side in the II′ cross section. More specifically, as shown in FIG. 1C, each pixel includes a light guide layer 113 including a microlens 111 and a color filter 112, and a first photoelectric conversion unit 115 and a second photoelectric conversion unit 116. Contain and consist of.

導光層１１３において、マイクロレンズ１１１は、画素へ入射した光束を第１の光電変換部１１５および第２の光電変換部１１６に効率よく導くよう構成されている。またカラーフィルタ１１２は、所定の波長帯域の光を通過させるものであり、上述したＲ、Ｇ、Ｂのいずれかの波長帯の光のみを通過させ、後段の第１の光電変換部１１５および第２の光電変換部１１６に導く。 In the light guide layer 113 , the microlenses 111 are configured to efficiently guide the light beams incident on the pixels to the first photoelectric conversion unit 115 and the second photoelectric conversion unit 116 . The color filter 112 passes light in a predetermined wavelength band, and passes only light in one of the R, G, and B wavelength bands described above. 2 photoelectric conversion unit 116 .

受光層１１４には、受光した光をアナログ画像信号に変換する２つの光電変換部（第１の光電変換部１１５と第２の光電変換部１１６）が設けられており、これら２つの光電変換部から出力された２種類の信号が測距に用いられる。即ち、撮像素子１１の各画素は、同様に水平方向に並んだ２つの光電変換部を有しており、全画素のうちの第１の光電変換部１１５から出力された信号で構成された画像信号と、第２の光電変換部１１６から出力された信号で構成される画像信号が用いられる。換言すれば、第１の光電変換部１１５と第２の光電変換部１１６とは、画素に対してマイクロレンズ１１１を介して入光する光束を、それぞれ部分的に受光する。故に、最終的に得られる２種類の画像信号は、撮像光学系１０の射出瞳の異なる領域を通過した光束に係る瞳分割画像群となる。ここで、各画素で第１の光電変換部１１５と第２の光電変換部１１６とが光電変換した画像信号を合成したものは、画素に１つの光電変換部のみが設けられている態様において該１つの光電変換部から出力される画像信号（鑑賞用）と等価である。 The light-receiving layer 114 is provided with two photoelectric conversion units (a first photoelectric conversion unit 115 and a second photoelectric conversion unit 116) that convert received light into analog image signals. Two types of signals output from are used for ranging. That is, each pixel of the image sensor 11 similarly has two photoelectric conversion units arranged in the horizontal direction, and an image composed of signals output from the first photoelectric conversion unit 115 of all the pixels. An image signal composed of a signal and a signal output from the second photoelectric conversion unit 116 is used. In other words, the first photoelectric conversion unit 115 and the second photoelectric conversion unit 116 each partially receive the light flux entering the pixel via the microlens 111 . Therefore, the two types of image signals finally obtained are a group of pupil-divided images related to light fluxes that have passed through different areas of the exit pupil of the imaging optical system 10 . Here, a combination of image signals photoelectrically converted by the first photoelectric conversion unit 115 and the second photoelectric conversion unit 116 in each pixel can be obtained in a mode in which only one photoelectric conversion unit is provided in the pixel. This is equivalent to an image signal (for viewing) output from one photoelectric conversion unit.

このような構造を有することで、本実施形態の撮像素子１１は、鑑賞用画像信号と測距用画像信号（２種類の瞳分割画像）とを出力することが可能となっている。なお、本実施形態では、撮像素子１１の全ての画素が２つの光電変換部を備え、高密度な深度情報を出力可能に構成されているものであるとして説明するが、本発明の実施はこれに限られるものではない。 With such a structure, the imaging device 11 of the present embodiment can output an image signal for viewing and an image signal for distance measurement (two types of pupil-divided images). In this embodiment, all the pixels of the image sensor 11 are provided with two photoelectric conversion units, and are configured to output high-density depth information. is not limited to

＜撮像面位相差測距方式の測距原理＞
ここで、本実施形態のデジタルカメラ１００で行われる、第１の光電変換部１１５および第２の光電変換部１１６から出力された瞳分割画像群に基づいて、被写体距離を導出する原理について、図２を参照して説明する。 <Ranging principle of imaging surface phase difference ranging method>
Here, the principle of deriving the subject distance based on the group of pupil-divided images output from the first photoelectric conversion unit 115 and the second photoelectric conversion unit 116 performed by the digital camera 100 of the present embodiment will be described with reference to FIG. 2 for explanation.

図２（ａ）は、撮像光学系１０の射出瞳１０１と、撮像素子１１中の画素の第１の光電変換部１１５に受光する光束を示した概略図である。図２（ｂ）は同様に第２の光電変換部１１６に受光する光束を示した概略図である。 FIG. 2A is a schematic diagram showing the exit pupil 101 of the imaging optical system 10 and the luminous flux received by the first photoelectric conversion units 115 of the pixels in the imaging element 11 . FIG. 2B is a schematic diagram similarly showing a light beam received by the second photoelectric conversion unit 116. As shown in FIG.

図２（ａ）および（ｂ）に示したマイクロレンズ１１１は、射出瞳１０１と受光層１１４とが光学的に共役関係になるように配置されている。撮像光学系１０の射出瞳１０１を通過した光束は、マイクロレンズ１１１により集光されて第１の光電変換部１１５または第２の光電変換部１１６に導かれる。この際、第１の光電変換部１１５と第２の光電変換部１１６にはそれぞれ図２（ａ）および（ｂ）に示される通り、異なる瞳領域を通過した光束を主に受光する。第１の光電変換部１１５には第１の瞳領域２１０を通過した光束、第２の光電変換部１１６には第２の瞳領域２２０を通過した光束となる。 The microlenses 111 shown in FIGS. 2A and 2B are arranged such that the exit pupil 101 and the light receiving layer 114 are in an optically conjugate relationship. A light beam that has passed through the exit pupil 101 of the imaging optical system 10 is condensed by the microlens 111 and guided to the first photoelectric conversion unit 115 or the second photoelectric conversion unit 116 . At this time, as shown in FIGS. 2A and 2B, the first photoelectric conversion unit 115 and the second photoelectric conversion unit 116 mainly receive light beams that have passed through different pupil regions. The first photoelectric conversion unit 115 receives the luminous flux that has passed through the first pupil region 210 , and the second photoelectric conversion unit 116 receives the luminous flux that has passed through the second pupil region 220 .

撮像素子１１が備える複数の第１の光電変換部１１５は、第１の瞳領域２１０を通過した光束を主に受光し、第１の画像信号を出力する。また、同時に撮像素子１１が備える複数の第２の光電変換部１１６は、第２の瞳領域２２０を通過した光束を主に受光し、第２の画像信号を出力する。第１の画像信号から第１の瞳領域２１０を通過した光束が撮像素子１１上に形成する像の強度分布を得ることができる。また、第２の画像信号から第２の瞳領域２２０を通過した光束が、撮像素子１１上に形成する像の強度分布を得ることができる。 A plurality of first photoelectric conversion units 115 included in the imaging device 11 mainly receive light beams that have passed through the first pupil region 210 and output first image signals. At the same time, the plurality of second photoelectric conversion units 116 included in the image sensor 11 mainly receive the light beams that have passed through the second pupil region 220, and output second image signals. The intensity distribution of the image formed on the imaging device 11 by the light flux passing through the first pupil region 210 can be obtained from the first image signal. In addition, the intensity distribution of the image formed on the imaging element 11 by the light flux passing through the second pupil region 220 can be obtained from the second image signal.

第１の画像信号と第２の画像信号間の相対的な位置ズレ量（所謂、視差量）は、デフォーカス量に応じた値となる。視差量とデフォーカス量との関係について、図２（ｃ）、（ｄ）、（ｅ）を用いて説明する。図２（ｃ）、（ｄ）、（ｅ）は本実施形態の撮像素子１１、撮像光学系１０について説明した概略図である。図中の符号２１１は、第１の瞳領域２１０を通過する第１の光束を示し、符号２２１は第２の瞳領域２２０を通過する第２の光束を示す。 A relative positional deviation amount (so-called parallax amount) between the first image signal and the second image signal has a value corresponding to the defocus amount. The relationship between the amount of parallax and the amount of defocus will be described with reference to FIGS. 2(c), (d), and (e) are schematic diagrams explaining the imaging element 11 and the imaging optical system 10 of this embodiment. Reference numeral 211 in the drawing denotes the first light flux passing through the first pupil region 210, and reference numeral 221 denotes the second light flux passing through the second pupil region 220. FIG.

図２（ｃ）は合焦時の状態を示しており、第１の光束２１１と第２の光束２２１が撮像素子１１上で収束している。このとき、第１の光束２１１により形成される第１の画像信号と第２の光束２２１により形成される第２の画像信号間との視差量は０となる。図２（ｄ）は像側でｚ軸の負方向にデフォーカスした状態を示している。この時、第１の光束により形成される第１の画像信号と第２の信号により形成される第２の画像信号との視差量は０とはならず、負の値を有する。図２（ｅ）は、像側でｚ軸の正方向にデフォーカスした状態を示している。この時、第１の光束により形成される第１の画像信号と第２の光束により形成される第２の画像信号との視差量は正の値を有する。図２（ｄ）と図２（ｅ）の比較から、デフォーカス量の正負に応じて、位置ズレの方向が入れ替わることが分かる。また、デフォーカス量に応じて、撮像光学系の結像関係（幾何関係）に従って位置ズレが生じることが分かる。第１の画像信号と第２の画像信号との位置ズレである視差量は、後述する領域ベースのマッチング手法により検出することができる。 FIG. 2C shows a state at the time of focusing, in which the first luminous flux 211 and the second luminous flux 221 converge on the imaging device 11. FIG. At this time, the amount of parallax between the first image signal formed by the first light flux 211 and the second image signal formed by the second light flux 221 is zero. FIG. 2D shows a state in which the image side is defocused in the negative direction of the z-axis. At this time, the amount of parallax between the first image signal formed by the first light flux and the second image signal formed by the second signal is not 0, but has a negative value. FIG. 2E shows a state in which the image side is defocused in the positive direction of the z-axis. At this time, the amount of parallax between the first image signal formed by the first light flux and the second image signal formed by the second light flux has a positive value. From the comparison between FIG. 2(d) and FIG. 2(e), it can be seen that the direction of positional deviation changes depending on whether the defocus amount is positive or negative. Also, it can be seen that positional deviation occurs according to the imaging relationship (geometric relationship) of the imaging optical system according to the defocus amount. The amount of parallax, which is the positional deviation between the first image signal and the second image signal, can be detected by a region-based matching method, which will be described later.

＜属性情報推定処理＞次に、図３（ａ）のフローチャートを用いて、本実施形態のデジタルカメラ１００において実行される撮像画像から対象物体の属性情報を推定する処理について説明する。なお、図３（ａ）のフローチャートに対応する処理は、制御部１２が、例えば記憶部１４に記憶されている対応する処理プログラムを読み出し、不図示の揮発性メモリに展開して実行し、デジタルカメラ１００の各部を制御することにより実現することができる。後述する図８および図９でも同様である。 <Attribute Information Estimation Processing> Next, the processing for estimating the attribute information of the target object from the captured image executed by the digital camera 100 of the present embodiment will be described with reference to the flowchart of FIG. 3(a). Note that the processing corresponding to the flowchart of FIG. It can be realized by controlling each part of the camera 100 . The same applies to FIGS. 8 and 9, which will be described later.

Ｓ３０１で、制御部１２は、計測対象となる物体の選択を行う。撮影時に表示部１６に計測を行う物体の一覧を表示し、入力部１５によってユーザが所望の物体を選択できるようにする。これにより物体の種類の推定を省略することができ、誤認識を防止することができる。ここでは、入力部１５と表示部１６を別体に構成しているが、タッチパネルなどにより表示部１６が入力部１５の機能を持つように構成してもよい。図４は計測対象の物体一覧の表示例を示している。図４では、動物の分類と、分類ごとの詳細な動物の種類が選択可能に表示され、ユーザは表示された選択肢のいずれかを選択すればよい。撮影対象が予め用意された物体の種類に当てはまらない場合は、例えば、図４に示すように「その他」の選択肢を表示し、一般的なパラメータを用いて属性情報推定を行えばよい。 In S301, the control unit 12 selects an object to be measured. A list of objects to be measured is displayed on the display unit 16 at the time of photographing, and the input unit 15 allows the user to select a desired object. As a result, it is possible to omit the estimation of the type of the object and prevent erroneous recognition. Here, the input unit 15 and the display unit 16 are configured separately, but the display unit 16 may be configured to have the function of the input unit 15 by using a touch panel or the like. FIG. 4 shows a display example of a list of objects to be measured. In FIG. 4, animal classifications and detailed animal types for each classification are displayed in a selectable manner, and the user may select one of the displayed options. If the object to be photographed does not correspond to the type of object prepared in advance, for example, as shown in FIG. 4, an option of "Others" may be displayed, and attribute information estimation may be performed using general parameters.

Ｓ３０２で、制御部１２は、設定された焦点位置、絞り、露光時間などの撮像設定にて撮像を行うよう処理する。より詳しくは、制御部１２は、撮像素子１１に撮像を行わせ、得られた撮像画像を画像処理装置１３に伝送させ、メモリ１３６に記憶するよう制御する。ここで、撮像画像は、撮像素子１１が有する第１の光電変換部１１５のみから出力された信号で構成された画像信号Ｓ１と、第２の光電変換部１１６のみから出力された信号で構成された画像信号Ｓ２の２種類であるものとする。 In S302, the control unit 12 performs processing to perform imaging with imaging settings such as the set focal position, aperture, and exposure time. More specifically, the control unit 12 controls the imaging device 11 to perform imaging, transmit the obtained captured image to the image processing device 13 , and store it in the memory 136 . Here, the captured image is composed of an image signal S1 composed of signals output only from the first photoelectric conversion unit 115 of the image sensor 11 and a signal output only from the second photoelectric conversion unit 116. It is assumed that there are two types of image signals S2.

Ｓ３０３で、画像処理装置１３は、得られた撮像画像から鑑賞用画像と深度画像とを生成する。より詳しくは、画像処理装置１３のうちの画像生成部１３０は、まず画像信号Ｓ１と画像信号Ｓ２の各画素の画素値を加算することで、１つのベイヤー配列画像を生成する。画像生成部１３０は、該ベイヤー配列画像について、Ｒ、Ｇ、Ｂ各色の画像のデモザイキング処理を行い、鑑賞用画像を生成する。なお、デモザイキング処理は、撮像素子上に配置されたカラーフィルタに応じて行われるものであり、デモザイキング方法についていずれの方式が用いられるものであってもよいことは言うまでもない。このほか、画像生成部１３０は、ノイズ除去、輝度信号変換、収差補正、ホワイトバランス調整、色補正などの処理を行い、最終的な鑑賞用画像を生成してメモリ１３６に格納する。 In S303, the image processing device 13 generates a viewing image and a depth image from the obtained captured image. More specifically, the image generation unit 130 in the image processing device 13 first generates one Bayer array image by adding the pixel values of the pixels of the image signal S1 and the image signal S2. The image generation unit 130 performs demosaicing processing on the images of each of R, G, and B colors on the Bayer array image to generate an image for viewing. The demosaicing process is performed according to the color filters arranged on the image pickup device, and it goes without saying that any demosaicing method may be used. In addition, the image generation unit 130 performs processing such as noise removal, luminance signal conversion, aberration correction, white balance adjustment, and color correction, generates a final viewing image, and stores it in the memory 136 .

＜深度画像生成処理＞
一方、深度画像については、深度生成部１３１が生成に係る処理を行う。ここで、深度画像生成に係る処理について、図３（ｂ）のフローチャートを用いて説明する。 <Depth image generation processing>
On the other hand, the depth image is generated by the depth generation unit 131 . Here, processing related to depth image generation will be described with reference to the flowchart of FIG. 3(b).

Ｓ３１１で、深度生成部１３１は、画像信号Ｓ１および画像信号Ｓ２について、光量補正処理を行う。撮像光学系１０の周辺画角ではヴィネッティングにより、第１の瞳領域２１０と第２の瞳領域２２０の形状が異なることに起因し、画像信号Ｓ１と画像信号Ｓ２の間では、光量バランスが崩れている。従って、本ステップにおいて、深度生成部１３１は、例えば、予め記憶部１４および／またはメモリ１３６に格納されている光量補正値を用いて、画像信号Ｓ１と画像信号Ｓ２の光量補正を行う。 In S311, the depth generation unit 131 performs light amount correction processing on the image signal S1 and the image signal S2. At the peripheral angle of view of the imaging optical system 10, due to vignetting, the shapes of the first pupil region 210 and the second pupil region 220 are different, and the light amount balance between the image signals S1 and S2 is lost. ing. Therefore, in this step, the depth generation unit 131 performs light amount correction on the image signal S1 and the image signal S2 using, for example, light amount correction values stored in the storage unit 14 and/or the memory 136 in advance.

Ｓ３１２で、深度生成部１３１は、撮像素子１１における変換時に生じたノイズを低減する処理を行う。具体的には深度生成部１３１は、画像信号Ｓ１と画像信号Ｓ２に対して、フィルタ処理を適用することで、ノイズ低減を実現する。一般に、空間周波数が高い高周波領域ほどＳＮ比が低くなり、相対的にノイズ成分が多くなる。従って、深度生成部１３１は、空間周波数が高いほど、通過率が低減するローパスフィルタを適用する処理を行う。なお、Ｓ３１１における光量補正は、撮像光学系１０の製造誤差などによっては望ましい結果とはならないため、深度生成部１３１は、直流成分を遮断し、かつ、高周波成分の通過率が低いバンドパスフィルタを適用することが望ましい。 In S<b>312 , the depth generation unit 131 performs processing to reduce noise generated during conversion in the image sensor 11 . Specifically, the depth generator 131 implements noise reduction by applying filtering to the image signal S1 and the image signal S2. In general, the higher the spatial frequency, the lower the SN ratio, and the more noise components there are. Therefore, the depth generation unit 131 performs processing to apply a low-pass filter whose pass rate decreases as the spatial frequency increases. Note that the light amount correction in S311 may not produce a desirable result due to manufacturing errors of the imaging optical system 10, etc. Therefore, the depth generation unit 131 uses a band-pass filter that cuts off DC components and has a low pass rate of high-frequency components. It is desirable to apply

Ｓ３１３で、深度生成部１３１は、画像信号Ｓ１と画像信号Ｓ２に基づいて、これらの画像間の視差量を導出する。具体的には、深度生成部１３１は、画像信号Ｓ１内に、代表画素情報に対応した注目点と、該注目点を中心とする照合領域とを設定する。照合領域は、例えば、注目点を中心とした一辺が所定長さを有する正方領域などの矩形領域であってよい。次に深度生成部１３１は、画像信号Ｓ２内に参照点を設定し、該参照点を中心とする参照領域を設定する。参照領域は、上述した照合領域と同一の大きさおよび形状を有する。深度生成部１３１は、参照点を順次移動させながら、画像信号Ｓ１の照合領域内に含まれる画像と、画像信号Ｓ２の参照領域内に含まれる画像との相関度を導出し、最も相関度が高い参照点を、画像信号Ｓ２における、注目点に対応する対応点として特定する。このようにして特定された対応点と注目点との相対的な位置ズレ量が、注目点における視差量となる。 In S313, the depth generator 131 derives the amount of parallax between these images based on the image signal S1 and the image signal S2. Specifically, the depth generator 131 sets a target point corresponding to the representative pixel information and a collation area centering on the target point in the image signal S1. The collation area may be, for example, a rectangular area such as a square area having one side of a predetermined length centered on the point of interest. Next, the depth generator 131 sets a reference point in the image signal S2 and sets a reference area centered on the reference point. The reference area has the same size and shape as the matching area described above. The depth generation unit 131 sequentially moves the reference points to derive the degree of correlation between the image included in the matching region of the image signal S1 and the image included in the reference region of the image signal S2. A high reference point is identified as a corresponding point corresponding to the point of interest in the image signal S2. The amount of relative positional deviation between the corresponding point specified in this way and the point of interest is the amount of parallax at the point of interest.

深度生成部１３１は、このように注目点を代表画素情報に従って順次変更しながら視差量を算出することで、該代表画素情報によって定められた複数の画素位置における視差量を導出する。本実施形態では簡単のため、鑑賞用画像と同一の解像度で深度情報を得るべく、視差量を計算する画素位置（代表画素情報に含まれる画素群）は、鑑賞用画像と同数になるよう設定されているものとする。なお、相関度の導出方法として、ＮＣＣ（Normalized Cross-Correlation）やＳＳＤ（Sum of Squared Difference）、ＳＡＤ（Sum of Absolute Difference）などの方法を用いてよい。 The depth generation unit 131 calculates the amount of parallax while sequentially changing the point of interest according to the representative pixel information in this way, thereby deriving the amount of parallax at a plurality of pixel positions determined by the representative pixel information. In this embodiment, for simplicity, the pixel positions (pixel groups included in the representative pixel information) for calculating the amount of parallax are set to be the same number as the viewing image in order to obtain the depth information with the same resolution as the viewing image. It shall be Methods such as NCC (Normalized Cross-Correlation), SSD (Sum of Squared Difference), and SAD (Sum of Absolute Difference) may be used as methods for deriving the degree of correlation.

また、導出された視差量は、所定の変換係数を用いることで、撮像素子１１から撮像光学系１０の焦点までの距離であるデフォーカス量に変換することができる。ここで、所定の変換係数Ｋ、デフォーカス量をΔＬとすると、視差量ｄは、以下の式１によって、デフォーカス量に変換できる。 Also, the derived amount of parallax can be converted into a defocus amount, which is the distance from the imaging element 11 to the focal point of the imaging optical system 10, by using a predetermined conversion coefficient. Here, assuming a predetermined conversion coefficient K and a defocus amount ΔL, the parallax amount d can be converted into a defocus amount by the following equation 1.

（式１）
ΔＬ＝Ｋ×ｄ
さらに、デフォーカス量ΔＬを幾何光学におけるレンズの公式である以下の式２を用いることで、被写体距離に変換することができる。
（式２）
１／Ａ＋１／Ｂ＝１／Ｆ
ここで、Ａは物面から撮像光学系１０の主点までの距離（被写体距離）、Ｂは撮像光学系１０の主点から像面までの距離、Ｆは撮像光学系１０の焦点距離を指すものとする。即ち、該レンズの公式において、Ｂの値がデフォーカス量ΔＬから導出することができるため、撮像時の焦点距離の設定に基づき、被写体から物面までの距離Ａを導出することができる。 (Formula 1)
ΔL=K×d
Furthermore, the defocus amount ΔL can be converted into the object distance by using the following equation 2, which is a lens formula in geometrical optics.
(Formula 2)
1/A+1/B=1/F
Here, A is the distance (subject distance) from the object plane to the principal point of the imaging optical system 10, B is the distance from the principal point of the imaging optical system 10 to the image plane, and F is the focal length of the imaging optical system 10. shall be That is, in the lens formula, the value of B can be derived from the defocus amount ΔL, so the distance A from the object to the object surface can be derived based on the setting of the focal length at the time of imaging.

深度生成部１３１は、このように導出した被写体距離を画素値とする２次元情報を構成し、深度画像としてメモリ１３６に格納する。 The depth generation unit 131 constructs two-dimensional information in which the derived object distance is used as a pixel value, and stores it in the memory 136 as a depth image.

一方、Ｓ３０４で、物体検出部１３２は、対象物体領域の検出を行う。物体検出部１３２は、Ｓ３０１で選択された対象物体の種類に基づき、事前に学習して取得し記憶部１４に格納された情報を利用して対象物体領域を特定し、特定した領域の輪郭に沿って対象物体を抽出する。この場合、深度画像を利用して対象物体領域の抽出を補助することも可能である。抽出した対象物体領域以外は特定または一定の値とし、対象物体領域のみが残された物体抽出画像を生成する。深度画像においても同様に対象物体領域以外は特定または一定の値に置き換え、対象物体領域のみが有効な値を持つ物体抽出深度画像を生成する。物体抽出画像および物体抽出深度画像はメモリ１３６に記憶され、以降の処理に利用される。対象物体領域を抽出するための学習方法には、例えばＤｅｅｐＬｅａｒｎｉｎｇなど、様々な機械学習を利用することができるが、特定の方法に限定されず、どのような方法を用いてもよい。 On the other hand, in S304, the object detection unit 132 detects a target object region. Based on the type of target object selected in S301, the object detection unit 132 identifies the target object region using information acquired in advance and stored in the storage unit 14, and detects the contour of the identified region. Extract the target object along In this case, the depth image can be used to assist extraction of the target object region. A specific or constant value is set to values other than the extracted target object region, and an object extraction image in which only the target object region is left is generated. Similarly, in the depth image, areas other than the target object area are replaced with specific or constant values to generate an object extraction depth image in which only the target object area has valid values. The object extraction image and the object extraction depth image are stored in memory 136 and used for subsequent processing. Various machine learning methods such as deep learning can be used as the learning method for extracting the target object region, but the method is not limited to a specific method, and any method may be used.

Ｓ３０５で、姿勢推定部１３３は、物体抽出画像における対象物体の姿勢推定を行う。姿勢推定部１３３は、Ｓ３０４における対象物体領域の検出結果から、物体抽出画像中の領域内における特徴点の抽出を行い、事前に学習して取得し記憶部１４に格納されている３次元形状の特徴点データを利用して姿勢の推定を行う。さらに物体抽出深度画像を利用することでより詳細な姿勢変化を推定することが可能となる。姿勢推定では、主に物体抽出画像を利用して対象物体としての動物の頭部や胴体、脚、尾といった部位がどこに位置しているか、頭部がどちらを向いているかといった情報を推定する。また、対象物体全体の撮影方向から見た向きは胴体の向きによって判定可能である。胴体の向きの判定には物体抽出深度画像を利用する。物体抽出深度画像から各画素における法線方向を算出し、上記胴体位置の情報を利用して胴体部分の法線方向を取得する。動物の胴体は曲面であるため法線方向は一定ではない。よって主成分分析などを行い、主たる法線方向を算出する。この主たる法線方向と垂直な面を対象物体の向きを表す平面として推定する。例えば、図５に示すように対象動物の胴体の中心を通り頭部から尾までを垂直に切断する平面Ｐｖおよび水平に切断する平面Ｐｈを推定し、後述の姿勢変換に利用する。 In S305, the posture estimation unit 133 estimates the posture of the target object in the object extraction image. The posture estimating unit 133 extracts feature points in the region in the object extraction image from the detection result of the target object region in S304, and obtains the three-dimensional shape stored in the storage unit 14 by learning in advance. The posture is estimated using feature point data. Furthermore, by using the object extraction depth image, it becomes possible to estimate the posture change in more detail. Pose estimation mainly uses object extraction images to estimate information such as where the head, body, legs, and tail of an animal as a target object are located, and which direction the head is facing. Also, the orientation of the entire target object viewed from the photographing direction can be determined from the orientation of the body. An object extraction depth image is used to determine the orientation of the torso. The normal direction of each pixel is calculated from the object extraction depth image, and the normal direction of the body portion is obtained using the information of the body position. Since the body of an animal is a curved surface, the normal direction is not constant. Therefore, a principal component analysis or the like is performed to calculate the main normal direction. A plane perpendicular to this main normal direction is estimated as a plane representing the orientation of the target object. For example, as shown in FIG. 5, a plane Pv that cuts vertically from the head to the tail of the target animal and a plane Ph that cuts horizontally from the head to the tail are estimated, and used for posture conversion, which will be described later.

本実施形態では、対象物体領域検出と姿勢推定を別の処理としたが、機械学習を利用することで対象物体領域検出と物体の姿勢推定を同時に行ってもよい。 In the present embodiment, target object region detection and orientation estimation are performed separately, but target object region detection and object orientation estimation may be performed simultaneously using machine learning.

Ｓ３０６で、姿勢変換部１３４は、Ｓ３０５で推定された対象物体の姿勢を利用して、対象物体の物体抽出画像での姿勢変換および物体抽出深度画像での姿勢変換を行う。例えば、物体抽出画像から物体の属性情報として動物の寸法を推定する場合、動物に種類によって計測しやすい特定の姿勢がある。例えば、哺乳類の場合は図６に示すように撮影した画像（図６（ａ））を変換し側面から撮影された画像（図６（ｂ））にすることでその全長Ｌａ・頭胴長Ｌｂ・体高Ｌｃといった寸法の計測が容易となる。 In S306, the posture transformation unit 134 uses the posture of the target object estimated in S305 to transform the posture of the target object in the object extraction image and the posture transformation in the object extraction depth image. For example, when estimating the size of an animal as the attribute information of the object from the extracted object image, the animal has a specific posture that is easy to measure depending on the type. For example, in the case of a mammal, as shown in FIG. 6, the photographed image (FIG. 6(a)) is converted into an image photographed from the side (FIG. 6(b)). - It becomes easy to measure a dimension such as body height Lc.

鳥類・魚類なども側面からの撮像画像となるように変換するのがよい。鳥類の場合は図６（ｃ）に示すように全長Ｌａや翼長Ｌｄなどの寸法を計測する。ただし、図６（ｄ）のように翼を広げた鳥類の場合は、上方から俯瞰した画像となるように変換するのが望ましく、翼開長Ｌｅを計測する。また、爬虫類・両生類・昆虫などの節足動物も情報からの俯瞰した画像となるのが望ましい。ただし、どのような動物においても撮影された画像における対象物体の姿勢に応じて、側面からの画像または上方からの画像に適宜変換するのが望ましい。 Birds and fishes should also be converted so that they are captured from the side. In the case of birds, dimensions such as total length La and wing length Ld are measured as shown in FIG. 6(c). However, in the case of birds with spread wings as shown in FIG. 6(d), it is desirable to convert the image into an image viewed from above, and the wing span Le is measured. In addition, it is desirable that arthropods such as reptiles, amphibians, and insects should also be a bird's-eye view image based on the information. However, for any animal, it is desirable to appropriately convert the image to an image from the side or an image from above, depending on the posture of the target object in the photographed image.

ここでの姿勢変換は基本的には幾何変換による画像の変換を行う。哺乳類の場合を例にすると、上記姿勢推定において推定された動物の姿勢を示す垂直切断平面Ｐｖを利用し、この平面の法線が撮影装置１に対して垂直となるように回転角を算出する。得られた回転角から回転行列Ｒを生成し、以下の式３のように物体抽出画像および物体抽出深度画像を回転変換させることで図６（ａ）であった対象動物の姿勢を図６（ｂ）のような姿勢に変換する。
（式３）

Ｐは変換前の画像上の位置（ｘ、ｙ、ｚ）を意味し、Ｐ’は変換後の画像上の位置である。変換により欠落した画素位置の情報は周辺の画素の情報から補間することで欠落のない変換画像を生成する。 Attitude transformation here basically transforms an image by geometric transformation. Taking the case of mammals as an example, the vertical cutting plane Pv representing the animal's posture estimated in the posture estimation is used, and the rotation angle is calculated so that the normal to this plane is perpendicular to the photographing device 1. . A rotation matrix R is generated from the obtained rotation angles, and the object extraction image and the object extraction depth image are rotationally transformed as shown in Equation 3 below, so that the posture of the target animal shown in FIG. b) to transform the pose.
(Formula 3)

P means the position (x, y, z) on the image before transformation, and P' is the position on the image after transformation. Information on pixel positions that are missing due to conversion is interpolated from information on surrounding pixels to generate a conversion image without missing points.

このように、姿勢変換された物体抽出画像および物体抽出深度画像がメモリ１３６に記憶され、以降の処理に利用される。本実施形態では、回転行列を利用した姿勢変換を例に説明したが、回転以外に平行移動、拡大縮小などを加えた変換を利用することもできる。姿勢変換を行うことで後述する属性情報推定のために予め計測保持しておくデータを減らすことができる利点もある。 In this way, the object extraction image and the object extraction depth image whose orientation has been changed are stored in the memory 136 and used for subsequent processing. In the present embodiment, an example of attitude transformation using a rotation matrix has been described, but transformations that include translation, scaling, and the like in addition to rotation can also be used. There is also the advantage that the data to be measured and stored in advance for estimating attribute information, which will be described later, can be reduced by performing attitude transformation.

Ｓ３０７で、属性情報推定部１３５は、対象物体の属性情報の推定を行う。属性情報とは、対象物体の寸法・形状・体積・質量を表し、属性情報推定ではこれらのうち少なくとも１つを推定する。 In S307, the attribute information estimation unit 135 estimates the attribute information of the target object. Attribute information represents the size, shape, volume, and mass of a target object, and at least one of these is estimated in attribute information estimation.

まず属性情報の１つである寸法推定について説明する。寸法推定では、姿勢変換された物体抽出画像を表示部１６に表示し、ユーザが入力部１５により表示された画像中の動物において所望の計測位置を指定する。指定方法としては２箇所を指定する方法（図７（ａ））、３箇所以上を指定してそれぞれの間を線形的に接続したり（図７（ｂ））、多項式を用いて接続したりする方法、ユーザが計測したい部分をなぞる方法（図７（ｃ））を用いる。図７（ａ）では、２点Ｐ１およびＰ２を指定し、２点間の水平の長さを計測する場合を示す。他にも垂直方向の長さを計測する場合や２点間のユークリッド距離を計測する場合を指定できるようにするのが望ましい。図７（ｂ）では４点Ｐ１～Ｐ４を指定し各点の間を直線で繋いだ例を示している。他にもスプライン曲線などを用いて各点間を補間してその長さを計測してもよい。図７（ｃ）は図中のＰ１からＰ２までユーザがなぞった曲線の長さを計測する例である。このように様々な計測位置の指定方法と指定区間の計測方法があるが、２箇所の指定では直線のみの計測となり簡便な計測が可能な一方で、３箇所以上の指定もしくはなぞることによって曲線の長さも計測可能となり計測の自由度が向上する。特に、曲線による計測は、姿勢変換において所望の姿勢に変換できなかった場合の計測に効果がある。例えば、図６（ｅ）に示すようにヘビの全長Ｌａの計測など直線に伸びている状態が困難な動物の計測に効果がある。 First, dimension estimation, which is one of the attribute information, will be described. In dimension estimation, an object extraction image whose posture has been changed is displayed on the display unit 16, and the user designates a desired measurement position on the animal in the image displayed by the input unit 15. FIG. You can specify two locations (Fig. 7(a)), specify three or more locations and connect them linearly (Fig. 7(b)), or connect them using a polynomial. and a method in which the user traces the portion to be measured (FIG. 7(c)). FIG. 7A shows a case where two points P1 and P2 are designated and the horizontal length between the two points is measured. In addition, it is desirable to be able to specify the case of measuring the length in the vertical direction or the case of measuring the Euclidean distance between two points. FIG. 7B shows an example in which four points P1 to P4 are specified and the points are connected by straight lines. Alternatively, a spline curve or the like may be used to interpolate between points and measure the length. FIG. 7(c) is an example of measuring the length of a curve traced by the user from P1 to P2 in the figure. In this way, there are various methods of specifying the measurement position and measuring the specified section. The length can also be measured, and the degree of freedom in measurement is improved. In particular, measurement using a curved line is effective for measurement when the posture cannot be converted to a desired posture. For example, as shown in FIG. 6(e), it is effective in measuring an animal that is difficult to stretch in a straight line, such as measuring the total length La of a snake.

計測点指定後の長さの計測は、指定された計測位置間の画素数をカウントすることで画像中の画素単位またはサブ画素単位で計測される。この場合の計測値は像空間における長さである。計測された画素単位の長さを実際の物体空間での長さに変換するために、まず撮像素子１１の１画素サイズの大きさから像空間での国際単位系の長さに変換する。次に撮影パラメータを利用して撮影倍率Ｍを求め、計測された像空間での長さと撮影倍率Ｍの積をとることで物体空間での実際の長さを算出する。 After the measurement points are specified, the length is measured in units of pixels or sub-pixels in the image by counting the number of pixels between the specified measurement positions. The measurement in this case is the length in image space. In order to convert the measured length in pixel units into the actual length in the object space, first, the size of one pixel of the image sensor 11 is converted into the length in the international system of units in the image space. Next, the photographing parameter is used to determine the photographing magnification M, and the product of the measured length in the image space and the photographing magnification M is taken to calculate the actual length in the object space.

撮影倍率Ｍは撮影時のパラメータである撮像光学系１０の焦点距離Ｆ、対象物体距離Ｚを利用して以下の式４により算出できる。
（式４）
Ｍ＝Ｚ／Ｆ
対象物体距離Ｚは、撮像光学系１０に含まれるフォーカレンズの位置と対応するフォーカス距離を予め計測しておき、撮影時のフォーカスレンズ位置を検出して対応するフォーカス距離を対象物体距離Ｚとして取得する。 The photographing magnification M can be calculated by the following equation 4 using the focal length F of the imaging optical system 10 and the target object distance Z, which are parameters at the time of photographing.
(Formula 4)
M=Z/F
For the target object distance Z, the focus distance corresponding to the position of the focus lens included in the imaging optical system 10 is measured in advance, the focus lens position at the time of shooting is detected, and the corresponding focus distance is acquired as the target object distance Z. do.

続いて属性情報の１つである形状推定について説明する。形状推定は対象物体の３次元形状を推定する。Ｓ３０１で選択された物体の種類の情報、Ｓ３０６で生成された姿勢変換された物体抽出画像および物体抽出深度画像を利用して行う。姿勢変換された物体抽出深度画像は、デジタルカメラ１００から対象物体までの距離に依存した値になっているため、対象物体までの距離を差し引くことでデジタルカメラ１００から見た対象物体面の深度画像（＝形状）が算出される。以降の説明では、計測された対象物体のある特定の一面を表面とするものとする。一度の撮影では対象物体の特定の一面形状のみが計測可能で、撮影方向から見えない反対の面は計測することができない。対象物体の反対面を推定するにあたり、予め計測対象となる複数の物体の３次元形状を計測し、参照３次元形状を記憶部１４および／またはメモリ１３６に格納しておく。格納しておく参照３次元形状は、物体ごとに平均的な１つの３次元形状でもよいが、反対面の推定精度を向上させるために複数の３次元形状を保持しておくことが望ましい。参照３次元形状は、ボクセル単位のデータまたは、国際単位系で表現されたデータのいずれであってもよいが、参照３次元形状データの単位によって以下の変換処理が変更される。ここでボクセルは、１画素をｘｙｚ方向に拡張した３次元の画素サイズを意味する。また、国際単位系のデータは物体側での対象物体のサイズを国際単位系で計測したものを意味する。上記算出された対象物体の表面の形状情報において、深度情報（Ｚ方向）は国際単位系であるが、対象物体のＸＹ方向の大きさは画素単位となっている。参照３次元形状のデータ単位に応じて深度情報をボクセル単位に変更、または対象物体のＸＹ方向の大きさを国際単位系に変更する。ボクセル単位および国際単位系の変換は、前述のように撮影パラメータを利用して撮影倍率Ｍを求め、式４を利用して変換する。 Next, shape estimation, which is one piece of attribute information, will be described. Shape estimation estimates the three-dimensional shape of the target object. This is performed using information on the type of object selected in S301, and the object extraction image and the object extraction depth image generated in S306 and subjected to attitude transformation. Since the object extraction depth image after attitude conversion has a value that depends on the distance from the digital camera 100 to the target object, the depth image of the target object plane viewed from the digital camera 100 is obtained by subtracting the distance to the target object. (=shape) is calculated. In the following description, a specific surface of the measured target object is assumed to be the surface. Only one specific surface shape of the target object can be measured in one shot, and the opposite surface that cannot be seen from the shooting direction cannot be measured. In estimating the opposite surface of the target object, the 3D shapes of a plurality of objects to be measured are measured in advance, and the reference 3D shapes are stored in the storage unit 14 and/or the memory 136 . The reference three-dimensional shape to be stored may be one average three-dimensional shape for each object, but it is desirable to store a plurality of three-dimensional shapes in order to improve the accuracy of estimating the opposite surface. The reference three-dimensional shape may be data in units of voxels or data expressed in the International System of Units, but the conversion process below is changed depending on the unit of the reference three-dimensional shape data. A voxel here means a three-dimensional pixel size obtained by expanding one pixel in the xyz directions. Also, the data of the international system of units means the size of the target object on the object side measured in the international system of units. In the calculated shape information of the surface of the target object, the depth information (Z direction) is in the international system of units, but the size of the target object in the XY directions is in units of pixels. Depth information is changed to voxel units according to the data unit of the reference three-dimensional shape, or the size of the target object in the XY directions is changed to the international system of units. Conversion between the voxel unit and the international system of units is performed by obtaining the photographing magnification M using the photographing parameters as described above and converting using Equation 4.

次に参照３次元形状と検出した対象物体の大きさが同じになるように、参照３次元形状を変換する。その後、対象物体の表面形状と大きさ変換した参照３次元形状との位置のマッチング処理が行われる。マッチング処理によって計測された表面形状が、参照３次元形状においてどの面に対応するかを決定する。同時に、計測されていない対象物体の反対面が参照３次元形状において特定される。この特定された参照３次元形状における反対面を、計測された対象物体の表面形状と合成することで対象物体の３次元形状が推定される。複数の異なる参照３次元形状を格納した場合は、最も計測形状と合致する参照３次元形状から反対面を推定する。 Next, the reference three-dimensional shape is transformed so that the size of the reference three-dimensional shape and the detected target object are the same. After that, position matching processing is performed between the surface shape of the target object and the size-converted reference three-dimensional shape. It is determined which surface in the reference three-dimensional shape the surface shape measured by the matching process corresponds to. At the same time, the opposite face of the target object that has not been measured is identified in the reference 3D shape. The three-dimensional shape of the target object is estimated by synthesizing the opposite surface of the specified reference three-dimensional shape with the measured surface shape of the target object. When a plurality of different reference three-dimensional shapes are stored, the opposite surface is estimated from the reference three-dimensional shape that best matches the measured shape.

３次元形状の推定精度をさらに向上させるために、計測した表面形状と参照３次元形状の合致面との形状の差を算出し、算出した差を参照３次元形状の反対面に加減算することで形状を補正し推定反対面とする。または、形状の厚みに対する上記差の量を算出し、反対面の形状の厚みに応じて補正量を変更してもよい。 In order to further improve the accuracy of 3D shape estimation, the difference between the measured surface shape and the matching surface of the reference 3D shape is calculated, and the calculated difference is added to or subtracted from the opposite surface of the reference 3D shape. Correct the shape and assume the opposite side. Alternatively, the amount of the difference with respect to the thickness of the shape may be calculated, and the correction amount may be changed according to the thickness of the shape of the opposite surface.

対象物体の大きさに対して、物体空間における１画素サイズが大きい場合、推定された３次元形状は段差のある不正確な形状となる。よって撮像素子１１の画素数が多いことが望ましく、撮影時に物体が画面に対してできるだけ大きく占めるように撮影するのが望ましい。画素単位の段差を低減するために、補間処理を適用することでより滑らかな形状に変更し、さらにはポリゴンデータとしてもよい。 If the size of one pixel in the object space is larger than the size of the target object, the estimated three-dimensional shape becomes an inaccurate shape with steps. Therefore, it is desirable that the number of pixels of the image sensor 11 is large, and it is desirable that the object occupies as much of the screen as possible when photographing. In order to reduce the pixel-by-pixel step, interpolation processing may be applied to change the shape to a smoother shape, and polygon data may be used.

次に、属性情報の１つである体積推定について説明する。体積推定では、上記形状推定で推定された３次元形状を用いて体積を算出する。推定された体積がボクセル単位データの場合、推定された３次元形状中のボクセル数をカウントし、ボクセルの一辺の長さを、式４を利用することで物体空間での体積を推定する。推定された体積が既に物体空間における国際単位系で表現されたデータの場合は、推定された３次元形状内を積分することで体積を推定する。なお、推定される体積は、画像処理にてベースとなる単位体積要素（正規格子単位）であるボクセル基準で導出されてもよいし、現実世界における実寸大の寸法基準で導出されるものであってもよい。 Next, volume estimation, which is one of attribute information, will be described. In volume estimation, the volume is calculated using the three-dimensional shape estimated in the shape estimation. When the estimated volume is voxel unit data, the number of voxels in the estimated three-dimensional shape is counted, and the length of one side of the voxel is estimated using Equation 4 to estimate the volume in the object space. If the estimated volume is data already expressed in the International System of Units in the object space, the volume is estimated by integrating within the estimated three-dimensional shape. Note that the estimated volume may be derived based on the voxel standard, which is a unit volume element (regular grid unit) that is the base in image processing, or may be derived based on the actual size standard in the real world. may

次に、属性情報の１つである質量推定について説明する。質量推定では、上記体積推定で導出された対象物体の体積と、記憶部１４および／またはメモリ１３６に格納されている対象物体の密度情報とを乗算することで対象物体の質量を推定する。密度情報は対象物体に対して一様としてもよいが、より高精度に質量を推定するために部位ごとに異なる情報を保持して利用することもできる。対象物体の骨格分析などを用いて部位ごと、例えば、頭部、胴体、腕、脚などに分割し、それぞれ異なる密度情報を用いて質量推定を行う。なお、本実施形態では、対象物体の密度情報を用いて質量を算出したが、これに限らず、対象物体の比重量の情報を予め記憶部１４および／またはメモリ１３６に格納しておき、推定された対象物体の３次元形状の体積を乗算することで重量を推定してもよい。 Next, mass estimation, which is one of attribute information, will be described. In the mass estimation, the mass of the target object is estimated by multiplying the volume of the target object derived by the volume estimation and the density information of the target object stored in the storage unit 14 and/or the memory 136 . The density information may be uniform for the target object, but it is also possible to store and use different information for each part in order to estimate the mass with higher accuracy. Using skeletal analysis of the target object, the object is divided into parts such as the head, body, arms, and legs, and mass estimation is performed using different density information for each part. In this embodiment, the density information of the target object is used to calculate the mass, but the present invention is not limited to this. The weight may be estimated by multiplying the volume of the three-dimensional shape of the target object.

Ｓ３０８で、制御部１２は、Ｓ３０７で推定された属性情報を表示部１６に表示すると共に、記憶部１４に記憶する。Ｓ３０７で推定された属性情報は、Ｓ３０３で生成された鑑賞用画像のメタデータとして深度画像と関連付けて記録することが望ましい。 In S<b>308 , the control unit 12 displays the attribute information estimated in S<b>307 on the display unit 16 and stores it in the storage unit 14 . The attribute information estimated in S307 is preferably recorded in association with the depth image as metadata of the viewing image generated in S303.

以上説明したように、本実施形態によれば、撮像画像から生成される深度画像と、画像が撮影された条件とから画像中の物体の属性情報を推定することが可能となる。詳しくは、深度画像と、物体の領域検出、姿勢推定、姿勢変換のための事前に学習し取得した情報、予め計測した形状、撮影パラメータおよび質量比を用いることで、対象物体の属性情報を推定できる。 As described above, according to this embodiment, it is possible to estimate the attribute information of an object in an image from the depth image generated from the captured image and the conditions under which the image was captured. Specifically, the attribute information of the target object is estimated by using the depth image, the information learned and acquired in advance for object region detection, pose estimation, and pose transformation, the shape measured in advance, the shooting parameters, and the mass ratio. can.

［実施形態２］次に、実施形態２について説明する。 [Embodiment 2] Next, Embodiment 2 will be described.

実施形態１では、ユーザが計測対象物体の選択を行っていた。これに対し、実施形態２は、ユーザにおる計測対象物体の選択入力がないところが実施形態１と相違する。なお、実施形態２において、デジタルカメラ１００の構成や機能は、実施形態１の図１や図３と同様であり、実施形態１の属性情報推定処理と相違する点を中心に説明する。 In the first embodiment, the user selects the object to be measured. In contrast, the second embodiment differs from the first embodiment in that the user does not input the selection of the object to be measured. In the second embodiment, the configuration and functions of the digital camera 100 are the same as those in FIGS. 1 and 3 of the first embodiment, and the differences from the attribute information estimation processing of the first embodiment will be mainly described.

図８は、実施形態２の属性情報推定処理を示し、実施形態１の図３の処理と同一の処理には同一のステップ番号を付して示している。 FIG. 8 shows the attribute information estimation process of the second embodiment, and the same step numbers are given to the same processes as the processes of FIG. 3 of the first embodiment.

Ｓ８０１で、図３のＳ３０２と同様に、制御部１２は、設定された焦点位置、絞り、露光時間などの撮像設定にて撮像を行うよう処理する。 In S801, as in S302 of FIG. 3, the control unit 12 performs processing to perform imaging with imaging settings such as the set focal position, aperture, and exposure time.

Ｓ８０２で、図３のＳ３０３と同様に、画像生成部１３０は、鑑賞用画像と深度画像を生成する。 In S802, the image generator 130 generates a viewing image and a depth image, as in S303 of FIG.

Ｓ８０３で、物体検出部１３２は、被写体の認識を行う。被写体認識・領域検出は、予め機械学習によって取得した物体の分類・種類の情報に基づき、画像中の物体の識別および位置・輪郭の抽出を行い、物体抽出画像を生成する。抽出した位置・輪郭情報を深度画像にも適用し、深度生成部１３１により物体抽出深度画像を生成する。機械学習は特定の方法に限定されず、どのような方法を用いてもよい。 In S803, the object detection unit 132 recognizes the subject. Object recognition/region detection identifies an object in an image and extracts its position/contour based on object classification/type information acquired in advance by machine learning, and generates an object extraction image. The extracted position/contour information is also applied to the depth image, and the depth generation unit 131 generates an object extraction depth image. Machine learning is not limited to a specific method, and any method may be used.

Ｓ８０４で、姿勢推定部１３３および姿勢変換部１３４は、Ｓ８０３で識別および抽出した物体の姿勢推定および姿勢変換を行う。ここで、画像処理装置１３の姿勢推定部１３３および姿勢変換部１３４が行うＳ８０４の処理の詳細を図９のフローチャートを用いて説明する。 In S804, posture estimation section 133 and posture transformation section 134 perform posture estimation and posture transformation of the object identified and extracted in S803. Details of the processing in S804 performed by the orientation estimation unit 133 and the orientation conversion unit 134 of the image processing device 13 will now be described with reference to the flowchart of FIG.

姿勢推定部１３３は、Ｓ８０４１において、Ｓ８０３で生成された物体抽出深度画像から、デジタルカメラ１００から対象物体の基準位置までの距離を差し引くことで対象物体の撮影方向から見た表面の形状を取得する。基準位置は、デジタルカメラ１００から対象物体までの最も近い位置で設定してもよいし、最も遠い位置で設定してもよく、特に限定するものではない。 In S8041, the posture estimation unit 133 acquires the shape of the surface of the target object viewed from the shooting direction by subtracting the distance from the digital camera 100 to the reference position of the target object from the object extraction depth image generated in S803. . The reference position may be set at the closest position from the digital camera 100 to the target object, or may be set at the farthest position, and is not particularly limited.

Ｓ８０４２で、姿勢推定部１３３は、Ｓ８０４１で得られた対象物体の表面形状と、予め記憶部１４に格納されている対象物体の３次元形状とを比較し、同じ大きさになるようにいずれか一方の大きさを変更する。その後の属性情報推定を考慮した場合、予め記憶部１４および／またはメモリ１３６に格納されている３次元形状の大きさを対象物体の表面形状の大きさに合わせるのが望ましく、変換係数を予め記憶部１４および／またはメモリ１３６に記憶しておくことが望ましい。 In S8042, the posture estimating unit 133 compares the surface shape of the target object obtained in S8041 with the three-dimensional shape of the target object stored in advance in the storage unit 14, and determines which one of them has the same size. Change the size of one. Considering subsequent attribute information estimation, it is desirable to match the size of the three-dimensional shape stored in the storage unit 14 and/or the memory 136 in advance with the size of the surface shape of the target object, and the transform coefficients are stored in advance. Preferably stored in unit 14 and/or memory 136 .

Ｓ８０４３で、姿勢推定部１３３は、Ｓ８０４１で取得した対象物体の表面形状と、予め記憶部１４および／またはメモリ１３６に格納されている３次元形状とのマッチング処理を行い、対象物体を撮影している方向を特定する。この方法は、撮影した物体の姿勢が予め記憶部１４および／またはメモリ１３６に格納されている属性情報が取得しやすい姿勢と類似した姿勢であって、撮影方向が異なる場合に有効である。一方、対象物体の姿勢が予め記憶部１４および／またはメモリ１３６に格納されている姿勢と大きく異なる場合は、マッチング処理における評価値（マッチングスコア）が低下する。よって、Ｓ８０４４においてマッチングスコアを閾値と比較する。Ｓ８０４４においてマッチングスコアが閾値より高い場合は、Ｓ８０４５で対象物体の向きを表す撮影面が特定される。マッチングスコアが閾値より低い場合は、Ｓ８０４６で対象物体の関節部位および骨格の特定を行う。この特定も予め学習して取得した情報を利用して行う。 In S8043, the posture estimation unit 133 performs matching processing between the surface shape of the target object acquired in S8041 and the three-dimensional shape stored in advance in the storage unit 14 and/or the memory 136, and shoots the target object. identify the direction This method is effective when the posture of the photographed object is similar to the posture in which the attribute information stored in advance in the storage unit 14 and/or the memory 136 is easy to obtain, and the photographing direction is different. On the other hand, when the orientation of the target object is significantly different from the orientation stored in advance in the storage unit 14 and/or the memory 136, the evaluation value (matching score) in the matching process is lowered. Therefore, the matching score is compared with the threshold in S8044. If the matching score is higher than the threshold in S8044, the imaging plane representing the orientation of the target object is specified in S8045. If the matching score is lower than the threshold, the joint parts and skeleton of the target object are specified in S8046. This identification is also performed using information acquired by learning in advance.

Ｓ８０４７で、姿勢変換部１３４は、予め用意された３次元形状における骨格位置との違いを算出し、関節位置を支点に関節位置より先端方向部分の部位を回転させて、基準となる姿勢に類似するように表面形状を変換する。例えば、座った状態の牛を撮影した場合、大腿部、飛節、前膝などの脚部の関節位置や長さ、回転角を推定し、関節を回転の支点として回転させて立ち上がった状態の推定画像を生成する。変換された表面形状を再びＳ８０４３のマッチング処理に入力し、再度対象物体の向きを表す撮影面の特定を行う。 In S8047, the posture transforming unit 134 calculates the difference from the skeletal position in the three-dimensional shape prepared in advance, and rotates the part in the distal direction from the joint position with the joint position as the fulcrum to obtain a posture similar to the reference posture. transform the surface shape to For example, when photographing a cow in a sitting position, the position, length, and rotation angle of the leg joints, such as the thigh, hock, and front knee, are estimated, and the joint is rotated as the fulcrum of rotation to stand up. generate an estimated image of The converted surface shape is again input to the matching processing in S8043, and the imaging plane representing the orientation of the target object is specified again.

Ｓ８０４８で、姿勢変換部１３４は、Ｓ８０４５で特定された撮影面情報に基づき、図３のＳ３０６と同様に、対象物体を側面から撮影したように、物体抽出画像および表面形状に対して幾何変換を利用して姿勢変換する。変換された物体抽出画像および表面形状はメモリ１３６に記憶され、以降の処理に利用される。 In S8048, based on the imaging plane information specified in S8045, the posture transformation unit 134 geometrically transforms the object extraction image and the surface shape as if the target object were photographed from the side, as in S306 of FIG. Use it to change your posture. The converted object extraction image and surface shape are stored in memory 136 and used for subsequent processing.

図８の説明に戻り、Ｓ８０５で、図３のＳ３０７と同様に、属性情報推定部１３５により対象物体の属性情報の推定を行う。Ｓ８０５で推定された属性情報は、表示部１６に表示されるとともに記憶部１４に記憶される。推定された属性情報は、Ｓ８０２で生成された鑑賞用画像のメタデータとして深度画像と共に記憶することが望ましい。 Returning to the description of FIG. 8, in S805, the attribute information estimation unit 135 estimates the attribute information of the target object in the same manner as in S307 of FIG. The attribute information estimated in S<b>805 is displayed on the display unit 16 and stored in the storage unit 14 . The estimated attribute information is desirably stored together with the depth image as metadata of the viewing image generated in S802.

ここで、Ｓ８０５の属性情報推定がＳ３０７と相違するところを説明する。 Here, the difference between attribute information estimation in S805 and S307 will be described.

属性情報の１つである寸法推定について、実施形態１では、ユーザが計測位置を指定し、指定された位置で寸法を計測していた。これに対して、実施形態２では、画像処理装置１３が、Ｓ８０３で得られた物体の識別結果により物体の種類を特定し、必要な寸法情報（全長・頭胴長・体高など）から寸法の計測位置を決定する。そして、Ｓ８０４６と同様に事前に学習して取得した情報を利用して骨格認識を行い、Ｓ８０３の被写体認識で得られた輪郭情報を利用して計測位置を特定する。 Regarding dimension estimation, which is one of the attribute information, in the first embodiment, the user designates the measurement position and measures the dimension at the designated position. On the other hand, in the second embodiment, the image processing apparatus 13 identifies the type of the object based on the identification result of the object obtained in S803, and determines the size from necessary dimensional information (full length, head-body length, body height, etc.). Determine the measurement position. Then, similar to S8046, skeleton recognition is performed using the information acquired by learning in advance, and the measurement position is specified using the contour information obtained by subject recognition in S803.

属性情報の１つである形状推定については、Ｓ８０４の姿勢推定・変換で生成した表面形状と、予め記憶部１４および／またはメモリ１３６に格納されている３次元形状の大きさを合わせるための変換係数を利用する。変換係数を利用して記憶部１４に記憶されている３次元形状の大きさを変換し、計測対象物体の表面形状以外の背面部分を、大きさが変換された３次元形状から取得する。計測対象物体の表面形状と３次元形状から取得した背面形状とを合成することで計測対象物体の３次元形状を生成する。合成にあたり、接続部分は滑らかになるように平滑化処理を行う。 For shape estimation, which is one of the attribute information, transformation for matching the size of the surface shape generated by posture estimation/transformation in S804 and the three-dimensional shape stored in advance in the storage unit 14 and/or the memory 136 is performed. Use coefficients. The conversion coefficient is used to convert the size of the three-dimensional shape stored in the storage unit 14, and the back surface portion of the object to be measured other than the surface shape is obtained from the three-dimensional shape whose size has been converted. A three-dimensional shape of the object to be measured is generated by synthesizing the surface shape of the object to be measured and the back surface shape obtained from the three-dimensional shape. In synthesizing, a smoothing process is performed so that the connection part becomes smooth.

属性情報の１つである体積および質量の推定については、対象物体の体毛を考慮した推定を行う。実施形態１では体毛を考慮しておらず、体毛も同じ密度として質量の算出を行っていたため、実際の質量と推定された質量との誤差が大きくなる場合がある。また、羊毛生産などにおいては、体毛体積のみの推定が必要な場合もある。 The estimation of the volume and mass, which are one of the attribute information, is made in consideration of the body hair of the target object. In the first embodiment, body hair is not taken into account and the mass is calculated with the same density as the body hair. Therefore, the error between the actual mass and the estimated mass may increase. Also, in wool production, etc., it may be necessary to estimate only the volume of body hair.

以上説明したように、本実施形態によれば、実施形態１の処理に加え、体毛量を推定し補正を行う。体毛量の推定には、まずジョイントバイラテラルフィルタやガイデットフィルターなどを利用してマッチング処理を行い、アルファマットを算出する。アルファマットが１以下の領域を体毛領域としてその厚みを算出し、推定した３次元形状から体毛領域を除いた体積を算出する。同様に質量推定についても、体毛領域を除いた体積を利用して、物体の種類ごとに格納されている密度情報を乗算することで体毛領域を除いた質量を推定する。または、体毛を含めた体積と体毛を除いた体積から体毛領域のみの体積を算出し、体毛の密度情報を利用して体毛質量を推定し、体毛を除いた質量との和をとることで、体毛の密度の違いを考慮した質量推定を行う。 As described above, according to the present embodiment, in addition to the processing of the first embodiment, the amount of body hair is estimated and corrected. In estimating the amount of body hair, matching processing is first performed using a joint bilateral filter, a guided filter, or the like, and an alpha matte is calculated. A region with an alpha matte of 1 or less is treated as a hair region, and its thickness is calculated, and a volume obtained by excluding the hair region from the estimated three-dimensional shape is calculated. Similarly, for mass estimation, the volume excluding the hair region is used and the density information stored for each type of object is multiplied to estimate the mass excluding the hair region. Alternatively, by calculating the volume of only the hair region from the volume including the hair and the volume excluding the hair, estimating the hair mass using the density information of the hair, and taking the sum of the mass excluding the hair, Perform mass estimation considering differences in hair density.

［他の実施形態］
本実施形態では、撮像素子１１が撮像面位相差測距方式の光電変換素子を有し、鑑賞用画像と深度画像とを取得できるものとして説明したが、本発明の実施において、深度情報の取得はこれに限られるものではない。深度情報は、例えば両眼の撮像装置や複数の異なる撮像装置から得られた複数枚の撮像画像に基づいて、ステレオ測距方式で取得するものであってもよい。あるいは、例えば光照射部と撮像装置を用いたステレオ測距方式や、ＴＯＦ（Time of Flight）方式と撮像装置の組み合わせによる方式などを用いて取得するものであってもよい。 [Other embodiments]
In the present embodiment, the imaging element 11 has a photoelectric conversion element of the imaging surface phase difference ranging method, and is capable of acquiring an appreciation image and a depth image. is not limited to this. The depth information may be acquired by a stereo ranging method, for example, based on a plurality of captured images obtained by binocular imaging devices or a plurality of different imaging devices. Alternatively, for example, a stereo ranging method using a light irradiation unit and an imaging device, or a method using a combination of a TOF (Time of Flight) method and an imaging device may be used.

実施形態１と実施形態２の属性情報推定処理はそれぞれの実施形態に限定するものではなく、同じ情報を用いる処理を入れ替えても実現可能である。 The attribute information estimation processing of Embodiments 1 and 2 is not limited to each embodiment, and can be implemented by replacing the processing using the same information.

また、本実施形態として適用可能な画像処理装置は、デジタルスチルカメラ、デジタルビデオカメラ、車載カメラ、携帯電話やスマートフォンなどを含む。 Image processing apparatuses that can be applied as the present embodiment include digital still cameras, digital video cameras, vehicle-mounted cameras, mobile phones, smart phones, and the like.

また、本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 Further, the present invention supplies a program that implements one or more functions of the above-described embodiments to a system or device via a network or a storage medium, and one or more processors in the computer of the system or device executes the program. It can also be realized by a process of reading and executing. It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

本発明は、撮像装置を利用した非接触での物体の属性情報推定であり、例えば、簡易的な家畜の成長記録、動物園での動物の健康管理、野生動物の遠方からの属性取得などメジャーや質量計による計測が困難な状況において有用である。 The present invention is a non-contact object attribute information estimation using an imaging device. It is useful in situations where it is difficult to measure with a mass meter.

発明は上記実施形態に制限されるものではなく、発明の精神および範囲から離脱することなく、様々な変更および変形が可能である。従って、発明の範囲を公にするために請求項を添付する。 The invention is not limited to the embodiments described above, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, the claims are appended to make public the scope of the invention.

１００…デジタルカメラ、１２…制御部、１３…画像処理装置、１３０…画像生成部、１３１…深度生成部、１３２…物体検出部、１３３…姿勢推定部、１３４…姿勢変換部、１３５…属性情報推定部 DESCRIPTION OF SYMBOLS 100... Digital camera 12... Control part 13... Image processing apparatus 130... Image generation part 131... Depth generation part 132... Object detection part 133... Posture estimation part 134... Posture conversion part 135... Attribute information estimation part

Claims

Depth generation means for generating depth information indicating the distance distribution of the subject in the depth direction from a captured image of the subject;
an object detection means for detecting an area of a specific object from the captured image;
posture estimation means for estimating the posture of the specific object;
attitude transformation means for transforming the attitude of the specific object in the captured image and the depth information into a specific attitude suitable for estimating attribute information of the specific object ;
and attribute information estimation means for estimating attribute information of the specific object from the captured image whose attitude has been changed, depth information, and image capturing conditions.

2. The object detection means detects the type and area of the specific object by extracting the area of the specific object using information about the object learned and acquired in advance. The image processing device according to .

2. The image processing apparatus according to claim 1, wherein the orientation estimation means estimates the orientation of the specific object using object information obtained by learning in advance.

4. The apparatus according to claim 3, wherein the posture estimation means estimates the posture of the specific object by using preliminarily acquired three-dimensional shape data of the object and part data of the specific object. Image processing device.

The attitude transforming means uses the attitude of the specific object in the captured image and the depth information estimated by the attitude estimation means and the attitude of the specific object acquired in advance to convert the captured image and the depth information. 5. The image processing apparatus according to any one of claims 1 to 4, wherein the geometric transformation is performed on the .

The pose transforming means uses the information of the part of the specific object estimated by the pose estimating means and the information of the pose and part of the object obtained in advance to convert the specified part in the captured image and the depth information. 6. The image processing apparatus according to claim 5, wherein the geometric transformation is performed for each part of the object.

The posture estimating means calculates a main normal direction of the body of the specific object and obtains a plane perpendicular to the normal direction,
7. The image processing apparatus according to claim 6, wherein said attitude transformation means performs geometric transformation on said captured image and said depth information with reference to said plane.

8. The image processing apparatus according to claim 1, wherein said attribute information is at least one of size, shape, volume and mass.

3. The attribute information estimating means specifies a measurement position corresponding to the type of the specific object, and estimates the dimension at the measurement position using object information acquired by learning in advance. 9. The image processing device according to 8.

The attribute information estimating means estimates the shape of the part of the object that cannot be obtained from the captured image by using the three-dimensional shape of the object obtained in advance, and synthesizes the shape with the shape of the part of the object obtained from the depth information of the object. 9. The image processing apparatus according to claim 8, wherein the three-dimensional shape of said specific object is estimated by:

9. The attribute information estimation means according to claim 8, wherein said attribute information estimation means estimates a volume from the three-dimensional shape of said specific object, and estimates a mass of said specific object using said volume and density of said object. Image processing device.

12. The image processing apparatus according to claim 11, wherein the attribute information estimating means estimates the mass of the specific object by using the parts of the object obtained from the captured image and the density of each part of the object. .

When the specific object is an animal, the attribute information estimating means calculates a volume of the region of the specific object excluding the hair region, and calculates the volume of the specific object using the volume excluding the hair region and the density of the object. 12. The image processing apparatus according to claim 11, wherein the mass of is estimated.

13. The image processing apparatus according to claim 1, further comprising input means for inputting information regarding said specific object.

The object detection means determines the type of the specific object by using information for identifying the type of the object acquired by learning in advance and information on the object input by the input means. 15. The image processing apparatus according to claim 14.

further comprising designating means for designating the measurement position of the specific object;
10. The image processing apparatus according to claim 9, wherein said attribute information estimating means calculates dimensions of the measurement position of said specific object specified by said specifying means.

16. The designating means designates the measurement positions in accordance with an operation by a user of designating two locations, an operation of designating three or more locations, or a tracing operation with respect to the specific object. The image processing device according to .

18. The image processing apparatus according to any one of claims 1 to 17, wherein said specific object is an animal other than human.

The image processing device is an imaging device having an imaging element that captures an image,
18. The image processing apparatus according to any one of claims 1 to 17, wherein said depth generating means generates depth information from images having different parallaxes captured by said imaging element.

20. The image processing apparatus according to any one of claims 1 to 19, further comprising recording means for recording the captured image, the depth information, and the attribute information in association with each other.

20. The image processing apparatus according to claim 19, wherein said image sensor has a plurality of photoelectric conversion units in one pixel.

a step in which the depth generating means generates depth information indicating a distance distribution of the subject in the depth direction from the captured image of the subject;
an object detection means detecting an area of a specific object from the captured image;
a pose estimating means estimating the pose of the particular object;
a step of transforming the pose of the specific object in the captured image and the depth information into a specific pose suitable for estimating attribute information of the specific object ;
An image processing method, wherein attribute information estimating means estimates attribute information of the specific object from the captured image whose attitude has been changed, depth information, and photographing conditions of the image.

A program for causing a computer to function as the image processing apparatus according to any one of claims 1 to 21.

A storage medium storing a program for causing a computer to function as the image processing apparatus according to any one of claims 1 to 21.