WO2022224498A1

WO2022224498A1 - Recognition device, recognition method, and program

Info

Publication number: WO2022224498A1
Application number: PCT/JP2022/000218
Authority: WO
Inventors: 達雄藤原
Original assignee: ソニーグループ株式会社
Priority date: 2021-04-22
Filing date: 2022-01-06
Publication date: 2022-10-27
Also published as: CN117178293A; JP2022166872A

Abstract

[Problem] To provide a recognition device, a recognition method, and a program that enable improvement in the accuracy of recognizing a recognition subject. [Solution] A recognition device of the present technology is equipped with a processing unit. The processing unit corrects a depth value of a recognition subject acquired with a Light Detection and Ranging (LiDAR) sensor of an apparatus provided with the LiDAR sensor and an image sensor that images the recognition subject, the LiDAR sensor including a light-emitting unit for irradiating the recognition subject with light and a light-receiving unit for receiving light that is reflected back from the recognition subject. The depth value is corrected by consulting depth correction information generated by using a sensing result of the LiDAR sensor and a sensing result of the image sensor.

Description

Recognition device, recognition method and program

The present technology relates to a recognition device, a recognition method, and a program related to recognition of a recognition target.

Patent Document 1 describes providing the user with an image of the user reaching out for the virtual object in an augmented reality image in which the virtual object is superimposed on the camera image.

JP 2020-064592 A

For example, when generating an image in which a user reaches out to a virtual object from an augmented reality image in which a virtual object is superimposed, if the hand recognition accuracy is low, the virtual object is superimposed on the hand and the hand is displaced. Sometimes it became an unnatural augmented reality image such as becoming invisible.

In view of the circumstances as described above, an object of the present technology is to provide a recognition device, a recognition method, and a program capable of improving recognition accuracy of a recognition target object.

A recognition device according to the present technology includes a processing unit.
The processing unit captures an image of a LiDAR (Light Detection and Ranging) sensor having a light emitting unit that irradiates a recognition target with light and a light receiving unit that receives light reflected from the recognition target, and the recognition target. With reference to depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor, the depth value of the recognition target obtained by the LiDAR sensor of the device equipped with an image sensor. to correct.

According to such a configuration, it is possible to correct the measurement error derived from the LiDAR sensor and improve the recognition accuracy of the recognition target.

The depth correction information may include difference information between the depth value of the recognition target object based on the sensing result of the LiDAR sensor and the actual depth value of the recognition target object.

The device comprises a plurality of image sensors and one LiDAR sensor,
The depth correction information includes the depth value of the recognition target calculated by triangulation using the position information of the recognition target detected from the sensing results of each of the plurality of image sensors, and the sensing result of the LiDAR sensor. may include difference information from the depth value of the recognition target object based on the depth image as .

the device comprises at least one image sensor and one LiDAR sensor;
The depth correction information is the position information of the recognition target detected from the sensing result of one of the image sensors and the position information of the recognition target detected from the reliability image as the sensing result of the LiDAR sensor. and the depth value of the recognition target object calculated by triangulation using the LiDAR sensor, and the difference information between the depth value of the recognition target object based on the depth image as the sensing result of the LiDAR sensor.

The object to be recognized may be a translucent body.
The object to be recognized may be human skin.
The object to be recognized may be a human hand.
The processing unit may recognize a gesture motion of a person who is the object to be recognized.

The processing unit may generate the depth correction information using the sensing result of the LiDAR sensor and the sensing result of the image sensor.

The device has a display,
The processing unit may generate an image to be displayed on the display unit using the corrected depth value of the recognition target object.

A recognition method according to the present technology includes a LiDAR (Light Detection and Ranging) sensor having a light emitting unit that irradiates light onto an object to be recognized and a light receiving unit that receives light reflected from the object to be recognized, and the object to be recognized. Depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor, the depth value of the recognition target acquired by the LiDAR sensor of the device comprising an image sensor that captures the Refer to and correct.

The program related to this technology is
A LiDAR (Light Detection and Ranging) sensor having a light emitting section that irradiates light onto a recognition target and a light receiving section that receives light reflected from the recognition target, and an image sensor that captures the recognition target. A step of correcting the depth value of the recognition object acquired by the LiDAR sensor of the device by referring to depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor. Let the device do it.

1 is an external view of a mobile terminal as a recognition device according to an embodiment of the present technology; FIG. It is a schematic block diagram of the said portable terminal. 2 is a configuration diagram including functional configuration blocks of the mobile terminal; FIG. FIG. 4 is a flowchart of a recognition method for a recognition target object; FIG. 4 is a diagram for explaining a correction map; FIG. FIG. 4 is a schematic diagram illustrating a method of generating a correction map according to the first embodiment; FIG. FIG. 4 is a flowchart of a correction map generation method according to the first embodiment; FIG. 4 is a diagram for explaining a basic image displayed on the display section when generating a correction map; FIG. 10 is a diagram for explaining a more detailed image displayed on the display unit when generating a correction map; FIG. 10 is a flow chart relating to a method of displaying an image displayed on a display unit when generating a correction map; FIG. 10 is a schematic diagram illustrating a method of generating a correction map according to the second embodiment; FIG. FIG. 11 is a flowchart of a correction map generation method according to the second embodiment;

Hereinafter, embodiments according to the present technology will be described with reference to the drawings. In the following description, the same reference numerals are given to the same configurations, and the description of the already-outed configurations may be omitted.

<First Embodiment>
[Appearance Configuration of Recognition Device]
FIG. 1 is an external view of a mobile terminal 1 as a recognition device. FIG. 1A is a plan view of the mobile terminal 1 as seen from the front 1a side where the display unit 34 is located, and FIG. 1B is a plan view of the mobile terminal 1 as seen from the back 1b side.
In this specification, the XYZ coordinate directions orthogonal to each other shown in the drawings correspond to the width, length, and height of the mobile terminal 1, which has a substantially rectangular parallelepiped shape. A plane parallel to the front surface 1a and the rear surface 1b is defined as an XY plane, and the thickness direction of the mobile terminal 1 corresponding to the height direction is defined as the Z axis. In this specification, the Z-axis direction corresponds to the depth direction.
In this embodiment, the mobile terminal 1 functions as a recognition device that recognizes a recognition target object. The mobile terminal 1 is a device having a first camera 2A and a second camera 2B, which are image sensors, a LiDAR sensor 3, and a display section . A mobile terminal 1 is a device having a multi-view camera.

As shown in FIGS. 1A and 1B, the mobile terminal 1 has a housing 4, a display section 34, a first camera 2A, a second camera 2B, and a LiDAR sensor 3. The mobile terminal 1 is configured such that a housing 4 holds a display panel constituting a display unit 34, a first camera 2A, a second camera 2B, a LiDAR sensor 3, other various sensors, a drive circuit, and the like. .

The mobile terminal 1 has a front surface 1a and a rear surface 1b located on the opposite side of the front surface 1a.
As shown in FIG. 1A, a display section 34 is arranged on the front face 1a side. The display unit 34 is configured by a display panel (image display means) such as a liquid crystal display or an organic EL display (Organic Electro-Luminescence Display). The display unit 34 displays images transmitted and received from an external device through a communication unit 41 described later, images generated by a display image generation unit 54 described later, input operation buttons, and images captured by the first camera 2A and the second camera 2B. A through image or the like can be displayed. Images include still images and moving images.
As shown in FIG. 1B, the imaging lens of the first camera 2A, the imaging lens of the second camera 2B, and the imaging lens of the LiDAR sensor 3 are positioned on the rear surface 1b side.

The first camera 2A, the second camera 2B, and the LiDAR sensor 3 are preliminarily calibrated so that the coordinate values of the same recognition object (subject) sensed in the shooting space are the same. As a result, by integrating the RGB information (RGB image data) and depth information (depth image data) sensed by the first camera 2A, the second camera 2B, and the LiDAR sensor 3, a point cloud (each point is a tertiary It is possible to construct a set of information that has original coordinates).
The configurations of the first camera 2A, the second camera 2B, and the LiDAR sensor 3 will be described later.

[Overall Configuration of Recognition Device and Configuration of Each Part]
FIG. 2 is a schematic configuration diagram of the mobile terminal 1. As shown in FIG. FIG. 3 is a configuration diagram including functional configuration blocks of the mobile terminal 1. As shown in FIG.

As shown in FIG. 2, the mobile terminal 1 includes a sensor unit 10, a communication unit 41, a CPU (Central Processing Unit) 42, a display unit 34, a GNSS reception unit 44, a main memory 45, and a flash memory 46. , an audio device unit 47 , and a battery 48 .

The sensor unit 10 includes imaging devices such as the first camera 2A, the second camera 2B, and the LiDAR sensor 3 and various sensors such as the touch sensor 43 . The touch sensor 43 is typically arranged on a display panel that constitutes the display section 34 . The touch sensor 43 receives input operations such as settings performed by the user on the display unit 34 .
The communication unit 41 is configured to communicate with an external device.
The CPU 42 controls the entire mobile terminal 1 by executing an operating system. The CPU 42 also executes various programs read from a removable recording medium and loaded into the main memory 45 or downloaded via the communication section 41 .
The GNSS receiver 44 is a Global Navigation Satellite System (GNSS) signal receiver. The GNSS receiver 44 acquires position information of the mobile terminal 1 .
The main memory 45 is composed of a RAM (Random Access Memory) and stores programs and data necessary for processing.
Flash memory 46 is an auxiliary storage device.
Audio device section 47 includes a microphone and a speaker.
A battery 48 is a power source for driving the mobile terminal 1 .

As shown in FIG. 3, the mobile terminal 1 has a sensor section 10, a processing section 50, a storage section 56, and a display section . In the sensor section 10 of FIG. 3, only main sensors mainly related to the present technology are illustrated.

The sensing results of the first camera 2A, the second camera 2B, and the LiDAR sensor 3 included in the sensor unit 10 are output to the processing unit 50.

(camera)
The first camera 2A and the second camera 2B have the same configuration. Hereinafter, when there is no particular need to distinguish between the two, such as the first camera 2A and the second camera 2B, the camera 2 is used.
The camera 2 is an RGB camera capable of capturing a color two-dimensional image (also called an RGB image) of a subject as image data. The RGB image is the sensing result of camera 2 .
The camera 2 is an image sensor that captures an image of a recognition target (object). The image sensor is, for example, a CCD (Charge-Coupled Device) sensor or a CMOS (Complementary Metal Oxide Semiconductor) sensor. The image sensor has a photodiode, which is a light receiving portion, and a signal processing circuit. In the image sensor, the light received by the light receiving portion is subjected to signal processing by a signal processing circuit, and image data corresponding to the amount of light incident on the light receiving portion is obtained.

(LiDAR sensor)
The LiDAR sensor 3 captures a depth image (also referred to as a distance image) of a recognition target (subject). A depth image is a sensing result of the LiDAR sensor 3 . A depth image is depth information including a depth value of a recognition object.
The LiDAR sensor 3 is a ranging sensor that uses remote sensing technology (LiDAR: Light Detection and Ranging) using laser light.
LiDAR sensors include a ToF (Time of flight) method and an FMCW (Frequency Modulated Continuous Wave) method, and although either method may be used, the ToF method can be preferably used. In this embodiment, an example using a ToF-type LiDAR sensor (hereinafter referred to as a ToF sensor) will be given.
There are two types of ToF sensors: a “direct method” and an “indirect method”, and either type of ToF sensor may be used. The "direct method" irradiates a subject with a light pulse that emits light for a short time, and measures the time it takes for the reflected light to reach the ToF sensor. The "indirect method" uses light that blinks periodically and detects the time delay as the phase difference when the light makes a round trip to and from the subject. From the viewpoint of increasing the number of pixels, it is more preferable to use an indirect ToF sensor.

The LiDAR sensor 3 has a light emitting part, a photodiode as a light receiving part, and a signal processing circuit. The light emitting unit emits laser light, typically near-infrared light (NIR light). The light receiving unit receives return light (reflected light) when the NIR light emitted from the light emitting unit is reflected by a recognition object (object). In the LiDAR sensor 3, the received return light is signal-processed by the signal processing circuit, and a depth image corresponding to the subject is acquired. The light emitting unit includes, for example, a light emitting member such as a light emitting diode (LED) and a driver circuit for causing the light emitting member to emit light.

Here, when obtaining depth information of a recognition target (subject) using a LiDAR sensor, if the recognition target is a translucent object, the measurement value and the actual value (hereinafter referred to as actual value). In other words, there is a problem that the three-dimensional measurement accuracy of the recognition target deteriorates due to the reflection characteristics of the material of the recognition target and the individual differences of the sensor devices.
In the LiDAR sensor, when a translucent body such as human skin is the recognition target, the light emitted from the light emitting unit is reflected back by the recognition target due to the influence of subsurface scattering (also called subcutaneous scattering). It takes extra time to arrive. Therefore, the LiDAR sensor measures a depth value slightly deeper than the actual value. For example, when the object to be recognized is human skin, an error of about 20 mm may occur between the measured value and the actual depth value.
Human skin, marble, milk, etc. are known as examples of translucent bodies. A translucent body is an object within which light transmission and scattering occurs.

In contrast, in the present technology, the depth value acquired by the LiDAR sensor 3 is corrected with reference to a correction map, which is depth correction information. As a result, the three-dimensional measurement accuracy of the recognition target object can be made highly accurate, and the recognition accuracy of the recognition target object can be improved.
In this embodiment, the correction map can be generated using sensing results of the first camera 2A, the second camera 2B, and the LiDAR sensor 3, respectively. Details of the correction map will be described later.

In the following, the recognition target is a human hand with semi-transparent skin that is exposed, and an example of recognizing the hand will be used.

(Processing part)
The processing unit 50 corrects the depth value acquired by the LiDAR sensor 3 using the correction map.
The processing unit 50 may generate a correction map.
The processing unit 50 has an acquisition unit 51 , a recognition unit 52 , a correction unit 53 , a display image generation unit 54 and a correction map generation unit 55 .

((acquisition unit))
The acquisition unit 51 acquires the sensing results of the first camera 2A, the second camera 2B, and the LiDAR sensor 3, that is, the RGB image and the depth image.

((recognition part))
The recognition unit 52 detects a hand region from the depth image and the RGB image acquired by the acquisition unit 51 . The recognition unit 52 detects the position of the characteristic point of the hand from the image area obtained by cutting out the detected hand area. Characteristic points of the hand for recognizing the position of the hand include fingertips, finger joints, wrists, and the like. Fingertips, finger joints, and wrists are parts of the hand.

More specifically, the recognition unit 52 detects the two-dimensional feature point positions of the hands from the hand regions of the RGB images respectively acquired by the first camera 2A and the second camera 2B. The detected two-dimensional feature point positions are output to the correction map generator 55 . Hereinafter, "two-dimensional feature point position" may be referred to as "two-dimensional position".
The recognition unit 52 also estimates and detects the three-dimensional feature point positions of the hand from the hand region of the depth image acquired by the LiDAR sensor 3 . The three-dimensional feature point positions of the recognition target detected based on the depth image of the LiDAR sensor 3 are output to the correction unit 53 . Hereinafter, "three-dimensional feature point position" may be referred to as "three-dimensional position". The three-dimensional position includes depth value information.

　The detection of the hand region and the detection of the feature point position can be performed by a known method. For example, deep neural network (DNN), hand recognition technology of the human body such as Hand Pose Detection, Hand Pose Estimation, Hand segmentation, feature points such as HOG (Histogram of Oriented Gradient), SIFT (Scale Invariant Feature Transform) The position of the hand in the image can be recognized by an extraction method, an object recognition method based on pattern recognition such as Boosting and SVM (Support Vector Machine), and an area extraction method such as Graph Cut.

((corrector))
When the recognition unit 52 recognizes that the region of the recognition target is human skin such as a hand, the correction unit 53 detects the recognition target (hand in this embodiment) based on the depth image of the LiDAR sensor 3. ) is corrected with reference to the correction map.

As a result, even if the object to be recognized is a translucent object such as human skin, the depth value is corrected so that the deviation (error) between the measured value by the LiDAR sensor 3 and the actual value due to subsurface scattering is eliminated. be done.
In other words, the correction using the correction map makes it possible to obtain the three-dimensional position information of the actual recognition target from the sensing result of the LiDAR sensor 3, thereby recognizing the recognition target with high accuracy.
The depth value of the recognition target object corrected by the correction unit 53 is output to the display image generation unit 54 .

((display image generator))
The display image generation section 54 generates an image signal to be output to the display section 34 . The image signal is output to the display section 34, and an image is displayed on the display section 34 based on the image signal.

The display image generation unit 54 may generate an image in which the virtual object is superimposed on the through image (camera image) acquired by the camera 2 . The virtual object may be a virtual object used when generating a correction map, which will be described later. Also, the virtual object may be, for example, a virtual object forming an augmented reality image by a game application.

Here, an example of displaying on the display unit 34 an image in which a user touches a wall, which is a virtual object, is displayed on an augmented reality image obtained by superimposing a wall virtual object on a camera image.
When generating the display image, the display image generation unit 54 uses the corrected depth value of the hand, which is the object to be recognized, to generate an augmented reality image in which the positional relationship between the hand and the wall, which is the virtual object, is appropriate. be able to.
As a result, for example, when an image in which a hand touches the surface of a wall, which is a virtual object, should be displayed, the virtual object of the wall is superimposed on a part of the hand, making the hand partially invisible, and the user cannot see the finger on the wall. There is no such thing as an image that is stuck.

((correction map generator))
The correction map generation unit 55 generates a correction map, which is depth correction information, using the sensing results of the first camera 2A and the second camera 2B and the sensing results of the LiDAR sensor 3 .

More specifically, the correction map generating unit 55 uses the two-dimensional feature point positions of the recognition target (hand) detected from the RGB images of the respective cameras 2 by the recognition unit 52 to determine the recognition target by triangulation. 3D feature point positions are calculated. The three-dimensional feature point positions of the recognition target object calculated using this triangulation are assumed to correspond to the three-dimensional feature point positions of the actual recognition target object, and include the depth values of the actual recognition target object.
The correction map generation unit 55 uses difference information between the depth value of the recognition target object calculated by triangulation and the depth value of the recognition target object based on the depth image of the LiDAR sensor 3 detected by the recognition unit 52, Generate a correction map.
A method of generating the correction map will be described later.

(storage unit)
The storage unit 56 includes a memory device such as a RAM and a non-volatile recording medium such as a hard disk drive, and is used to cause the mobile terminal 1 to execute recognition processing for recognition objects, correction map (depth correction information) generation processing, and the like. program.

The recognition processing program for the recognition target object stored in the storage unit 56 is for causing the recognition device (mobile terminal 1 in this embodiment) to execute the following steps.
In the above step, the depth value of the object to be recognized acquired by the LiDAR sensor of a device (mobile terminal 1 in this embodiment) provided with a LiDAR sensor and an image sensor, the sensing result of the LiDAR sensor and the sensing result of the image sensor are obtained. This is a step of referring to and correcting depth correction information (correction map) generated using the depth correction information.

The correction map (depth correction information) generation processing program stored in the storage unit 56 is for causing the recognition device (mobile terminal 1 in this embodiment) to execute the following steps.
The above steps include a step of calculating the three-dimensional position of the recognition target by triangulation from the two-dimensional position of the recognition target detected from the RGB images of each of the plurality of cameras, and a step of calculating the three-dimensional position of the recognition target from the depth image of the LiDAR sensor. a correction map (depth correction information) using difference information between the three-dimensional position of the recognition target calculated by triangulation and the three-dimensional position of the recognition target based on the depth image of the LiDAR sensor; a step of generating

Further, the storage unit 56 may store a pre-generated correction map. The correction unit 53 may refer to the correction map prepared in advance to correct the depth value acquired by the LiDAR sensor 3 .

[Recognition method]
FIG. 4 is a flow diagram of a method for recognizing a recognition object.
As shown in FIG. 4, when the recognition process starts, the acquisition unit 51 acquires the sensing result (depth image) of the LiDAR sensor 3 (ST1).

Next, the hand region is detected by the recognition unit 52 using the depth image acquired by the acquisition unit 51 (ST2).
The recognition unit 52 estimates and detects the three-dimensional feature point positions of the hand, which is the object to be recognized, from the depth image (ST3). The detected three-dimensional feature point position information of the recognition target object is output to the correction unit 53 .

Next, the correction unit 53 corrects the Z position of the detected three-dimensional feature point position of the recognition object using the correction map (ST4). The corrected 3D feature point positions of the recognition target object correspond to the actual 3D feature point positions of the recognition target object.
The corrected three-dimensional feature point position information of the object to be recognized is output to the display image generator 54 (ST5).

As described above, in the recognition method of the present embodiment, even if the recognition target is translucent human skin, the sensing result of the LiDAR sensor 3 is corrected using the correction map. The recognition accuracy of is improved.

[Correction map]
The correction map is depth correction information for correcting the depth value (Z value) of the recognition target detected by the LiDAR sensor 3 . The measured value of the LiDAR sensor 3 has an error from the actual value due to subsurface scattering on the skin, which is the object to be recognized, and individual differences of the LiDAR sensor 3 . A correction map corrects for this error.

A correction map will be described with reference to FIG.
As shown in FIG. 5A, a three-dimensional grid 9 is arranged in the real space of the imaging area 8 that can be acquired by the LiDAR sensor 3 . The three-dimensional grid 9 includes a plurality of evenly spaced grid lines parallel to the X axis, a plurality of uniformly spaced grid lines parallel to the Y axis, and a plurality of uniformly spaced grid lines parallel to the Y axis. It is divided by grid lines parallel to the Z-axis.
FIG. 5B is a schematic diagram of FIG. 5A viewed from the Y-axis direction.
5A and 5B, reference numeral 30 indicates the center of the LiDAR sensor 3. FIG.

The correction map is a map that holds an offset value related to depth on each grid point of the three-dimensional grid 9 . The "offset value related to depth" means how much the depth value (measured value) obtained by the LiDAR sensor 3 deviates from the actual depth value (actual value) in the Z-axis direction by + or -. is the value shown.

The "offset value related to depth" will be described.
In the example shown in FIG. 5B , the black circle located on the grid point A indicates the three-dimensional position 13 of the recognition object based on the depth image acquired by the LiDAR sensor 3 . A white circle with a white inside indicates the three-dimensional position 12 of the actual object to be recognized. The three-dimensional position of the recognition object includes depth value information. In other words, reference numeral 13 indicates the position measured by the LiDAR sensor 3, and reference numeral 12 indicates the actual position.
The difference a between the depth value of the three-dimensional position 13 of the recognition target based on the depth image of the LiDAR sensor 3 and the depth value of the actual three-dimensional position 12 of the recognition target is the "offset value related to depth" at the grid point A. becomes. In the example shown in FIG. 5B, the "offset value for depth" at grid point A is +.
In the correction map, an “offset value related to depth” is set for each grid point of the three-dimensional grid 9 arranged in the imaging region 8 .
By referring to such a correction map and correcting the depth value of the recognition target acquired by the LiDAR sensor 3, the three-dimensional measurement accuracy of the recognition target is made highly accurate, and the recognition accuracy of the recognition target is improved. can be improved.

[Correction method using correction map]
A method of correcting the depth value using the correction map described above will be described. Hereinafter, the "offset value related to depth" is simply referred to as "offset value". The three-dimensional position of the object to be recognized acquired by the LiDAR sensor 3 is called "measured position". The “measured position” is a pre-correction three-dimensional position and includes pre-correction depth value information.

As described above, in the correction map, an offset value is set for each lattice point of the three-dimensional grid 9. When the measurement position is on a grid point, the offset value set at the grid point is used to correct the depth value of the measurement position.

On the other hand, when the measurement position is not on the grid point, for example, using bilinear interpolation processing or the like, an offset value at the measurement position is calculated, and the offset value can be used to correct the depth value of the measurement position. .

In the bilinear interpolation process, for example, the offset value at the measurement position is calculated as follows.
The measurement position is in the XY plane passing through four grid points formed by two grid lines extending adjacently in the X-axis direction and two grid lines extending adjacently in the Y-axis direction. An example will be given.
The offset value at the measurement position is based on the ratio of the offset value at each of the four grid points and the distance value in the X-axis direction between two grid points adjacent in the X-axis direction among the four grid points and the measurement position. It is calculated using a weighting factor and a weighting factor based on a ratio of distance values in the Y-axis direction between two adjacent lattice points in the Y-axis direction among the four lattice points and the measurement position. That is, the offset value at the measurement position is calculated based on the offset value at each of the four grid points and the weighted average of the distance values in the XY directions between the four grid points and the measurement position.

Here, for the sake of convenience, the case where the measurement position is positioned within a plane passing through four grid points has been described as an example. can be calculated.
That is, in the three-dimensional grid 9, when the measurement position is in the three-dimensional space of the minimum unit partitioned by the grid lines, the offset value at each of the eight grid points constituting the minimum three-dimensional space, and the 8 The offset value at the measurement position can be calculated based on the weighted average of the distance values in each XYZ axis direction between one grid point and the measurement position.

[Correction map generation method]
(Outline of correction map generation method)
A correction map can be generated using the sensing results of the first camera 2A and the second camera 2B and the sensing results of the LiDAR sensor 3 . The outline of the correction map generation method will be described below with reference to FIGS. 6 and 7. FIG.

FIG. 6 is a schematic diagram illustrating an example of generating a correction map using the mobile terminal 1 having two cameras and one LiDAR sensor. The correction map is generated in a state in which the hand of the user U, which is the object to be recognized, is positioned within the shooting area of the mobile terminal 1 .
In FIG. 6 , a plurality of small white circles superimposed on the hand of the user U indicate the characteristic point positions 6 of the hand of the user U, and indicate joint positions, fingertip positions, wrist positions, and the like.
Here, a case of recognizing the fingertip position of the index finger will be described.

In FIG. 6, the white circle with reference numeral 120 is the tip of the index finger calculated by triangulation using the two-dimensional feature point positions detected from the RGB images respectively acquired by the first camera 2A and the second camera 2B. shows the 3D feature point positions of . The fingertip position 120 calculated using this triangulation corresponds to the actual fingertip position and includes information on the depth value of the actual recognition object.

In FIG. 6, reference numeral 130 indicates the three-dimensional feature point positions of the tip of the index finger based on the depth image acquired by the LiDAR sensor 3. The fingertip position 130 of the index finger acquired by the LiDAR sensor 3 is deviated from the actual fingertip position 120 of the object to be recognized due to subsurface scattering during measurement by the LiDAR sensor 3 .

The difference between the fingertip position 120 calculated using triangulation and the fingertip position 130 of the index finger based on the depth image of the LiDAR sensor 3 is the error component. This error component becomes the "offset value related to depth" in the correction map.
A correction map for correcting the measurement error originating from the LiDAR sensor 3 when the recognition target in the mobile terminal 1 is human skin is generated by acquiring such error component data over the entire imaging area. be able to.

The flow of correction map generation processing in the processing unit 50 will be described with reference to FIG.
As shown in FIG. 7, the three-dimensional feature point positions of the recognition object are detected from the depth image of the LiDAR sensor 3 (ST11). The three-dimensional feature point positions based on this depth image correspond to reference numeral 130 in FIG.
Also, two-dimensional feature point positions are detected from the RGB images of the first camera 2A and the second camera 2B (ST12). Using the detected two-dimensional feature point positions, the three-dimensional feature point positions of the recognition object are calculated by triangulation (ST13). The three-dimensional feature point positions calculated by this triangulation are the actual three-dimensional feature point positions of the recognition object. Three-dimensional feature point positions calculated by triangulation correspond to reference numeral 120 in FIG.

Next, the depth image of the LiDAR sensor 3 estimated in ST21 with respect to the three-dimensional feature point positions calculated based on the RGB images of each of the plurality of cameras (the first camera 2A and the second camera 2B) calculated in ST23. is calculated as an error component (ST14).
A correction map is generated by acquiring such error component data for the entire imaging area.

Thus, the correction map includes difference information between the depth value of the recognition target object based on the sensing result of the LiDAR sensor 3 and the actual depth value of the recognition target object.

FIG. 8 is a diagram for explaining a basic image displayed on the display section 34 when the correction map is generated.
When the correction map is generated, as shown in FIGS. 8A and 8B, the display unit 34 of the mobile terminal 1 displays the through image acquired by the first camera 2A or the second camera 2B with the correction map generated. An image in which a target sphere 7, which is a virtual object for , is superimposed is displayed. Note that the virtual object for generating the correction map is not limited to a spherical shape, and can have various shapes.
For example, the user U holds the mobile terminal 1 with one hand and positions the other hand within the imaging area so that the other hand is displayed on the display unit 34 . The correction map is generated by the user U viewing the image displayed on the display unit 34 and moving the other hand.
The target sphere 7 is displayed so that its position can be changed within the imaging area. The user U moves the other hand so as to chase the target ball 7 according to the movement of the target ball 7 displayed on the display unit 34 . In this way, by moving the hand according to the movement of the target sphere 7, it is possible to obtain error component data in the entire imaging area, and use the data to generate a correction map.

A more specific correction map generation method will be described below.
(Specific example of correction map generation method)
A more specific method of generating a correction map will be described with reference to FIGS. 9 and 10. FIG.
FIG. 9 is a diagram for explaining an image displayed on the display unit 34 when the correction map is generated.
FIG. 10 is a flowchart relating to display of an image displayed on the display unit 34 when the correction map is generated.

As described above, during the correction map generation process, the user U holds the mobile terminal 1 with one hand and positions the other hand so as to be within the field of view of the camera 2 .
The user U moves the other hand according to the moving direction and size of the target ball displayed on the display unit 34 while looking at the display unit 34 . A correction map is generated based on this hand movement information.

An image displayed when the correction map is generated will be described with reference to FIG. 9 according to the flow of FIG.
When the correction map generation process starts, as shown in FIG. 9A, a through image captured by the first camera 2A or the second camera 2B is displayed on the display section 34 of the mobile terminal 1 (ST21). . Furthermore, as shown in FIG. 9A, the target sphere 7 is superimposed on the through-the-lens image and displayed at the target location (ST22). A sphere 11 is displayed (ST23). Hereinafter, the ``user recognition result sphere'' will be referred to as a ``user sphere''.

Both the target sphere 7 and the user sphere 11 are virtual objects. The target sphere 7 is displayed in different colors, for example, yellow, and the user sphere 11 is displayed in blue, for example, so that they can be distinguished from each other.
The size of the target sphere 7 does not change and is always displayed at a constant size.
The user sphere 11 is displayed at a predetermined position of the recognized user U's hand. For example, in the example shown in FIG. 8, the user sphere 11 is displayed such that the center of the user sphere 11 is positioned near the base of the middle finger. A user sphere 11 indicates a recognition result based on the sensing result of the LiDAR sensor 3 . In the image displayed on the display unit 34, the user sphere 11 is displayed so as to follow the movement of the hand of the user U within the XY plane. Furthermore, the size of the user sphere 11 changes according to the movement of the hand of the user U in the Z-axis direction. In other words, the size of the user sphere 11 changes according to the position (depth value) of the hand of the user U in the Z-axis direction.

The mobile terminal 1 guides the user, for example, by voice, to move the hand so that the user sphere 11 matches the target sphere 7 as shown in FIG. 9(B) (ST24). Here, the match between the target sphere 7 and the user sphere 11 means that the positions and sizes of the two spheres are substantially the same. Guidance for the match between the target sphere 7 and the user sphere 11 may be displayed on the display unit 34 in text as well as voice.

Next, as shown in FIG. 9(C), when the match between the target sphere 7 and the user sphere 11 is recognized, the target sphere 7 moves as shown in FIG. 9(D). The portable terminal 1 guides the user U to follow the movement of the target ball 7 by voice or the like. The target sphere 7 moves throughout the imaging area.

The correction map generation unit 55 acquires information on the movement of the hand of the user U, which moves so as to follow the target sphere 7 that moves over the entire imaging area. That is, the three-dimensional position information of the object (hand) to be recognized by the LiDAR sensor 3 in the entire imaging area is acquired by the correction map generation unit 55 (ST25).

Furthermore, in the above-described correction map generation processing of ST11 to ST15, the correction map generation unit 55 obtains the three-dimensional position information of the recognition target object by the LiDAR sensor 3, and in parallel, the three-dimensional position information calculated by triangulation. is also obtained.
That is, the correction map generation unit 55 acquires the RGB images of the two

cameras

2A and 2B, and uses the two-dimensional position information of the recognition target detected from the RGB images of each camera to perform triangulation to determine the three-dimensional image of the recognition target. The original position is calculated. Three-dimensional position information calculated by this triangulation is also acquired over the entire imaging area.

Then, as described using the flow diagram of FIG. 7, the three-dimensional position information of the recognition target based on the depth image (sensing result) of the LiDAR sensor 3 and the RGB images (sensing results) of the two

cameras

2A and 2B, respectively. ) is calculated. A correction map is generated by the correction map generator 55 using the error component data in the entire imaging area.
In this way, the user can generate a correction map for correcting the measurement error (ranging error) by the LiDAR sensor 3 for each mobile terminal 1, and can make adjustments suitable for the mounted LiDAR sensor 3. Become.

Note that the correction map may be generated by the user for each mobile terminal 1 as described above, or may be prepared in advance. In a device (mobile terminal in this embodiment) equipped with a LiDAR sensor and a camera, the type of sensor installed for each type of device is known in advance. A correction map may be generated and prepared in advance. The same thing can be said for a second embodiment, which will be described later.

<Second embodiment>
In the first embodiment, an example of generating a correction map using the sensing results of two cameras and one LiDAR sensor was given, but the present invention is not limited to this.
In this embodiment, an example of generating a correction map using the sensing results of one camera and one LiDAR sensor mounted on a device (mobile terminal in this embodiment) will be given.
The mobile terminal as a device in this embodiment differs from the mobile terminal in the first embodiment in that the number of cameras is different. While the mobile terminal in the first embodiment is equipped with a compound camera, the mobile terminal in the second embodiment is equipped with a monocular camera. Differences will be mainly described below.

In the second embodiment, the program for generating the correction map (depth correction information) stored in the storage unit 56 of the mobile terminal 1 that also functions as the recognition device performs the following steps: 1) is executed.
The above steps include a step of detecting the two-dimensional position of the recognition target from the RGB image (sensing result) of one camera, and a step of detecting the two-dimensional position of the recognition target from the reliability image (sensing result) of the LiDAR sensor. and calculating the three-dimensional position of the recognition target by triangulation using the two-dimensional position of the recognition target based on the RGB image of the camera and the two-dimensional position of the recognition target based on the reliability image of the LiDAR sensor; A step of detecting the three-dimensional position of the recognition target from the depth image of the LiDAR sensor, and a difference between the three-dimensional position of the recognition target calculated by triangulation and the three-dimensional position of the recognition target based on the depth image of the LiDAR sensor. generating depth correction information (correction map) using

A method of generating a correction map according to this embodiment will be described with reference to FIGS. 11 and 12. FIG.
FIG. 11 is a schematic diagram illustrating an example of generating a correction map using the mobile terminal 1. FIG.
In FIG. 11 , a plurality of small white circles overlapped with the user's U hand indicate feature point positions 6 of the user's U hand. Here, a case of recognizing the fingertip position of the index finger will be described.
FIG. 12 is a flowchart of the correction map generation method according to this embodiment.
The image displayed on the display unit when generating the correction map is the same as in the first embodiment.

In FIG. 11, reference numeral 121 denotes an index finger calculated by triangulation using the two-dimensional feature point positions detected from the RGB image of the camera 2 and the two-dimensional feature point positions detected from the reliability image of the LiDAR sensor 3. indicates the fingertip position of the The fingertip position 121 calculated using triangulation is assumed to correspond to the actual fingertip position and includes information on the actual depth value of the recognition target object. A fingertip position 121 is a three-dimensional feature point position of the recognition object.

A reliability image is reliability information that represents the reliability of depth information acquired by the LiDAR sensor 3 for each pixel. The reliability is calculated at the same time when depth information is acquired by the LiDAR sensor 3 . The reliability is calculated using luminance information and contrast information of the image used for depth information calculation. The reliability is determined for each pixel using a real value, and finally a reliability image is generated as a grayscale image in which the reliability is a luminance value.

In FIG. 11, reference numeral 131 indicates the three-dimensional feature point positions of the tip of the index finger based on the depth image acquired by the LiDAR sensor 3. The fingertip position 131 of the index finger acquired by the LiDAR sensor 3 is deviated from the actual fingertip position 121 of the object to be recognized due to subsurface scattering during measurement by the LiDAR sensor 3 .

The difference between the fingertip position 121 calculated using triangulation and the fingertip position 131 of the index finger based on the depth image of the LiDAR sensor 3 is the error component. This error component becomes the "offset value related to depth" in the correction map.
A correction map for correcting the measurement error originating from the LiDAR sensor 3 when the recognition target in the mobile terminal 1 is human skin is generated by acquiring such error component data over the entire imaging area. be able to.

In the correction map generation process of the present embodiment, the correction map generation unit 55 generates three-dimensional position information of the recognition target based on the depth image (sensing result) of the LiDAR sensor 3 and the RGB image (sensing result) of one camera 2. and three-dimensional position information of the recognition target based on the reliability image (sensing result) of the LiDAR sensor 3, a correction map is generated.
The flow of correction map generation processing in the processing unit 50 will be described below with reference to FIG. 12 .

As shown in FIG. 12, the three-dimensional feature point positions of the recognition object are detected from the depth image of the LiDAR sensor 3 (ST31). The three-dimensional feature point positions based on this depth image correspond to reference numeral 131 in FIG.
Also, two-dimensional feature points are detected from the reliability image of the LiDAR sensor 3 (ST32).
Also, two-dimensional feature point positions are detected from the RGB image of camera 2 (ST33).

Next, using the two-dimensional feature point positions detected from the reliability image and the two-dimensional feature point positions detected from the RGB image of the camera 2, the three-dimensional feature point positions of the recognition target are calculated by triangulation. (ST34). The three-dimensional feature point positions calculated using this triangulation correspond to the actual three-dimensional feature point positions of the recognition object. Three-dimensional feature point positions calculated by triangulation correspond to reference numeral 121 in FIG.

Next, the difference between the three-dimensional feature point position of the recognition object calculated using triangulation in ST34 and the three-dimensional feature point position based on the depth image of the LiDAR sensor 3 estimated in ST31 is calculated as an error component. (ST35).
A correction map is generated by acquiring such error component data for the entire imaging area.

As in each of the above embodiments, the present technology generates the depth value obtained by the LiDAR sensor of a device equipped with a LiDAR sensor and a camera (image sensor) using the sensing result of the LiDAR sensor and the sensing result of the camera. Correction is performed by referring to the correction map (depth correction information) provided. As a result, it is possible to correct the error in the depth value of the sensing result of the LiDAR sensor according to the individual difference of the LiDAR sensor, and it is possible to improve the recognition accuracy of the recognition object.

This technology is particularly preferably applied when the object to be recognized is translucent like human skin. With this technology, even if the recognition target is a translucent object, by correcting the depth value acquired by the LiDAR sensor using a correction map, subsurface scattering in the recognition target and individual differences in sensor devices can be corrected. The deviation (error) between the measured value of the LiDAR sensor and the actual value due to is corrected. This enables stable and highly accurate measurement of the recognition target object, thereby improving the recognition accuracy of the recognition target object.
Therefore, as described above, the present technology can be particularly preferably applied to the recognition of human hands whose skin is frequently exposed.
The technology may also be applied to gesture recognition to recognize gesture actions performed by a user. As an alternative to controllers and remote controllers for games, home appliances, etc., gesture recognition results of hand gestures performed by users can be used to input operations for games and home appliances. Since the present technology enables highly accurate recognition of a recognition target, stable and accurate operation input is possible.

<Other configuration examples>
Embodiments of the present technology are not limited to the above-described embodiments, and various modifications are possible without departing from the gist of the present technology.

For example, in the above-described first and second embodiments, an example using an RGB camera and a LiDAR sensor, which are separate devices, was given. Some RGB-D cameras may be used.
In the first embodiment, instead of two cameras and one LiDAR sensor, one camera and one RGB-D camera may be used.
In the second embodiment, instead of one camera and one LiDAR sensor, one RGB-D camera may be used.

Also, for example, in the above-described embodiment, an example was given in which a mobile terminal, which is a device equipped with an image sensor and a LiDAR sensor, functions as a recognition device that recognizes a recognition target. On the other hand, the recognition device that recognizes the recognition target object may be an external device different from the device including the image sensor and the LiDAR sensor. For example, part or all of the processing unit 50 shown in FIG. 3 may be configured by an external device such as a server different from the device including the image sensor and the LiDAR sensor.

This technique can also take the following configurations.
(1) A LiDAR (Light Detection and Ranging) sensor having a light emitting section that irradiates light onto a recognition target and a light receiving section that receives light reflected from the recognition target, and an image sensor that captures the recognition target. and correcting the depth value of the recognition target acquired by the LiDAR sensor by referring to depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor. A recognition device comprising a processing unit.
(2) The recognition device according to (1) above,
The depth correction information includes difference information between the depth value of the recognition target object based on the sensing result of the LiDAR sensor and the actual depth value of the recognition target object.
(3) The recognition device according to (1) or (2) above,
The device comprises a plurality of the image sensors and one LiDAR sensor,
The depth correction information includes the depth value of the recognition target calculated by triangulation using the position information of the recognition target detected from the sensing results of each of the plurality of image sensors, and the sensing result of the LiDAR sensor. Recognition device including difference information from the depth value of the recognition target object based on the depth image of .
(4) The recognition device according to (1) or (2) above,
the device comprises at least one image sensor and one LiDAR sensor;
The depth correction information is the position information of the recognition target detected from the sensing result of one of the image sensors and the position information of the recognition target detected from the reliability image as the sensing result of the LiDAR sensor. and a depth value of the recognition target object based on a depth image as a sensing result of the LiDAR sensor.
(5) The recognition device according to any one of (1) to (4) above,
The recognition device, wherein the object to be recognized is a translucent object.
(6) The recognition device according to (5) above,
The recognition device, wherein the recognition object is human skin.
(7) The recognition device according to (6) above,
The recognition device, wherein the recognition object is a human hand.
(8) The recognition device according to any one of (1) to (7) above,
The recognition device, wherein the processing unit recognizes a gesture motion of a person who is the recognition target object.
(9) The recognition device according to any one of (1) to (8) above,
The recognition device, wherein the processing unit generates the depth correction information using a sensing result of the LiDAR sensor and a sensing result of the image sensor.
(10) The recognition device according to any one of (1) to (9) above,
The device has a display,
The recognition device, wherein the processing unit generates an image to be displayed on the display unit using the corrected depth value of the recognition target object.
(11) A LiDAR (Light Detection and Ranging) sensor having a light emitting section that irradiates light onto a recognition target and a light receiving section that receives light reflected from the recognition target, and an image sensor that captures the recognition target. and correcting the depth value of the recognition target acquired by the LiDAR sensor by referring to depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor. recognition method.
(12) A LiDAR (Light Detection and Ranging) sensor having a light emitting section that irradiates light onto a recognition target and a light receiving section that receives light reflected from the recognition target, and an image sensor that captures the recognition target. and correcting the depth value of the recognition target acquired by the LiDAR sensor by referring to depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor. A program that causes a recognizer to perform steps .

1 … Mobile terminal (recognition device, device)
2... Camera (image sensor)
2A... First camera (image sensor)
2B... Second camera (image sensor)
3

LiDAR sensor

12, 120, 121 Actual fingertip position, fingertip position calculated by triangulation (three-dimensional position of recognition object including actual depth value)
13, 130, 131 ... Fingertip positions based on LiDAR sensor sensing results (three-dimensional positions of recognition objects including depth values based on LiDAR sensor sensing results)
34... Display unit 50... Processing unit

Claims

A LiDAR (Light Detection and Ranging) sensor having a light emitting section that irradiates light onto a recognition target and a light receiving section that receives light reflected from the recognition target, and an image sensor that captures the recognition target. A processing unit of a device that corrects the depth value of the recognition target acquired by the LiDAR sensor by referring to depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor. recognition device.
The recognition device according to claim 1,
The depth correction information includes difference information between the depth value of the recognition target object based on the sensing result of the LiDAR sensor and the actual depth value of the recognition target object.
The recognition device according to claim 2,
The device comprises a plurality of the image sensors and one LiDAR sensor,
The depth correction information includes the depth value of the recognition target calculated by triangulation using the position information of the recognition target detected from the sensing results of each of the plurality of image sensors, and the sensing result of the LiDAR sensor. Recognition device including difference information from the depth value of the recognition target object based on the depth image of .
The recognition device according to claim 2,
the device comprises at least one image sensor and one LiDAR sensor;
The depth correction information is the position information of the recognition target detected from the sensing result of one of the image sensors and the position information of the recognition target detected from the reliability image as the sensing result of the LiDAR sensor. and a depth value of the recognition target object based on a depth image as a sensing result of the LiDAR sensor.
The recognition device according to claim 1,
The recognition device, wherein the object to be recognized is a translucent object.
The recognition device according to claim 5,
The recognition device, wherein the recognition object is human skin.
A recognition device according to claim 6,
The recognition device, wherein the recognition object is a human hand.
The recognition device according to claim 1,
The recognition device, wherein the processing unit recognizes a gesture motion of a person who is the recognition target object.
The recognition device according to claim 1,
The recognition device, wherein the processing unit generates the depth correction information using a sensing result of the LiDAR sensor and a sensing result of the image sensor.
The recognition device according to claim 1,
The device has a display,
The recognition device, wherein the processing unit generates an image to be displayed on the display unit using the corrected depth value of the recognition target object.
A LiDAR (Light Detection and Ranging) sensor having a light emitting section that irradiates light onto a recognition target and a light receiving section that receives light reflected from the recognition target, and an image sensor that captures the recognition target. A recognition method of correcting the depth value of the recognition object acquired by the LiDAR sensor of the device by referring to depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor.
A LiDAR (Light Detection and Ranging) sensor having a light emitting section that irradiates light onto a recognition target and a light receiving section that receives light reflected from the recognition target, and an image sensor that captures the recognition target. A step of correcting the depth value of the recognition object acquired by the LiDAR sensor of the device by referring to depth correction information generated using the sensing result of the LiDAR sensor and the sensing result of the image sensor. A program that causes a device to run.