WO2022123654A1

WO2022123654A1 - Information processing device and information processing method

Info

Publication number: WO2022123654A1
Application number: PCT/JP2020/045684
Authority: WO
Inventors: 政人土屋; 良枝今井
Original assignee: 三菱電機株式会社
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-16
Also published as: JPWO2022123654A1

Abstract

A warning device (110) equipped with an object detection unit (114) for detecting an object by using a feature map from a target image, and a determination basis visualization unit (115) for generating a determination basis visualization image, which is an image which makes it possible to identify the level of importance in an image region which is the basis for determination when detecting an object according to a feature map in a pixel arrangement identical to that of the target image.

Description

Information processing equipment and information processing method

This disclosure relates to an information processing device and an information processing method.

The number of cases where machine learning technology is applied has expanded, and it has come to be used in places familiar to general users. In such a situation, it is important to clarify which part of the observed signal contributes to the inference result by the trained model in order to give a convincing feeling to the user. In particular, the END-TO-END learning method in deep learning in recent years becomes a black box without clearly grasping the generation process as a physical phenomenon of the observed signal. For this reason, in order to enhance the explanation, the degree of attention to the technique of visualizing the judgment basis is high.

As a conventional technique, for example, Patent Document 1 discloses a technique for visualizing a judgment basis in a case where a disease in the body is judged from an endoscopic image.

Japanese Unexamined Patent Publication No. 2020-89712

The technique of visualizing the judgment basis in order to improve the explanation like the conventional technique is often used in image classification and has not been applied to more complicated tasks such as object detection.

In the warning system using the object detection technology, it is necessary to select the information when notifying the user of the recognition result. For example, if the judgment is unreliable at the time of detection, or if the detection is not a threat at that time, telling the user may rather hinder the user's operation or concentration.

In addition, although the visualization techniques so far have been applied to single tasks such as image classification, conventional visualization techniques are used as they are for multitasking tasks such as object detection that combines image classification and position detection. Cannot be applied.

Therefore, one or more aspects of the present disclosure are intended to enable visualization of the result of the judgment basis for object detection.

The information processing apparatus according to one aspect of the present disclosure detects the object according to the feature map in the object detection unit that detects the object from the target image using the feature map and the arrangement of the same pixels as the target image. It is characterized by including a judgment basis visualization unit that generates a judgment basis visualization image, which is an image that enables identification of the importance of an image area that is a determination basis.

In the information processing method according to one aspect of the present disclosure, an object is detected from a target image using a feature map, and a judgment basis is used when the object is detected according to the feature map in the same pixel arrangement as the target image. It is characterized in that a judgment basis visualization image, which is an image capable of identifying the importance of the image area that has become, is generated.

According to one or more aspects of the present disclosure, it is possible to visualize the result of the judgment basis for object detection.

It is a block diagram which shows the structure of a warning system roughly. It is a block diagram which shows the structure of the warning device schematicly. It is a schematic diagram which shows the 1st example of the object detection result information. It is a schematic diagram which shows the 2nd example of the object detection result information. This is a table for determining the warning setting level. It is a table for determining the warning content from the warning setting level. It is a hardware block diagram of the warning device. It is a flowchart which shows the process performed by a warning device. (A) and (B) are schematic views showing an example of a screen image output to a monitor.

FIG. 1 is a block diagram schematically showing a configuration of a warning system 100 as an information processing system (for example, a visualization system) according to an embodiment.
The warning system 100 includes a camera 101, a distance measuring sensor 102, a warning device 110 as an information processing device, a monitor 103, and a speaker 104.

The warning system 100 uses a warning device 110 mounted as an in-vehicle device to give a warning notification to a user such as a driver or other occupants together with visualization of the judgment basis.
However, this warning system 100 is not limited to being mounted on an automobile, and can be widely applied to mobile vehicles such as railroad vehicles and two-wheeled vehicles. Therefore, the description of the first embodiment does not narrow the scope of the right.

In the warning system 100, the warning device 110 is provided with a camera 101 as an image pickup device that acquires an image in the traveling direction, a distance measuring sensor 102 as a distance measuring device that measures the distance between an object in the traveling direction, and a warning content. A monitor 103 as a display device that visually conveys the user to the user, and a speaker 104 as an acoustic device that conveys the warning content to the user by sound are connected. The monitor 103 and the speaker 104 function as a warning output device that outputs a warning.

The camera 101 is installed facing the front of the vehicle and converts visible light into a digital signal, and can acquire an image of the front of the vehicle as an RGB value. The camera 101 outputs an image output signal indicating the captured image to the warning device 110. The camera 101 does not have to be a visible light camera, and may be an infrared camera, a depth camera, or the like.

The distance measuring sensor 102 measures the distance to the object in front of the camera 101 and transmits it to the warning device 110. The distance measuring sensor 102 outputs a depth image output signal indicating a depth image representing the measured distance for each pixel to the warning device 110. As the distance measuring sensor 102, a conventional radar or a sensor such as LiDAR is assumed.

The monitor 103 is a device used to visually convey a warning from the warning device 110 to the user by an image. As the displayed image, an image obtained by superimposing a warning display on the image acquired from the camera 101 is used.

The speaker 104 is a device used to convey a warning from the warning device 110 to the user by voice.

FIG. 2 is a block diagram schematically showing the configuration of the warning device 110.
The warning device 110 includes an input unit 111, an image signal acquisition unit 112, a preprocessing unit 113, an object detection unit 114, a judgment basis visualization unit 115, a condition judgment unit 116, a warning execution unit 117, and an output unit. It is equipped with 118.

The input unit 111 receives the input of the image output signal from the camera 101 and the depth image output signal from the distance measuring sensor 102. The input unit 111 gives the image output signal and the depth image output signal to the image signal acquisition unit 112.

The image signal acquisition unit 112 converts an image output signal, which is an analog signal from the camera 101, into a digitized image signal.
Further, the image signal acquisition unit 112 converts the depth image output signal, which is an analog signal from the distance measuring sensor 102, into a digitized depth image signal.
Then, the image signal acquisition unit 112 supplies the image signal and the depth image signal to the preprocessing unit 113.

The preprocessing unit 113 converts the image signal into a processed image signal in order to make the format convenient for the object detection unit 114 and the judgment basis visualization unit 115 for the convenience of calculation efficiency.
Further, the preprocessing unit 113 also converts the depth image signal into the processed depth image signal.
The preprocessing includes various processes such as standardization and whitening, but here, the preprocessing is simply a resizing process for converting the resolution of a two-dimensional image signal into a square.
Then, the preprocessing unit 113 gives the processed image signal and the processed depth image signal to the object detection unit 114, the judgment basis visualization unit 115, and the warning execution unit 117.

Here, the image signal acquisition unit 112 and the preprocessing unit 113 configure an image acquisition unit that acquires an object image for detecting an object. Here, the target image is an image represented by the processed image signal.

The object detection unit 114 detects an object from the target image using the feature map.
For example, the object detection unit 114 detects the object position based on the processed image and the processed depth image, which are the images indicated by the processed image signal and the processed depth image signal received from the preprocessing unit 113, and the object detection unit 114 thereof. Generates object detection result information, which is information indicating the object detection result that is the detection result. The object detection result information is given to the judgment basis visualization unit 115, the condition judgment unit 116, and the warning execution unit 117.

FIG. 3 is a schematic diagram showing a first example of object detection result information.
The object detection result information 120 shown in FIG. 3 is table-type information including a detection candidate ID row 120a, a detection position row 120b, an estimated distance row 120c, and a reliability row 120d.
The detection candidate ID column 120a stores the detection candidate ID which is the object candidate identification information as the identification information for identifying the detected object.

The detection position sequence 120b stores the detection position, which is the position where the object candidate is detected.
The detection position indicates the position of the detected object in the image space. Here, the position when the position information of the object detection model used by the object detection unit 114 is formulated by the Dirac Delta distribution is shown.

The estimated distance sequence 120c stores the distance of the detected object.
The reliability column 120d stores the reliability indicating the probability that the detected object is the target object.

FIG. 4 is a schematic diagram showing a second example of the object detection result information.
Similar to the object detection result information 120 shown in FIG. 3, the object detection result information 121 shown in FIG. 4 has a detection candidate ID row 121a, a detection position row 121b, an estimated distance row 121c, and reliability. It is information in the form of a table including the sex column 121d.
In the example shown in FIG. 4, the detection position row 121b shows the position when the position information of the object detection model used by the object detection unit 114 is formulated with a normal distribution.

Further, when the object detection unit 114 uses the deep learning model as the object detection model, the feature map is calculated in the calculation process of performing the object detection. Feature map is a general term that refers to a multidimensional tensor obtained as a result of performing a spatial convolution operation on an image signal. The object detection unit 114 generates feature map information indicating this feature map, and gives the feature map information to the judgment basis visualization unit 115.

Returning to FIG. 2, the judgment basis visualization unit 115 is an image in which the importance of the image region that is the judgment basis when detecting an object can be identified according to the feature map in the same pixel arrangement as the target image. Generate a certain judgment basis visualization image.
For example, the judgment basis visualization unit 115 is based on the processed image signal and the processed depth image signal received from the preprocessing unit 113, the object detection result information and the feature map information received from the object detection unit 114, and the judgment basis visualization image. To generate.

The judgment basis visualization image is a shaded image having the same resolution as the processed image signal after resizing and the processed depth image signal, and is an image area that is an important judgment basis for the object detection unit 114 to perform object detection. It is an image in which a high value is set.

Usually, the judgment basis visualization image is converted from a low value to a color map to which colors such as blue, green, yellow, and red are assigned, and is often displayed superimposed on the original image by using alpha blending or the like.

The judgment basis visualization image is created based on the information of the detection position included in the object detection result, and the distribution of the detection position is used as it is for visualization. Specifically, when the detection position of the object detection model used by the object detection unit 114 is formulated by the Dirac Delta distribution, as an example of simple visualization, there is a calculation method as shown in the following equation (1).

In the equation (1), k is the detection candidate ID, x is the position of the object identified by k in the image space, conf is the reliability value of the object identified by k, and Z is the standardized constant. It is a constant for setting the maximum value of E at each position x in all object classes at the time of object detection to 1.

On the other hand, when the position information of the object detection model is formulated with a normal distribution, or when the feature map F calculated by the object detection unit 114 in the calculation process for calculating the object detection result is used together, the following (2) ) There is a calculation method like the formula.

Here, c is a channel in the feature map. The circle symbol in the equation (2) indicates the Hadamard product, and the outer double line indicates the L2 norm.

The judgment basis visualization unit 115 gives the judgment basis visualization signal indicating the judgment basis visualization image to the warning execution unit 117.

The condition determination unit 116 determines the warning setting level based on the object detection result information.
In the first embodiment, the reliability and estimated distance included in the object detection result are used to determine the warning setting level.

FIG. 5 is a table for determining the warning setting level L.
As shown in FIG. 5, the warning setting level L is determined according to the combination with the reliability and the estimated distance. Since the warning setting level L is used to determine the operation of the warning execution unit 117, the warning setting level information indicating the warning setting level L is given to the warning execution unit 117.

Specifically, the warning setting level 1 is a level when the first condition that no object is detected from the target image is satisfied.
The warning setting level 1 is a level when the reliability is less than 50% and the second condition that the distance is 10 m or more is satisfied.
The warning setting level 2 is a level when the third condition that the reliability is less than 50% and the distance is less than 10 m, or the reliability is 50% or more and the distance is 10 m or more is satisfied.
The warning setting level 3 is a level when the reliability is 50% or more and the fourth condition that the distance is less than 10 m is satisfied.
Here, the second condition is a condition in which a warning for the detected object is more necessary than the first condition, and the third condition is a warning for the detected object rather than the second condition. The fourth condition is a condition in which the need for warning to the detected object is higher than the third condition.

The warning execution unit 117 is an output execution unit that determines the content of the warning to the user based on the warning setting level indicated by the warning setting level information received from the condition determination unit 116 and executes the output.
FIG. 6 is a table for determining the warning content from the warning setting level L.
As shown in FIG. 6, the warning execution unit 117 determines the warning content according to the warning setting level L. Then, the warning execution unit 117 generates warning information according to the determined warning content, and gives a warning signal indicating the warning information to the output unit 118.

For example, if the warning setting level L is 0, the warning execution unit 117 does not generate warning information. In this case, the warning execution unit 117 outputs a processed image signal indicating the target image to the output unit 118.

If the warning setting level L is 1, the warning execution unit 117 generates an output image using the judgment basis visualization image as a warning image, and generates warning information indicating the warning image. Specifically, the warning execution unit 117 generates an output image by alpha blending the judgment basis visualization image received from the judgment basis visualization unit 115 into the processed image indicated by the processed image signal.

If the warning setting level L is 2, the warning execution unit 117 generates a processed image in which the object detected in the target image is surrounded by a rectangle as a warning image, and generates warning information indicating the warning image. Specifically, the warning execution unit 117 sets the detection position included in the object detection result received from the object detection unit 114 as a rectangle and superimposes the warning image on the processed image indicated by the processed image signal. Generate and generate warning information indicating the warning image.

If the warning setting level L is 3, the warning execution unit 117 generates the above-mentioned processed image as a warning image, and also generates a warning sound which is a voice for warning, and the warning image and the warning sound indicating the warning sound are generated. Generate information. Specifically, the warning execution unit 117 superimposes the detection position included in the object detection result received from the object detection unit 114 on the processed image indicated by the processed image signal as a rectangle to display the warning image. At the same time as generating, a warning sound by voice is generated.

The output unit 118 outputs a warning information given by the warning execution unit 117 to convey a warning to the user through at least one of the monitor 103 and the speaker 104.

FIG. 7 is a hardware configuration diagram of the warning device 110.
As shown in FIG. 7, the warning device 110 can be configured by a computer 130.

The memory 131 stores a program that functions as an image signal acquisition unit 112, a preprocessing unit 113, a judgment basis visualization unit 115, an object detection unit 114, a condition judgment unit 116, and a warning execution unit 117, and data used by these. The memory 131 may be, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Read Only Memory), an EPROM (Electrically Memory Memory, etc.) An optical disk, a photomagnetic disk, or the like is used.

The processor 132 reads out a program that functions as an image signal acquisition unit 112, a preprocessing unit 113, a judgment basis visualization unit 115, an object detection unit 114, a condition judgment unit 116, and a warning execution unit 117, and executes processing. The processor 132 uses, for example, a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a microprocessor, a microcontroller, a DSP (Digital Signal Processor), or the like.

The acoustic interface 133 is used when the warning execution unit 117 outputs a warning by sound from the speaker 104 when instructing the user to give a warning. The acoustic interface functions as an output unit 118.

The image interface 134 is for transmitting the analog signal obtained from the camera 101 and the distance measuring sensor 102 to the image signal acquisition unit 112. In this case, the image interface 134 functions as an input unit 111.
The image interface 134 is also used when the warning execution unit 117 outputs a final screen output signal to the monitor 103 when instructing the user to give a warning. In this case, the image interface 134 functions as an output unit 118.

The network interface 135 is used when the image signal acquisition unit 112 receives an image input signal from the external network environment. The network interface 135 is not required unless it is configured to communicate with the external network environment. In this case, the network interface 135 functions as the input unit 111.

Although the memory 131 is arranged inside the warning device 110 in FIG. 2, it may be configured as an external memory such as a USB memory and may be configured to read a program and data. Further, as the memory 131, the memory in the device and the external memory may be used together.

FIG. 8 is a flowchart showing a process performed by the warning device 110.
First, the input unit 111 receives the input of the image output signal from the camera 101 and the depth image output signal from the distance measuring sensor 102 (S10). The input image output signal and depth image output signal are given to the image signal acquisition unit 112, converted from an analog signal to a digital signal by the image signal acquisition unit 112, and used as an image signal and a depth image signal in the preprocessing unit. Given to 113.

Next, the preprocessing unit 113 performs preprocessing on the image signal and the depth image signal to convert them into a processed image signal and a processed depth image signal, respectively (S11). The processed image signal and the processed depth image signal are given to the object detection unit 114, the judgment basis visualization unit 115, and the warning execution unit 117.

Next, the object detection unit 114 performs object detection from the processed image signal and the processed depth image signal, and determines the object detection result information indicating the object detection result which is the detection result. And the warning execution unit 117, and further, the feature map information indicating the feature map obtained in the process of object detection is given to the judgment basis visualization unit 115 (S12).

Next, the condition determination unit 116 determines the warning setting level according to the combination of the estimated distance and the reliability included in the object detection result information, and provides the warning setting level information indicating the determined warning setting level. It is given to the warning execution unit 117 (S13).

The warning execution unit 117 confirms the warning setting level indicated by the given warning setting level information (S14), ends the processing if the warning setting level is 0, and processes if the warning setting level is weak. To step S15, and if the warning setting level is medium to strong warning, the process proceeds to step S16.

In step S15, the warning execution unit 117 generates a warning image by alpha blending the judgment basis visualization image received from the judgment basis visualization unit 115 into the processed image indicated by the processed image signal, and shows the warning image. Generate warning information. The generated warning information is given to the output unit 118. Then, the process proceeds to step S17.

On the other hand, in step S16, if the warning setting level L is 2, the warning execution unit 117 processes the detected position included in the object detection result received from the object detection unit 114 as a rectangular shape and is indicated by the processed image signal. By superimposing on the finished image, a warning image is generated, and warning information indicating the warning image is generated. When the warning setting level L is 3, a warning image is generated by superimposing the detection position included in the object detection result received from the object detection unit 114 on the processed image indicated by the processed image signal as a rectangle. At the same time, a warning sound by voice is generated, and a warning image and warning information indicating the warning sound are generated. Then, the process proceeds to step S17.

In step S17, the output unit 118 issues a warning to the user using at least one of voice and image based on the warning information generated by the warning execution unit 117.

9 (A) and 9 (B) are schematic views showing an example of a screen image output to the monitor 103.
Both the screen image 140 and the screen image 150 assume a situation in which

vehicles

141 and 151 cross to the right from the corner of the road ahead.

The screen image 140 is a warning screen image generated by the warning execution unit 117 when it is determined that the warning setting level is medium to strong. In the screen image 140, the detection position included in the object detection result is used, and a warning by the rectangle 142 is displayed at that position.

On the other hand, the screen image 150 is a warning screen image generated by the warning execution unit 117 when it is determined that the warning setting level is equivalent to a weak warning. In the screen image 150, the attention area 152 in which the information indicated by the judgment basis visualization signal is superimposed on the original image is displayed as the warning content.

In the embodiment described above, since the image is resized in the preprocessing unit 113, the warning execution unit 117 generates a warning image using the processed image processed by the preprocessing unit 113. However, the present embodiment is not limited to such an example. For example, when the preprocessing unit 113 does not resize the image, the warning execution unit 117 can generate a warning image using the image represented by the digitized image signal by the image signal acquisition unit 112. ..

100 warning system, 101 camera, 102 distance measuring sensor, 103 monitor, 104 speaker, 110 warning device, 111 input unit, 112 image signal acquisition unit, 113 preprocessing unit, 114 object detection unit, 115 judgment basis visualization unit, 116 conditions Judgment unit, 117 warning execution unit, 118 output unit.

Claims

An object detection unit that detects an object from the target image using a feature map,
Judgment to generate a judgment basis visualization image which is an image capable of identifying the importance of an image region which is a judgment basis when detecting the object in the same pixel arrangement as the target image according to the feature map. An information processing device characterized by having a grounds visualization unit.
A condition determination unit that determines whether or not the detection result of the object satisfies the first condition and whether or not the second condition is satisfied.
When the detection result satisfies the first condition, the target image is output, and when the detection result satisfies the second condition, an output image using the judgment basis visualization image is output. The information processing apparatus according to claim 1, further comprising an output execution unit.
The information processing apparatus according to claim 2, wherein the second condition is a condition in which a warning to the object is more necessary than the first condition.
The condition determination unit further determines whether or not the detection result satisfies the third condition, and further determines.
The second or third aspect of the present invention, wherein the output execution unit outputs a processed image in which the object is surrounded by a rectangle in the target image when the detection result satisfies the third condition. Information processing device.
The information processing apparatus according to claim 4, wherein the third condition is a condition in which a warning to the object is more necessary than the second condition.
The condition determination unit further determines whether or not the detection result satisfies the fourth condition, and further determines.
The information processing apparatus according to claim 4, wherein the output execution unit outputs the processed image and the voice for warning when the detection result satisfies the fourth condition.
The information processing apparatus according to claim 6, wherein the fourth condition is a condition in which a warning to the object is more necessary than the third condition.
Detect an object from the target image using a feature map and
It is characterized in that, in the same pixel arrangement as the target image, a judgment basis visualization image, which is an image that enables identification of an image area that is a judgment basis when detecting the object, is generated according to the feature map. Information processing method to do.