WO2024069778A1

WO2024069778A1 - Object detection system, camera, and object detection method

Info

Publication number: WO2024069778A1
Application number: PCT/JP2022/036062
Authority: WO
Inventors: 一成岩永
Original assignee: 株式会社日立国際電気
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2024-04-04

Abstract

The purpose of the present invention is to facilitate the setting of an appropriate distant area for realizing high-speed, high-accuracy object detection regardless of the camera installation conditions.　An image processing device 120 according to the present disclosure comprises: a function for setting, on the basis of positions and sizes of a plurality of object frames (221, 222) that respectively surround a plurality of persons that are included in a camera image (200) that is captured in advance by an image capture device 110, a distant area that includes an area in which the sizes of the object frames in the image area of the camera image become equal to or less than a threshold value; and a function for generating, on the basis of a camera image that is captured during operation, a reduced image that is a reduction of the camera image and a trimmed image in which the distant area portion is trimmed from the camera image, inputting both the reduced image and the trimmed image into a low-resolution learning model to perform person detection, and synthesizing and outputting the detection result based on the reduced image and the detection result based on the trimmed image.

Description

OBJECT DETECTION SYSTEM, CAMERA, AND OBJECT DETECTION METHOD

The present invention relates to an object detection system that detects objects from camera images taken of a surveillance area.

Traditionally, research and development has been conducted on object detection systems that detect targets by analyzing camera images taken of a surveillance area using a learning model for object detection. Object detection can be performed using a variety of methods that utilize AI (Artificial Intelligence) technology.

In the first conventional method shown in FIG. 1, a high-resolution camera image 11, such as a 4K video image, is input as is into a high-resolution learning model 12 for analysis. As the analysis result, the high-resolution learning model 12 outputs a result image 13 in which a detection frame indicating a detected object is added to the input image. The result image 13 is displayed on a monitor terminal either as is or after being adjusted for user confirmation. According to the first conventional method, highly accurate detection can be achieved, but analysis by the high-resolution learning model 12 takes time. Furthermore, the high-resolution learning model 12 requires a significant amount of time for learning.

In the second conventional method shown in FIG. 2, a high-resolution camera image 21 is reduced to a predetermined size (e.g., VGA size) to generate a reduced image 22, which is then input to a low-resolution learning model 23 for analysis. The low-resolution learning model 23 outputs a result image 24 as the analysis result, in which a detection frame indicating the detected object is added to the input image. Based on this result image 24, a final output image 25 is generated in which the detection frame is reflected in the original camera image 21 (or a camera image adjusted for user confirmation), and is displayed on a monitor terminal. According to the second conventional method, analysis can be performed quickly in the low-resolution learning model 23, but small objects in the distance become difficult to see due to the image reduction, resulting in a lower detection rate.

In the third conventional method shown in Figure 3, a reduced image 32 is generated by reducing a high-resolution camera image 31 to a predetermined size, and a cropped image 33 is generated by cutting out a distant area from the camera image, and these are input to a low-resolution learning model 34 for analysis. As the analysis results, the low-resolution learning model 34 outputs a result image 35 showing the detection result based on the reduced image 32, and a result image 36 showing the detection result based on the cropped image 33. These

result images

35 and 36 are combined to generate a final output image 37 in which the detection frame is reflected in the original camera image 37, and this is displayed on the monitor terminal.

The third conventional method can achieve high-speed, high-precision object detection. When the installation conditions of the camera are fixed, such as in the case of the vehicle-mounted camera disclosed in Patent Document 1, the third conventional method does not pose any particular problems because the distant area in the camera image is also fixed. However, when a variety of installation conditions are expected depending on the situation at the site, such as in the case of surveillance cameras, the distant area setting differs for each camera. Moreover, setting an appropriate distant area requires specialized knowledge and is a tedious and time-consuming task.

JP 2020-4366 A

The present invention was made in consideration of the above-mentioned conventional circumstances, and aims to make it easier to set an appropriate far-field area to achieve high-speed, high-precision object detection, regardless of the camera installation conditions.

In order to achieve the above object, an object detection system according to one aspect of the present invention is configured as follows. That is, in an object detection system that detects objects from camera images taken of a surveillance area, before the start of operation, a process is executed to set a distant area including an area in the image area of the camera image where the size of the object frame is equal to or smaller than a threshold based on the positions and sizes of multiple object frames surrounding each of multiple objects included in a camera image taken in advance, and during operation, a process is executed to generate a first image by reducing the camera image based on the taken camera image and a second image by cutting out a portion of the distant area from the camera image, input both the first image and the second image into a predetermined learning model to detect objects, and output a process to combine the detection result based on the first image and the detection result based on the second image.

Here, the multiple object frames can be set by user operations on a previously captured camera image. Alternatively, the multiple object frames can be set based on the detection results obtained by inputting a previously captured camera image into a learning model.

A camera according to another aspect of the present invention is configured as follows. That is, in a camera that photographs a surveillance area to detect objects, the camera has a function of setting a distant area including an area in the image area of a camera image where the size of an object frame is equal to or smaller than a threshold based on the positions and sizes of multiple object frames surrounding each of multiple objects included in a camera image photographed before the start of operation, and a function of generating a first image by reducing the camera image based on a camera image photographed during operation and a second image by cutting out a portion of the distant area from the camera image, inputting both the first image and the second image into a predetermined learning model to detect objects, and outputting a combination of the detection results based on the first image and the detection results based on the second image.

　An object detection method according to yet another aspect of the present invention is configured as follows. That is, the object detection method detects an object from a camera image taken of a surveillance area, and includes the steps of: before starting operation, setting a distant area including an area in the image area of the camera image where the size of the object frame is equal to or smaller than a threshold based on the positions and sizes of multiple object frames surrounding each of multiple objects included in the camera image taken in advance; and during operation, generating a first image by reducing the camera image based on the taken camera image and a second image by cutting out a portion of the distant area from the camera image, inputting both the first image and the second image into a predetermined learning model to detect the object, and outputting a combination of the detection result based on the first image and the detection result based on the second image.

The present invention makes it easy to set an appropriate far-field area to achieve high-speed, high-precision object detection, regardless of the camera installation conditions.

FIG. 1 is a diagram showing an overview of object detection according to a first conventional method. FIG. 13 is a diagram showing an overview of object detection according to a second conventional method. FIG. 13 is a diagram showing an overview of object detection by a third conventional method. 1 is a diagram illustrating an example of the configuration of an object detection system according to an embodiment of the present invention. 11A and 11B are diagrams illustrating an example of setting a detection area and an object frame for a camera image. FIG. 13 is a diagram showing an example of setting a far region for a camera image. FIG. 13 is a diagram showing another example of setting the far region for the camera image. FIG. 13 is a diagram showing another example of setting an object frame for a camera image.

One embodiment of the present invention will be described with reference to the drawings. FIG. 4 shows an example of the configuration of an object detection system according to one embodiment of the present invention. As shown in FIG. 4, the object detection system of this example includes an imaging device 110, an image processing device 120, a monitor terminal 130, and an operation terminal 140. These devices can be connected to each other so that they can communicate with each other via wires or wirelessly. In addition, any network such as the Internet may be interposed between these devices.

The imaging device 110 is a device such as a surveillance camera that captures images of a monitored area. In this example, a camera capable of outputting high-quality camera images such as 4K video is used as the imaging device 110. The imaging device 110 can be installed under installation conditions that correspond to the situation at the site. Therefore, the angle of view and tilt of the imaging device 110 are not particularly limited. However, for the sake of simplicity of explanation, the imaging device 110 in this example is installed in a posture with virtually no tilt, and captures camera images that capture the monitored area with a horizontal or nearly horizontal line of sight. The high-resolution camera images captured by the imaging device 110 are transmitted to the image processing device 120.

The image processing device 120 is, for example, a computer equipped with hardware resources such as a processor and memory, and is configured to read from the memory programs relating to the following functions of the present invention and execute them with the processor. The image processing device 120 has a function of performing object detection using a method similar to the third conventional method described above, based on high-resolution camera images received from the imaging device 110 during system operation. The image processing device 120 further has a function of setting the far area. The far area is set before the system starts operating. The setting of the far area is explained in detail below.

Here, the learning model used by the image processing device 120 is for detecting people contained in images, and has, for example, a yolo v3 network structure, and is trained using an input image of 640 x 360 pix. Furthermore, a threshold representing the size (height) of a person in an image that can be stably detected by the learning model is defined as Th, and Th = 50 pix. In other words, a person whose size in an image is 50 pix or more can be stably detected, but if the size falls below that, the detection accuracy decreases. Note that the threshold Th = 50 pix is just an example, and will vary depending on the structure of the learning model and the data provided during learning.

When setting the distant area, the image processing device 120 receives from the user a detection area to be subject to person detection in a high-resolution camera image captured in advance by the imaging device 110. Similarly, multiple object frames surrounding multiple people included in the camera image are received from the user. In FIG. 5, one detection area 210 and two

object frames

221, 222 are set for a camera image 200 captured in advance. Note that detection areas may be set in two or more locations. Also, object frames may be set in three or more locations.

These settings are input by the user through the operation terminal 140 and provided to the image processing device 120. In this example, the image processing device 120 is configured to display the camera image 200 on the operation terminal 140 and accept settings for the detection area and object frame through operations on that image. Note that the above explanation is just one example, and the method for setting the detection area and object frame is not particularly limited.

After receiving the detection area and object frame settings, the image processing device 120 estimates the size of the person at each coordinate in the detection area by linear interpolation. As an example, in a 4K (3840 x 2150 pix) camera image, the Y coordinate of the top of the detection area is y_u = 400, the Y coordinate of the bottom is y_b = 2150, the Y coordinate of the first object frame is y1 = 500, the height h1 = 75 pix, and the Y coordinate of the second object frame is y2 = 1000, the height h1 = 450 pix. In addition, the height of each person is h = 170 cm. If the resize ratio is defined as r, when a 4K (3840 x 2150 pix) camera image is converted to a predetermined size (for example, 640 x 360 pix), the resize ratio r = 640/3850 ≒ 0.1667.

Here, the height H of a person at any Y coordinate can be expressed by the following formula.
p1=h1/h
p2 = h2/h
H = (Y × (p2 - p1) / (y2 - y1)
+ (p2-(p2-p1)/(y2-y1) x y1) x h x r

According to the above formula, when Y = 199.94, H = 50, which is the same value as the threshold value Th. For this reason, in an image reduced from 3840 x 2160 to 640 x 360, positions where Y = 200 or greater are the range in which people can be reliably detected. Therefore, if the rest of the image range, that is, the image range from (0,0) to (3840,200), is set as the distant region, the maximum conditions can be met in all areas of the camera image 200.

In the example of Figure 6, 14 far area frames 230 (7 horizontal x 2 vertical), each with a size of 640 x 360, are set for the image range of (0,0) to (3840,200). In other words, the system is set to cut out 14 cropped images from the camera image when it is in operation. Note that, as shown in Figure 6, multiple far area frames 230 may be positioned so that the boundaries with adjacent frames overlap. This makes it possible to avoid a decrease in detection accuracy caused by a person being cut off in the cropped image when there is a person on the boundary line between adjacent far area frames.

Here, the size of the far region frame may be other sizes as long as the value satisfies the above formula. For example, the image may be cropped to a size larger than the expected input size of the low resolution learning model, such as cropping to a size of 1280 x 720 and resizing to 640 x 360, and then reduced to match the expected input size. Also, as shown in Figure 7, the far region frame 230 may not be set outside the detection area 210 in the camera image 200. In Figure 7, fewer than the 14 in Figure 6, i.e., seven far region frames 230, are set.

The image processing device 120 performs the above processing before the system is operated to set the far area for the camera image of the imaging device 110. During subsequent system operation, the image processing device 120 performs object detection for the camera image received from the imaging device 110 in accordance with the far area setting in a manner similar to the third conventional method described above. That is, the image processing device 120 generates a reduced image obtained by reducing the high-resolution camera image and a cropped image obtained by cutting out a portion of the far area from the camera image. In the case of the setting in FIG. 6, 14 cropped images are generated, and in the case of the setting in FIG. 7, 14 cropped images are generated. The image processing device 120 inputs both the reduced image and the cropped image into a learning model for low resolution to perform human detection. After that, the image processing device 120 combines the detection result based on the reduced image and the multiple detection results based on the multiple cropped images, and outputs the result as a final output image. The final output image output from the image processing device 120 is transmitted to the monitor terminal 130 and displayed by the monitor terminal 130.

Here, the reduced image in the above object detection may be reduced to the expected input size of the low-resolution learning model after removing areas other than the detection area. Furthermore, if the aspect ratio of the image after removing areas other than the detection area differs from the aspect ratio of the expected input size, the range to be removed may be expanded so that the aspect ratio matches. Alternatively, the image may be reduced while maintaining the aspect ratio after removal, and the missing parts may be padded.

As described above, the image processing device 120 of this example has the function of setting a distant area including an area in the image area of a camera image (200) captured in advance by the imaging device 110 where the size of the object frame is equal to or smaller than a threshold value (Th) based on the positions and sizes of multiple object frames (221, 222) surrounding multiple people included in the camera image (200), and the function of generating a reduced image by reducing the camera image and a cropped image by cutting out a portion of the distant area from the camera image based on a camera image captured during system operation, inputting both the reduced image and the cropped image into a low-resolution learning model to perform person detection, and outputting a combination of the detection results based on the reduced image and the detection results based on the cropped image. With this configuration, it is possible to easily set an appropriate distant area to achieve high-speed and high-precision object detection regardless of the installation conditions of the imaging device 110.

In the above explanation, a person detection system that detects people from camera images has been used as an example, but this technology can be applied to any object detection system that detects various other objects. Also, in the above explanation, it is assumed that the image capture device 110 is installed in a substantially tilted position, but the installation manner of the image capture device 110 is not limited to this, and for example, the image capture device 110 may be installed at an angle. In this case, by setting at least three object frames for the camera image, it is possible to appropriately set the distant area.

In the above description, the object frame is set by the user operating the operation terminal 140, but it is also possible to automate the setting of the object frame. Specifically, for example, as shown in FIG. 8, a plurality of provisional far area frames 240 are set so as to cover the entire image for a camera image captured in advance by the imaging device 110, and the system is trial-run. In the example of FIG. 8, a total of 35 provisional far area frames 240, 7 horizontal by 5 vertical, are set. By trial-running the system with such settings, the processing load on the image processing device 120 increases and it takes a certain amount of time, but it is possible to accurately detect people included in the camera image. Therefore, it is possible to automatically set a plurality of object frames surrounding each person without user operation. The automation of the setting of the object frame is particularly effective in cases where the attitude of the imaging device 110 is changed during system operation.

In addition, in the above description, the imaging device 110 and the image processing device 120 are separate devices, but these devices may be integrated. In other words, the imaging device 110 may have not only a function for capturing camera images, but also a function for setting a distant area including an area in the image area of a camera image where the size of an object frame is equal to or smaller than a threshold based on the positions and sizes of multiple object frames surrounding each of multiple people included in the camera image captured before the system is operated, and a function for generating a reduced image by reducing the camera image and a cropped image by cutting out a portion of the distant area from the camera image based on a camera image captured during the system operation, inputting both the reduced image and the cropped image into a learning model for low resolution to perform person detection, and outputting a combination of the detection result based on the reduced image and the detection result based on the cropped image.

The above describes the embodiments of the present invention, but these embodiments are merely illustrative and do not limit the technical scope of the present invention. The present invention can take various other embodiments, and various modifications such as omissions and substitutions can be made without departing from the gist of the present invention. These embodiments and their modifications are included in the scope and gist of the invention described in this specification, etc., and are included in the scope of the invention described in the claims and their equivalents.

In addition, the present invention can be provided not only as the devices described above or as systems composed of these devices, but also as methods executed by these devices, programs for implementing the functions of these devices using a processor, and storage media for storing such programs in a computer-readable format.

110: Imaging device, 120: Image processing device, 130: Monitor terminal, 140: Operation terminal

Claims

In an object detection system that detects objects from camera images taken in a surveillance area,
Before starting operation, a process is executed to set a distant region including a region in an image region of a camera image where the size of the object frame is equal to or smaller than a threshold value based on the positions and sizes of a plurality of object frames surrounding a plurality of objects included in the camera image captured in advance;
An object detection system characterized by performing a process of generating, during operation, a first image obtained by reducing a captured camera image and a second image obtained by cutting out a portion of the distant area from the camera image, inputting both the first image and the second image into a predetermined learning model to detect objects, and synthesizing and outputting the detection result based on the first image and the detection result based on the second image.
2. The object detection system according to claim 1,
The object detection system according to claim 1, wherein the plurality of object frames are set by a user operation on the pre-captured camera image.
2. The object detection system according to claim 1,
An object detection system characterized in that the multiple object frames are set based on detection results obtained by inputting the previously captured camera images into the learning model.
In a camera that captures images of a surveillance area to detect objects,
A function of setting a distant area including an area in an image area of a camera image where the size of an object frame is equal to or smaller than a threshold value based on the positions and sizes of a plurality of object frames surrounding each of a plurality of objects included in the camera image captured before the start of operation;
A camera characterized by having a function of generating a first image by reducing the camera image based on a camera image taken during operation, and a second image by cutting out a portion of the distant area from the camera image, inputting both the first image and the second image into a predetermined learning model to detect objects, and synthesizing and outputting the detection results based on the first image and the detection results based on the second image.
An object detection method for detecting an object from a camera image of a surveillance area, comprising:
Before starting operation, a step of setting a distant area including an area in an image area of a camera image where the size of the object frame is equal to or smaller than a threshold value based on the positions and sizes of a plurality of object frames surrounding each of a plurality of objects included in a camera image captured in advance;
An object detection method comprising the steps of: generating, during operation, a first image obtained by reducing a captured camera image and a second image obtained by cutting out a portion of the distant area from the camera image; inputting both the first image and the second image into a predetermined learning model to detect an object; and synthesizing and outputting the detection result based on the first image and the detection result based on the second image.