CN109726741B

CN109726741B - Method and device for detecting multiple target objects

Info

Publication number: CN109726741B
Application number: CN201811488003.9A
Authority: CN
Inventors: 夏炎; 刘镇; 吕李娜
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2023-05-30
Anticipated expiration: 2038-12-06
Also published as: CN109726741A

Abstract

The invention discloses a method and a device for detecting a multi-target object, wherein the method comprises the following steps: connecting a target object detection device; creating a pretrained multi-target object detection model by using a convolutional neural network; installing deep learning framework software; sequentially reading each frame of image from the camera; reducing the image read by the camera to 448 x 448 pixels; dividing the reduced image into 7*7 grids with the same size; breaking whether the object is in the grid cell of 7*7 using the coordinate values; sending the grid unit with the object into a pre-training network model to obtain a frame regression value; outputting the frame regression of 90 object categories of each grid; outputting the position value and the confidence coefficient of each frame regression object; setting a threshold to filter out frames with low scores; and carrying out non-maximum value inhibition treatment on the reserved frames, and merging the frames to obtain a final detection result. The invention solves the problems of complicated design, low detection speed and poor multi-target concurrency capability of image feature extraction in the prior art.

Description

Method and device for detecting multiple target objects

Technical Field

The invention belongs to the technical field of computer image processing and machine vision, relates to a method for detecting multiple target objects, and particularly relates to a method and a device for detecting multiple target objects of a two-dimensional video camera.

Background

Conventional target detection generally uses a framework of sliding windows, mainly comprising three steps: (1) Using sliding windows with different sizes to frame a certain part in the graph as a candidate region; (2) extracting visual features related to the candidate region. Such as Harr features commonly used for face detection; HOG features commonly used for pedestrian detection and common target detection, etc.; (3) The recognition is performed using a classifier, such as a conventional SVM model. In the conventional object detection, the multi-scale deformation component model DPM regards an object as a plurality of components (such as a nose, a mouth and the like of a human face), and the object is described by the relation among the components, so that the characteristic is very consistent with the non-rigid body characteristics of many objects in nature. DPM can be regarded as an extension of HOG+SVM, well inherits the advantages of the two, and achieves good effects on tasks such as face detection, pedestrian detection and the like, but DPM is relatively complex, and the detection speed is low, so that a plurality of improved methods are also presented. Among them, the method of target detection based on deep learning is a research hotspot in recent years. After the development of target detection based on deep learning, the actual effect is difficult to break through. For example, mAP on the ILSVRC2013 test set of OverFeat can only reach 24.3%. Many of these innovations have been done in real time by combining some traditional vision domain methods with deep learning, such as Selective Search (Selective Search) and image Pyramid (Pyramid). These are all based on region nomination. This approach requires high computational resources to implement and it is difficult to handle multiple targets simultaneously. When multiple targets are detected in a real-time camera video stream, accelerated training of multiple GPU graphics cards is often required, so that portability of equipment for detecting target objects is poor. It is often difficult to apply in some end-to-end real-time processing fields where there is no network and mobility requirements are high.

Disclosure of Invention

The invention aims to solve the problems and the defects of the prior art and provides a method and a device for detecting a multi-target object of a two-dimensional video camera.

The method for detecting the target object has the advantages of lower power consumption, lower consumption of computing resources and portability, and can be suitable for a network-free environment and realize end-to-end real-time target object detection. The maximum target detection types of the invention are 90.

In order to achieve the above purpose, the technical scheme provided by the invention is as follows:

a multi-target object detection method comprising the steps of:

step 1: connecting a target object detection device;

step 2: creating a pretrained multi-target object detection model by using a convolutional neural network;

step 3: installing deep learning framework application software on the target object detection equipment device;

step 4: reading application software by using a camera, and sequentially reading each frame of image from the camera;

step 5: reducing the image read by the camera to 448 x 448 pixels;

step 6: dividing the reduced image into 7*7 grids with the same size;

step 7: breaking whether the object is in the grid cell of 7*7 using the coordinate values;

step 8: sending 7*7 grid units with objects judged in the step 7 into a pre-training network model to obtain a frame regression value;

step 9: outputting the frame regression values of 90 object categories of each grid through the discriminators of 90 categories;

step 10: outputting the position value and the confidence coefficient of each frame regression object through the 90 classes of discriminators;

step 11: after the position value and the confidence value of each frame are obtained, a threshold value is set, and frames with low scores are filtered out;

step 12: and carrying out non-maximum value inhibition treatment on the reserved frames, and merging the frames to obtain a final detection result.

Further, the connection manner of the connection device apparatus in step 1 is as follows:

the mobile terminal display card chip is connected with the embedded main board, the camera is connected with the embedded main board, the power adapter is connected with the embedded main board, and the hard disk is connected with the embedded main board.

Further, the specific content and steps of creating the pre-training multi-target object detection model in the step 2 are as follows:

(1) Preparing a training sample picture of a target object to be detected;

(2) Manually calibrating the position and size frames of the target in the sample picture;

(3) Scaling down the scaled sample picture to 448 x 448 pixels;

(4) Performing feature extraction on the reduced samples by using a 24-layer convolutional neural network to obtain some frame regression coordinates, confidence coefficient of objects contained in the frames and category probability;

(5) And performing non-maximum suppression on all the frames, and outputting a unique frame after screening.

Further, in the specific way of determining whether the object is in the grid cell of 7*7 by using the coordinate values in step 7, the coordinates of the center point of the object are compared with the coordinate ranges of the grid cell, so as to determine whether the object is in the grid cell.

Further, the detection method further comprises the step of comparing the confidence of the object with a threshold of a target image to judge whether the video to be detected contains the target image or not; comparing the confidence value with a threshold value of the target image, and judging that the video to be detected contains the target image when the confidence value is larger than or equal to the threshold value of the target image; and when the confidence score is smaller than the threshold value of the target image, judging that the target image is not included in the video to be detected.

Further, the detection method further comprises the step of comparing the position value of the object with the position value in a pre-training multi-target object detection model to judge the accuracy of object target detection; comparing the position value with the position of the manual calibration target in the sample by utilizing the cross ratio, and judging that the detection to be detected is correct when the position value is greater than or equal to a cross ratio threshold value of the position of the manual calibration target in the sample; and when the position value and the position of the manual calibration target in the sample are smaller than and are compared with a threshold value, judging that the detection to be detected is wrong.

In order to achieve the above purpose, another technical scheme provided by the invention is as follows:

the device for detecting the multi-target object comprises a mobile terminal display card chip, an embedded main board, a camera, a power adapter and a hard disk, wherein the embedded main board is a hardware platform for detecting the whole multi-target object; the mobile terminal display card chip is an embedded image processing module for processing the video stream image; the camera is used for acquiring video images; the power adapter is responsible for supplying power to the embedded motherboard; the hard disk is used for storing data; the embedded main board is respectively connected with the display card chip, the camera, the power adapter and the hard disk.

The method for detecting the multi-target object has the characteristics and beneficial effects that:

1. the method uses a low-power consumption power supply, and the power consumption of the equipment is lower than that of the deep learning target detection of a computer and a server side;

2. the method of the invention can be used under the condition of no network, and does not need to be transmitted to a server side for operation in real time. The system can be used outdoors and is in an environment with poor network;

3. the method has small volume, is applied to an embedded device, and is suitable for terminal application of multi-target object identification;

4. the method of the present invention has a process Frame Per Second (FPS) of 75.1% accuracy at 18-32VOC 2007 dataset. The precision can be very high under the condition of meeting the low power consumption;

5. the method of the invention can simultaneously process 90 kinds of multi-targets.

Drawings

Fig. 1 is a flowchart of a multi-target object detection method according to the present invention.

Fig. 2 is a device connection diagram according to the present invention.

FIG. 3 is a flow chart of creating a pre-trained multi-target object detection model in accordance with the present invention.

Fig. 4 is a flow chart of installation of device software for multi-target object detection according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a flowchart of a multi-target object detection method provided by the present invention. The invention provides a method for detecting a multi-target object, which comprises the following steps:

s101, connecting a target object detection device and connecting devices required by a multi-target object detection method;

s102, creating a pre-training multi-target object detection model by using a convolutional neural network. A model of multi-objective detection needs to be pre-trained before installation into the device, with the goal of detection without a network;

s103, installing deep learning framework application software on the target object detection equipment device;

s104, calling application software by using a camera, and sequentially reading each frame of image from the video camera;

s105, scaling the image read in the S104 to 448 x 448 pixels by using image scaling application software;

s106, dividing the image scaled in the S105 into grids of 7*7 by using image cutting application software;

s107, whether the broken object is in the grid cell of 7*7 or not is determined by the coordinate values. If the center coordinates of the object's frame fall within this grid, this grid is used to predict the object; if the object's bounding box center coordinates are not in the grid, the grid is not used to predict the object. Each grid predicts multiple frame regressions, each frame regressions needs to be accompanied by a confidence value for prediction besides the position of the frame regressions;

s108, sending the picture of the judging object in the grid of S107 to a pretrained multi-target object detection model of S102;

s109, calculating an output boundary frame position value and a confidence value by utilizing a multi-target object detection model network, wherein the confidence value represents the confidence of the predicted frame containing an object and the predicted multi-accuracy information of the frame, and calculating the confidence value of the position by adopting the following formula:

as described in the above equation, the first term takes 1 if there is an object falling in a grid cell, and takes 0 otherwise. The second term is the ratio of the intersection between the predicted bounding box regression and the actual manually labeled bounding box. Each frame regression predicts 5 values of center point abscissa, center point ordinate, width, length and confidence value, and each grid predicts 90 category information. 7*7 grids, each predicting 2 bounding box regressions, and 90 categories. The output is a tensor of 7*7 (5×2+90).

S110, filtering the value of S109, wherein a depth confidence threshold value with a threshold value of 0.5 is adopted.

In the invention, the step of comparing the confidence value of the position and the score of the confidence degree with the image threshold of the target to judge whether the video to be detected contains the target image comprises the following steps:

comparing the confidence value and confidence response score of the location with a threshold value of a target image;

if the response score is greater than or equal to the threshold value of the target image, judging that the video to be detected contains the target image;

and if the response score is smaller than the threshold value of the target image, judging that the target image is not included in the video to be detected.

S111, positioning the position value and the confidence value of each frame obtained after the step of judging that the video to be detected comprises the target image in the step of S110, setting a threshold value, and filtering out frames with low scores;

and S112, performing non-maximum value inhibition processing on the reserved frames, merging the frames to obtain a final detection result, and reserving only one frame with the highest value to obtain the detection result.

Based on the above object detection methods, the present invention further provides a multi-object detection device, which is used for executing the multi-object detection method.

Fig. 2 is a connection diagram of each device for multi-target object detection according to the present invention, and a device 1200 for multi-target object detection according to the present invention is provided. The mobile terminal display card comprises 1 mobile terminal display card chip, 1 embedded main board, 1 camera, 1 power adapter and 1 hard disk. The embedded main board is a hardware platform for detecting the whole target object. The mobile terminal display card chip is an embedded image processing module which is responsible for processing the video stream image. The camera is used for acquiring video images. The power adapter is responsible for supplying power to the embedded motherboard. The hard disk is used for storing data. The mobile terminal display card chip 804 is connected with the embedded motherboard 805, the camera 802 is connected with the embedded motherboard 805, the power adapter 801 is connected with the embedded motherboard 805, and the hard disk 803 is connected with the embedded motherboard 805.

FIG. 3 is a flowchart for creating a pre-trained multi-target object detection model according to the present invention, and a flowchart for creating a multi-target object detection model is provided, comprising the following steps:

s110: preparing training sample pictures of a target object to be detected, wherein the number of samples of a single target is not less than 1 ten thousand pictures;

s111: manually calibrating the position and the size frame of the target in the sample picture, and manually calibrating the position of the real frame of the sample by using image processing software;

s112: using image processing software to reduce the calibrated sample to 448 x 448 pixels;

s113: using 24 layers of convolutional neural networks, then connecting 2 full-connection layers and a convolutional neural network with the size of 1 x 90 to perform feature extraction on the reduced sample to obtain a plurality of frame regression coordinate values, confidence coefficient values of objects contained in the frames and class probability values of 90 targets;

s114: and (3) performing non-maximum value inhibition screening on the characteristic convolution layer value obtained in the step (S113), and finally merging the characteristic convolution layer values into a frame.

Fig. 4 is a flowchart of installation of equipment software for multi-target object detection according to the present invention.

S201, first, a 64-bit os of the wu Ban Tu is installed on a computer. This operating system uses a long-term support version;

s202, after the step of S201 is completed, connecting a 64-bit operating system of the black-ban diagram brushed on the host computer with a data line on target object detection equipment;

s203, after the system installation of S202 is completed, using an embedded installation package of Injeida to install a deep learning image acceleration package of kuda;

s204, after the installation of S203 is completed, installing a google deep learning frame, and using a deep learning method to process the target detection problem;

s205, installing camera calling software and image processing software after S204 is completed, wherein the camera calling software and the image processing software are mainly used for operations such as image reading, zooming and cutting;

s206, installing an object detection framework, and integrating excellent algorithms and recognition frame algorithms by using a google target detection application framework.

The invention aims to provide a method for detecting a multi-target object of a two-dimensional video camera. The problems of complicated design, low detection speed, heavy equipment, high power consumption and the like of image feature extraction in the prior art are solved. The target object detection by the method has lower power consumption, lower calculation resource and portability, and can be suitable for real-time target object detection end to end in a network-free environment. The device uses a low-power supply with 90W, and the power consumption of the device is lower than that of deep learning target detection at a computer and a server. The device in the invention can be used under the condition of no network, and does not need to be transmitted to a server side for operation in real time. Can be used outdoors in areas with poor network. The equipment has small volume, the whole equipment has the size of 40cm, and the equipment is suitable for terminal application of multi-target object identification. The frame number per second (FPS) of the method of the invention was 75.1% accurate at the 18-32VOC 2007 dataset. And high precision can be achieved under the condition of low power consumption. The number of multi-target species that can be processed simultaneously in the method of the present invention is 90.

Claims

1. The detection device comprises a mobile terminal display card chip, an embedded main board, a camera, a power adapter and a hard disk, wherein the embedded main board is a hardware platform for detecting the whole multi-target object; the mobile terminal display card chip is an embedded image processing module and is responsible for processing video stream images; the camera is used for acquiring video images; the power adapter is responsible for supplying power to the embedded motherboard; the hard disk is used for storing data; the embedded main board is respectively connected with the display card chip, the camera, the power adapter and the hard disk; the detection method is characterized by comprising the following steps of:

step 1: the connection target object detection equipment device comprises the following specific connection modes: connecting a mobile terminal display card chip with an embedded main board, connecting a camera with the embedded main board, connecting a power adapter with the embedded main board, and connecting a hard disk with the embedded main board;

step 2: creating a pretrained multi-target object detection model by using a convolutional neural network, wherein the method comprises the following specific contents and steps:

(1) Preparing a training sample picture of a target object to be detected;

(3) Scaling down the scaled sample picture to 448 x 448 pixels;

(5) Performing non-maximum suppression on all frames, and outputting a unique frame after screening;

step 5: reducing the image read by the camera to 448 x 448 pixels;

step 6: dividing the reduced image into 7*7 grids with the same size;

step 7: judging whether the object is in the grid unit of 7*7 by utilizing the coordinate values, wherein the specific method is that the center point coordinate of the object is compared with the coordinate range of the grid unit to judge whether the object is in the grid unit;

step 12: performing non-maximum value inhibition treatment on the reserved frames, and merging the frames to obtain a final detection result;

the detection method further comprises the step of comparing the confidence coefficient of the object with a threshold value of a target image to judge whether the video to be detected contains the target image or not; and compares the confidence score to a threshold for the target image,

when the confidence score is larger than or equal to a threshold value of the target image, judging that the target image is contained in the video to be detected;

when the confidence score is smaller than a threshold value of a target image, judging that the target image is not included in the video to be detected;

the detection method further comprises the steps of comparing the position value of the object with the position value in a pre-training multi-target object detection model, and judging the accuracy of object target detection;

and comparing the cross ratio of the position value and the position of the manual calibration target in the sample,

when the position value and the position of the manual calibration target in the sample are greater than or equal to a threshold value of the cross ratio, judging that the detection to be detected is correct;

and when the position value and the position of the manual calibration target in the sample are smaller than and are compared with a threshold value, judging that the detection to be detected is wrong.