CN110889425A

CN110889425A - Target detection method based on deep learning

Info

Publication number: CN110889425A
Application number: CN201811644255.6A
Authority: CN
Inventors: 邓远志; 林淼; 刘志永; 陈志列
Original assignee: EVOC Intelligent Technology Co Ltd
Current assignee: EVOC Intelligent Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2020-03-17

Abstract

The invention provides a target detection method based on deep learning. According to the method, an integral image training model is directly selected, two stages of candidate region extraction and feature detection are integrated, namely classification categories and rectangular surrounding frames of real targets are directly regressed at multiple positions of an image, the stored features are read and written by using a video memory, and a softmax function is combined to replace svm for classifying the features, so that the speed of target detection can be increased, the targets and background regions can be better distinguished by utilizing integral image direct training, and the precision of target detection can be improved.

Description

Target detection method based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection method based on deep learning.

Background

The target detection is the basis for realizing complex visual tasks such as target retrieval, target tracking, abnormal behavior detection, scene understanding and the like, and the detection of the target in the image or the video through the algorithm can provide more bases for advanced decision-making, so that a good target detection model is an important link.

Currently, a target detection method based on a regional convolutional neural network (R-CNN) is dominant in the field of target detection, and a target detection process of the method includes: firstly, a candidate area set is generated, wherein the candidate area is obtained by finding out the possible positions of targets in the image in advance by using the information of textures, edges, colors and the like in the image, then all the candidate areas are used as training samples and input into a Convolutional Neural Network (CNN) for training, then the CNN characteristics extracted from each candidate area are input into a classifier SVM for training, and finally the classified candidate areas of the classifier SVM are subjected to frame regression to correct the candidate areas so as to meet the condition that the window extracted from the candidate areas is more consistent with a target real window.

In the process of implementing the invention, the inventor finds that at least the following technical problems exist in the prior art:

in the target detection algorithm based on the R-CNN, training must be performed by two parts, namely candidate region training and CNN feature training, and the algorithm needs to frequently read and write the stored features through a hard disk, so that the conventional target detection method is relatively time-consuming to detect images with the same resolution on the same hardware platform.

Disclosure of Invention

According to the target detection method based on deep learning, the two stages of candidate region extraction and feature detection are integrated, the stored features are read and written by using the video memory, and the classification of the features by using the softmax function instead of svm is combined, so that the speed and the precision of target detection can be improved.

The invention provides a target detection method based on deep learning, which comprises the following steps:

(1) loading the image and the corresponding annotation information file into a computer video memory, and randomly initializing a weight matrix;

the annotation information file comprises the category of each real target in the image and the coordinates of a rectangular bounding box containing the real target;

(2) carrying out grid division on the image to obtain a plurality of grid subimages, and predicting a candidate area of each grid subimage;

(3) performing convolution operation on a plurality of candidate area matrix vectors of each grid sub-image to obtain a feature map of the grid sub-image, performing convolution operation on the feature map on different convolution layers by using convolution kernels of different scales, and performing integral fusion on the feature maps of different scales corresponding to each grid sub-image;

(4) performing pooling operation on the fused feature map, and performing convolution operation on the pooled feature map and a convolution kernel with a fixed scale to further optimize the feature map;

(5) performing pooling operation on the output characteristic diagram of the step (4) by using a filter;

(6) taking the output of the step (5) as the input of the full connection layer, and performing convolution operation by adopting a fixed step length;

(7) taking the output of the step (6) as the input of a classification function Softmax, calculating the confidence coefficient of the image target class and the predicted coordinate information, calculating the error of the image target class and the predicted coordinate information, and calculating the corresponding gradient value through the error to update the weight matrix of each layer;

(8) stopping training if the training times reach the set times, otherwise, returning to the step (3);

(9) and obtaining a trained model after the set training times are reached, and performing product calculation on the image to be detected and the model weight matrix to obtain a target detection result in the image.

According to the target detection method based on deep learning provided by the embodiment of the invention, the whole image training model is directly selected, and the target detection problem is converted into a regression problem, namely the classification category and the rectangular surrounding frame of the real target are directly regressed at a plurality of positions of the input image. Compared with the prior art, on one hand, the candidate region extraction and the feature detection are integrated, namely classification categories and rectangular surrounding frames of real targets are directly regressed at a plurality of positions of the image, and in the training process, feature reading and writing are not needed through a hard disk, but the stored features are read and written by utilizing a video memory, so that the reading and writing efficiency is obviously improved, and the speed of target detection can be improved; on the other hand, convolution operations are carried out on different convolution layers through convolution kernels of different scales, feature maps of different scales are fused after convolution calculation, so that the method is suitable for a multi-scale real target, and the classification of features by using a softmax function instead of svm is combined, so that the precision of target detection is improved.

Drawings

FIG. 1 is a flowchart of a deep learning-based target detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an application of the deep learning-based target detection method in a security platform.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a target detection method based on deep learning, as shown in fig. 1, the method comprises the following steps:

(1) and loading the image and the corresponding annotation information file into a computer video memory, and randomly initializing a weight matrix.

The annotation information file comprises the category of each real target in the image and the coordinates of a rectangular bounding box containing the real target.

(2) And carrying out grid division on the image to obtain a plurality of grid sub-images, and predicting the candidate area of each grid sub-image.

(3) Convolution operation is carried out on a plurality of candidate area matrix vectors of each grid sub-image to obtain a feature map of the grid sub-image, convolution operation is carried out on the feature map on different convolution layers by utilizing convolution kernels of different scales, and integration fusion is carried out on the feature maps of different scales corresponding to each grid sub-image.

(4) And performing pooling operation on the fused feature map, and performing convolution operation on the pooled feature map and a convolution kernel with a fixed scale to further optimize the feature map.

And (4) reducing feature dimensions and enhancing the anti-interference (such as interference caused by image stretching, rotation and other operations) capability of the features.

(5) And (4) performing pooling operation on the output characteristic diagram of the step (4) by using a filter.

(6) And (5) taking the output of the step (5) as the input of the full connection layer, and performing convolution operation by adopting a fixed step.

Specifically, the features output in step (5) are scaled to 1 × 1000, that is, a 1000-dimensional feature map is obtained, and then the feature map is convolved with a fixed step.

(7) And (4) taking the output of the step (6) as the input of a classification function Softmax, firstly calculating the confidence coefficient of the image target class and the predicted coordinate information, then calculating the error of the image target class and the real labeling information, and calculating the corresponding gradient value through the error so as to update the weight matrix of each layer.

Specifically, the output image characteristics of the step (6) are used as the input of a classification function Softmax, the confidence coefficient of the target category in the image and the coordinate information corresponding to the target are calculated, standard Euclidean distance calculation is carried out on the confidence coefficient and the coordinate information corresponding to the target in the current image, the corresponding gradient value is calculated through the error, the weight matrixes of all layers are added and updated, and the target confidence coefficient and the corresponding coordinate obtained in the next training are closer to the real value.

(8) And (5) stopping training if the training times reach the set times, and otherwise, returning to the step (3).

Specifically, the set training times are reached to obtain a trained model, then an image to be detected is input, the image to be detected is subjected to calculation such as convolution kernel pooling from (2) to (6), and finally the category and coordinate information of target detection is obtained through a classification function softmax, namely, the product calculation is carried out on the image to be detected and a model weight matrix to obtain a target detection result in the image.

According to the target detection method based on deep learning provided by the embodiment of the invention, the whole image training model is directly selected, and the target detection problem is converted into a regression problem, namely the classification category and the rectangular surrounding frame of the real target are directly regressed at a plurality of positions of the input image. Compared with the prior art, on one hand, from the step (2) to the step (7), the invention integrates a series of processes of extracting the classification of the characteristic Softmax from the candidate region into a whole, realizes the training from the input end to the output end, namely directly regresses the classification category and the rectangular surrounding frame of the real target on a plurality of positions of the image, and does not need to read and write the characteristic through a hard disk in the training process, but utilizes the video memory to read and write the stored characteristic, so that the reading and writing efficiency is obviously improved, and the target detection speed can be improved; on the other hand, convolution operations are carried out on different convolution layers through convolution kernels of different scales, feature maps of different scales are fused after convolution calculation so as to adapt to a multi-scale real target, and a softmax function is combined to replace svm to classify features, so that better performance is still kept in high-dimensional feature classification, and the accuracy of target detection is improved.

Optionally, if the center coordinate of the rectangular bounding box is located in the coordinate range of the grid sub-image, performing product calculation on the matrix vector of the grid sub-image and the weight matrix, and predicting a plurality of candidate regions, otherwise, not performing candidate region prediction processing on the grid sub-image.

Optionally, before loading the image and the corresponding annotation information file into the computer video memory, the method further includes;

and marking each real target in the image by adopting an image marking tool to mark, and generating a marking information file.

Optionally, after the loading the image and the corresponding annotation information file into the computer video memory, before the grid-dividing the image into a plurality of grid sub-images, the method further includes:

initializing coordinates of a candidate region of the image as null.

Optionally, the convolution kernel with a fixed scale is a convolution kernel of 3x3 or a convolution kernel of 5x5, the filter is a filter of 2x2, and the fixed step is a step of 1x 1.

The target detection algorithm based on deep learning is well applied to security images, target detection can be performed on road scenes of traffic security images after the target detection algorithm is embedded into a security platform, and the target detection working process of the security platform is as follows:

1) and carrying out video recording on a traffic road scene through the road traffic camera, and uploading the recorded image video at regular intervals.

2) And the server decodes the video into frames, initializes the graph accelerator and loads the deep learning model.

3) And inputting the image to be detected into the deep learning network model to obtain the target category and position coordinate information in the road traffic image, such as the position of a pedestrian and the position and model of a vehicle.

4) The recognized target is framed out and displayed in the image, and the recognition effect graph is shown in fig. 2.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A target detection method based on deep learning is characterized by comprising the following steps:

2. The method of claim 1, wherein predicting the candidate regions for each mesh sub-image comprises:

and if the central coordinate of the rectangular surrounding frame is positioned in the coordinate range of the grid sub-image, performing product calculation on the matrix vector of the grid sub-image and a weight matrix to predict a plurality of candidate regions, and otherwise, not performing candidate region prediction processing on the grid sub-image.

3. The method of claim 1, wherein before loading the image and the corresponding annotation information file into the computer video memory, further comprising;

and marking each real target in the image by adopting an image marking tool to generate a marking information file.

4. The method of claim 1, wherein after loading the image and the corresponding annotation information file into the computer memory, and before performing the mesh division on the image to obtain a plurality of mesh sub-images, further comprising:

initializing coordinates of a candidate region of the image as null.

5. The method of claim 1, wherein the fixed-scale convolution kernel is a 3x3 convolution kernel or a 5x5 convolution kernel.

6. The method of claim 1, wherein the filter is a 2x2 filter.

7. The method of claim 1, wherein the fixed stride is a 1x1 stride.